Difference between revisions of "Datasets"

From Knowitall

Jump to: navigation, search

Revision as of 03:20, 18 July 2013

Below is a list of potentially useful NLP datasets.

Contents

1 Corpora
2 Knowledge bases
3 Entities
4 Relations
5 Paraphrasing
6 Other resources

Corpora

Syntactic Ngrams over Time (2013)
Yelp Dataset Challenge: sample of Yelp data from Phoenix, AZ (2013)
Google N-grams N-grams from a large corpus of books (2010)

Knowledge bases

Freebase Data Dumps: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013)
Distributing the Edit History of Wikipedia Infoboxes: edit history of Wikipedia info boxes.

Entities

Freebase Annotations of the ClueWeb Corpora: entity-linked ClueWeb09 and ClueWeb12 (2013)
Learning from Big Data: 40 Million Entities in Context: named entities with document context (2013)
From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas: probabilistic mappings from strings to entities (2012)

Relations

50,000 Lessons on How to Read: a Relation Extraction Corpus: human-annotated judgments for "place of birth" and "attended an institution" (2013)

Paraphrasing

PPDB: The Paraphrase Database: paraphrases obtained from multi-lingual corpora (2013)

Other resources

Microsoft Research Downloads

Retrieved from "https://dada.cs.washington.edu/knowitall/wiki/index.php?title=Datasets&oldid=356"

Navigation menu