Difference between revisions of "Datasets"
From Knowitall
Line 8: | Line 8: | ||
== Knowledge bases == | == Knowledge bases == | ||
* [https://developers.google.com/freebase/data Freebase Data Dumps]: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013) | * [https://developers.google.com/freebase/data Freebase Data Dumps]: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013) | ||
− | * [http://googleresearch.blogspot.com/2013/05/distributing-edit-history-of-wikipedia.html Distributing the Edit History of Wikipedia Infoboxes]: edit history of Wikipedia info boxes. | + | * [http://googleresearch.blogspot.com/2013/05/distributing-edit-history-of-wikipedia.html Distributing the Edit History of Wikipedia Infoboxes]: edit history of Wikipedia info boxes (2013) |
+ | * [http://wiki.dbpedia.org/Datasets DBpedia Dataset] | ||
== Entities == | == Entities == | ||
Line 23: | Line 24: | ||
== Other resources == | == Other resources == | ||
* [http://research.microsoft.com/apps/catalog/default.aspx?t=downloads Microsoft Research Downloads] | * [http://research.microsoft.com/apps/catalog/default.aspx?t=downloads Microsoft Research Downloads] | ||
+ | * [http://www.cse.unt.edu/~rada/downloads.html Rada Milhalcea's page] |
Latest revision as of 03:25, 18 July 2013
Below is a list of potentially useful NLP datasets.
Corpora
- Syntactic Ngrams over Time (2013)
- Yelp Dataset Challenge: sample of Yelp data from Phoenix, AZ (2013)
- Google N-grams N-grams from a large corpus of books (2010)
Knowledge bases
- Freebase Data Dumps: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013)
- Distributing the Edit History of Wikipedia Infoboxes: edit history of Wikipedia info boxes (2013)
- DBpedia Dataset
Entities
- Freebase Annotations of the ClueWeb Corpora: entity-linked ClueWeb09 and ClueWeb12 (2013)
- Learning from Big Data: 40 Million Entities in Context: named entities with document context (2013)
- From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas: probabilistic mappings from strings to entities (2012)
Relations
- 50,000 Lessons on How to Read: a Relation Extraction Corpus: human-annotated judgments for "place of birth" and "attended an institution" (2013)
Paraphrasing
- PPDB: The Paraphrase Database: paraphrases obtained from multi-lingual corpora (2013)