Difference between revisions of "Datasets"

Revision as of 03:10, 18 July 2013

Below is a list of potentially useful NLP datasets.

Corpora

Syntactic Ngrams over Time (2013)
Google N-grams N-grams from a large corpus of books (2010)

Knowledge bases

Freebase Data Dumps: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013)
Distributing the Edit History of Wikipedia Infoboxes: edit history of Wikipedia info boxes.

Entities

Freebase Annotations of the ClueWeb Corpora: entity-linked ClueWeb09 and ClueWeb12 (2013)
Learning from Big Data: 40 Million Entities in Context: named entities with document context (2013)
From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas: probabilistic mappings from strings to entities (2012)

Relations

50,000 Lessons on How to Read: a Relation Extraction Corpus: human-annotated judgments for "place of birth" and "attended an institution" (2013)

Paraphrasing

PPDB: The Paraphrase Database: paraphrases obtained from multi-lingual corpora (2013)

@@ Line 1: / Line 1: @@
 Below is a list of potentially useful NLP datasets.
-== Facts ==
+== Corpora ==
-* [https://developers.google.com/freebase/data Freebase Data Dumps] All Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013)
+* [http://googleresearch.blogspot.com/2013/05/syntactic-ngrams-over-time.html Syntactic Ngrams over Time] (2013)
+* [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google N-grams] N-grams from a large corpus of books (2010)
+== Knowledge bases ==
+* [https://developers.google.com/freebase/data Freebase Data Dumps]: all Freebase triples, updated weekly, as well as a one-time dump of triples that were removed (2013)
+* [http://googleresearch.blogspot.com/2013/05/distributing-edit-history-of-wikipedia.html Distributing the Edit History of Wikipedia Infoboxes]: edit history of Wikipedia info boxes.
 == Entities ==
 * [http://googleresearch.blogspot.com/2013/07/11-billion-clues-in-800-million.html Freebase Annotations of the ClueWeb Corpora]: entity-linked ClueWeb09 and ClueWeb12 (2013)
 * [http://googleresearch.blogspot.com/2013/03/learning-from-big-data-40-million.html Learning from Big Data: 40 Million Entities in Context]: named entities with document context (2013)
+* [http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas]: probabilistic mappings from strings to entities (2012)
 == Relations ==

Difference between revisions of "Datasets"

Revision as of 03:10, 18 July 2013

Contents

Corpora

Knowledge bases

Entities

Relations

Paraphrasing

Navigation menu

Views

Personal tools

Navigation

Search

Tools