Difference between revisions of "Document-level Open IE"

Revision as of 23:00, 13 November 2013

Goals

Extend sentence-based Open IE extractors to incorporate document-level reasoning, such as:
- Coreference
- Entity Linking
- NER
- Rules implemented for TAC 2013 Entity Linking
Define necessary data structures and interfaces by Oct-9
End-to-end system evaluation by Nov-11

Work Log

11-12

Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:

Parsing
Chunking
Stemming
Coref
Sentence-level Open IE
NER tagging

This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.

Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.

11-8

Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.

Extraction-level stats (emailed)

From 20 documents, there were 528 total extractions in all runs.

Rules-diff: -- 206 extractions in diff -- 75 baseline better -- 99 rule-based system better -- 33 bad extractions (neither better)

Coref-diff: -- 280 extractions in diff -- 105 baseline better -- 115 coref+rule-based system better -- 59 bad extractions (neither better)

I took a closer look at the "baseline better" cases to see where we were getting it wrong:

Rule-based system -- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN" -- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia") -- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson") -- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase) -- 75 total

Coref+rule-based system -- 49 strange string errors -- 11 location errors -- 13 entity disambiguation errors -- 17 incorrect links -- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that") -- 105 total

Approximate running times over 20 documents: Baseline: 45 sec Rules: 45 Sec Rules+Coref: 230 sec

11-4

Released system output for evaluation:
- "Rules" configuration, using rule-based best-mention disambiguation, NO Coref.
- "Coref" configuration, using coref-assisted rule-based best-mention disambiguation. Entity Linking context also extended via coreference.
- Entity Linking output, showing differences in Entitylinks between each system configuration (and baseline).

Next: Stephen, John and I will annotate the output and analyze performance.

10-25

Met with Stephen, John, and Michael. Items:

Create a (very simple) webapp for doc extractor
Cleanup arguments before submitting them to the linker.
Replace best-mention substrings rather than substituting best mentions for the entire argument.
Reformat evaluation output to show only extractions that have been annotated with additional info (diff)
Evaluate difference in linker performance with/without document-level info.

10-18

Met with Stephen and John. Discussed:

Evaluation systems:
- Baseline sentence extractor with entity linker, no coreference
- Full system with best-mention finding rules
- Full system without coreference.
Evaluation data:
- Sample of 20-30 documents from TAC 2013.
- Moving away from QA/Query based approach, since the queries/questions will bias evaluation of the document extractor.
- Instead, we will evaluate all (or a uniform sample) of extractions.
Evaluation criteria:
- Extractions "correct" if their arguments are as unambiguous as possible given the document text.
- Measure prec/yield using this metric and compare systems.

10-17

Completed: Integrated sentence-level Open IE and Freebase Linker, test run OK.

Next Goals:

Integrate best-mention finding rules.
- First: Drop in code "as-is"
- After: Factor out NER tagging, coref components
Fix issues with tracking character offsets
- Offsets are not properly computed for Open IE extractions
- Find a good way for retrieving document metadata by character offset.

10-9

Short term goal - define necessary interfaces and data structures by 10-11

Implemented interfaces for:
- Document
- Sentence
- Extraction
- Argument/Relation
- Coreference Mention
- Coreference Cluster
- Entity Link
Discussed interfaces at length with John and Michael
- Interfaces to be incorporated into generic NLP tool library (nlptools):
  - Document
  - Sentence
  - CorefResolver

Difference between revisions of "Document-level Open IE"

Revision as of 23:00, 13 November 2013

Contents

Goals

Work Log

11-12

11-8

Extraction-level stats (emailed)

11-4

10-25

10-18

10-17

10-9

Navigation menu

Views

Personal tools

Navigation

Search

Tools

@@ Line 9: / Line 9: @@
 == Work Log ==
+=== 11-12 ===
+Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:
+* Parsing
+* Chunking
+* Stemming
+* Coref
+* Sentence-level Open IE
+* NER tagging
+This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.
+Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.
+=== 11-8 ===
+Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.
+==== Extraction-level stats (emailed) ====
+From 20 documents, there were 528 total extractions in all runs.
+Rules-diff:
+-- 206 extractions in diff
+-- 75 baseline better
+-- 99 rule-based system better
+-- 33 bad extractions (neither better)
+Coref-diff:
+-- 280 extractions in diff
+-- 105 baseline better
+-- 115 coref+rule-based system better
+-- 59 bad extractions (neither better)
+I took a closer look at the "baseline better" cases to see where we were getting it wrong:
+Rule-based system
+-- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN"
+-- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia")
+-- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson")
+-- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase)
+-- 75 total
+Coref+rule-based system
+-- 49 strange string errors
+-- 11 location errors
+-- 13 entity disambiguation errors
+-- 17 incorrect links
+-- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that")
+-- 105 total
+Approximate running times over 20 documents:
+Baseline: 45 sec
+Rules: 45 Sec
+Rules+Coref: 230 sec
 === 11-4 ===
 * Released system output for evaluation: