Document-level Open IE
Contents
Goals
- Extend sentence-based Open IE extractors to incorporate document-level reasoning, such as:
- Coreference
- Entity Linking
- NER
- Rules implemented for TAC 2013 Entity Linking
- Define necessary data structures and interfaces by Oct-9 (done)
- Preliminary End-to-end system evaluation by Nov-11 (done)
- Quantitatively determine how much this adds to present Open IE
Work Log
11-22
Trained and evaluated a linear classifier for best-mentions (instances of a rule-application or best-mention-resolution), which provides 95% precision at 90% yield over news data. Features include rule type (person, organization, location), whether coreference info was used, and the ambiguity of a given mention. Todo from here:
- Polish features:
- Include coreference info when deciding candidates for a rule (currently coreference is only considered after applying rules)
- Improve ambiguity measures: Return a value that indicates prominence of the chosen mention out of other ambiguous mentions
- Improve location ambiguity measure - fix a technical issue reading tipster gazetteer. Consider city, stateOrProvince, and Country ambiguity separately.
- Debug an issue where names are resolved to something only matching a prefix (e.g. Steven Miller -> Steven Tyler)
- Produce a formal evaluation
- How much does this help "Open IE?"
- How many extractions get annotated with additional, useful information?
- How many more links get found as a result of best mentions? Coref? Are they higher confidence links?
- How much does it increase running time to do Document-level processing with/without Coref?
- How much does this help "Open IE?"
- Code cleanup, packaging, and release.
11-12
Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:
- Parsing
- Chunking
- Stemming
- Coref
- Sentence-level Open IE
- NER tagging
This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.
Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.
11-8
Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.
Extraction-level stats
From 20 documents, there were 528 total extractions in all runs.
Rules-diff: -- 206 extractions in diff -- 75 baseline better -- 99 rule-based system better -- 33 bad extractions (neither better)
Coref-diff: -- 280 extractions in diff -- 105 baseline better -- 115 coref+rule-based system better -- 59 bad extractions (neither better)
I took a closer look at the "baseline better" cases to see where we were getting it wrong:
Rule-based system -- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN" -- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia") -- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson") -- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase) -- 75 total
Coref+rule-based system -- 49 strange string errors -- 11 location errors -- 13 entity disambiguation errors -- 17 incorrect links -- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that") -- 105 total
Approximate running times over 20 documents: Baseline: 45 sec Rules: 45 Sec Rules+Coref: 230 sec
11-4
- Released system output for evaluation:
- "Rules" configuration, using rule-based best-mention disambiguation, NO Coref.
- "Coref" configuration, using coref-assisted rule-based best-mention disambiguation. Entity Linking context also extended via coreference.
- Entity Linking output, showing differences in Entitylinks between each system configuration (and baseline).
Next: Stephen, John and I will annotate the output and analyze performance.
10-25
Met with Stephen, John, and Michael. Items:
- Create a (very simple) webapp for doc extractor
- Cleanup arguments before submitting them to the linker.
- Replace best-mention substrings rather than substituting best mentions for the entire argument.
- Reformat evaluation output to show only extractions that have been annotated with additional info (diff)
- Evaluate difference in linker performance with/without document-level info.
10-18
Met with Stephen and John. Discussed:
- Evaluation systems:
- Baseline sentence extractor with entity linker, no coreference
- Full system with best-mention finding rules
- Full system without coreference.
- Evaluation data:
- Sample of 20-30 documents from TAC 2013.
- Moving away from QA/Query based approach, since the queries/questions will bias evaluation of the document extractor.
- Instead, we will evaluate all (or a uniform sample) of extractions.
- Evaluation criteria:
- Extractions "correct" if their arguments are as unambiguous as possible given the document text.
- Measure prec/yield using this metric and compare systems.
10-17
Completed: Integrated sentence-level Open IE and Freebase Linker, test run OK.
Next Goals:
- Integrate best-mention finding rules.
- First: Drop in code "as-is"
- After: Factor out NER tagging, coref components
- Fix issues with tracking character offsets
- Offsets are not properly computed for Open IE extractions
- Find a good way for retrieving document metadata by character offset.
10-9
Short term goal - define necessary interfaces and data structures by 10-11
- Implemented interfaces for:
- Document
- Sentence
- Extraction
- Argument/Relation
- Coreference Mention
- Coreference Cluster
- Entity Link
- Discussed interfaces at length with John and Michael
- Interfaces to be incorporated into generic NLP tool library (nlptools):
- Document
- Sentence
- CorefResolver
- Interfaces to be incorporated into generic NLP tool library (nlptools):