Document-level Open IE

From Knowitall
Jump to: navigation, search

Goals

  • Extend sentence-based Open IE extractors to incorporate document-level reasoning, such as:
    • Coreference
    • Entity Linking
    • NER
    • Rules implemented for TAC 2013 Entity Linking
  • Define necessary data structures and interfaces by Oct-9 (done)
  • Preliminary End-to-end system evaluation by Nov-11 (done)
  • Quantitatively determine how much this adds to present Open IE

Work Log

11-22

Trained and evaluated a linear classifier for best-mentions (instances of a rule-application or best-mention-resolution), which provides 95% precision at 90% yield over news data. Features include rule type (person, organization, location), whether coreference info was used, and the ambiguity of a given mention. Todo from here:

  • Polish features:
    • Include coreference info when deciding candidates for a rule (currently coreference is only considered after applying rules)
    • Improve ambiguity measures: Return a value that indicates prominence of the chosen mention out of other ambiguous mentions
    • Improve location ambiguity measure - fix a technical issue reading tipster gazetteer. Consider city, stateOrProvince, and Country ambiguity separately.
    • Debug an issue where names are resolved to something only matching a prefix (e.g. Steven Miller -> Steven Tyler)
  • Produce a formal evaluation
    • How much does this help "Open IE?"
      • How many extractions get annotated with additional, useful information?
      • How many more links get found as a result of best mentions? Coref? Are they higher confidence links?
      • How much does it increase running time to do Document-level processing with/without Coref?
  • Code cleanup, packaging, and release.


11-12

Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:

  • Parsing
  • Chunking
  • Stemming
  • Coref
  • Sentence-level Open IE
  • NER tagging

This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.

Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.

11-8

Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.

Extraction-level stats

From 20 documents, there were 528 total extractions in all runs.

Rules-diff: -- 206 extractions in diff -- 75 baseline better -- 99 rule-based system better -- 33 bad extractions (neither better)

Coref-diff: -- 280 extractions in diff -- 105 baseline better -- 115 coref+rule-based system better -- 59 bad extractions (neither better)

I took a closer look at the "baseline better" cases to see where we were getting it wrong:

Rule-based system -- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN" -- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia") -- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson") -- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase) -- 75 total

Coref+rule-based system -- 49 strange string errors -- 11 location errors -- 13 entity disambiguation errors -- 17 incorrect links -- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that") -- 105 total

Approximate running times over 20 documents: Baseline: 45 sec Rules: 45 Sec Rules+Coref: 230 sec

11-4

  • Released system output for evaluation:
    • "Rules" configuration, using rule-based best-mention disambiguation, NO Coref.
    • "Coref" configuration, using coref-assisted rule-based best-mention disambiguation. Entity Linking context also extended via coreference.
    • Entity Linking output, showing differences in Entitylinks between each system configuration (and baseline).

Next: Stephen, John and I will annotate the output and analyze performance.

10-25

Met with Stephen, John, and Michael. Items:

  • Create a (very simple) webapp for doc extractor
  • Cleanup arguments before submitting them to the linker.
  • Replace best-mention substrings rather than substituting best mentions for the entire argument.
  • Reformat evaluation output to show only extractions that have been annotated with additional info (diff)
  • Evaluate difference in linker performance with/without document-level info.

10-18

Met with Stephen and John. Discussed:

  • Evaluation systems:
    • Baseline sentence extractor with entity linker, no coreference
    • Full system with best-mention finding rules
    • Full system without coreference.
  • Evaluation data:
    • Sample of 20-30 documents from TAC 2013.
    • Moving away from QA/Query based approach, since the queries/questions will bias evaluation of the document extractor.
    • Instead, we will evaluate all (or a uniform sample) of extractions.
  • Evaluation criteria:
    • Extractions "correct" if their arguments are as unambiguous as possible given the document text.
    • Measure prec/yield using this metric and compare systems.

10-17

Completed: Integrated sentence-level Open IE and Freebase Linker, test run OK.

Next Goals:

  • Integrate best-mention finding rules.
    • First: Drop in code "as-is"
    • After: Factor out NER tagging, coref components
  • Fix issues with tracking character offsets
    • Offsets are not properly computed for Open IE extractions
    • Find a good way for retrieving document metadata by character offset.

10-9

Short term goal - define necessary interfaces and data structures by 10-11

  • Implemented interfaces for:
    • Document
    • Sentence
    • Extraction
    • Argument/Relation
    • Coreference Mention
    • Coreference Cluster
    • Entity Link
  • Discussed interfaces at length with John and Michael
    • Interfaces to be incorporated into generic NLP tool library (nlptools):
      • Document
      • Sentence
      • CorefResolver