Difference between revisions of "Document-level Open IE"

From Knowitall
Jump to: navigation, search
Line 9: Line 9:
  
 
== Work Log ==
 
== Work Log ==
 +
=== 11-12 ===
 +
Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:
 +
* Parsing
 +
* Chunking
 +
* Stemming
 +
* Coref
 +
* Sentence-level Open IE
 +
* NER tagging
 +
This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.
 +
 +
Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.
 +
=== 11-8 ===
 +
Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.
 +
==== Extraction-level stats (emailed) ====
 +
From 20 documents, there were 528 total extractions in all runs.
 +
 +
Rules-diff:
 +
-- 206 extractions in diff
 +
-- 75 baseline better
 +
-- 99 rule-based system better
 +
-- 33 bad extractions (neither better)
 +
 +
Coref-diff:
 +
-- 280 extractions in diff
 +
-- 105 baseline better
 +
-- 115 coref+rule-based system better
 +
-- 59 bad extractions (neither better)
 +
 +
I took a closer look at the "baseline better" cases to see where we were getting it wrong:
 +
 +
Rule-based system
 +
-- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN"
 +
-- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia")
 +
-- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson")
 +
-- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase)
 +
-- 75 total
 +
 +
Coref+rule-based system
 +
-- 49 strange string errors
 +
-- 11 location errors
 +
-- 13 entity disambiguation errors
 +
-- 17 incorrect links
 +
-- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that")
 +
-- 105 total
 +
 +
Approximate running times over 20 documents:
 +
Baseline: 45 sec
 +
Rules: 45 Sec
 +
Rules+Coref: 230 sec
 +
 
=== 11-4 ===
 
=== 11-4 ===
 
* Released system output for evaluation:
 
* Released system output for evaluation:

Revision as of 23:00, 13 November 2013

Goals

  • Extend sentence-based Open IE extractors to incorporate document-level reasoning, such as:
    • Coreference
    • Entity Linking
    • NER
    • Rules implemented for TAC 2013 Entity Linking
  • Define necessary data structures and interfaces by Oct-9
  • End-to-end system evaluation by Nov-11

Work Log

11-12

Implemented serialization to allow extractor input to be saved to disk after pre-processing, saving us from redoing with each run:

  • Parsing
  • Chunking
  • Stemming
  • Coref
  • Sentence-level Open IE
  • NER tagging

This saves roughly 3 minutes per run (over 20 docs) and will greatly speed development time.

Started refactoring and cleaning up rules. Next step: get all substitution rules "on equal footing" programatically so that a classifier can be built to rank them.

11-8

Finished annotating data and discussed results. String substitution rules need to be tightened up, and a confidence measure over them would help greatly.

Extraction-level stats (emailed)

From 20 documents, there were 528 total extractions in all runs.

Rules-diff: -- 206 extractions in diff -- 75 baseline better -- 99 rule-based system better -- 33 bad extractions (neither better)

Coref-diff: -- 280 extractions in diff -- 105 baseline better -- 115 coref+rule-based system better -- 59 bad extractions (neither better)

I took a closer look at the "baseline better" cases to see where we were getting it wrong:

Rule-based system -- 49 strange string errors e.g. "CDC" -> "CIENCE SLIGHTED IN" -- 16 location errors (e.g. "Washington" [DC] -> "Washington, Georgia") -- 8 entity disambiguation errors, e.g. ("he" [Scott Peterson] => "Laci Peterson") -- 1 incorrect link (e.g. "the theory" linked to "Theory" in FreeBase) -- 75 total

Coref+rule-based system -- 49 strange string errors -- 11 location errors -- 13 entity disambiguation errors -- 17 incorrect links -- 6 coref errors (e.g. "make it clear that" -> "make the CDC clear that") -- 105 total

Approximate running times over 20 documents: Baseline: 45 sec Rules: 45 Sec Rules+Coref: 230 sec

11-4

  • Released system output for evaluation:
    • "Rules" configuration, using rule-based best-mention disambiguation, NO Coref.
    • "Coref" configuration, using coref-assisted rule-based best-mention disambiguation. Entity Linking context also extended via coreference.
    • Entity Linking output, showing differences in Entitylinks between each system configuration (and baseline).

Next: Stephen, John and I will annotate the output and analyze performance.

10-25

Met with Stephen, John, and Michael. Items:

  • Create a (very simple) webapp for doc extractor
  • Cleanup arguments before submitting them to the linker.
  • Replace best-mention substrings rather than substituting best mentions for the entire argument.
  • Reformat evaluation output to show only extractions that have been annotated with additional info (diff)
  • Evaluate difference in linker performance with/without document-level info.

10-18

Met with Stephen and John. Discussed:

  • Evaluation systems:
    • Baseline sentence extractor with entity linker, no coreference
    • Full system with best-mention finding rules
    • Full system without coreference.
  • Evaluation data:
    • Sample of 20-30 documents from TAC 2013.
    • Moving away from QA/Query based approach, since the queries/questions will bias evaluation of the document extractor.
    • Instead, we will evaluate all (or a uniform sample) of extractions.
  • Evaluation criteria:
    • Extractions "correct" if their arguments are as unambiguous as possible given the document text.
    • Measure prec/yield using this metric and compare systems.

10-17

Completed: Integrated sentence-level Open IE and Freebase Linker, test run OK.

Next Goals:

  • Integrate best-mention finding rules.
    • First: Drop in code "as-is"
    • After: Factor out NER tagging, coref components
  • Fix issues with tracking character offsets
    • Offsets are not properly computed for Open IE extractions
    • Find a good way for retrieving document metadata by character offset.

10-9

Short term goal - define necessary interfaces and data structures by 10-11

  • Implemented interfaces for:
    • Document
    • Sentence
    • Extraction
    • Argument/Relation
    • Coreference Mention
    • Coreference Cluster
    • Entity Link
  • Discussed interfaces at length with John and Michael
    • Interfaces to be incorporated into generic NLP tool library (nlptools):
      • Document
      • Sentence
      • CorefResolver