Document-level Open IE

Goals

Extend sentence-based Open IE extractors to incorporate document-level reasoning, such as:
- Coreference
- Entity Linking
- NER
- Rules implemented for TAC 2013 Entity Linking
Define necessary data structures and interfaces by Oct-9
End-to-end system evaluation by Nov-11

Released system output for evaluation:
- "Rules" configuration, using rule-based best-mention disambiguation, NO Coref.
- "Coref" configuration, using coref-assisted rule-based best-mention disambiguation. Entity Linking context also extended via coreference.
- Entity Linking output, showing differences in Entitylinks between each system configuration (and baseline).

Next: Stephen, John and I will annotate the output and analyze performance.

Met with Stephen, John, and Michael. Items:

Create a (very simple) webapp for doc extractor
Cleanup arguments before submitting them to the linker.
Replace best-mention substrings rather than substituting best mentions for the entire argument.
Reformat evaluation output to show only extractions that have been annotated with additional info (diff)
Evaluate difference in linker performance with/without document-level info.

Met with Stephen and John. Discussed:

Evaluation systems:
- Baseline sentence extractor with entity linker, no coreference
- Full system with best-mention finding rules
- Full system without coreference.
Evaluation data:
- Sample of 20-30 documents from TAC 2013.
- Moving away from QA/Query based approach, since the queries/questions will bias evaluation of the document extractor.
- Instead, we will evaluate all (or a uniform sample) of extractions.
Evaluation criteria:
- Extractions "correct" if their arguments are as unambiguous as possible given the document text.
- Measure prec/yield using this metric and compare systems.

Completed: Integrated sentence-level Open IE and Freebase Linker, test run OK.

Next Goals:

Integrate best-mention finding rules.
- First: Drop in code "as-is"
- After: Factor out NER tagging, coref components
Fix issues with tracking character offsets
- Offsets are not properly computed for Open IE extractions
- Find a good way for retrieving document metadata by character offset.

Short term goal - define necessary interfaces and data structures by 10-11

Implemented interfaces for:
- Document
- Sentence
- Extraction
- Argument/Relation
- Coreference Mention
- Coreference Cluster
- Entity Link
Discussed interfaces at length with John and Michael
- Interfaces to be incorporated into generic NLP tool library (nlptools):
  - Document
  - Sentence
  - CorefResolver