100713Notes
Goals
1. Run and compare Mihai's reimplementation of Multir with Original Multir on protobuf train and test input
2. Reimplement Distant Supervision component
- Rewrite distant supervision code in Java
- Have modules for semantic databases and training corpora
- Separate the process of training instance collection from feature generation
- Reimplement Multir input interface to deal with new training data format
Log
- October 8 2013
- Compared Mihai's reimplementation of Multir and the original Multir algorithm
Aggregate Extraction Precision/Recall Table at Highest Recall Level Algorithm Precision Recall Mihai's Reimplementation .328 .183 Original Multir .372 .180
- This will serve as a benchmark as I try to refactor the Multir code into a more usable code base.
Sentential Extraction Precision/Recall Table for Original Multir Algorithm at Highest Recall Level Precision Recall .843 .325
- The sentential extraction results for the original Multir Algorithm seem to be different from the results in the paper in that the highest recall difference here is approximately 20 percentage points lower than the recall level reported in the paper.
- October 10 2013
I was able to input all of the specified files from below and store the information in an Apache Derby DB run with Java code. The next task is to extract possible entities from sentences in the corpus and annotate these sentences with relations from the semantic database by querying the Derby DB.
Distant Supervision Input Specifications
1. Semantic Database - A Tab Separated File of the following format:
Entity1 \t Entity2 \t Relation
2. Semantic Database Entity Names - A Tab Separated File of the following format:
Entity \t EntityName
3. Target Relations File - A File with a newline separated list of relations:
Relation1 Relation2 ... RelationN
4. Corpus - A collection of raw text files
Distant Supervision Output Specifications
1. Human Readable Output:
Entity1 \t Entity1SentenceOffsets \t Entity2 \t Entity2SentenceOffsets \t Relation \t Sentence \t DocumentOffset \t DocumentId
2. Google Protobuffer Output For Input to Multir: