100713Notes

From Knowitall
Jump to: navigation, search

Goals

1. Run and compare Mihai's reimplementation of Multir with Original Multir on protobuf train and test input
2. Reimplement Distant Supervision component

  • Rewrite distant supervision code in Java
  • Have modules for semantic databases and training corpora
  • Separate the process of training instance collection from feature generation
  • Reimplement Multir input interface to deal with new training data format


Log

October 8 2013
Compared Mihai's reimplementation of Multir and the original Multir algorithm
Aggregate Extraction Precision/Recall Table at Highest Recall Level
Algorithm Precision Recall
Mihai's Reimplementation .328 .183
Original Multir .372 .180
This will serve as a benchmark as I try to refactor the Multir code into a more usable code base.


Sentential Extraction Precision/Recall Table for Original Multir Algorithm at Highest Recall Level
Precision Recall
.843 .325


The sentential extraction results for the original Multir Algorithm seem to be different from the results in the paper in that the highest recall difference here is approximately 20 percentage points lower than the recall level reported in the paper.


October 10 2013

I was able to input all of the specified files from below and store the information in an Apache Derby DB run with Java code. The next task is to extract possible entities from sentences in the corpus and annotate these sentences with relations from the semantic database by querying the Derby DB.

Distant Supervision Input Specifications

1. Semantic Database - A Tab Separated File of the following format:

    Entity1 \t Entity2 \t Relation

2. Semantic Database Entity Names - A Tab Separated File of the following format:

    Entity \t EntityName

3. Target Relations File - A File with a newline separated list of relations:

    Relation1
    Relation2
    ...
    RelationN

4. Corpus - A collection of raw text files


Distant Supervision Output Specifications

1. Human Readable Output:

  Entity1 \t  Entity1SentenceOffsets \t Entity2 \t Entity2SentenceOffsets \t Relation \t Sentence \t DocumentOffset \t DocumentId

2. Google Protobuffer Output For Input to Multir: