100713Notes

Goals

1. Run and compare Mihai's reimplementation of Multir with Original Multir on protobuf train and test input
2. Reimplement Distant Supervision component

Rewrite distant supervision code in Java
Have modules for semantic databases and training corpora
Separate the process of training instance collection from feature generation
Reimplement Multir input interface to deal with new training data format

Log

October 8 2013

Compared Mihai's reimplementation of Multir and the original Multir algorithm

Aggregate Extraction Precision/Recall Table at Highest Recall Level
Algorithm	Precision	Recall
Mihai's Reimplementation	.328	.183
Original Multir	.372	.180

This will serve as a benchmark as I try to refactor the Multir code into a more usable code base.

Sentential Extraction Precision/Recall Table for Original Multir Algorithm at Highest Recall Level
Precision	Recall
.843	.325

The sentential extraction results for the original Multir Algorithm seem to be different from the results in the paper in that the highest recall difference here is approximately 20 percentage points lower than the recall level reported in the paper.

October 10 2013

I was able to input all of the specified files from below and store the information in an Apache Derby DB run with Java code. The next task is to extract possible entities from sentences in the corpus and annotate these sentences with relations from the semantic database by querying the Derby DB.

Distant Supervision Input Specifications

1. Semantic Database - A Tab Separated File of the following format:

    Entity1 \t Entity2 \t Relation

2. Semantic Database Entity Names - A Tab Separated File of the following format:

    Entity \t EntityName

3. Target Relations File - A File with a newline separated list of relations:

    Relation1
    Relation2
    ...
    RelationN

4. Corpus - A collection of raw text files

Distant Supervision Output Specifications

1. Human Readable Output:

  Entity1 \t  Entity1SentenceOffsets \t Entity2 \t Entity2SentenceOffsets \t Relation \t Sentence \t DocumentOffset \t DocumentId

2. Google Protobuffer Output For Input to Multir:

100713Notes

Goals

Log

Navigation menu

Views

Personal tools

Navigation

Search

Tools