Difference between revisions of "Bobs Multir Updates"

From Knowitall
Jump to: navigation, search
Line 2: Line 2:
 
Bob's Update
 
Bob's Update
 
* This Week
 
* This Week
*# Built new version of MultiR with two new classes (DBPopulatorSingleModel and DBPopulatorPartitionedModel), initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new
+
*# Built new version of MultiR with two new classes (DBPopulatorSingleModel and DBPopulatorPartitionedModel), initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new functionality to populate the Information Omnivore Database.
functionality to populate the Information Omnivore Database.
 
 
*# The new classes have access to the same data as the ManualEvaluation and MultiModelManualEvaluation classes, but they will write two new files:
 
*# The new classes have access to the same data as the ManualEvaluation and MultiModelManualEvaluation classes, but they will write two new files:
 
*## The "ExtractorVoter" file is for populating most of the Info Omnivore DB tables with information on extractions, sentences, document-relative offsets, experiment names and dates, voter IDs, etc.
 
*## The "ExtractorVoter" file is for populating most of the Info Omnivore DB tables with information on extractions, sentences, document-relative offsets, experiment names and dates, voter IDs, etc.
*## The "Sentences" file will contain a list of all sentences (full original text strings) in the 300 documents in the Test subset of the full corpus, with associated DocIDs
+
*## The "Sentences" file will contain a list of all sentences (full original text strings) in the 300 documents in the Test subset of the full corpus, with associated DocIDs (of the document containing the sentence) and offsets (document-relative start and end for each sentence).
(of the document containing the sentence) and offsets (document-relative start and end for each sentence).
 
 
*# Ran new output module on two cases: the baseline system, and the partitioned-model system with generalized features (to test build, etc).  Most functionality was stubbed.
 
*# Ran new output module on two cases: the baseline system, and the partitioned-model system with generalized features (to test build, etc).  Most functionality was stubbed.
*# Accumulated samples of Java program fragments scattered throughout MultiR to perform two basic functions: File IO (creating Reader/Writer Streams, etc) and storage of data needed for computing
+
*# Accumulated samples of Java program fragments scattered throughout MultiR to perform two basic functions: File IO (creating Reader/Writer Streams, etc) and storage of data needed for computing document-relative offsets (HashSet, HashMap, HashTable, etc).  Will use these as starting points for new code to perform these functions in the two new classes.
document-relative offsets (HashSet, HashMap, HashTable, etc).  Will use these as starting points for new code to perform these functions in the two new classes.
 
 
* Next Week
 
* Next Week
 
*# Print Sentences file.  Target completion: Monday.
 
*# Print Sentences file.  Target completion: Monday.
*## We need to read Test Corpus sentences into hash tables (indexed by sentence ID, a unique integer), storing document name, sentence document-relative offset (computed from data in other files),
+
*## We need to read Test Corpus sentences into hash tables (indexed by sentence ID, a unique integer), storing document name, sentence document-relative offset (computed from data in other files), and sentence text.
and sentence text.
 
 
*## These offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR.
 
*## These offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR.
 
*## These structures will be used to print the file "Sentences".  I will initially print a partially-stubbed version, filling in actual information as the implementation of the offset-computation proceeds.
 
*## These structures will be used to print the file "Sentences".  I will initially print a partially-stubbed version, filling in actual information as the implementation of the offset-computation proceeds.
 
*## This file will be used to populate the Sentence DB table.  The file will contain all Test Corpus sentences, whether or not MultiR made extractions from them.
 
*## This file will be used to populate the Sentence DB table.  The file will contain all Test Corpus sentences, whether or not MultiR made extractions from them.
 
*# Minor modification to Stephen Soderland's original design spec.  Target completion: Monday or Tuesday.
 
*# Minor modification to Stephen Soderland's original design spec.  Target completion: Monday or Tuesday.
*## Original spec incorporated two output files, one for extractions and a second for votes on relations.  New spec will define a single output file (called "ExtractorVoter") to combine these items.
+
*## Original spec incorporated two output files, one for extractions and a second for votes on relations.  New spec will define a single output file (called "ExtractorVoter") to combine these items. Using a single file will make database population easier, because the tables being populated from this input file need essentially all the data items contained within this file.
Using a single file will make database population easier, because the tables being populated from this input file need essentially all the data items contained within this file.
+
*## The new spec (with a single "ExtractorVoter" file) is consistent with an alternate spec by Stephen Jonany for entry of Named-Entity-Link voted data into the Info Omnivore DB. His spec is almost identical to Stephen Soderland's modulo the merging of the two files.
*## The new spec (with a single "ExtractorVoter" file) is consistent with an alternate spec by Stephen Jonany for entry of Named-Entity-Link voted data into the Info Omnivore DB.
+
*## The new spec also needs the addition of a "Score" field (inadvertently omitted in earlier designs), which (in the MultiR version) will hold the sum of the feature weights for each extraction. The design is based on the current dump to Standard Output of this information by the ManualEvaluation module.
His spec is almost identical to Stephen Soderland's modulo the merging of the two files.
 
*## The new spec also needs the addition of a "Score" field (inadvertently omitted in earlier designs), which (in the MultiR version) will hold the sum of the feature weights for each extraction.
 
The design is based on the current dump to Standard Output of this information by the ManualEvaluation module.
 
 
*# Complete implementation of printing of ExtractorVoter file.  Target completion: Tuesday.
 
*# Complete implementation of printing of ExtractorVoter file.  Target completion: Tuesday.
 
*## This file ("ExtractorVoter") needs all the data extracted (sentences, relations, etc) and computed (document-relative offsets) as described above.
 
*## This file ("ExtractorVoter") needs all the data extracted (sentences, relations, etc) and computed (document-relative offsets) as described above.
 
*# Run MultiR on two experiments.  Target completion: Tuesday.
 
*# Run MultiR on two experiments.  Target completion: Tuesday.
*## Once implementation of DBPopulatorSingleModel and DBPopulatorPartitionedModel classes is complete and debugged, I will run the system on two experiments, "Baseline" (original features, non-partitioned model)
+
*## Once implementation of DBPopulatorSingleModel and DBPopulatorPartitionedModel classes is complete and debugged, I will run the system on two experiments, "Baseline" (original features, non-partitioned model) and "GeneralizedPartitioned" (additional generalized features and partitioned triple model) to generate the actual data files to be used for the initial Info Omnivore population.
and "GeneralizedPartitioned" (additional generalized features and partitioned triple model) to generate the actual data files to be used for the initial Info Omnivore population.
 
 
*# Print two additional files.  Target completion: Tuesday.
 
*# Print two additional files.  Target completion: Tuesday.
 
*## Two small additional files are needed to populate the DB, one ("Workers") for the Worker table and one ("Relations") naming all relations extracted in a MultiR run for the Relation table.
 
*## Two small additional files are needed to populate the DB, one ("Workers") for the Worker table and one ("Relations") naming all relations extracted in a MultiR run for the Relation table.
Line 39: Line 31:
 
*## This step consists of programming the PostgreSQL database to read the MultiR "ExtractorVoter" output file to construct the following tables: RelationAnnotation, RelationExtractor, and VoteRelation.
 
*## This step consists of programming the PostgreSQL database to read the MultiR "ExtractorVoter" output file to construct the following tables: RelationAnnotation, RelationExtractor, and VoteRelation.
 
*## Build final two tables in Info Omnivore database.  Target completion: Friday.
 
*## Build final two tables in Info Omnivore database.  Target completion: Friday.
*## This step involves programming the PostgreSQL database to populate the final two tables (Worker and Relations) either by hand (for the simple initial MultiR runs) or from the two hand-generated files,
+
*## This step involves programming the PostgreSQL database to populate the final two tables (Worker and Relations) either by hand (for the simple initial MultiR runs) or from the two hand-generated files, "Workers" and "Relations".
"Workers" and "Relations".
 
 
*# Populate Info Omnivore DB from NEL files from Stephen Jonany.  Target completion: Friday.
 
*# Populate Info Omnivore DB from NEL files from Stephen Jonany.  Target completion: Friday.
 
*## If population of DB from MultiR files succeeds, the next activity will be to populate the DB from NEL files provided by Stephen Jonany.
 
*## If population of DB from MultiR files succeeds, the next activity will be to populate the DB from NEL files provided by Stephen Jonany.

Revision as of 22:18, 25 April 2014

April 25 2014 Update

Bob's Update

  • This Week
    1. Built new version of MultiR with two new classes (DBPopulatorSingleModel and DBPopulatorPartitionedModel), initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new functionality to populate the Information Omnivore Database.
    2. The new classes have access to the same data as the ManualEvaluation and MultiModelManualEvaluation classes, but they will write two new files:
      1. The "ExtractorVoter" file is for populating most of the Info Omnivore DB tables with information on extractions, sentences, document-relative offsets, experiment names and dates, voter IDs, etc.
      2. The "Sentences" file will contain a list of all sentences (full original text strings) in the 300 documents in the Test subset of the full corpus, with associated DocIDs (of the document containing the sentence) and offsets (document-relative start and end for each sentence).
    3. Ran new output module on two cases: the baseline system, and the partitioned-model system with generalized features (to test build, etc). Most functionality was stubbed.
    4. Accumulated samples of Java program fragments scattered throughout MultiR to perform two basic functions: File IO (creating Reader/Writer Streams, etc) and storage of data needed for computing document-relative offsets (HashSet, HashMap, HashTable, etc). Will use these as starting points for new code to perform these functions in the two new classes.
  • Next Week
    1. Print Sentences file. Target completion: Monday.
      1. We need to read Test Corpus sentences into hash tables (indexed by sentence ID, a unique integer), storing document name, sentence document-relative offset (computed from data in other files), and sentence text.
      2. These offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR.
      3. These structures will be used to print the file "Sentences". I will initially print a partially-stubbed version, filling in actual information as the implementation of the offset-computation proceeds.
      4. This file will be used to populate the Sentence DB table. The file will contain all Test Corpus sentences, whether or not MultiR made extractions from them.
    2. Minor modification to Stephen Soderland's original design spec. Target completion: Monday or Tuesday.
      1. Original spec incorporated two output files, one for extractions and a second for votes on relations. New spec will define a single output file (called "ExtractorVoter") to combine these items. Using a single file will make database population easier, because the tables being populated from this input file need essentially all the data items contained within this file.
      2. The new spec (with a single "ExtractorVoter" file) is consistent with an alternate spec by Stephen Jonany for entry of Named-Entity-Link voted data into the Info Omnivore DB. His spec is almost identical to Stephen Soderland's modulo the merging of the two files.
      3. The new spec also needs the addition of a "Score" field (inadvertently omitted in earlier designs), which (in the MultiR version) will hold the sum of the feature weights for each extraction. The design is based on the current dump to Standard Output of this information by the ManualEvaluation module.
    3. Complete implementation of printing of ExtractorVoter file. Target completion: Tuesday.
      1. This file ("ExtractorVoter") needs all the data extracted (sentences, relations, etc) and computed (document-relative offsets) as described above.
    4. Run MultiR on two experiments. Target completion: Tuesday.
      1. Once implementation of DBPopulatorSingleModel and DBPopulatorPartitionedModel classes is complete and debugged, I will run the system on two experiments, "Baseline" (original features, non-partitioned model) and "GeneralizedPartitioned" (additional generalized features and partitioned triple model) to generate the actual data files to be used for the initial Info Omnivore population.
    5. Print two additional files. Target completion: Tuesday.
      1. Two small additional files are needed to populate the DB, one ("Workers") for the Worker table and one ("Relations") naming all relations extracted in a MultiR run for the Relation table.
      2. These (for our initial MultiR run) are small files (1 to 10 or 20 lines) and will be created by hand.
      3. The reason for using files rather than hand DB entry is to maximize uniformity of DB creation procedures across future usages.
    6. Build Sentences table in Info Omnivore database. Target completion: Wednesday.
      1. This step consists of programming the PostgreSQL database to read the MultiR "Sentences" output file to construct the Sentences table.
    7. Build the next three tables in Info Omnivore database. Target completion: Thursday or Friday.
      1. This step consists of programming the PostgreSQL database to read the MultiR "ExtractorVoter" output file to construct the following tables: RelationAnnotation, RelationExtractor, and VoteRelation.
      2. Build final two tables in Info Omnivore database. Target completion: Friday.
      3. This step involves programming the PostgreSQL database to populate the final two tables (Worker and Relations) either by hand (for the simple initial MultiR runs) or from the two hand-generated files, "Workers" and "Relations".
    8. Populate Info Omnivore DB from NEL files from Stephen Jonany. Target completion: Friday.
      1. If population of DB from MultiR files succeeds, the next activity will be to populate the DB from NEL files provided by Stephen Jonany.

April 20 2014 Update

Bob's Update

  • This Week
    1. Did preliminary planning, code exploration, etc, to begin implementation of MultiR output module for populating Information Omnivore Database.
    2. Preliminary exploratory implementation included doing a build of a trial MultiR system with minimal modifications (to see if build process is working properly).
    3. After fixing minor bug (missing import), trial build worked. Ran system on repeat of the baseline/vs/generalized-nonpartitioned/vs/partitioned experiment to test functionality; it worked OK. Ran only on Test corpus subset, to save time.
  • Next Week
    1. Will build new version of MultiR with two new classes, initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new functionality needed for writing output files to populate Info Omnivore Database.
    2. The new classes will perform essentially same function (and have access to same data) as the ManualEvaluation and MultiModelManualEvaluation classes but will write three new files (rather than dumping to Standard Output):
      1. Extractions file (data to populate most Info Omnivore DB tables, containing information on extractions, sentences, and document-relative offsets (these offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR).
      2. VoteRelation file containing information on extractions with associated experiment names and dates, voter IDs, etc.
      3. Sentences file containing a list of all sentences (full original text strings) contained within the 300 documents in the Test subset of the full corpus, with associated DocIDs (of the doc containing the sentence) and offsets (document-relative start and end for each sentence).
    3. Will run this new output module on two cases: the baseline system, and the partitioned-model system with generalized features. The associated 6 output files will serve as inputs for Omnivore DB population.
    4. Next step (hopefully started this week) will be actual population of Omnivore DB from the six just-described files.

April 13 2014 Update

Bob's Update

  • This Week
    1. Studied MultiR, GIT, etc.
    2. Ran complete Baseline(Unpartitioned-Model) versus Baseline-Features(Partitioned-Model) versus Generalized-Features with-and-without model partitioning. All look reasonable except baseline-unpartitioned, which may have been compromised by a bad distance-supervision output file. John replaced the file (earlier runs with different file and later with fixed file looked OK, so it may have been that I used a temporarily-bad file). The other three look OK - in general, generalizing the features seems to help a little but partitioning the model helps a lot. Results are in three subdirectories of "/projects/WebWare6/Multir/Evaluations":
      1. Baseline_v_Generalized_and_Partitioned_v_Non/
      2. GeneralizedFeatures_NonPartitioned/
      3. GeneralizedFeatures_Partitioned/
    3. Prepared for Database design for Information Omnivore, including a meeting with Stephen and Lydia Chilton on Friday.
  • Next Week
    1. Main activity will focus on implementation of the Information Omnivore Database.
    2. May rerun some MultiR experiment if needed to resolve ambiguities mentioned above.

April 4 2014 Update

Bob's Update

  • This Week
    1. Continued familiarization with MultiR codebase (and GIT, and Eclipse, etc).
    2. Ran part of 2-by-2 experiment: the Baseline+Generalized Features, No-Partitioning case.
  • Next Week
    1. Will finish (if needed) and write up results of 4-way (2-by-2, Generalized-added/Baseline-only versus Partitioned/No-Partitioned) experiment.
    2. Highest priority: Starting discussions with users and preliminary design of Information Omnivore Database to hold results/data/etc for all experiments in the NLP/Crowdsourcing groups. This will extend some preliminary design work by Stephen and will start with consultations with potential users (probably Lydia Chilton first). Will lead to implementation over next month or so.
    3. Lower priority (ie, back-burner): Design of tracing mechanism for MultiR project which (hopefully) will elucidate role of features in training process, so we can track cause of apparent overgeneralizations. This hopefully will lead to a general tool that can be used to track internal behavior of Multir and other learning systems.

March 21 2014 Update

  • Bob's update
    1. See document [1]