Bobs Multir Updates

From Knowitall
Jump to: navigation, search

May 9 2014 Update

Bob's Update

  • This Week
    1. Completed programs to generate Sentences file for both BaselineModel and Partitioned(Generalized-features)Model.
      1. Ran program on both Baseline and Partitioned model cases - produced identical output files (as expected).
      2. Placed output files in "/projects/WebWare6/InfoOmnivore/MultiR/TestSentences".
    2. Almost (modulo one offset calculation) completed program to print ExtractionsVotes file (will probably finish today or tomorrow).
    3. Am hand-generating additional files ("Workers" and "Relations"), to complement Sentences and ExtractionsVotes files for DB population.
    4. Minor modification to spec for ExtractionsVotes file: sentenceIndex (integer) deleted and sentenceText (string) added.
    5. Will place all new code into GitHub upon completion and testing.
  • Next Week
    1. Monday
      1. Complete ExtractionVotes file printer, finish two additional files, and update GitHub (if not yet completed and correct by previous Friday).
      2. Build 1->3 mapping program: reads ExtractionVotes file and writes three output files, one to populate each of the InfoOmnivore DB tables defined for this file (independent Java program in new package).
      3. Create InfoOmnivore tables (ie, begin PostgreSQL programming to create database).
    2. Tuesday or Wednesday
      1. Populate DB tables from existing files (Sentences, ExtractionsVotes -> three table-based subsets, Workers, Relations).
        1. Must be careful about order of table population: If a foreign key exists, must populate the "target" table first (the one pointed to by the foreign key, the "pointee") before the "source" table (the one containing the foreign key, the "pointer").
    3. Rest of week
      1. Will do same with the NEL files provided by Stephen Jonany to create a NEL database as with the files created by me for MultiR DB population.
        1. Basically this involves iterating over same sequence of steps as with MultiR (ie, programming a function to map NEL main file(s) to subset files needed for population of individual tables, etc).
        2. This mapper function will handle missing votes by simply not generating a line in the file used to populate Votes table for missing votes. Will test this by explicitly changing some votes to 0.5 in the MultiR version.


May 2 2014 Update

Bob's Update

  • This Week
    1. Completed implementation of generation of Sentences file (program integrated into DBPopulatorSingleModel and same code will be copied into DBPopulatorPartitionedModel).
      1. This program computes document-relative sentence and relation offsets, as needed according to spec described below.
    2. Ran above system on Baseline/SingleModel case, generating Sentences file (seems correct; placed in /projects/WebWare6/InfoOmnivore/InputFiles/MultiR_Runs/BaselineSentenceFile .
    3. Added two files to /projects/WebWare6/InfoOmnivore/DesignDocs : MultiR_to_Information_Omnivore_DB_Populator.docx and *.pdf (spec for MultiR output files for DB input processing).
  • Next Week
    1. Add code to DBPopulatorPartitionedModel class to generate Sentences file (completed) and ExtractionsVotes file (in process at this point).
    2. Complete program for generation of ExtractionsVotes file (as per spec file added above).
    3. Run MultiR DB-output-file generation for Sentences file on GeneralizedFeatures/PartitionedModel (should be identical to one already generated for Baseline model).
    4. Complete run of MultiR DB-output-file generation on GeneralizedFeatures/PartitionedModel experiment (generating both Sentences and ExtractionsVotes files).
    5. Print two small additional files (by hand), one ("Workers") for the Worker table and one ("Relations") naming all relations extracted in a MultiR run for the Relation table.
    6. Build Sentences table in Info Omnivore database (using the Sentences files).
    7. Build the next three tables in Info Omnivore database (using the ExtractionsVotes files).
    8. Build final two tables in Info Omnivore database ("Workers" and "Relations") by hand.
    9. If population of DB from MultiR files succeeds, the next activity will be to populate the DB from NEL files provided by Stephen Jonany.


April 25 2014 Update

Bob's Update

  • This Week
    1. Built new version of MultiR with two new classes (DBPopulatorSingleModel and DBPopulatorPartitionedModel), initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new functionality to populate the Information Omnivore Database.
    2. The new classes have access to the same data as the ManualEvaluation and MultiModelManualEvaluation classes, but they will write two new files:
      1. The "ExtractorVoter" file is for populating most of the Info Omnivore DB tables with information on extractions, sentences, document-relative offsets, experiment names and dates, voter IDs, etc.
      2. The "Sentences" file will contain a list of all sentences (full original text strings) in the 300 documents in the Test subset of the full corpus, with associated DocIDs (of the document containing the sentence) and offsets (document-relative start and end for each sentence).
    3. Ran new output module on two cases: the baseline system, and the partitioned-model system with generalized features (to test build, etc). Most functionality was stubbed.
    4. Accumulated samples of Java program fragments scattered throughout MultiR to perform two basic functions: File IO (creating Reader/Writer Streams, etc) and storage of data needed for computing document-relative offsets (HashSet, HashMap, HashTable, etc). Will use these as starting points for new code to perform these functions in the two new classes.
  • Next Week
    1. Print Sentences file. Target completion: Monday.
      1. We need to read Test Corpus sentences into hash tables (indexed by sentence ID, a unique integer), storing document name, sentence document-relative offset (computed from data in other files), and sentence text.
      2. These offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR.
      3. These structures will be used to print the file "Sentences". I will initially print a partially-stubbed version, filling in actual information as the implementation of the offset-computation proceeds.
      4. This file will be used to populate the Sentence DB table. The file will contain all Test Corpus sentences, whether or not MultiR made extractions from them.
    2. Minor modification to Stephen Soderland's original design spec. Target completion: Monday or Tuesday.
      1. Original spec incorporated two output files, one for extractions and a second for votes on relations. New spec will define a single output file (called "ExtractorVoter") to combine these items. Using a single file will make database population easier, because the tables being populated from this input file need essentially all the data items contained within this file.
      2. The new spec (with a single "ExtractorVoter" file) is consistent with an alternate spec by Stephen Jonany for entry of Named-Entity-Link voted data into the Info Omnivore DB. His spec is almost identical to Stephen Soderland's modulo the merging of the two files.
      3. The new spec also needs the addition of a "Score" field (inadvertently omitted in earlier designs), which (in the MultiR version) will hold the sum of the feature weights for each extraction. The design is based on the current dump to Standard Output of this information by the ManualEvaluation module.
    3. Complete implementation of printing of ExtractorVoter file. Target completion: Tuesday.
      1. This file ("ExtractorVoter") needs all the data extracted (sentences, relations, etc) and computed (document-relative offsets) as described above.
    4. Run MultiR on two experiments. Target completion: Tuesday.
      1. Once implementation of DBPopulatorSingleModel and DBPopulatorPartitionedModel classes is complete and debugged, I will run the system on two experiments, "Baseline" (original features, non-partitioned model) and "GeneralizedPartitioned" (additional generalized features and partitioned triple model) to generate the actual data files to be used for the initial Info Omnivore population.
    5. Print two additional files. Target completion: Tuesday.
      1. Two small additional files are needed to populate the DB, one ("Workers") for the Worker table and one ("Relations") naming all relations extracted in a MultiR run for the Relation table.
      2. These (for our initial MultiR run) are small files (1 to 10 or 20 lines) and will be created by hand.
      3. The reason for using files rather than hand DB entry is to maximize uniformity of DB creation procedures across future usages.
    6. Build Sentences table in Info Omnivore database. Target completion: Wednesday.
      1. This step consists of programming the PostgreSQL database to read the MultiR "Sentences" output file to construct the Sentences table.
    7. Build the next three tables in Info Omnivore database. Target completion: Thursday or Friday.
      1. This step consists of programming the PostgreSQL database to read the MultiR "ExtractorVoter" output file to construct the following tables: RelationAnnotation, RelationExtractor, and VoteRelation.
      2. Build final two tables in Info Omnivore database. Target completion: Friday.
      3. This step involves programming the PostgreSQL database to populate the final two tables (Worker and Relations) either by hand (for the simple initial MultiR runs) or from the two hand-generated files, "Workers" and "Relations".
    8. Populate Info Omnivore DB from NEL files from Stephen Jonany. Target completion: Friday.
      1. If population of DB from MultiR files succeeds, the next activity will be to populate the DB from NEL files provided by Stephen Jonany.


April 20 2014 Update

Bob's Update

  • This Week
    1. Did preliminary planning, code exploration, etc, to begin implementation of MultiR output module for populating Information Omnivore Database.
    2. Preliminary exploratory implementation included doing a build of a trial MultiR system with minimal modifications (to see if build process is working properly).
    3. After fixing minor bug (missing import), trial build worked. Ran system on repeat of the baseline/vs/generalized-nonpartitioned/vs/partitioned experiment to test functionality; it worked OK. Ran only on Test corpus subset, to save time.
  • Next Week
    1. Will build new version of MultiR with two new classes, initially copies of ManualEvaluation and MultiModelManualEvaluation, as prototypes for new functionality needed for writing output files to populate Info Omnivore Database.
    2. The new classes will perform essentially same function (and have access to same data) as the ManualEvaluation and MultiModelManualEvaluation classes but will write three new files (rather than dumping to Standard Output):
      1. Extractions file (data to populate most Info Omnivore DB tables, containing information on extractions, sentences, and document-relative offsets (these offsets must be computed from information currently held in some MultiR data structures and some WebWare6 files but not currently printed out by MultiR).
      2. VoteRelation file containing information on extractions with associated experiment names and dates, voter IDs, etc.
      3. Sentences file containing a list of all sentences (full original text strings) contained within the 300 documents in the Test subset of the full corpus, with associated DocIDs (of the doc containing the sentence) and offsets (document-relative start and end for each sentence).
    3. Will run this new output module on two cases: the baseline system, and the partitioned-model system with generalized features. The associated 6 output files will serve as inputs for Omnivore DB population.
    4. Next step (hopefully started this week) will be actual population of Omnivore DB from the six just-described files.


April 13 2014 Update

Bob's Update

  • This Week
    1. Studied MultiR, GIT, etc.
    2. Ran complete Baseline(Unpartitioned-Model) versus Baseline-Features(Partitioned-Model) versus Generalized-Features with-and-without model partitioning. All look reasonable except baseline-unpartitioned, which may have been compromised by a bad distance-supervision output file. John replaced the file (earlier runs with different file and later with fixed file looked OK, so it may have been that I used a temporarily-bad file). The other three look OK - in general, generalizing the features seems to help a little but partitioning the model helps a lot. Results are in three subdirectories of "/projects/WebWare6/Multir/Evaluations":
      1. Baseline_v_Generalized_and_Partitioned_v_Non/
      2. GeneralizedFeatures_NonPartitioned/
      3. GeneralizedFeatures_Partitioned/
    3. Prepared for Database design for Information Omnivore, including a meeting with Stephen and Lydia Chilton on Friday.
  • Next Week
    1. Main activity will focus on implementation of the Information Omnivore Database.
    2. May rerun some MultiR experiment if needed to resolve ambiguities mentioned above.


April 4 2014 Update

Bob's Update

  • This Week
    1. Continued familiarization with MultiR codebase (and GIT, and Eclipse, etc).
    2. Ran part of 2-by-2 experiment: the Baseline+Generalized Features, No-Partitioning case.
  • Next Week
    1. Will finish (if needed) and write up results of 4-way (2-by-2, Generalized-added/Baseline-only versus Partitioned/No-Partitioned) experiment.
    2. Highest priority: Starting discussions with users and preliminary design of Information Omnivore Database to hold results/data/etc for all experiments in the NLP/Crowdsourcing groups. This will extend some preliminary design work by Stephen and will start with consultations with potential users (probably Lydia Chilton first). Will lead to implementation over next month or so.
    3. Lower priority (ie, back-burner): Design of tracing mechanism for MultiR project which (hopefully) will elucidate role of features in training process, so we can track cause of apparent overgeneralizations. This hopefully will lead to a general tool that can be used to track internal behavior of Multir and other learning systems.


March 21 2014 Update

  • Bob's update
    1. See document [1]