Difference between revisions of "Pattern Learning"

From Knowitall
Jump to: navigation, search
(Determining target extractions (seeds))
Line 32: Line 32:
 
#  Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
 
#  Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
 
## There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times
 
## There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times
 +
 +
== Executing the Extractor ==
 +
=== Generalized Extractor ===
 +
#  Apply pattern to sentence
 +
#  Remove matches with an adjacent `neg` edge
 +
#  Convert the match into an extraction
 +
 +
=== Specific Extractor ===
 +
#  Run the generalized extractor with the pattern from a (pattern, relation) pair
 +
#  Keep any extractions where the relations match
 +
 +
=== LDA Extractor ===
 +
#  Run the generalized extractor with a pattern
 +
#  Remove extractions with "relation strings" that don't match any target relation
 +
#  Keep best associated target relation by maximizing P(p | r)

Revision as of 22:48, 30 November 2011

Building the boostrapping data

Determining target relations

  1. Restrict high quality set of ClueWeb extractions to have proper noun arguments
  2. Choose the most frequent relations from this set

Determining target extractions (seeds)

  1. Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
  2. Apply Jonathan Berant's relation string normalization.
  3. Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, IN, NNP, and NNPS.
  4. Filter extractions
    1. Remove extraction strings that occur less than three times.
    2. Remove extractions with single or double letter arguments, optionally ending with a period.
  5. Filter arguments
    1. Remove inc, ltd, vehicle, turn, page, site
    2. Remove arguments that are 2 or fewer characters
  6. Measure the occurrence of the arguments.
  7. Keep extractions from the target relations that have arguments that occur commonly (20 times).
  8. Remove target relations which have fewer than 15 seeds

Lemma grep

  1. Search corpus for all sentences that contain the lemmas in a target extraction.
  2. Remove duplicate sentences (sentence*extraction pairs must be unique).
  3. For each sentence*extraction pair, search for a pattern that connects the lemmas.
    1. Pattern must start with the arg1

Reducing the patterned results

  1. Don't allow patterns that contain punct edges or edges with non-word ([^\w]) characters
  2. Remove patterns that occur less than 10 times.
  3. Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
    1. There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times

Executing the Extractor

Generalized Extractor

  1. Apply pattern to sentence
  2. Remove matches with an adjacent `neg` edge
  3. Convert the match into an extraction

Specific Extractor

  1. Run the generalized extractor with the pattern from a (pattern, relation) pair
  2. Keep any extractions where the relations match

LDA Extractor

  1. Run the generalized extractor with a pattern
  2. Remove extractions with "relation strings" that don't match any target relation
  3. Keep best associated target relation by maximizing P(p | r)