Difference between revisions of "Pattern Learning"

From Knowitall
Jump to: navigation, search
(Determining target extractions)
(Determining target extractions)
Line 7: Line 7:
 
== Determining target extractions ==
 
== Determining target extractions ==
  
# Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
+
# Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
# Apply Jonathan Berant's relation string normalization.
+
# Apply Jonathan Berant's relation string normalization.
# Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS.
+
# Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS.
# Remove extraction strings that occur less than three times.
+
# Filter extractions
# Measure the occurrence of the arguments.
+
## Remove extraction strings that occur less than three times.
# Keep extractions from the target relations that have arguments that occur commonly (20 times).
+
## Remove extractions with single or double letter arguments, optionally ending with a period.
 +
# Measure the occurrence of the arguments.
 +
# Keep extractions from the target relations that have arguments that occur commonly (20 times).
  
 
== Reducing the lemma grep results ==
 
== Reducing the lemma grep results ==

Revision as of 19:27, 25 October 2011

Building the boostrapping data

Determining target relations

  1. Restrict high quality set of ClueWeb extractions to have proper noun arguments
  2. Choose the most frequent relations from this set

Determining target extractions

  1. Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
  2. Apply Jonathan Berant's relation string normalization.
  3. Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS.
  4. Filter extractions
    1. Remove extraction strings that occur less than three times.
    2. Remove extractions with single or double letter arguments, optionally ending with a period.
  5. Measure the occurrence of the arguments.
  6. Keep extractions from the target relations that have arguments that occur commonly (20 times).

Reducing the lemma grep results

  1. Remove patterns that occur less than 5 times.
  2. Remove duplicate sentences.
  3. Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
    1. There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times