Difference between revisions of "Pattern Learning"
From Knowitall
(→Determining target extractions) |
(→Determining target extractions) |
||
Line 10: | Line 10: | ||
# Apply Jonathan Berant's relation string normalization. | # Apply Jonathan Berant's relation string normalization. | ||
# Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS. | # Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS. | ||
− | # Measure the occurrence of the arguments | + | # Remove extraction strings that occur less than three times. |
− | # Keep extractions from the target relations that have arguments that occur commonly ( | + | # Measure the occurrence of the arguments. |
+ | # Keep extractions from the target relations that have arguments that occur commonly (20 times). | ||
== Reducing the lemma grep results == | == Reducing the lemma grep results == |
Revision as of 19:25, 25 October 2011
Contents
Building the boostrapping data
Determining target relations
- Restrict high quality set of ClueWeb extractions to have proper noun arguments
- Choose the most frequent relations from this set
Determining target extractions
- Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
- Apply Jonathan Berant's relation string normalization.
- Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS.
- Remove extraction strings that occur less than three times.
- Measure the occurrence of the arguments.
- Keep extractions from the target relations that have arguments that occur commonly (20 times).
Reducing the lemma grep results
- Remove patterns that occur less than 5 times.
- Remove duplicate sentences.
- Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
- There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times