Difference between revisions of "Pattern Learning"
From Knowitall
(→Determining target extractions (seeds)) |
|||
Line 32: | Line 32: | ||
# Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently. | # Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently. | ||
## There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times | ## There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times | ||
+ | |||
+ | == Executing the Extractor == | ||
+ | === Generalized Extractor === | ||
+ | # Apply pattern to sentence | ||
+ | # Remove matches with an adjacent `neg` edge | ||
+ | # Convert the match into an extraction | ||
+ | |||
+ | === Specific Extractor === | ||
+ | # Run the generalized extractor with the pattern from a (pattern, relation) pair | ||
+ | # Keep any extractions where the relations match | ||
+ | |||
+ | === LDA Extractor === | ||
+ | # Run the generalized extractor with a pattern | ||
+ | # Remove extractions with "relation strings" that don't match any target relation | ||
+ | # Keep best associated target relation by maximizing P(p | r) |
Revision as of 22:48, 30 November 2011
Contents
Building the boostrapping data
Determining target relations
- Restrict high quality set of ClueWeb extractions to have proper noun arguments
- Choose the most frequent relations from this set
Determining target extractions (seeds)
- Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
- Apply Jonathan Berant's relation string normalization.
- Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, IN, NNP, and NNPS.
- Filter extractions
- Remove extraction strings that occur less than three times.
- Remove extractions with single or double letter arguments, optionally ending with a period.
- Filter arguments
- Remove inc, ltd, vehicle, turn, page, site
- Remove arguments that are 2 or fewer characters
- Measure the occurrence of the arguments.
- Keep extractions from the target relations that have arguments that occur commonly (20 times).
- Remove target relations which have fewer than 15 seeds
Lemma grep
- Search corpus for all sentences that contain the lemmas in a target extraction.
- Remove duplicate sentences (sentence*extraction pairs must be unique).
- For each sentence*extraction pair, search for a pattern that connects the lemmas.
- Pattern must start with the arg1
Reducing the patterned results
- Don't allow patterns that contain punct edges or edges with non-word ([^\w]) characters
- Remove patterns that occur less than 10 times.
- Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
- There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times
Executing the Extractor
Generalized Extractor
- Apply pattern to sentence
- Remove matches with an adjacent `neg` edge
- Convert the match into an extraction
Specific Extractor
- Run the generalized extractor with the pattern from a (pattern, relation) pair
- Keep any extractions where the relations match
LDA Extractor
- Run the generalized extractor with a pattern
- Remove extractions with "relation strings" that don't match any target relation
- Keep best associated target relation by maximizing P(p | r)