Revision as of 00:38, 16 November 2011

Building the boostrapping data

Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
Apply Jonathan Berant's relation string normalization.
Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, NNP, and NNPS.
Filter extractions
1. Remove extraction strings that occur less than three times.
2. Remove extractions with single or double letter arguments, optionally ending with a period.
Filter arguments
1. Remove inc, ltd, vehicle, turn, page, site
2. Remove arguments that are 2 or fewer characters
Measure the occurrence of the arguments.
Keep extractions from the target relations that have arguments that occur commonly (20 times).

Search corpus for all sentences that contain the lemmas in a target extraction.
Remove duplicate sentences (sentence*extraction pairs must be unique).
For each sentence*extraction pair, search for a pattern that connects the lemmas.
1. Pattern must start with the arg1

Remove patterns that occur less than 5 times.
Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
1. There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times

@@ Line 14: / Line 14: @@
 ## Remove extractions with single or double letter arguments, optionally ending with a period.
 # Filter arguments
-## Remove inc, ltd, vehicle, turn
+## Remove inc, ltd, vehicle, turn, page, site
+## Remove arguments that are 2 or fewer characters
 # Measure the occurrence of the arguments.
 # Keep extractions from the target relations that have arguments that occur commonly (20 times).