Pattern Learning

From Knowitall
Jump to: navigation, search

Further Work

Bootrapping

Patterns

  • reduce postag restriction (collapse VB, VBZ, VBN, etc.)

Building the boostrapping data

Determining target relations

  1. Restrict high quality set of ClueWeb extractions to have proper noun arguments
  2. Choose the most frequent relations from this set

Determining target extractions (seeds)

  1. Start with the clean, chunked dataset of ReVerb extractions from ClueWeb.
  2. Apply Jonathan Berant's relation string normalization.
  3. Filter relations so each relation's normalized relation string matches a target relation and the arguments only contain DT, IN, NNP, and NNPS.
  4. Filter extractions
    1. Remove extractions that occur a single time.
    2. Remove extractions with single or double letter arguments, optionally ending with a period.
  5. Filter arguments
    1. Remove inc, ltd, vehicle, turn, page, site
    2. Remove arguments that are 2 or fewer characters
  6. Measure the occurrence of the arguments.
  7. Keep extractions from the target relations that have arguments that occur commonly (20 times).
  8. Remove target relations which have fewer than 15 seeds

Lemma grep

  1. Search corpus for all sentences that contain the lemmas in a target extraction.
  2. Remove duplicate sentences (sentence*extraction pairs must be unique).
  3. For each sentence*extraction pair, search for a pattern that connects the lemmas.
    1. Pattern must start with the arg1

Reducing the patterned results

  1. Don't allow patterns that contain punct edges or edges with non-word ([^\w]) characters
  2. Remove patterns that occur less than 10 times.
  3. Remove extractions that have an (extraction, pattern) pairs that occurs anomalously frequently.
    1. There was a single one: (hotel reservation, be make, online) ocurred 32k times, the next one ocurred 8k times

Extracting

Collapsing Dependencies

Collapsing noun edges

We want only the nn edge between "Barack Obama" to be collapsed in the following.

  1. US president Barack Obama declared victory yesterday.
  2. US president Barack Obama likes to drink beer.
  3. Barack Obama, the president of the US, has a wife.

Executing the Extractor

Generalized Extractor

  1. Apply pattern to sentence
  2. Remove matches with an adjacent `neg` edge
  3. Convert the match into an extraction

Specific Extractor

  1. Run the generalized extractor with the pattern from a (pattern, relation) pair
  2. Keep any extractions where the relations match

LDA Extractor

  1. Run the generalized extractor with a pattern
  2. Remove extractions with "relation strings" that don't match any target relation
  3. Keep best associated target relation by maximizing P(p | r)

Expanding Arguments

Since patterns will match single nodes for the relation and arguments, it's necessary to expand the relation and the argument. For example, if the relation is a noun, you want to expand over some modifiers ("det", "amod", "num", "nn", "poss", "quantmod"). For an argument (which is presumed to be a noun but possibly might be an adjective) we additionally want to expand over prepositions (we don't in the relation case because the preposition word itself is part of the template and the phrase is usually the argument).

Arguments

Care must be taken that arguments do not expand over a relation, else some relational noun extractions are impossible (US president Barack Obama). Arguments should expand over rcmod, infmod, partmod, and ref but only if there is no other argument or relation inside the subcomponent of the graph (all-or-nothing expansion).

Relations

Noun relations expand over the noun modifiers that noun arguments expand over so long as they do not intersect and argument (otherwise we can't get "Iranian president Mack"). Verb relations may extend over advmod edges, but only if those edges reach a node that is adjacent in the source sentence (consider expanding over the advmod subcomponents in "On the three and twentieth day of the seventh month he sent the people away to their tents , joyful and glad of heart for the goodness that the LORD had shown to David , and to Solomon , and to Israel his people ." In general there is a lot of danger when expanding over non-adjacent nodes.)

Relations should not extend over prt (phrasal verb particle) but sometimes these are misidentified as advmod, especially when disconnected from the verb ("The people were sent merrily away.") Ideally we would get "sent away", but due to the bad parses this is not possible.

Extractions

Slots

prepc

  1. After winning the Superbowl, the Saints are top dogs of the NFL.
  2. He purchased it without paying a premium.
  3. After winning the lottery, James becomes an Epicurean.
  4. Two months after joining the European Union , Bulgaria began attracting increasing interest towards local real estates.

partmod/advcl

  1. Having won the lottery, James becomes an Epicurean.