Rule Learner/Rules

From Knowitall
Jump to: navigation, search

Rules are functions from extractions (triples) to ontological relations. Here is an example of a rule.

   speaksLanguage {
       Entity=PatternConstraint("ARGUMENT1", "<>* <type='Person'>+ <>*")
       Language=PatternConstraint("ARGUMENT2", "<>* <type='Language'>+ <>*")
       PatternConstraint("PREDICATE", "<string='fluent'> <string='in'>")
   }

The rule maps onto an ontological relation named "speaksLanguage" with arguments "Entity" and "Language". The first two lines of the rule specify arguments (Entity and Language). Arguments ultimately have a value associated with them. The third line is a constraint. Constraints behave like arguments (they must evaluate to true) but they their value is not used in the ontological relation.

Each constraint specifies the part of the relation it pertains to. This can be ARGUMENT1, ARGUMENT2, PREDICATE, LEFT, RIGHT, or RELATION. The first three apply to parts of the triple. The next two apply to the sentence to the left or right of the extraction. The last, RELATION, applies to the whole extraction. RELATION can be useful when the desired information spans triple boundaries (for example, if you wanted to extract goals, you may want the text "is to conquer the world" from (the evil man's goal, is to, conquer the world)).

The PatternConstraint is one of many types of constraints. I primarily use the PatternConstraint because it is very flexible. The other constraints primarily exist for rule learning, where a generalization step must be defined. The patterns are token-based patterns. There is a separate file that gives examples and information about patterns, but I will give a few details here.

If the pattern has no matching groups, the entire matched text of the pattern will be used. For example, if you pattern is "<type='Person'>+ foo" it will match the entire type "Person" followed by the text "foo" if there is a match. Alternatively, if the pattern had a matching group such as in "(<type='Person'>+) foo", the pattern constraint would require that the text "foo" follows a type "Person", but it would only yield the text under the type "Person" for the argument value.

In the above example, "<>* <type='Person'>+ <>*" matched the entire extraction part (in this case, ARGUMENT1) if it contains the type "Person". Empty angled brackets (<>) act as a wildcard character. The entire argument is matched because there could be an extraction such as (King Tut's daughter, is fluent in, Egyptian) which we want to map onto isFluentIn(King Tut's daughter, Egyptian) and not isFluentIn(King Tut, Egyptian) even though the latter is probably true (consider the predicate "marries" for an example where the alternate rule results in a false ontological relation).

Another consideration is that extractions from the RelationalNounExtractor can be very useful (for example, with the hasParent relation) but the extraction does not contain a verb in the predicate. This is because the tokens of the extraction are taken literally from the sentence (this is not ideal, but it is a fact of the system today). So when the extractor sees "Bar's son Foo" and builds the extraction (Foo, [is] the son [of], Bar), "is" and "of" are not actually available. I have been handling this so far by checking if the first token is not a verb or if it is "is" with "(?: <!pos='vb.'> | <lemma='is'>)" followed by whatever else makes sense ("<>* <string='son'>").

I mentioned there are other types of constraints. Here is an example of an alternate way to write the first rule in this document.

   speaksLanguage {
       Entity=PartConstraint("ARGUMENT1")
       Language=PartConstraint("ARGUMENT2")
       PatternConstraint("PREDICATE", "<string='fluent'> <string='in'>")
       TypeConstraint("ARGUMENT1", "Person")
       TypeConstraint("ARGUMENT2", "Language")
   }

Here the arguments are changed to PartConstraints, which always match the specified part of the relation. Since the arguments always match (and thus don't really constrain anything), we need to add the last two constraints to ensure there are the appropriate types in the extraction.

You may want to add general logic so that rules do not fire against extractions when the predicate contains "no" or "not".


Old Rule Information (some sections may still apply)

Rules used to be represented with an XML syntax. Here is an example.

<rule> 
  <form name="FounderOf">
    <argument type="TypeConstraint" part="argument1" name="Founder">
      <descriptor>Person</descriptor>
    </argument>
    <argument type="TypeConstraint" part="argument2" name="Organization">
      <descriptor>Organization</descriptor>
    </argument>
  </form>
  <constraints>
    <constraint type="TermConstraint" part="predicate">
      <term>founder</term>
    </constraint>
  </constraints>
</rule>


This rule will extract an ontological relation from an extraction if that extraction's predicate contains the string "founder", if the first argument contains the type "Person", and the second argument contains the type "Organization". The arguments of the ontological relation are defined by the text under the types "Person" and "Organization". The arguments require CaptureConstraints, or constraints that when matched can be resolved as text. Here is a more complicated rule.

 <rule>
   <form name="hasFather">
     <argument part="ARGUMENT1" type="TypeConstraint" name="Son">
       <descriptor>Person</descriptor>
     </argument>
     <argument part="ARGUMENT2" type="TypeConstraint" name="Father">
       <descriptor>Person</descriptor>
     </argument>
   </form>
   <constraints>
     <constraint part="PREDICATE" type="SequenceConstraint">
       <term>
         <lemma>the</lemma>
       </term>
       <term>
         <lemma>son</lemma>
       </term>
     </constraint>
   </constraints>
 </rule>

This rule will extract an ontological relation from an extraction if the extraction's predicate contains "the son" and both arguments contain the type "Person". The relation's arguments will be the text under the "Person" types.

The SequenceConstraint is rather verbose. In fact, this rule could also be represented with a StringConstraint.

 <rule>
   <form name="hasFather">
     <argument part="ARGUMENT1" type="TypeConstraint" name="Son">
       <descriptor>Person</descriptor>
     </argument>
     <argument part="ARGUMENT2" type="TypeConstraint" name="Father">
       <descriptor>Person</descriptor>
     </argument>
   </form>
   <constraints>
     <constraint part="PREDICATE" type="StringConstraint">
       <string>the son</string>
     </constraint>
   </constraints>
 </rule>

The SequenceConstraint is necessary because you may have a more complicated rule. Consider the following.


 <rule>
   <form name="hasFather">
     <argument part="ARGUMENT1" type="TypeConstraint" name="Son">
       <descriptor>Person</descriptor>
     </argument>
     <argument part="ARGUMENT2" type="TypeConstraint" name="Father">
       <descriptor>Person</descriptor>
     </argument>
   </form>
   <constraints>
     <constraint part="PREDICATE" type="SequenceConstraint">
       <term>
         <lemma>the</lemma>
       </term>
       <term>
         <lemma>son</lemma>
         <pos>NN</pos>
       </term>
     </constraint>
   </constraints>
 </rule>

In this rule, there is a further constraint that the part of speech under "son" must be "NN" (common noun). The other reason for using SequenceConstraint is it better defines what it means to generalize the rule. A StringConstraint, at present, does not define the generalize method because there is not an intuitive way to. The SequenceConstraint does (to see how it performs look at the code or the javadoc). The XML format, while a format that is easy to serialize and deserialize from, is rather obtuse, so there is a toString (and toMultilineString) method to each rule.

   hasFather {
       Son=TypeConstraint(ARGUMENT1, person)
       Father=TypeConstraint(ARGUMENT2, person)
   }
   constraints {
       SequenceConstraint(PREDICATE, [Term(lemma="the"), Term(lemma="son")])
   }

Ahh, much better. At least to me.