Difference between revisions of "Rule Learner/XML Representation"

From Knowitall
Jump to: navigation, search
Line 39: Line 39:
  
  
   <sentence text="The summaries and quoted excerpts in TOSIR issues are intended for the non-profit , background use by members of the United States intelligence and law enforcement communities in furtherance of their professional dutiesand the summaries and quoted excerpts are subject to the copyright protections associated with the original sources .">
+
   <sentence id="1" text="The summaries and quoted excerpts in TOSIR issues are intended for the non-profit , background use by members of the United States intelligence and law enforcement communities in furtherance of their professional dutiesand the summaries and quoted excerpts are subject to the copyright protections associated with the original sources .">
 
     <tokens>
 
     <tokens>
 
       <token string="The" norm="the" pos="DT" chunk="B-NP" />
 
       <token string="The" norm="the" pos="DT" chunk="B-NP" />
Line 93: Line 93:
 
     </tokens>
 
     </tokens>
 
     <extractions>
 
     <extractions>
       <extraction text="TOSIR issues are intended for the non-profit , background use">
+
       <extraction id="1" text="TOSIR issues are intended for the non-profit , background use">
 
         <predicate start="8" end="11" text="are intended for" />
 
         <predicate start="8" end="11" text="are intended for" />
 
         <argument start="6" end="8" text="TOSIR issues" />
 
         <argument start="6" end="8" text="TOSIR issues" />
 
         <argument start="11" end="16" text="the non-profit , background use" />
 
         <argument start="11" end="16" text="the non-profit , background use" />
 
       </extraction>
 
       </extraction>
       <extraction text="the summaries and quoted excerpts are subject to the copyright protections">
+
       <extraction id="2" text="the summaries and quoted excerpts are subject to the copyright protections">
 
         <predicate start="38" end="41" text="are subject to" />
 
         <predicate start="38" end="41" text="are subject to" />
 
         <argument start="33" end="38" text="the summaries and quoted excerpts" />
 
         <argument start="33" end="38" text="the summaries and quoted excerpts" />

Revision as of 22:57, 20 April 2011

I have a rule language in place and executable. The language has changed slightly to be consistent with exiting XML serialization and to allow for greater flexibility. Following is an example for the rule for FounderOf and a few results.


 <rule> 
   <form name="FounderOf">
     <argument type="TypeConstraint" part="argument1">
       <descriptor>Person</descriptor>
     </argument>
     <argument type="TypeConstraint" part="argument2">
       <descriptor>Organization</descriptor>
     </argument>
   </form>
   <constraints>
     <constraint type="TermConstraint" part="predicate">
       <term>founder</term>
     </constraint>
   </constraints>
 </rule>


 2496: FounderOf(Shaykh Hasan al-Banna, Muslim Brotherhood)
 12286: FounderOf(Hasan al-Banna, Muslim Brotherhood)
 52603: FounderOf(Muhammad, Islam)
 52625: FounderOf(Muhammad, Islam)


The <form> section defines constraints that are used to create parameters in the resulting ontological relation. The <constraints> section defines additional constraints. The rule language is easily extensible because the "type" attribute specifies a class name. New classes are easy to write. For example, I could write a LemmaConstraint which matches on lemmas or I could write a WordNetConstraint which behaves like the WordNet constraints you presently use.

The input to the "rule runner" is XML representation of relations. The syntax is slightly different from what you were using for a few reasons. I had written relation serialization some months ago, it deserializes into existing objects that are compatible with ReVerb, our class recognizer does not distinguish between the NER and other classes, and the representation is more readable and compact. It is very easy (three lines of code) to generate this XML if you are using ReVerb.


 // chunkedSentence: ChunkedSentence, reverbExtractor: ReVerbExtractor
 Sentence sentence = new Sentence(chunkedSentence);
 sentence.addExtractions(reverbExtractor.extract(chunkedSentence));
 sentence.toXmlElement();


Here is an XML example. I find this serialization much more readable because the tokens and types are only listed once per sentence. Note that the text field is completely redundant. It exists only to provide readability. Omitting it produces a compact representation.


 <sentence id="1" text="The summaries and quoted excerpts in TOSIR issues are intended for the non-profit , background use by members of the United States intelligence and law enforcement communities in furtherance of their professional dutiesand the summaries and quoted excerpts are subject to the copyright protections associated with the original sources .">
   <tokens>
     <token string="The" norm="the" pos="DT" chunk="B-NP" />
     <token string="summaries" norm="summary" pos="NNS" chunk="I-NP" />
     <token string="and" norm="and" pos="CC" chunk="I-NP" />
     <token string="quoted" norm="quote" pos="VBN" chunk="I-NP" />
     <token string="excerpts" norm="excerpt" pos="NNS" chunk="I-NP" />
     <token string="in" norm="in" pos="IN" chunk="B-PP" />
     <token string="TOSIR" norm="tosir" pos="NN" chunk="B-NP" />
     <token string="issues" norm="issue" pos="NNS" chunk="I-NP" />
     <token string="are" norm="be" pos="VBP" chunk="B-VP" />
     <token string="intended" norm="intend" pos="VBN" chunk="I-VP" />
     <token string="for" norm="for" pos="IN" chunk="B-PP" />
     <token string="the" norm="the" pos="DT" chunk="B-NP" />
     <token string="non-profit" norm="non-profit" pos="JJ" chunk="I-NP" />
     <token string="," norm="" pos="," chunk="I-NP" />
     <token string="background" norm="background" pos="NN" chunk="I-NP" />
     <token string="use" norm="use" pos="NN" chunk="I-NP" />
     <token string="by" norm="by" pos="IN" chunk="B-PP" />
     <token string="members" norm="member" pos="NNS" chunk="B-NP" />
     <token string="of" norm="of" pos="IN" chunk="I-NP" />
     <token string="the" norm="the" pos="DT" chunk="I-NP" />
     <token string="United" norm="unite" pos="NNP" chunk="I-NP" />
     <token string="States" norm="state" pos="NNPS" chunk="I-NP" />
     <token string="intelligence" norm="intelligence" pos="NN" chunk="I-NP" />
     <token string="and" norm="and" pos="CC" chunk="I-NP" />
     <token string="law" norm="law" pos="NN" chunk="I-NP" />
     <token string="enforcement" norm="enforcement" pos="NN" chunk="I-NP" />
     <token string="communities" norm="community" pos="NNS" chunk="I-NP" />
     <token string="in" norm="in" pos="IN" chunk="B-PP" />
     <token string="furtherance" norm="furtherance" pos="NN" chunk="B-NP" />
     <token string="of" norm="of" pos="IN" chunk="I-NP" />
     <token string="their" norm="their" pos="PRP$" chunk="I-NP" />
     <token string="professional" norm="professional" pos="JJ" chunk="I-NP" />
     <token string="dutiesand" norm="dutiesand" pos="CC" chunk="O" />
     <token string="the" norm="the" pos="DT" chunk="B-NP" />
     <token string="summaries" norm="summary" pos="NNS" chunk="I-NP" />
     <token string="and" norm="and" pos="CC" chunk="I-NP" />
     <token string="quoted" norm="quote" pos="VBN" chunk="I-NP" />
     <token string="excerpts" norm="excerpt" pos="NNS" chunk="I-NP" />
     <token string="are" norm="be" pos="VBP" chunk="B-VP" />
     <token string="subject" norm="subject" pos="JJ" chunk="B-ADJP" />
     <token string="to" norm="to" pos="TO" chunk="B-PP" />
     <token string="the" norm="the" pos="DT" chunk="B-NP" />
     <token string="copyright" norm="copyright" pos="NN" chunk="I-NP" />
     <token string="protections" norm="protection" pos="NNS" chunk="I-NP" />
     <token string="associated" norm="associate" pos="VBN" chunk="B-VP" />
     <token string="with" norm="with" pos="IN" chunk="B-PP" />
     <token string="the" norm="the" pos="DT" chunk="B-NP" />
     <token string="original" norm="original" pos="JJ" chunk="I-NP" />
     <token string="sources" norm="source" pos="NNS" chunk="I-NP" />
     <token string="." norm="" pos="." chunk="O" />
   </tokens>
   <extractions>
     <extraction id="1" text="TOSIR issues are intended for the non-profit , background use">
       <predicate start="8" end="11" text="are intended for" />
       <argument start="6" end="8" text="TOSIR issues" />
       <argument start="11" end="16" text="the non-profit , background use" />
     </extraction>
     <extraction id="2" text="the summaries and quoted excerpts are subject to the copyright protections">
       <predicate start="38" end="41" text="are subject to" />
       <argument start="33" end="38" text="the summaries and quoted excerpts" />
       <argument start="41" end="44" text="the copyright protections" />
     </extraction>
   </extractions>
   <types>
     <type descriptor="Nation" start="20" end="22" text="United States" />
     <type descriptor="StanfordLocation" start="20" end="22" text="United States" />
     <type descriptor="location" start="20" end="22" text="United States" />
   </types>
 </sentence>