Schema Enforcer

In my previous post I introduced you to GATE, the software I use and help to develop at work. Over the last ten years I've developed a number of processing resources (PRs are like plugins) for GATE. Some of these plugins have made it into the main GATE distribution (the Chemistry Tagger and the Noun Phrase Chunker being the most successful) whilst I've allowed others to slowly die. I still have quite a few that I've developed either for my own pet projects or for work that should really be made available for everyone to use. The problem tends to be that they need cleaning up and documenting before they are released. I've now made a start on cleaning up the PRs that I think are useful and in this post I'll introduce you to the first of these that I've managed to commit to the main GATE SVN repository; the Schema Enforcer.

The idea for the Schema Enforcer started to germinate in my head during a long afternoon trying to teach people how to manually annotate documents using GATE Teamware. In essence we want people who are familiar with a set of documents to markup the entities within the documents that they believe are interesting/relevant to a given task. We then treat these manually annotated documents as a gold standard for evaluating automatic systems that create the same annotations.

It turns out that if you can pre-annotate the documents with an automatic system and have the annotators correct and add to existing annotations they not only find the task easier to understand but they tend to be able to annotate a document quicker which usualy saves us money.

When processing a document in GATE you tend to find that applications create a lot of annotations that are not actually required. For example, GATE creates a SpaceToken annotation over each blank space. These can be really useful when creating other more complex annotations but no human is ever going to need to look at them. So when pre-annotating documents for Teamware what I (and most other people do) is to simply create a new annotation set into which we copy any annotation types which we are asking the annotators to create or correct (we usually do this using the Annotation Set Transfer PR rather than by hand). The problem with simply copying annotations from one set to another is that this does nothing to check that the annotation features conform to any set of guidelines. Whilst odd features are less of an issue than intermediate or temporary annotations they can still be quite distracting.

In Teamware, when starting an annotation process, you specify the annotations that can be created using XML based annotation schmeas. These define the type of the annotation, it's features, and for some features the set of permitted values. For example here is a schema for defining a Location annotation.

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2000/10/XMLSchema">
  <element name="Location">
    <complexType>
      <attribute name="locType" use="required" value="other">
        <simpleType>
          <restriction base="string">
            <enumeration value="region"/>
            <enumeration value="airport"/>
            <enumeration value="city"/>
            <enumeration value="country"/>
            <enumeration value="county"/>
            <enumeration value="other"/>
          </restriction>
        </simpleType>  
      </attribute>

      <attribute name="requires-attention" use="optional" type="boolean"/>         
      <attribute name="comment"  use="optional" type="string"/>
    </complexType>
  </element>
</schema>

You should be able to see from this that a Location annotation can have three features (referred to as attributes in the schema); locType, requires-attention, and comment. The last two features are fairly self explanatory but the locType feature requires a little explanation. Basically locType is an enumerated feature, that is it can only take on one of the six values specified in the schema. What this means is that an annotator cannot decide to create a Location annotation with a locType set to, for instance, beach as that is not one of the defined values. In this case they would probably set locType to other and use the comment feature to say that it is actually a beach. Also note that locType is a required feature which means you can't choose not to set it's value.

The idea I had should now be obvious; why not use the schemas to drive the copying of annotations from one annotation set to another. After a little bit of experimenting this idea became the Schema Enforcer PR. Details of exactly how to use the PR can be found in the main GATE manual but in essense the Schema Enforcer will copy an annotation if and only if....
  • the type of the annotation matches one of the supplied schemas, and
  • all required features are present and valid (i.e. meet the requirements for being copied to the 'clean' annotation)
Each feature of an annotation is copied to the new annotation if and only if....
  • the feature name matches a feature in the schema describing the annotation,
  • the value of the feature is of the same type as specified in the schema, and
  • if the feature is defined, in the schema, as an enumerated type then the value must match one of the permitted values

I've now made use of this PR in two different projects and it really does make life easier. Not only can I be sure that annotations people get to correct in Teamware actually match the annotation guidelines, but it provides a really easy way of producing a 'clean' annotation set as the output of a GATE application, but don't just take my word for it!
nice one, mark - very useful! i've had these problems before too, but used jape grammars instead - your approach is much nicer!
I think it would be nice if whoever gets to teach Teamware at FIG doesn't get snagged by the non-standard annotations that came up on Tuesday. ;-)
So if you already develop GATE applications and think that you'd like to add the Schema Enforcer to your pipeline you can find it in the main GATE SVN repository or just grab a recent nightly build.

0 comments:

Post a Comment