An event is a dynamic situation that unfolds. Most events describe an interaction or relationship between objects. A span within a document that refers to a single event is an event mention. Sample event mentions include:
Once the schema is defined, use Rosette Adaptation Studio (RAS) to annotate documents containing event mentions to train an events model. Event mentions in text have multiple components with nuanced relationships to each other, making annotating events much more complex than annotating for named entity recognition and extraction.
Event recognition analyzes unstructured text and extracts event mentions. When extracting event mentions from text, each mention is of a specific, pre-defined type. Each type has a schema which specifies the details, including key phrases and roles, that characterize the event type. Each key phrase and role uses extractors to define how the object is extracted from the text.
The first task in defining your event schema is defining the set of event types you want to recognize. When extracting events, you don't extract all possible event types; you only extract the event types of interest. It's important to recognize which sorts of events are significant and will be mentioned frequently in your domain. Consider the set of entities and events that are going to be mentioned in the documents you will be analyzing. The goal is to train a model to extract only the event types that you are meaningful to your operation.
For example, if you're analyzing financial documents, bankrupting and acquiring events will be of interest, but flying or battle events would not be of interest. If, however, you're analyzing travel blogs, recognizing flying events may be important. Alternatively, if you're analyzing military reports, flying and battle events may be relevant.
Let's consider a project that tracks troop movements and battles between military units: troop_movement and battle may be the only two event types you need.
You could make these event types more granular (e.g. aerial_battle, tank_battle, etc.), but a larger number of possible event types will result in a smaller number of training examples for each type. This will make it more difficult for a machine learning classifier to learn the patterns. A better solution might be to have a more generic event type (like battle) that captures more fine-grained distinctions with roles such as mode_of_battle indicating the concepts aerial, tank, or artillery. In general, you should try to create as few event types as possible to extract the information you’re really interested in.
Each event type has one or more key phrases, a word in the text that evokes the given event type. Rosette uses key phrases to identify candidate event mentions from the text.
Let's consider the troop_movement example. If you were reading a document, what words would you look for to indicate that it was discussing troop movements? Drove, flew, took off, landed, arrived, moved are all potential key phrases.
Looking at the keyword flew in more detail, what about other tenses of the word flew? Words like flying, fly, flies. You don't want to have to list every possible version of the word.
Rosette identifies candidate keywords in your documents by using an extractor. For key phrases, the extractor will look for the exact words or the lemmas of the words you define.
Rosette uses the candidate key phrase to identify event types in the text.
Event mentions include more objects than the key phrase. These other objects are usually entity mentions, i.e. people, places, times, and other mentions which add detail to the key phrase. For a flying event, with a key phrase of flew, you may want to know who flew? Where did they go? When did they go? What kind of aircraft did they fly on? The people, locations, times, and aircraft are all entity mentions that have roles.
Roles detail how the entity mention relates to the event. They answer the questions: What does this entity do in the event? What role does it play?
Let's look at a troop_movement event. What types of entities might we expect to find? What types of roles? Some possible roles include:
-
Mover: the people or organization moving
-
Origin: where the trip originates
-
Destination: where the trip ends
-
Mode of transportation: the vehicle used in the movement
-
Date: the date of the movement
There may be more roles in an entity mention than you are interested in capturing. For example, let's assume you want to know who flew, but you don't care about when they flew. You would define the role of traveler, but would not define a role for date or time. Part of defining the schema for an event model is determining which roles are important to your organization and task.
Roles are generic categories, such as traveler, origin, and destination, When annotating event mentions, you tag extracted entities with the role they perform in the entity mention. Extractors define the rules used to extract role candidates from text.
A role can be required or optional. If required, an event mention will not be extracted without the role. You should only mark a role as required if it must always be in the event mention. Let's look at some examples for a flight scenario.
Bob flew from Boston to Los Angeles on Wednesday.
The key phrase and roles are:
Let's assume the destination is marked as required in the schema definition. In this case, only one of the following event mentions will be extracted.
Bob flew to Los Angeles.
Bob's flew from Boston on Wednesday.
The second event mention will not be extracted, since it does not contain the required role, even if it is annotated.
Determining Events and Key Phrases
It can be difficult to define when you need to separate the events you are trying to extract into different event types. Some events might be very similar, but the roles in the event have different perspectives to the key phrase. In this case, you will want to create separate event types. Otherwise, the model may have difficulty determining the correct roles.
For example, let's consider a Commerce event for buying and selling show tickets. One way to model this would be to create a purchase event that includes both buying and selling.
-
Event: commerce event
-
Key Phrases: buy, obtain, sell, distribute
-
Roles buyer, seller, show
Let's consider a couple of events:
In these examples, the model will have difficulty identifying correctly the buyer and the seller if they are the same event type. The event model cannot distinguish the different perspectives the roles may have based on the key phrase; all key phrases in a single event type are expected to have the same relationship to the roles.
Therefore, we strongly recommend that when key phrases have different relationships to the roles, they should be separated into separate event types.
-
Event: selling event
-
Key Phrases: sell, distribute
-
Roles buyer, seller, show
A similar example would be the events entering and exiting. While they may have the same roles (person, from location, to location, time), the perspective of the person to the locations is different for each key phrase.
Rosette has multiple techniques to identify candidate key phrases and roles in text. For example, it can match a list of words, or it can match all the lemmas for a given word. Using Rosette Entity Extractor (REX) it can identify entity mentions of specific entity types. Extractors define the rules and techniques used to identify role and key phrase candidates in the text. While any extractor type can be used to define roles, only morphological extractors can be used to identify key phrase candidates.
Once defined, extractors are reusable in multiple schemas. An extractor named location may be defined as the standard REX entity type Location. It could be used in troop_movement events as well as travel events, as each of them have roles involving locations.
The currently supported extractor types are:
-
Entity: A list of REX entity types. You can use the standard, pre-defined REX entity types or train a custom model to extract other entity types. The custom model must be loaded in Rosette Server to define an entity extractor with custom entity types.
-
Semantic: A list of words or phrases. Any word whose meaning is similar to one of these words will match. For example, an extractor of meeting will match assembly, gathering, conclave. Rosette uses word vector similarity to identify similar words. While a semantic extractor can be defined by a phrase, it will only identify single words as candidate roles.
-
Morphological: A list of words. When a word is added to this list, it is immediately converted to and stored as its lemma. Words with the same lemmatization will match. For example, a morphological extractor for go will match going, went, goes, gone.This is the only extractor type valid for key phrases.
-
Exact: a list of words or phrases. Exact will match any words on the list, whether they are identified as entity types or not. For example, you could have a list of common modes of transportation, including armored personnel carrier and specific types of tanks.
Rosette event extraction takes advantage of the advanced entity extraction capabilities provided by Rosette entity extractor (REX). REX uses pre-trained statistical models to extract the following entity types:
-
Location
-
Organization
-
Person
-
Title
-
Product
You can also use custom-trained entity extraction models, trained by the Rosette Model Training Suite, to extract additional entity types. These models are loaded into Rosette Server. They can be called in the default configuration or through a custom profile.
REX also includes rule-based extractors, including statistical regex extractors that can extract additional entity types such as:
-
Date
-
Time
-
Credit Card numbers
-
Phone Numbers
The rule-based extractors are not returned by default, To use rule-based REX extractors, modify the supplementalRegularExpressionPaths
in the REX configuration (rex-factory-config.yaml)
file. You can also add custom regex files to create new Exact extractors.
Note
Any models, gazetteers, and regular expressions used when training a model must also be used when performing event extraction. Use the same custom profile to configure REX for model training and event extraction. The custom profile is set in the schema definition for event model training.
Role types define the rules that are used to identify a piece of text as a candidate for a specific role or key phrase. A role type is made up of one or more extractors and is reusable.
Multiple extractors can be included in a role type definition. They are combined as a union - all possible candidates extracted are included.
Example 1. Combining Extractors
This definition matches any REX location, one synonym for New York City, or any words with meanings similar to city or state.
{entities: [LOC], exact: ["the big apple"], semantic: ["city", "state"]}
Example 2. Defining the Movers Role Type
Let's look at another example. Let's say we want to identify the movers in the troop movement schema. What are potential movers?
-
People: Any entity extracted as a person. This could be defined as an entity extractor named person-entity.
-
Groups: Specific troop organizations, such as special forces or battalion or squad. This could be defined as an exact extractor, a list of terms you've identified as groups that are movers. Let's call this extractor troop_groups.
The movers role type would include both the person-entity and troop_groups extractors. The mover role would have the role type of movers. This role type could be used by other roles as well.
Role types are also used to define the rules for extracting key phrase candidates. How would we extract the key phrases for the troop movement schema? What are some key phrases we could be looking for? Words such as fly, drive, and move are all potential key phrases. We would also want other tenses and versions of those words. For example, given move, we would also want to extract moved, moves, moving.
You could define a single morphological extractor move-morphological-key that lists the specific words and specifies that all lemmas for the words are also matches. The troop_movement_key role type would use the move-morphological-key extractor.
Example 3. move-morphological-key extractor
"name": "move-morphological-key",
"kind": "morphological",
"items": [
{"surface-form": "march"
},
{"surface-form": "fly"
},
{"surface-form": "drive"
},
{"surface-form": "traverse"
},
{"surface-form": "move"
}
]
Role types are generic categories, while role mentions are specific instances of those categories. Extractors define the specific rules to extract the role candidates. Extractors are combined into role types.
The events schema describes the events and roles you want to extract. Defining an ontology of event types and roles, along with the extractors that will find the events and roles in the text, is a complex and difficult task. It must be complete and accurate before you start training models. Semantic frames are one type of structured representation of a situation involving various participants and roles, such as an event. Existing resources which define semantic frames can be helpful in identifying and describing your event schema.
Let's consider a system that tracks troop movements and battles between military units. In this case, TROOP_MOVEMENT and BATTLE may be the only two event types necessary.
You could make these event types more granular (e.g. aerial_battle, tank_battle, etc.), but a larger number of possible event types will result in a smaller number of training examples for each type. This will make it more difficult for a machine learning classifier to learn the patterns. A better solution might be to have a more generic event type (like battle) that captures more fine-grained distinctions with roles such as mode_of_battle indicating the concepts aerial, tank, or artillery. In general, you should try to create as few event types as possible to extract the information you’re really interested in.
Once you have selected an event type, try writing a general description of what you want to extract for the event type. Think about the most basic elements of events of that type and how they relate to each other. The goal is to be able to describe the event in a way that generalizes all the possible event mentions that you will want to extract.
In a TROOP_MOVEMENT event, one or more soldiers (MOVER) move from one location (ORIGIN) to another (DESTINATION), possibly in some sort of vehicle (MODE_OF_TRANSPORTATION).
Defining an ontology of event types and role relations that is consistent and extends to all possible cases is very challenging. We strongly recommend you research existing resources, such as FrameNet, before trying to build something from scratch. It is possible that an ontology specific to your domain is already available somewhere. You can also use the existing event types in an ontology like FrameNet as an inspiration to design your own event types, as well as a verification check for the scheme you define.
Resources for Modeling Your Domain
When modeling events for annotating in English, it may be helpful to rely on existing resources, such as the semantic frames proposed by FrameNet. A semantic frame is a sort of prototype for a situation, and the English FrameNet provides a dictionary of semantic frames and annotated examples. Some semantic frames and their constituent frame elements may align well with your concept of an event type and its roles.
If we look at the description of the “Motion” frame in FrameNet, we observe something fairly similar to what we came up with for TROOP_MOVEMENT:
Some entity (Theme) starts out in one place (Source) and ends up in some other place (Goal), having covered some space between the two (Path). Alternatively, the Area or Direction in which the Theme moves or the Distance of the movement may be mentioned.
The Motion frame describes more general movement events, not just troop movements, so obviously there is no mention of soldiers. But apart from this, there is a clear correspondence between many of the frame element types in Motion and the role relations in TROOP_MOVEMENT:
-
Theme: MOVER
-
Source: ORIGIN
-
Goal: DESTINATION
FrameNet uses terminology from the linguistic field of semantic role labeling, but you are free to name your role relations whatever is most descriptive and intuitive for developers and annotators.
You’ll also notice that Motion has more role types than TROOP_MOVEMENT (Path, Area, Direction, Distance). These may give you ideas for roles that you haven’t considered, but might be useful for your application. On the other hand, some of these roles are optional and appear very infrequently, which makes them difficult for a machine learning model to learn; this is exacerbated by the fact that some roles apply to long spans of text with complex grammatical constructions. We recommend keeping your list of roles as short and simple as possible to model the information you need, while ensuring that there are many examples of each role in the data you annotate.
Tip
If you can’t find anything in FrameNet that aligns with your description of an event type and roles, that’s an indication that the event type may be ill-defined.