An entity refers to an object of potential interest, such as a person, organization, location, or date. When you process a document, locating entities can help you classify the document and determine what kinds of data of interest it is likely to contain.
The Rosette entity extractor endpoint uses statistical models, pattern matching (regular expressions), and exact matching (gazetteers) to identify entities in input text.
Using contextual features specified by a computational linguist and a substantial body of news stories in which entities have been tagged by native speakers, Basis Technology has developed an AP (Averaged Perceptron) processor and statistical models for extracting a variety of entities in a number of languages.
PATTERN MATCHING PROCESSOR: REGULAR EXPRESSIONS
The entity extractor includes regular expressions for finding language-specific entities and generic entities that may appear in a variety of languages.
EXACT MATCHING PROCESSOR: GAZETTEERS
The entity extractor uses gazetteers to return exact matches. The distribution includes binary gazetteers for each language and a number of entity types, and a cross-language gazetteer for corporations. In order to match entities despite differences in whitespace, Gazetteer entries and potential matches are space normalized (any amount of whitespace between words is treated as a single space).