Extraction of objects

DictaScope Tokenizer

The lexical analyzer which assorts the entrance text for reception on an exit of a set of the marked text objects (tokens) from this text.

Module basic purpose at processing of texts in a natural language is revealing of text objects and the facts, such as:

I. Objects

  • a person
  • a post
  • teams
  • organizations (commercial and noncommercial)
  • areas
  • cities
  • areas / states
  • the states
  • geographical objects
  • dates
  • quantity indicators
  • statements of persons
  • operating systems

II.Facts Concerning persons

  • a post
  • a work place
  • date of birth / age
  • the birthplace

The revealed objects and the facts are led to an initial form normalized).

The module structure joins samples of rules for revealing and normalization of text objects some of listed categories and the facts.

Analysis process in DictaScope Tokenizer copes the rules which have been written down in special language. It is possible to create sets of rules for allocation and normalization of any difficult text objects or to finish existing under specific requirements (names of the goods, biographical particulars, references to the literature).

Entrance format – the plain-text. The result can be given out in format XML.

For program work the morphological dictionary is required.

The program is delivered in the form of dynamic library for Windows/FreeBSD.