2000 - Automatic recognition of name entities for information extraction and retrieval

"Automatic recognition of name entities for information extraction and retrieval" (in collaboration). 
In Proceedings of 21th Symposium on Applied Linguistics organised by the Department Of Philology and Linguistics, Aristotle University of Thessaloniki, pp. 131-143, 

The purpose of this article is to describe a system under development Identification noun entities from pc's in free text. The system was developed within the project "PENED99 - ECONOMY" and intended to be incorporated in extraction systems and information retrieval (Information Extraction and Retrieval Systems). The project consists in recognizing and classifying Named Entities (persons, organizations, place names, temporal expressions, arithmetic expressions) in accordance with the standards of the International Conference evaluation of information extraction systems (Message Understanding Conferences - MUC), but adapted to the Greek data.

The system under development receives recognition in the first stage, the entrance, a text which has gone through the stages of identification of surface structures (sentences, words, abbreviations, etc.) grammatical classification. Then the noun entities recognized by means of lists of names of known and unknown words recognition methods . In the second stage the partially labeled text passed to a sequence of rules - standards, while the third and last stage of final recognition and categorization CO using a grammar patterns (pattern grammar) based on finite-state techniques. In this rules translated into finite automata with known analytical techniques. A final reference memory for storing known to every moment of each alternative formulations of nominal entity. For the development of the tool (training lists, export grammatical rules ) used a body about 120,000 words, and a corpus of 30,000 words was used for the initial evaluation. For this task, required the design and implementation of a prototype database for the collection and recording of the corpus (body of about 120,000 words) made by the writer.


020Get the full paper here - Aristotle University of Thessaloniki-Digital repository

Related Articles