Lasp I - Statistical language and speech processing
This project aims at building a Hidden Markov Model POS tagger and testing it on actual data. It is developed as a Midterm Project for Language and Speech Processing at the University of Amsterdam, Netherlands, 2005/2006.
Team members:
- Gideon Borensztajn
- Jiøí I¹a
Introduction
In the current state of the technology there is a big emphasis to automatic text understanding, information retrieval and communication with the user. The aim of the Language and Speech Processing lectures is to explain the statistical language modelling. To allow the students the better and deeper understanding of the topic, they are supposed to develop a Midterm Project.
Midterm project contains several steps. Each of them contains several techniques that make us closer to the final goal - Part Of Speech Tagger (POSTagger).
The difficulty of the language processing arises mainly from ambiguity - the fact that the word may have several meanings and part of speech tags. For instance list might be a noun or a verb. If more ambiguities are combined in the single sentence, it may became extremely difficult to automatically understand the text.
There are several techniques to deal with the text processing. One of them is to make the grammar of the language manually. This proved to be very difficult, sensitive to slight changes and time consuming. The other way is to use supervised learning methods. The ultimate goal is to develop a learner, that, provided by the training text tagged by a human, could extract the language knowledge by itself. One such opportunity are Hidden Markov models (HMM), based on the statistically discovered features in the text.
Report
The final report is available for download - midterm.pdf.