DutchSemCor

DutchSemCor: (“Dutch corpus with word senses from the Cornetto database”): NWO-project 380-70-011 (2009-2012)

Project coordinator of DutchSemCor: which aims to deliver a one-million word Dutch corpus that is fully sense-tagged with senses and domain tags from the Cornetto database (STEVIN project STE05039). 250K words of this corpus will be manually tagged. The remainder will be automatically tagged using three different word-sense-disambiguation systems (WSD), and will be validated by human annotators. The corpus data will be based on existing corpus material collected in the projects CGN, D-CoI and SoNaR. These corpora have already been automatically annotated with morpho-syntactic tags and structures. The corpora will be extended where necessary to find sufficient examples for meanings of words that are less frequent and do not appear in the above corpora.

The resulting corpus, for which we aim to offer the same balance in types of text as these basic resources, will be extremely rich in terms of lexical semantic information. Its availability will enable many new lines of research and technology developments for the Dutch language. In particular, it will enable research into the relation between language form and language interpretation, and as such it will be applicable in the fields of cognitive science, (psycho-)linguistics, language learning and language teaching, semantic web applications, information retrieval, machine translation, text mining, and document interpretation (summarization, topic segmentation).

We foresee that the corpus will create new directions of research and technology development on a par with current developments for English. (News VUA.)

CLTL

the Computational Linguistics & Text Mining Lab

DutchSemCor