Please be aware that you need to login at github before having access to the thesis github repository
Master linguistics : track Text Mining
Jan van Casteren (2020) Automatic Attribution Extraction From Dutch News Articles: A Beginning (full thesis ♦ thesis github research at: eScience center – inside the filter bubble)
abstract: This thesis presents the first full experimental setup to research Automatic Attribution Extraction from Dutch texts. The aimed end products of the thesis work are: 1) reliable annotation guidelines for Attribution Extraction from Dutch, 2) two corpora (development and evaluation) and 3) two preliminary baseline classifiers: a rule-based classification system and a statistical classifier (SVM). As there is no previous work on the topic for the Dutch language in particular, the focal point of this project is the creation, evaluation, and improvement of the annotation guidelines. The development corpus is used to verify the quality of the annotation guidelines and attempt to improve them. The evaluation corpus, then, is used to evaluate the performance of the preliminary baseline classifier. Finally, we aim to provide suggestions for further development of the newly created environment to research Automatic Attribution Extraction from Dutch news articles.
Peter Caine (2020). Mind the gap: A comparison of linguistic vs deep-learning approaches to aspect extraction and aspect category detection (full thesis ♦ thesis github)
abstract: Aspect Extraction (AE) and Aspect Category Detection (ACD) form two crucial subtasks in Aspect Based Sentiment analysis (ABSA). Since the beginning of this approach to sentiment analysis, two approaches have dominated the area. Early researchers adopted linguistic approaches based on other basic NLP tasks, such as POS-tagging, parsing, rules or rankings mechanism (Marrese-Taylor and Matsuo, 2017). More recently, word embeddings have been employed as the primary features for input to deep learning (DL) neural networks. This thesis reviews studies and selects systems from both approaches to re-implement with the aim of evaluating the output to uncover qualitative differences. It was found that, while performance of DL systems was uniformly better on both tasks, inspection of the output revealed that the output of DL systems could often be unreliable and offer very few implementable strategies to improve performance, while the the simplest linguistic system was much more amenable to targeted strategies. Some strategies are explored for synthesising the strengths of both approaches to boost performance statistics, although questions remain about what can be measured with these metrics and whether pursuit of high scoring systems might ultimately not be the most meaningful aim.
Luca Meima (2020) Finding potentially HIV defining conditions in medical reports (full thesis ♦ thesis github ♦ internship at https://mytomorrows.com/)
abstract: Human immunodeficiency virus (HIV) is still one of the most challenging global public health issues. An underlying problem is that 45-50% of the people living with HIV in Western Europe are diagnosed late. Therefore, Erasmus Medical Center wants to optimize HIV testing. In collaboration with the company MyTomorrows, EMC aims to create a CDS (clinical decision support) system that selects patients presenting with a specific HIV indicator. These patients should be recommended an HIV test. HIV indicators are specific medical conditions, risk factors and medications related to an HIV prevalence of >0.1%. These are stated in the electronic medical reports (EMR). nThe project of my internship at MyTomorrows involves the first step in creating this system: identifying HIV indicators in Dutch narrative EMR data. I use the tool of MyTomorrows for identification. The tool is called QuickUMLS, which is a medical concept extracting tool that maps biomedical terms to the Unified Medical Language System (UMLS). The contribution of this project lays in showing how this tool can facilitate the identification of HIV indicators. Accordingly, the research question of the internship project is: Which improvements can be made in identifying HIV indicator conditions, risk factors and related medications in Dutch narrative clinical data by using QuickUMLS as part of the CDS system?To answer this question, first the coverage of UMLS is checked. The results of processing the lists with HIV indicators show that the coverage of the Dutch lexicon is small. However, it demonstrates that using the English version is a useful addition to the Dutch lexicon, because it contains more risk factors and related medications. Secondly, results are obtained for processing 50 Dutch clinical notes with the Dutch and English UMLS versions separately and together. This method is complemented by using regular expressions which are used to correct for the unrecognized HIV indicators by QuickUMLS. The method is evaluated using the generally used measures for extraction system: precision, recall and F-score. Results show that evaluation scores are high when both versions and regular expressions are used. However, recall should be improved, for it is very important that all HIV indicators are recognized. If the HIV indicator is not recognized, the patient is not tested. This could have major implications when the patient turns out to have HIV. This research recommends using the Dutch lexicon only for increasing recall if the speed of the system is a prerequisite, since iterating over two lexicons takes up to 4,7 more time. This can be done by adding the list of HIV indicators to the Dutch lexicon. Because it is a finite list, this is a manageable task. Using regular expressions is not recommended, because they do not capture lexical variants. However, these are useful when the coverage of the lexicon is not yet increased. The generalizability of the results is limited, since not all HIV indicators are captured in the 50 clinical notes. Further research should focus on deciding whether the identified HIV indicator is (still) a present HIV indicator for the patient.
Eva Zegelaar (2020) An Automatic Emotion & Purpose Classifier for Dutch Tweets Written by Members of the Dutch Parliament (full thesis ♦ thesis github ♦ internship at: https://reddata.nl/)
abstract: Twitter has a considerable body of tweets posted by politicians and research suggests that these can be influential in political affairs (Duncombe, 2019). In Red Data’s Haagsefeiten [“Facts of The Hague”] website archive, more than 1 million tweets expressed by members of the Dutch Parliament are accessible. The goal of this thesis is to classify these tweets into meaningful emotion and purpose categories. The selection of the labels is based on an agreement study whereby trial annotation rounds took place. The emotion category resulted in a Kappa score of (0.416) and the binary proactivity category resulted in a Kappa score of (0.314). These scores are considered low, therefore a third category was introduced, polarity, resulting in a Kappa score of (0.561). The agreement study shows that there is an identifiable presence of positive and negative labels, but less so for complex emotions and proactivity labels. Two systems were implemented in this study: A baseline SVM and a state-of-the-art CNNBiLSTM with pre-trained Dutch word embeddings. The system that performed best is the SVM with polarity labels, scoring a weighted f1-score of 0.59 and an accuracy of 0.60. The research concludes that the SVM works well with less training data and well-distributed polarity labels. To improve the performance of the SVM, feature engineering and the implementation of word embeddings are two possible solutions. The CNN-BiLSTM needs more training data. Lastly, to improve the quality of the emotion and proactivity labels, a new agreement study with possibly newly proposed labels is necessary.
Research master linguistics : track Human Language Technology
András Aponyi (2020) Estimating Translation Quality Using Distributed Representations of Words and Sentences (full thesis ♦ thesis github ♦ internship at https://www.taus.net/)
abstract: In recent years, there has been growing interest in the language industry in methods to estimate the quality of machine translated texts without access to reference translations. Therefore, quality estimation (QE) has become an active field of research in natural language processing. Moreover, significant breakthroughs have also been achieved in research related to distributed representations of words, commonly referred to as word embeddings. Inspired by the distributional hypothesis, these can provide explicit semantic information about words and the sentences that contain them. In this thesis, I present a quality estimation pipeline in which these two fields of research come together to produce sentence-level binary labels that denote the quality of hypothetical translations. First, I adapt an existing QE framework to predict sentence level translation quality scores in a domain-specific data set for the English-French language pair. In addition, I propose SemScore, a translation quality metric to measure semantic similarity between two sentences in different languages based on distributed word representations. Finally, I combine the predicted translation quality scores and SemScore values in a binary classification task to assess whether the new metric can complement existing QE methods. I find that the SemScore feature can improve classification outcomes in a neural model when compared to using predicted HTER scores alone, but is not suitable for predicting translation quality on its own. However, since the pipeline consists of a variety of steps that rely heavily on each other, I conclude that it might be possible to achieve higher classification accuracy by improving each component individually.
Klaudia Bartosiak (2020)Towards Formalizing Eligibility Criteria of Clinical Trials: Biomedical Entity Linking (full thesis not available ♦ thesis github ♦ internship at https://mytomorrows.com/)
abstract: This thesis contributes to the research on transforming free-text eligibility criteria of clinical trials to a machine-understandable format. Achieving the thesis’ objective involved two components. First of them consisted of identifying a suitable pipeline of tasks to convert eligibility criteria of clinical trials to a machine-understandable representation. The second one involved delivering a tool that provides solution to one of the tasks included in the pipeline. The reason for conducting this research is that despite many proposed ways to represent eligibility criteria, only a few works describe how to extract the knowledge from free text to achieve the target representations. The existing solutions, in turn, are often incomplete or the access to them is limited. From the identified pipeline, we decided to deliver a tool for the biomedical entity linking. This component maps entity mentions in free text to the corresponding terms in a medical knowledge base named UMLS (Unified Medical Language System). The motivation for this choice is the lack of access to the source codes of the existing highquality solutions for the biomedical entity linking. Therefore, they could not be used in the target pipeline and a new solution was needed. Thus, we adapted an existing general-purpose entity linking approach to the biomedical domain and further evaluated its performance on a set of eligibility criteria of clinical trials. The micro-F1 score of the biomedical entity linking system is 0.114, indicating that the performance of the tool is unsatisfactory. There were three research questions guiding this study: What tasks need to be addressed in order to create a system that transforms free text eligibility criteria of clinical trials into a machine-readable format?, What will be the performance of a domain-adapted entity linking tool on biomedical corpus? and How will it perform on eligibility criteria of clinical trials?
Suzana Bašic (2020) Color as a Discriminative Property for Establishing Object Identity in Human-Robot Communication (full thesis not available ♦ thesis github ♦ research project: CLTL-make robots talk and think )
abstract: This thesis explores the discriminative potential of color for disambiguating between different object instances of the same type in human-robot communication. Since the overarching goal is to improve human-machine collaboration in the physical world through communication in natural language, the problem is approached both from a natural language understanding (NLU) and natural language generation (NLG) perspective. The experiments presented in this work are of preliminary nature and are meant to shed light on the potential of color as a discriminative property for object instance disambiguation. More specifically, they are meant to explore the usefulness of distributed representations of color terms for disambiguating between different object instances. Color names are processed using an existing mapping of RGB values to a broad range of human-generated color names. Distributional approaches with GloVe embeddings are compared to string matching baselines both in the NLG and NLU experiments. A survey was conducted among native speakers of English to 1) generate input for the NLU models and 2) externally validate the output of NLG models. Despite certain difficulties, largely arising from the current limitations of computer vision, the results seem promising, especially in the area of NLG. However, to obtain better results, especially with respect to NLU, more sophisticated approaches are necessary. Multimodal embeddings that jointly encode visual and linguistic information seem to be the most promising approach for future work. Furthermore, color should ultimately be combined with other properties like size to minimize the inherent ambiguity.
Lauren Green (2020) Semi-supervised Classification of Occupations using Pseudo-Labelling and Information Extraction (full thesis not available ♦ internship at https://greple.de/)
Ngan Nguyen (2020) Clickbait anatomy: Identifying clickbait with machine learning (full thesis ♦ thesis github)
abstract: This research focuses on the exploration of linguistic patterns in clickbait, aiming at characterizing clickbait from serious formal news using quantitative and qualitative analyses. Two significant findings about the nature of clickbait are discovered: (1) there are noticeable changes in terms of syntactic structures and topics in clickbait headlines, (2) the contents of a clickbait article can provide valuable discourse-level information that can be used to differentiate clickbait from non-clickbait. Based on the results of the analysis, three types of features are selected to be used for machine learning systems: stylometic features with encoded sequential part-of-speech and dependency tags, word embeddings, and document embeddings. The best system which uses Support Vector Machines algorithm and word embedding features achieves precision and recall scores of 0.82, as well as 82% of accuracy.
Lisa Vasileva (2020) Machine Translation Detection for Neural Machine Translation Scenario (full thesis ♦ internship at https://www.taus.net/)
abstract: This thesis explores the task of Machine Translation Detection for Neural Machine Translation architecture. Previous work in the field of MT detection focuses on Statistical Machine Translation, and aims to model the types of errors SMT systems produce in order to distinguish machine-produced text from human-produced text. However, NMT architectures are more sophisticated than SMT and are known to produce less errors. They also boast more fluent and natural-sounding output, leaving less possibilities to build automatic systems capable of distinguishing machine-translated and human-translated text based on error detection. In order to tackle the task of MT detection for NMT, in this thesis I explore an alternative approach, inspired by translationese studies. This area of Translation Studies has widely explored the underlying differences between translated texts and texts created in a given language originally and has experimentally established there are inherent differences between translated and original texts, not related to the quality of translation. Building up on a hypothesis that machine-translated and human-translated texts exhibit translationese features differently, I explore this approach to MT detection for NMT scenario.
Jonathan Schaller (2020) Cross-domain evaluation of a question-answering classifier (full thesis not available )
abstract: Question-answering is one of the essential tasks in Natural Language Processing. Still, the field is focusing too much on experiments on individual datasets. This research thesis aims at examining, how far a machine-learning classification system, trained on an open-domain dataset is applicable on a specialised, closed domain dataset. To examine this, two classifiers, an SVM and an LSTM model, are trained on a dataset from the SemEval competition and tested on the academical open-domain dataset WikiQA as well as the real-world closed-domain dataset CompCorpus. The results are evaluated in qualitative and quantitative analysis, giving ideas for improving the methodology, revealing structural problems in the datasets, and resulting in a discussion on the generalisability of question-answering models trained on open-domain datasets.
Karen Goes (2019) Exploring text mining techniques to structure a digitised catalogue (full thesis ♦ internship at: https://www.kb.nl/
abstract: This research aims to obtain structured data from digitised Brinkman catalogue volumes and identifying which text mining techniques can be used for this task. The Brinkman catalogue lists the books, journal titles and maps published in the Netherlands since 1846. They have been digitised by the National Library of the Netherlands using Optical Character Recognition (OCR). However, this data is unstructured, uncorrected and cannot be processed computationally. Since the data is uncorrected an analysis is performed to identify which volumes meet the requirements to be used for further processing. The entries are then formed by identifying the start and end of the entry. Each entry contains metadata about a book such as the author, title, publisher and retail price. This metadata is extracted from the entries using different text mining techniques namely, a rule-based system including regular expressions, a probabilistic context-free grammar, and named entity recognition. The extracted information is improved using external knowledge such as a list of Dutch surnames for authors and Dutch and Belgian city names. For the three most recent volumes the National Library has provided evaluation data that is already digitally available. To evaluate the remaining volumes evaluation data has been manually created for 100 entries. From the evaluation results the conclusion is drawn that the rulebased method with regular expressions is the best technique. The extracted data is transformed to a standard format for bibliographic metadata, which makes it structured and searchable data.
Liza King (2018) Modals and Measles: Computational linguistic investigations into modal use in the vaccination debate (full thesis)
abstract: This thesis presents investigations into the automatic processing of modal auxiliary senses in the vaccination debate. The primary aims of this research are to identify the most informative features for machine learning classification with regards to modal sense disambiguation, as well as discover whether modal senses are a useful fea- ture in predicting vaccination stance. Support Vector Machines and boosted trees with gradient descent are used to test a combination of context and subject features for the modal sense disambiguation task, and the SVM is used to ascertain whether incorporating modal senses as a feature aids in predicting vaccination stance. The majority of experiments conducted improve on baselines using the most frequent class and the addition of subject features are shown to be largely informative. Dif- ferent classifiers have different strengths, where the boosted trees experiments gain more correct predictions for the dynamic sense, and the SVM gains more correct predictions for the epistemic-dynamic sense. Investigations regarding the informa- tiveness of modal senses as a predictor of vaccination stance are not conclusive and it is suggested that further research, with a greater corpus and additional features, is needed
Benedetta Torsi (2018) Detecting claims in a cross-register corpus (full thesis)
abstract:This work is aimed at determining the best annotation scheme for our data, at producing a corpus annotated with argumentation information and at training a system to identify sentences containing claims. The corpus chosen contains texts belonging to different registers. Most corpora that have been annotated with argumentation information thus far contain texts from a specific genre. Usually, they are composed by persuasive texts with a defined structure. It is of interest to identify claims in usergenerated text in which it is rare to find structured arguments. This type of data contains precious information on the way people form opinions about a topic. In the literature, the basic components of the argument are the premise and the claim. Most annotation tasks that have been proposed until now include the identification and the consequent distinction of these two elements. In practice, the differentiation between these components is fuzzy, making it difficult to perform the task automatically. Therefore, this work focuses only on the claim component. In particular, our goal is to identify claims made by authors who engage in the vaccination debate, as it deals with public health and safety. The research question that will lead this exploration is the following: What is a definition of “claim” that is feasible for its identification in a cross-register corpus?.
Pia Sommerauer (2017) From old to new racism? Investigating known dangers in distributional semantic approaches to conceptual change (full thesis)
abstract: This thesis explores the methodological approaches used to investigate changes in the complex concept of RACISM with Distributional Semantic Models (DSMs) in natural language. The concept of RACISM has been researched extensively in a number of fields and is said to have undergone a shift from an old, biological interpretation to a new understanding in cultural terms. It can be expected that this shift is reflected in large collections of natural language. The currently used ways of studying conceptual change by analyzing meaning changes in individual words based on their distribution encompass a number of sources for variations in the results that do not reflect actual, conceptual changes, which leads to unreliable conclusions. The resarch carried out in this thesis has two main goals: Firstly, it aims at operationalizing expected changes in the conceptual system in such a way that they can be investigated by means of DSMs, which can represent changes in word meaning purely on the basis of their distribution in natural language. This requires a kind of translation of concepts into words and their changing semantic relations to each other. Secondly, this thesis aims at examining the conclusions drawn from such an approach with respect to the known dangers of variations specific to distributional semantic approaches. The results indicate that almost none of the initial conclusions about the reflection of the conceptual change in RACISM reflected in language can withstand this examination. While the insights about the reflection of the conceptual changes in language remain thus limited, a number of highly relevant insights into the methodological approaches used to study conceptual change could be gained.
Chantal van Son (2015) Towards a Dutch frame-semantic parser (full thesis ♦ research project: CLTL-newsreader)
abstract: In computational linguistics, frame-semantic parsing refers to the task of automatically extracting frame-semantic structures from text. Whereas most research in this area has focused on English, this thesis explores methods for frame-semantic parsing in Dutch. Instead of creating a FrameNet-like resource for this language, it is proposed to start from the predicate-argument structures generated by a Dutch PropBank-style semantic role labeller and to exploit information provided by resources like SemLink, the Predicate Matrix or a corpus of cross-annotations for frame and frame element identification. These resources provide mappings between the predicates and roles of FrameNet, PropBank, VerbNet and WordNet, which makes it possible to map the English frames and frame elements from FrameNet onto the Dutch predicates and their arguments, provided that there is a way to translate the Dutch predicate to its English equivalent. For this purpose, alignments between the Dutch and English WordNets are used, as well as machine translations. The results show that these resources indeed offer great potential for frame-semantic parsing in a cross-lingual setting. The best system achieved F1-scores of 0.41 and 0.49 on frame identification and frame element identification respectively, and there is still room for improvement.
Femke Klaver (2014) Authorship attribution of forum posts (full thesis ♦ internship at: https://www.tno.nl/nl/)
abstract: The anonymity of dark webs such as Tor facilitates a relatively safe environment for criminal activity. Because criminals tend to leave as few traces as possible, identification of those users often is difficult. One trace they do leave, however, is their writing style. The aim of this research is to investigate the discriminative possibilities of certain writing style characteristics, or stylometric features, when applied to messages posted on a forum on Tor. Two different problems regarding authorship analysis are investigated. The first problem entails identifying users that use multiple aliases. We look at the differences between two sets of texts to see whether or not they are written by the same person. Our approach shows very promising results when character bi- and trigrams are used as features. These features return a precision score of .98 and a recall score of .9. The second problem focuses on attributing the correct author to a text. Experiments are carried out using instances containing feature values of single messages and average feature values of combinations of messages as input data. In this thesis we show that, when we use combinations of 5 messages, we are able to select the correct user from a set of 67 users with an accuracy of around 83%. Both approaches deliver very promising results, showing that writing style characteristics are indeed very helpful in discriminating between users. While this research focused solely on forum messages, the same approaches can be applied on data from other sources. They can also be helpful for linking data from different sources to each other.