Web Services

 


IMPORTANT: These web services are intended for DEMONSTRATION PURPOSES ONLY. If you want to use them in a production environment, please contact us.


 

In this page you can find a list of the Web Services developed by the CLTL group. Mainly this Web Services (WS) are linguistic processors, applying some process to an input text, and generating an output text. The different WS are grouped depending on the task performed.

1.- Using / calling the Web Services

To call these web-services you have to send the input file via a POST request to the web-service URL. The easiest way is to use Curl. Curl is command line tool for transferring data with URL syntax, supporting the most of existing protocols (HTTP, IMAP,POP3, FTP, Telnet….),  and can be used to transfer data from or to a server without user interaction.

To run this command line tool, we have to open a terminal window (or console) in our machine. The most basic way of using the curl command, given that our input text is in a file called “my_input.txt”, the URL of the WS is https://some.server:port/my_ws, and finally we want to store the result in a file “my_output.txt” is:

echo 'Dit is een mooie en gezellig hotel in Amsterdam' | curl --data-binary @- ic.vupr.nl:8081/tokenizer?lang=nl | curl --data-binary @- ic.vupr.nl:8081/treetagger_kaf | curl --data-binary @- ic.vupr.nl:8081/chunker_nl > my_ouput.kaf

If you visit with your browser the URL of the web-service, simply by clicking on the link (a GET request), you will get some further information and examples of usage of that web-service.

2.- Web Services

In this section we will include all the WS developed by the CLTL group. These are the main kind of linguistic processors that we have:

  1. Tokenizers
    • Open-nlp tokenizer and sentence splitter
    • Opener tokenizer (based on perl)
  2. Part-of-speech taggers
    • TreeTagger (from plain text)
    • TreeTagger (from token KAF)
  3. Parsers
    • Stanford parser
    • Alpino parser (from plain text)
    • Alpino parser (from KAF/NAF files)
  4. Chunkers
    • Open-nlp noun phrase chunker trained on Alpino data
  5. Word Sense Disambiguation systems
    • UKB (Dutch and English
    • Suppor Vector Machines (Dutch)
  6. Other
    • Hotel property tagger
    • Hotel polarity tagger
    • Opinion detector
  7. Opinion miners
    • Hotel basic opinion miner
    • MPQA trained deluxe opinion miner
  8. Complete pipelines for English and Dutch

    • Basic version
    • Deluxe version
  9. NER taggers
    • Opener ner

2.1 Tokenizers

2.1.1 Open-nlp tokenizer and sentence splitter

  • Short description: performs sentence splitting and text tokenization
  • Language: Dutch
  • Input: UTF-8 plain text
  • Output: KAF text with token layer
  • URL: http://ic.vupr.nl:8081/tokenizer
  • Example file output: here
  • Version: basic

This web service performs two tasks:

  1. Sentence splitting: detects the boundaries of the sentences in the input text
  2. Tokenization: splits each sentence in single tokens

This webservice is based on Open-nlp, a machine learning platform for natural language processing. Specifically for this webservice we have used the Dutch models trained on the conllx alpino data which are available at http://opennlp.sourceforge.net/models-1.5/

2.1.2 Opener Tokenizer

  • Short description: performs sentence splitting and text tokenization. It’s based on rules with perl
  • Language: language independent
  • Input: UTF-8 plain text
  • Parameters: i18n code for the language must be specified (nl, en, es, fr, it, de)
  • Output: KAF text with token layer
  • Original URL: https://github.com/opener-project/tokenizer-base
  • URL: ic.vupr.nl:8081/opener_tokenizer

2.2 Part-of-speech taggers

2.2.1 TreeTagger (from plain text)

  • Short description: perform part-of-speech and lemma annotation
  • Language: english and dutch
  • Input: UTF-8 plain text
  • Output: KAF text with token and term layer
  • URL: ic.vupr.nl:8081/treetagger_plain_to_kaf
  • Example output file: here
  • Version: lite

In this case we have implemented one wrapper around TreeTagger which is a tool for annotating text with part-0f-speech and lemma information. Our WS takes the input text, calls to TreeTagger and generates an ouput KAF file with the token and term layer. The tokenization is performed by TreeTagger as well. The information of the term layer (lemmas and PoS tags) are assigned by TreeTagger as well. There are two different WS’s for English and Dutch to avoid the use of one parameter to specify the language, so consider the use of the correct version for your language.

2.2.2 TreeTagger (from KAF)

  • Short description: perform part-of-speech and lemma annotation
  • Language: Dutch
  • Input: KAF text with token layer
  • Output: KAF text with token and term layer
  • URL: ic.vupr.nl:8081/treetagger_kaf_to_kaf
  • Example output file: here
  • Version: basic

This is another wrapper around TreeTagger, a tool for annotating text with part-of-speech and lemma information. In this case the input for the Web Service has to be a KAF file, with the token layer. The program will call TreeTagger, and will create the term layer containing the PoS and lemma information, and taking care of linking properly the new terms to the previous tokens already existing (the token layer given in the input KAF will not be modified in the output KAF). This Web Service works with Dutch text.


2.3 Parsers

2.3.1 Stanford parser

  • Short description: perform syntactic analysis in English
  • Language: English
  • Input: UTF-8 plain text
  • Output: KAF text with token, term, dependency and chunk layer
  • URL: ic.vupr.nl:8081/stanford_plain_to_kaf
  • Example output file: here
  • Version: basic

This Web Service is a wrapper around the Stanford parser. This parser analyzes the syntactic structure of the input text, detecting phrases, dependencies between constituents and generating an structure tree of the text. This tool works with English text.

2.3.2 Alpino parser

  • Short description: perform syntactic analysis in Dutch
  • Language: Dutch
  • Input: UTF-8 plain text
  • Output: KAF text with token, term, dependency and chunk layer
  • URL: ic.vupr.nl:8081/alpino_plain_to_kaf
  • Example output file: here
  • Version: basic

In a similar way of the previous parser, Alpino performs an analysis of Dutch text, detecting the syntactic structure of it, as well as some other information like chunks and dependencies between elements in the text. This tool works for Dutch input text.

2.3.3 Alpino parser with KAF/NAF

  • Short description: perform syntactic analysis in Dutch
  • Language: Dutch
  • Input: KAF/NAF file with text and term layer
  • Output: input KAF/NAF extended with dependency layer
  • URL: http://ic.vupr.nl:8081/alp_dep_parser

This module obtains the dependencies from an input KAF/NAF file by calling to the Alpino parser. The output is also KAF or NAF.


2.4 Chunkers

2.4.1 Open-nlp noun phrase chunker trained on Alpino data

  • Short description: detects chunks (noun phrases) in Dutch
  • Language: Dutch
  • Input: KAF with term layer
  • Output:KAF extended with chunk layer
  • URL: http://ic.vupr.nl:8081/chunker_nl
  • Example output file: here
  • Version: basic

This Web Service implements a noun phrase detection (or chunker) in Dutch language. This module is based on Machine Learning, and the open-nlp toolkit has been used to train the model. The training data has been compiled from the Alpino treebank data, which is annotated with noun phrases among other kind of information, which in our case has been discarded. The XML format of Alpino has been converted to the tab separated format required by open-nlp, which is the same format as defined in the CONLL competition.


2.5 Word Sense Disambiguation Systems

2.5.1 UKB

Unsupervised and knowledge-based WSD system, for Dutch and English.

2.5.2 Support Vector Machines WSD

WSD system trained with Support Vector Machines on the data generated by the DutchSemcor project. It works for Dutch text, in plain or KAF/NAF format, and only using the mod-wsgi interface.

  • URL: http://ic.vupr.nl:8081/svm_wsd (click on this URL for parameters and usage)
  • Input: plain text, KAF or NAF
  • Output: semcor XML format (if input is plain), KAF or NAF

2.6 Other

2.6.1 Hotel property tagger

  • Short description: detects properties of a hotel in Dutch and English text
  • Language: Dutch and English
  • Input: KAF with term layer
  • Output:KAF extended with property layer
  • URL: ic.vupr.nl:8081/hotel_property_tagger
  • Example output file: here
  • Version: lite

This module implements a detector and tagger of hotel properties, such as: room, staff, internet, breakfast, cleanliness… It’s based on a list of properties derived from a hotel lexicon and implements a basic lookup algorithm using that property list. Furthermore, this list has been propagated automatically through WordNet using the hyponymy relations and a propagation algorithm developed within the CLTL group.

2.6.2 Hotel polarity tagger

  • Short description: assigns positive/negative polarities to the terms in a text
  • Language: Dutch, English and German
  • Input: KAF with term layer
  • Output:KAF with term layer extended with polarities
  • URL: http://ic.vupr.nl:8081/polarity_tagger
  • Example output file: here
  • Version: basic

This module implements a tagger that assigns the polarity (positive or negative) to an input text. The program also detects if there are what we call  intensifiers (like “very”, “heel”) or polarity shifters (mainly negators). For this task we use a polarity lexicon, where the words are tagged with the polarities, created for the Hotel Domain, so it’s important to note that this tagger will tag the text with polarities according to the hotel domain (a word could be positive in the hotel domain and negative in an sport domain).

2.7 Opinion miners

2.7.1 Hotel basic opinion miner

  • Short description: detects opinions and their elements (targets, holders and expressions) in texts
  • Language: Dutch, English, German, French, Italian and Spanish
  • Input: KAF with term layer containing polarities
  • Output:KAF extended with opinion layer
  • URL: http://ic.vupr.nl:8081/opinion_miner_basic
  • Example output file: here
  • Version: basic

This tool implements an opinion detector in English and Dutch. It takes as input KAF text with the term layer annotated with polarities, for instance to indicate that the lemma “goedkoop” has a positive polarity. From these evidences of positive and negative words, the system tries to extract whole opinions in three different steps:

  1. Expression detection: to detect the complete expression in case there are intensiers or modifiers of the words with polarity
  2. Target detection: find about what is the previous expression
  3. Holder detection: find who is expressing the opinion

This system is based on a set of rules to perform the three steps detailed.

2.7.2 MPQA trained delux opinion miner

This system has been trained using the MPQA data, and following a machine learning approach, so it works just for English text. The whole process is divided into 2 main tasks:

  1. Detecting chunks of text that represent opinion entities (holders, targets and positive/negative opinion expressions ). For this tasks, 3 models have been trained using a Conditional Random Field library (crfsuite)
  2. Detecting the relations between the extracted entities in 1), for creating the opinion triples (holder+target+expression). For this task, 2 models have been trained using a Support Vector Machine library (svmlight)

Description of the webservice:

  • Input: a KAF file (with token and term layer, preferably also with polarity information, entities, constituents and dependencies)
  • Output: a KAF file extended with the opinion layer
  • URL: http://ic.vupr.nl:8081/opinion_miner_deluxe

2.8 Complete systems

These web services can be used to run our whole pipeline starting from plain English or Dutch text. Depending on the version used, different linguistic processors will be called.

2.8.1 Basic version

These are the modules applied by the basic version, which works for English and Dutch.

  • Tokenizer open-nlp
  • TreeTagger pos-tagger
  • Polarity tagger
  • NER (entity tagger)  trained on opener by EHU on conll data
  • Hotel aspect tagger
  • Rule-based opinion miner
  • WSD trained on DutchSemcor data

URL for the webservice:

  • ic.vupr.nl:8081/pipeline_basic
  • Usage: the language needs to be specified as a parameter in the curl command
    • cat file.en.txt | curl –data-binary @- ic.vupr.nl:8081/pipeline_basic?lang=en
    • cat file.nl.txt | curl –data-binary @- ic.vupr.nl:8081/pipeline_basic?lang=nl

 

2.8.2 Deluxe version

These are the modules applied by the basic version, which works for English and Dutch.

  • Tokenizer open-nlp
  • TreeTagger pos-tagger
  • Polarity tagger
  • NER (entity tagger)  trained on opener by EHU on conll data
  • Hotel aspect tagger
  • Dependency parser (newsreader EHU-srl for English, opener Alpino for Dutch)
  • Constituency parser (opener stanford for English, opener Alpino for Dutch
  • Opinion miner deluxe: trained with CRF and SVM on small set of hotel reviews annotated by VU on 2013
  • SVM trained on DutchSemcor data

URL for the webservice:

  • ic.vupr.nl:8081/pipeline_deluxe
  • Usage: the language needs to be specified as a parameter in the curl command
    • cat file.en.txt | curl –data-binary @- ic.vupr.nl:8081/pipeline_deluxe?lang=en
    • cat file.nl.txt | curl –data-binary @- ic.vupr.nl:8081/pipeline_deluxe?lang=nl

nertaggers

2.9 NER taggers

Opener NER

NER tagger developed within the OpeNER tagger, for different languages.

  • Input: a KAF file with at least text and term layer
  • Output: extended KAF file with entities
  • Mod-wsg URL: ic.vupr.nl:8081/opener_ner


3. Extracting events from text

The following services can be used to extract events from text. The services need to be called in the order specified below, where the output of one module is the input for the next module. The languages available are English and Dutch

1. Parsers

  • Dutch (Alpino)
  • English (Stanford)
  • Input stream is plain text in UTF-8
  • Ouput stream is KAF
  • Description: applies tokenization, lemmatisation, pos-tagging, some named entity recognition, chunking and dependency labelling of the text and generates the output in KAF.

2. Multiword tagger:

3. WSD

3.1 Language-independent processing of text

Steps 1 through 5 require special software, resources for each language which eventually produce a conceptual representation of the text in KAF. Two more modules are needed to extract events from this representation. These module are not yet installed as webservices:

6. Kybot
URL: runs local
input stream: KAF
output stream: KAF

7. Coreference or SEM layer
URL: runs local
input stream: KAF
output stream: KAF

 



Leave a Reply

Your email address will not be published. Required fields are marked *