Preprocessing Tools

From FachschaftSprachwissenschaft
Jump to: navigation, search

This page is a collection of text preprocessing tools, as described here: PreprocessingPresentation.pdf.

For all of the tools, if you want to use them, make sure you obey the license terms!

Sentence Segmentation

Segmentation of a string of text into sentences.

OpenNLP Sentence Detector

Usage: Download, Maven or UIMA. The documentation provides examples for training, application and evaluation.

The German tokenizer was trained on the tiger corpus, the English one on OpenNLP training data, but you can easily train it on other data.

Programming language: Java

Languages: English, German, Dutch, Danish, Swedish, Portuguese.

SentenceDetector

Tokenization

OpenNLP Tokenizer

Usage: Download, Maven, or UIMA. The documentation provides examples for the application and training.

The German tokenizer was trained on the tiger corpus, the English one on OpenNLP training data, but you can easily train it on other data.

Programming language: Java

Languages: English, German, Dutch, Danish, Swedish, Portuguese.

OpenNLP Tokenizer

Stanford Word Segmenter

Usage: Command line invocation or Java API. For Chinese, two different models are available.

Programming language: Java

Languages: Arabic, Chinese

Word Segmenter

Stemming

Removal of affixes (reduction to the word stem). An overview of the output for different algorithms can be found here.

Snowball stemmer

The Snowball stemmer page and software are no longer actively maintained - but it can still be used, and it is actually one of the best-known stemmers.

Usage: The Snowball stemmer can be used within NLTK or by downloading the package for the language of your choice.

Programming language: Snowball language (and C and Java, wrappers for Python, Perl, PHP and C++ available)

Languages: English, German, and many others (e.g. French, Spanish, Norwegian, Russian, Finnish)

SnowballStemmer

NLTK

NLTK contains several other stemmers beside the snowball stemmer!

Programming language: Python

Lemmatization

Removal of inflectional affixes (reduction to the lexical entry).

Stanford Lemmatizer

Usage: The lemmatizer is a (minor) annotator component of the Stanford CoreNLP API. It can be used from the command line or be integrated in a Java program (download or Maven).

Programming language: Java (Python, Perl, JavaScript and other wrappers are available)

Languages: English, Spanish, Chinese, German and Arabic

Stanford lemmatizer

Mate Lemmatizer

Usage: Unfortunately, Mate tools files contain practically no JavaDoc. However, there are coding examples available showing how to use the lemmatizer. There is a DKPro component, but it doesn't use the latest version of the lemmatizer.

Programming language: Java

Languages: German, English, French, Spanish

Mate Lemmatizer

TreeTagger (integrated lemmatizer)

The TreeTagger lemmatizes its input in addition to tagging.

Usage: Both the downloadable TreeTagger archive and the TreeTaggerWrapper for Java (tt4j) website already provide examples on how to use the TreeTagger (command line & Java).

Programming language: Wrappers for Python, Java, Perl, R, Ruby

Languages: German, English, French, Italian, Dutch, Spanish and many others

TreeTagger

Stop word filtering

The removal of high-frequency words/words listed on a stopword list.

There are several stopword lists available on the web. However, stopwords are very task-specific, so it's important to choose the proper list carefully.

NLTK

NLTK contains a Stopword Corpus with stopwords for 11 languages.

Programming Language: Python

POS Tagging

Assigning a part-of-speech (verb/noun/...) to each word in a corpus.

StanfordPosTagger

Usage: The tagger contains several trained models and uses the Penn Treebank tag set.

Programming language: Java

Languages: English, Arabic, Chinese, French, German

StanfordPOSTagger

OpenNLPPosTagger

Usage: Contained in the OpenNLP library (a machine learning based toolkit).

Programming language: Java

Languages: English, German, Dutch, Spanish, Swedish, Portuguese

OpenNLP PosTagger

TreeTaggerPosTagger

Usage:

Programming language: Java

Languages: German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish

TreeTagger

Chunk Parsing

Identifying chunks (segments) of a sentence and classifying them (mostly NP/VP/...)

OpenNLP Chunker

Usage: Contained in the OpenNLP library (a machine learning based toolkit).

Programming language: Java

Languages: English, German, Dutch, Spanish, Swedish, Portuguese

OpenNLP Chunker

TreeTagger Chunker

Usage:

Programming language: Java

Languages: English, German, French, and Spanish.

TreeTagger

Dependency Parsing

Identifying dependency relations that hold between words and characterizing them

Mate Parser

Usage: Can be downloaded or tested online. It includes two different dependency parsers, a graph-based parser, and a joint tagger and transition-based parser.

Programming language: Java

Languages: English, German

MateParser

Malt Parser

Usage: Can be downloaded online. It can induce a parsing model from treebank data and can parse new data using an induced model.

Programming language: C and Java

Languages: English and other languages

MaltParser

Stanford Parser

Usage: Can be downloaded or used online. Outputs dependencies and phrase structure trees.

Programming language: Java

Languages: English, Chinese, German, Arabic, Italian, Bulgarian, Portuguese

Stanford Parser

Tool collections

All of the toolkits below are published under the Apache Software License version 2.

OpenNLP

Apache OpenNLP is a natural language processing toolkit provided by Apache. It contains several tools already mentioned above as well as many others (all for English and some for other languages).

Usage: Apache OpenNLP is available as a download version, via Maven, and as a UIMA integration.

Programming language: Java

Languages: many

Apache OpenNLP

NLTK

Another toolkit is NLTK (= Natural Language ToolKit), which was also mentioned before. It not only contains many preprocessing tools (stemming, tagging, classification, ...), but also corpus interfaces like WordNet.

Programming language: Python

Languages: many

NLTK

Stanford Core NLP

Many tools are provided by the Stanford Natural Language Processing Group with different models and for various languages (dependent on the tool). Like OpenNLP, you can download the tools, use them with the Maven dependency or as UIMA components.

Core NLP

DKPro Core

A last collection of preprocessing tools to be mentioned here is DKPro Core from the TU Darmstadt, which wraps many of the tools mentioned on this page (Stanford NLP, TreeTagger, OpenNLP and many more) as UIMA components, providing a very convenient way of creating NLP pipelines (once you figured out how UIMA works - I recommend you use UIMAfit instead, that's a little bit easier).

Programming Language: Java

Languages: many

DKPro Core