Difference between revisions of "Preprocessing Tools"

From FachschaftSprachwissenschaft
Jump to: navigation, search
(Snowball stemmer)
(Lemmatization)
Line 46: Line 46:
  
 
==Lemmatization==
 
==Lemmatization==
 +
Removal of inflectional affixes (reduction to the lexical entry).
 +
 +
===Stanford Lemmatizer===
 +
 +
Programming language: Java (Python, Perl, JavaScript and other wrappers are available)
 +
 +
Languages: English, Spanish, Chinese, German and Arabic
 +
 +
[http://nlp.stanford.edu/software/corenlp.shtml Stanford lemmatizer]
 +
 +
===Mate Lemmatizer===
 +
 +
Programming language: Java
 +
 +
Languages: German, English, French, Spanish
 +
 +
[https://code.google.com/p/mate-tools/ Mate Lemmatizer]
  
 
==Stop word filtering==
 
==Stop word filtering==

Revision as of 10:21, 30 August 2015

This page is a collection of text preprocessing tools, as described here: PreprocessingPresentation.pdf.

Sentence Segmentation

Segmentation of a string of text into sentences.

OpenNLP Sentence Detector

Programming language: Java

Languages: English, German, Dutch, Danish, Swedish, Portuguese.

SentenceDetector

Tokenization

OpenNLP Tokenizer

Programming language: Java

Languages: English, German, Dutch, Danish, Swedish, Portuguese.

OpenNLP Tokenizer

Stanford Word Segmenter

Programming language: Java

Languages: Arabic, Chinese

Word Segmenter

Stemming

Removal of affixes (reduction to the word stem).

Snowball stemmer

Programming language: Snowball language (and C and Java, wrappers for Python, Perl, PHP and C++ available)

Languages: English, German, and many others (e.g. French, Spanish, Norwegian, Russian, Finnish)

SnowballStemmer

Lemmatization

Removal of inflectional affixes (reduction to the lexical entry).

Stanford Lemmatizer

Programming language: Java (Python, Perl, JavaScript and other wrappers are available)

Languages: English, Spanish, Chinese, German and Arabic

Stanford lemmatizer

Mate Lemmatizer

Programming language: Java

Languages: German, English, French, Spanish

Mate Lemmatizer

Stop word filtering

POS Tagging

Assigning a part-of-speech (verb/noun/...) to each word in a corpus.

StanfordPosTagger

StanfordPOSTagger

OpenNLPPosTagger

OpenNLP PosTagger

TreeTaggerPosTagger

TreeTagger

Chunk Parsing

Identifying chunks (segments) of a sentence and classifying them (mostly NP/VP/...)

OpenNLP Chunker

OpenNLP Chunker

TreeTagger Chunker

TreeTagger

Dependency Parsing

Identifying dependency relations that hold between words and characterizing them

Mate Parser

Usage: Can be downloaded or tested online. It includes two different dependency parsers, a graph-based parser, and a joint tagger and transition-based parser.

Programming language: Java

Languages: English, German

MateParser

Malt Parser

MaltParser

Stanford Parser

Usage: Can be downloaded or used online. Outputs dependencies and phrase structure trees.

Programming language: Java

Languages: English, Chinese, German, Arabic, Italian, Bulgarian, Portuguese

Stanford Parser

Tool collections

NLTK

dkpro