Difference between revisions of "Preprocessing Tools"

From FachschaftSprachwissenschaft
Jump to: navigation, search
(Mate Parser)
(Sentence Segmentation)
Line 7: Line 7:
 
Segmentation of a string of text into sentences.
 
Segmentation of a string of text into sentences.
  
- OpenNLP Sentence Detector
+
OpenNLP Sentence Detector
 +
 
 
Languages: English, German, Dutch, Danish, Swedish, Portuguese.
 
Languages: English, German, Dutch, Danish, Swedish, Portuguese.
 +
 
Programming language: Java
 
Programming language: Java
[http://opennlp.apache.org/documentation/manual/opennlp.html#tools.sentdetect]
+
 
 +
[http://opennlp.apache.org/documentation/manual/opennlp.html#tools.sentdetect SentenceDetector]
  
 
==Tokenization==
 
==Tokenization==

Revision as of 10:17, 30 August 2015

This page is a collection of text preprocessing tools, as described here: PreprocessingPresentation.pdf.

Sentence Segmentation

Segmentation of a string of text into sentences.

OpenNLP Sentence Detector

Languages: English, German, Dutch, Danish, Swedish, Portuguese.

Programming language: Java

SentenceDetector

Tokenization

Stemming

Lemmatization

Stop word filtering

POS Tagging

Assigning a part-of-speech (verb/noun/...) to each word in a corpus.

StanfordPosTagger

StanfordPOSTagger

OpenNLPPosTagger

OpenNLP PosTagger

TreeTaggerPosTagger

TreeTagger

Chunk Parsing

Identifying chunks (segments) of a sentence and classifying them (mostly NP/VP/...)

OpenNLP Chunker

OpenNLP Chunker

TreeTagger Chunker

TreeTagger

Dependency Parsing

Identifying dependency relations that hold between words and characterizing them

Mate Parser

Usage: Can be downloaded or tested online. It includes two different dependency parsers, a graph-based parser, and a joint tagger and transition-based parser.

Programming language: Java

Languages: English, German

MateParser

Malt Parser

MaltParser

Stanford Parser

Usage: Can be downloaded or used online. Outputs dependencies and phrase structure trees.

Programming language: Java

Languages: English, Chinese, German, Arabic, Italian, Bulgarian, Portuguese

Stanford Parser

Tool collections

NLTK

dkpro