CRAN Task View: Natural Language Processing

Natural language processing has come a long way since its foundations were laid in the 1940s and 50s (for an introduction see, e.g., Jurafsky and Martin (2008): Speech and Language Processing, Pearson Prentice Hall). This CRAN task view collects relevant R packages that support computational linguists in conducting analysis of speech and language on a variety of levels - setting focus on words, syntax, semantics, and pragmatics.

In recent years, we have elaborated a framework to be used in packages dealing with the processing of written material: the package tm. Extension packages in this area are highly recommended to interface with tm's basic routines and useRs are cordially invited to join in the discussion on further developments of this framework package. To get into natural language processing, the cRunch service and tutorials may be helpful.

Frameworks:

tm provides a comprehensive text mining framework for R. The Journal of Statistical Software article Text Mining Infrastructure in R gives a detailed overview and presents techniques for count-based analysis methods, text clustering, text classification and string kernels.
tm.plugin.dc allows for distributing corpora across storage devices (local files or Hadoop Distributed File System).
tm.plugin.mail helps with importing mail messages from archive files such as used in Thunderbird (mbox, eml).
tm.plugin.factiva allows importing press/Web corpora from Dow Jones Factiva.
RcmdrPlugin.temis is an Rcommander plug-in providing an integrated solution to perform a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering.
openNLP provides an R interface to OpenNLP , a collection of natural language processing tools including a sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, and named-entity detector, using the Maxent Java package for training and using maximum entropy models.
Trained models for English and Spanish to be used with openNLP are available from http://datacube.wu.ac.at/ as packages openNLPmodels.en and openNLPmodels.es, respectively.
RWeka is a interface to Weka which is a collection of machine learning algorithms for data mining tasks written in Java. Especially useful in the context of natural language processing is its functionality for tokenization and stemming.

Words (lexical DBs, keyword extraction, string manipulation, stemming)

wordnet provides an R interface to WordNet , a large lexical database of English.
R's base package already provides a rich set of character manipulation routines. See help.search(keyword = "character", package = "base") for more information on these capabilities.
RKEA provides an R interface to KEA (Version 5.0). KEA (for Keyphrase Extraction Algorithm) allows for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
gsubfn can be used for certain parsing tasks such as extracting words from strings by content rather than by delimiters. demo("gsubfn-gries") shows an example of this in a natural language processing context.
tau contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
Snowball provides the Snowball stemmers which contain the Porter stemmer and several other stemmers for different languages. See the Snowball webpage for details.
SnowballC provides exactly the same API as Rstem, but uses a slightly different design of the C libstemmer library from the Snowball project. It also supports two more languages.
Rstem (available from Omegahat) is an alternative interface to a C version of Porter's word stemming algorithm.
KoNLP provides a collection of conversion routines (e.g. Hangul to Jamos), stemming, and part of speech tagging through interfacing with the Lucene's HanNanum analyzer. In version 0.0-8.0, the documentation is sparse and still needs some help.
koRpus is a diverse collection of functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). See the web page for more information.
zipfR offers some statistical models for word frequency distributions. The utilities include functions for loading, manipulating and visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. (The name of this library derives from the most famous word frequency distribution, Zipf's law.)
maxent is an implementation of maxinum entropy minimising memory consumption of very large data-sets.
wordcloud provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency.

Semantics:

lsa provides routines for performing a latent semantic analysis with R. The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome. The article Investigating Unstructured Texts with Latent Semantic Analysis gives a detailed overview and demonstrates the use of the package with examples from the are of technology-enhanced learning.
topicmodels provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors.
lda implements Latent Dirichlet Allocation and related models similar to LSA and topicmodels.
kernlab allows to create and compute with string kernels, like full string, spectrum, or bounded range string kernels. It can directly use the document format used by tm as input.
skmeans helps with clustering providing several algorithms for spherical k-means partitioning.
movMF provides another clustering alternative (approximations are fitted with von Mises-Fisher distributions of the unit length vectors).
textir is a suite of tools for text and sentiment mining.
textcat provides support for n-gram based text categorization.
corpora offers utility functions for the statistical analysis of corpus frequency data.

Pragmatics:

qdap helps with quantitative discourse analysis of transcripts.

Maintainer:	Fridolin Wild, Knowledge Media Institute (KMi), The Open University, UK
Contact:	fridolin.wild at open.ac.uk
Version:	2014-01-01