DM Answers

Data Mining with STATISTICA

Extending STATISTICA Text Mining Capabilities using R Integration

12th of April, 2014

Last summer I kicked off my blog with a series of posts about text mining Pub Med journal articles using STATISTICA Text Miner.  I had a reader ask a question in the Fall of 2013 concerning adding phrases to the text analysis.  STATISTICA allows phrases to be specified.  The problem is that there is no provision for counting phrases within STATISTICA so the text miner knows which phrases to include in the analysis.  I suggested to the reader that R could possibly be used to generate n-grams which in turn could be specified in STATISTICA.

I didn’t give it much thought for a few months until I came upon another text mining project.  I set forth with resolve to find a method to get the n-grams within STATISTICA using R.  The following video shows the results of my research.  I found that the tm and RWeka packages could be used together to get the desired results.  This of course presumes that you have WEKA installed on your computer to be accessed through the RWeka package.

If you would like to try the code with a text mining project within STATISTICA, save the following text in a text editior such as Notepad with an extension of .R   When the file is opened in STATISTICA it will be automatically recognized as an R macro.

if (Sys.getenv(“JAVA_HOME”)!=””)
pubmed <- ActiveDataSet
pubmed <- pubmed[,1]
doc.vec <- VectorSource(pubmed)
doc.corpus <- Corpus(doc.vec)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(doc.corpus, control = list(tokenize = BigramTokenizer))
skipWords <- function(x) removeWords(x, c(“i”,”me”,”my”,”myself”,”we”,”our”,”ours”,”ourselves”,
funcs <- list(stripWhitespace, skipWords, removePunctuation, tolower, stemDocument)
y <- tm_map(doc.corpus, FUN=tm_reduce, tmFuns= funcs)
tdm <- TermDocumentMatrix(y, control = list(tokenize = BigramTokenizer))
findFreqTerms(tdm, low=20, highfreq = Inf)

Leave a Reply

Your email address will not be published. Required fields are marked *