DM Answers

Data Mining with STATISTICA

Sleep Survey

28th of July, 2016
   
               
   
Type answer in text box above
   
Type answer in text box above
   
   
   
Type answer in text box above
   

Creating a corpus from transactional data

10th of May, 2015

First off, I would like to say Happy Mother’s Day to all the mothers out there.  I would especially like to honor my wife, my mom, and my mom’s mom.  These are great women that I owe a lot to.  I would not be the man I am today without their positive influence in my life.

Now on to my latest dmanswers blog post using R for a text mining project.  I have a use case where I need to take some transactions from a checking account and use the R tm package to create a term document matrix.  The term document matrix will then be used to create a predictive model for a transaction category.  I will assume that these transactions will be accessible by an ODBC connection to a database.  For the purposes of this blog post, a comma delimited text file has been created from a fake data set which is shown below (Table 1).  This data set was imported into R using the following command.

Dataset <- read.table(“C:/temp/Book1.csv”, header=TRUE, sep=”,”, na.strings=”NA”, dec=”.”, strip.white=TRUE)

Table 1
Transactions
1 PURCHASED AUTHORIZED GROCERY ON 05/08 STORE AAAAA
2 BILL PAY GAS COMPANY
3 PURCHASED AUTHORIZED GROCERY ON 05/06 STORE AAAAA
4 PURCHASE AUTHORIZED ON 05/01 FAST FOOD CCCC
5 PURCHASE AUTHORIZED ON 05/01 MOVIE THEATRE DDDDDDD
6 CREDIT CARD E-PAYMENT
7 SCHOOL LUNCH
8 BILL PAY MORTGAGE COMPANY EEEEE
9 BILL PAY CAR LOAN 111111
10 PURCHASE AUTHORIZED ON 04/28 GAS STATION FFFFFFFFF
11 PURCHASE AUTHORIZED ON 04/15 ELECTRIC COMPANY
12 PURCHASE AUTHORIZED ON 04/13 PIZZA RESTAURANT
13 PURCHASED AUTHORIZED GROCERY ON 04/12 STORE GGGGG
14 PURCHASE AUTHORIZED ON 04/08 GAS STATION 2222222
15 CAR INSURANCE
16 TRANSFER TO SAVINGS
17 PURCHASE AUTHORIZED ON 05/01 FAST FOOD HHH
18 DENTIST
19 CHECK

Typically individual text files are imported into a corpus for analysis.  Most examples I have looked at on the internet fit into this category.  This is a unique situation where text is will be stored in a data frame after being queried from a database.  I was curious if there were any built in functions that could take the data frame and convert it into a corpus.  The following reference was a good resource to answer my question:

http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf

For the use case, I do not need to specify the title, authors, or topics of the documents.  I just need each row of the data frame imported into separate documents.  After reading through the reference above, I wondered if I could get away without specifying the unneeded meta data for my use case.  I tried the following command:

corpus <- VCorpus(DataframeSource(Dataset))

This command successfully imported the contents of each row into a corpus with 19 separate documents.  I also verified the contents of the corpus by using the inspect command on the corpus object.  Everything seems to have worked correctly when importing the data frame into the corpus.  I hope to use tm package to do some text mining on this corpus in the coming weeks.

Year Hiatus

26th of April, 2015

About a year ago I took a new job with Zions Bancorporation as a Data Scientist.  The group I work with does not currently use STATISTICA.  However, there is an openness to give STATISTICA a chance.  I do not currently have a license for my home computer so I have not been able to blog about STATISTICA for my own personal uses.

In the meantime, I have been honing my R programming skills.  I feel like this will be a good investment.  I took an R programming course through Coursera back in January.  If I am able to acquire a STATISTICA license for my personal use I will continue blogging about the integration of STATISTICA and R.  If not, I am thinking about doing a few projects completely in R and reporting them on my blog  So stay tuned in the coming weeks as I consider my options for starting my blog up again.

Extending STATISTICA Text Mining Capabilities using R Integration

12th of April, 2014

Last summer I kicked off my blog with a series of posts about text mining Pub Med journal articles using STATISTICA Text Miner.  I had a reader ask a question in the Fall of 2013 concerning adding phrases to the text analysis.  STATISTICA allows phrases to be specified.  The problem is that there is no provision for counting phrases within STATISTICA so the text miner knows which phrases to include in the analysis.  I suggested to the reader that R could possibly be used to generate n-grams which in turn could be specified in STATISTICA.

I didn’t give it much thought for a few months until I came upon another text mining project.  I set forth with resolve to find a method to get the n-grams within STATISTICA using R.  The following video shows the results of my research.  I found that the tm and RWeka packages could be used together to get the desired results.  This of course presumes that you have WEKA installed on your computer to be accessed through the RWeka package.

If you would like to try the code with a text mining project within STATISTICA, save the following text in a text editior such as Notepad with an extension of .R   When the file is opened in STATISTICA it will be automatically recognized as an R macro.

if (Sys.getenv(“JAVA_HOME”)!=””)
Sys.setenv(JAVA_HOME=””)
library(RWeka)
library(tm)
pubmed <- ActiveDataSet
pubmed <- pubmed[,1]
doc.vec <- VectorSource(pubmed)
doc.corpus <- Corpus(doc.vec)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(doc.corpus, control = list(tokenize = BigramTokenizer))
skipWords <- function(x) removeWords(x, c(“i”,”me”,”my”,”myself”,”we”,”our”,”ours”,”ourselves”,
“you”,”your”,”yours”,”yourself”,”yourselves”,”he”,”him”,”his”,”himself”,”she”,”her”,”hers”,”herself”,
“it”,”its”,”itself”,”they”,”them”,”their”,”theirs”,”themselves”,”what”,”which”,”who”,”whom”,
“this”,”that”,”these”,”those”,”am”,”is”,”are”,”was”,”were”,”be”,”been”,”being”,”have”,”has”,”had”,
“having”,”do”,”does”,”did”,”doing”,”a”,”an”,”the”,”and”,”but”,”if”,”or”,”because”,”as”,”until”,”while”,
“of”,”at”,”by”,”for”,”with”,”about”,”against”,”between”,”into”,”through”,”during”,”before”,”after”,
“above”,”below”,”to”,”from”,”up”,”down”,”in”,”out”,”on”,”off”,”over”,”under”,”again”,”further”,
“then”,”once”,”here”,”there”,”when”,”where”,”why”,”how”,”all”,”any”,”both”,”each”,”few”,”more”,
“most”,”other”,”some”,”such”,”no”,”nor”,”not”,”only”,”own”,”same”,”so”,”than”,”too”,”very”,”epub”))
funcs <- list(stripWhitespace, skipWords, removePunctuation, tolower, stemDocument)
y <- tm_map(doc.corpus, FUN=tm_reduce, tmFuns= funcs)
tdm <- TermDocumentMatrix(y, control = list(tokenize = BigramTokenizer))
findFreqTerms(tdm, low=20, highfreq = Inf)

Predictive Quality Control Summary

29th of March, 2014

Okay so I lied!  There will be another predictive quality control blog post.  I decided there needed to be one more post that summarized everything I have presented over the last few months concerning predictive quality control.  I created a Power Point Presentation and recorded a you tube video.  As always, I would be interested in any feedback you have for me on the content of my blog posts.  I promise next time there will be a new topic. 🙂

Boosted Trees for Predictive Quality Control

18th of March, 2014

I would like to start off by congratulating StatSoft on the recent results for the Gartner Magic Quadrant for Advanced Analytics:

http://www.statsoft.com/Company/About-Us/Reviews/2014-Published-Reviews#gartner-adv-MQ2014

In the last blog I revisited the quality control method I had first talked about back on October 6, 2013.  I would like to wrap things up by talking briefly about the use of boosted trees for predictive quality control.  Anyone that has used boosted trees in STATISTICA Data Miner knows that they can be a very powerful technique for predictive modeling.  They can also be used for variable selection which makes them useful for predictive quality control.

In the following video I will demonstrate how to use boosted trees to get an importance plot with the machine data set provided previously.  I will also discuss how the results compare to the other two methods presented in previous posts and some tips on how to get the best results.

 

Revisiting Predictive Quality Control

2nd of March, 2014

I introduced a method for predictive quality control in my October 6, 2013 blog post.  I am going to revisit this topic by refining this method by using the Predictor Screening instead of Feature Selection.  The R-square value is included in the Predictor Screening output which is useful in evaluating the quality of the results.  A case is made for using the Data Health Node prior to performing Predictor Screening.

Improved performance over Regular Expressions using Split String

15th of February, 2014

When employing the CRISP-DM data mining model, a good portion of the time is spent in the data preparation phase.  Last week I talked about how to pick off values from a string using regular expressions.  This is a project that has significance to me right now because this applies to a work related project.  I posted some fake data that is on a similar magnitude as my work problem and some example code last week.  Did you try it out?  With my computer the macro was completing in about a minute and a half.

I was not happy with this performance because I know that the data I am working with in real life is about one magnitude larger.  This translates into about a twenty minute wait using regular expressions to parse the strings.  I wanted to see if I could improve the performance.   I posted a question on the MSDN website and someone suggested using Split String.  That ended up being a great suggestion.  I put some code into my macros to measure the time to complete in seconds.  With the regular expression it was taking approximately 110 seconds on average to complete.  With the split string code it was taking about five seconds!  That is an awesome improvement.  With my larger data set at work, I was able to get the strings parsed in approximately 40 seconds instead of 21 minutes!

Anyway, I would like to thank the people that posted in reply to my question on the MSDN website.  This has led to a vast improvement in performance for my macro and has cut down the time spent in the data preparation phase of CRISM-DM.  Now I can get on to the data mining.

If you would like to try this out, please get the data from the last post.  I’ll put both macros below for you to test out.

Macro using regular expressions (remember to check Microsoft VBScript Regular Expressions Version 5.5 in the references):

Sub Main
StartTime = Now()
Dim re As New RegExp
Dim mc As MatchCollection
With re
.Pattern = “([0-9]+)(?=\,)|([0-9]+$)”
.Global = True
.IgnoreCase = True
End With
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim mStr As String
For j = 1 To S1.NumberOfCases
mStr = S1.Cells(j,1).Text
Set mc = re.Execute(mStr)
For i = 1 To mc.Count
S2.Cells(j,i) = mc.Item(i – 1).Value
Next
Next
EndTime = Now()
ElapsedTime = (EndTime – StartTime)*24*60*60
MsgBox(ElapsedTime & ” seconds”)
End Sub

 

Macro using split string:

Sub Main
StartTime = Now()
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim ValStr As String
Dim splitList As Variant
Dim values(50) As Double
For i = 1 To S1.NumberOfCases
ValStr = S1.Cells(i,1)
splitList = Split(ValStr, “,”)
For j= 0 To 50
values(j) = CDbl(Mid(splitList(j),InStr(splitList(j),”=”)+1,InStr(splitList(j),Right(splitList(j),1))+1))
Next
S2.CData(i) = values
Next
EndTime = Now()
ElapsedTime = (EndTime – StartTime)*24*60*60
MsgBox(ElapsedTime & ” seconds”)
End Sub

Video showing performance gains:

Regular Expressions in STATISTICA

9th of February, 2014

I have been working the last two weeks in my free time on implementing Regular Expressions(regex) in STATISTICA.  If you are not familiar with regex, they can be very useful to search through text strings and pull out relevant information you want for data mining.  For instance I could have a string that looks like the following:

0=2,1=5,2=3,3=4,4=10

For this example, the numbers to the left of equals sign designate a position and the numbers to the right of the equals sign are values corresponding to the respective positions.  Regex gives you the ability to get the parts of string you want.  In this case I want the values to the right of the equals sign.  To begin, all you have to do is activate the Microsoft VBScript Regular Expressions Version 5.5 in the references for a SVB macro.  The hardest thing in my opinion is coming up with the regex pattern to match the data you want to pick off from the string.  Microsoft has a good reference for regular expressions:

http://msdn.microsoft.com/en-us/library/ms974570.aspx

Here is a link to some made up data for this example (tab delimited text file):

Strings with values

And here is some starter code (Don’t forget to add the reference to the VBScript Regular Expressions 5.5):

‘#Language “WWB-COM”
Sub Main
Dim re As New RegExp
Dim mc As MatchCollection
With re
.Pattern = “([0-9]+)(?=\,)|([0-9]+$)”
.Global = True
.IgnoreCase = True
End With
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim mStr As String
For j = 1 To S1.NumberOfCases
mStr = S1.Cells(j,1).Text
Set mc = re.Execute(mStr)
For i = 1 To mc.Count
S2.Cells(j,i) = mc.Item(i – 1).Value
Next
Next
End Sub

Typically there are many different ways to get the same matches.  See if you can figure out why I used this particular pattern.  Can you come up with some alternate patterns that work?

This macro completes in approximately 1.5 minutes for my computer.  This is longer than I want it to take.  My assumption is that the inner for loop is taking the most time.  I am looking into a way to transfer a MatchCollection to an array and then to a STATISTICA spreadsheet thus avoiding doing the inner for loop.  I hope to have something working by Monday evening.  If I do, I will post a video to show the comparison between the two methods.  In the meantime, I would encourage you set a baseline by running the macro with the attached data.

Delay in next blog post

5th of February, 2014

Hi everyone,

My youngest son has been sick this past week so my wife and I have not gotten much sleep.  Consequently I have not been able to put in the time necessary to get out a good blog post.  I still want to test the capabilities of the in place database capabilities of STATISTICA Data Miner further with a larger data set than can reside in my computer’s memory.  I also want to address a question about n-grams  that was posed after one of my text mining posts a few months back.  You will see results of these projects in the coming weeks.  I have a lot of other ideas that are forming in the back of my mind, but I am always open to taking on projects posed by my readers.  Please feel free to post questions or suggestions below.  I apologize for not being able to get a post out this past weekend.  I hope to be able to post on Saturday February 8th.

Regards,

Toby Barrus