DM Answers

Data Mining with STATISTICA

Regular Expressions in STATISTICA

9th of February, 2014

I have been working the last two weeks in my free time on implementing Regular Expressions(regex) in STATISTICA.  If you are not familiar with regex, they can be very useful to search through text strings and pull out relevant information you want for data mining.  For instance I could have a string that looks like the following:

0=2,1=5,2=3,3=4,4=10

For this example, the numbers to the left of equals sign designate a position and the numbers to the right of the equals sign are values corresponding to the respective positions.  Regex gives you the ability to get the parts of string you want.  In this case I want the values to the right of the equals sign.  To begin, all you have to do is activate the Microsoft VBScript Regular Expressions Version 5.5 in the references for a SVB macro.  The hardest thing in my opinion is coming up with the regex pattern to match the data you want to pick off from the string.  Microsoft has a good reference for regular expressions:

http://msdn.microsoft.com/en-us/library/ms974570.aspx

Here is a link to some made up data for this example (tab delimited text file):

Strings with values

And here is some starter code (Don’t forget to add the reference to the VBScript Regular Expressions 5.5):

‘#Language “WWB-COM”
Sub Main
Dim re As New RegExp
Dim mc As MatchCollection
With re
.Pattern = “([0-9]+)(?=\,)|([0-9]+$)”
.Global = True
.IgnoreCase = True
End With
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim mStr As String
For j = 1 To S1.NumberOfCases
mStr = S1.Cells(j,1).Text
Set mc = re.Execute(mStr)
For i = 1 To mc.Count
S2.Cells(j,i) = mc.Item(i – 1).Value
Next
Next
End Sub

Typically there are many different ways to get the same matches.  See if you can figure out why I used this particular pattern.  Can you come up with some alternate patterns that work?

This macro completes in approximately 1.5 minutes for my computer.  This is longer than I want it to take.  My assumption is that the inner for loop is taking the most time.  I am looking into a way to transfer a MatchCollection to an array and then to a STATISTICA spreadsheet thus avoiding doing the inner for loop.  I hope to have something working by Monday evening.  If I do, I will post a video to show the comparison between the two methods.  In the meantime, I would encourage you set a baseline by running the macro with the attached data.

Delay in next blog post

5th of February, 2014

Hi everyone,

My youngest son has been sick this past week so my wife and I have not gotten much sleep.  Consequently I have not been able to put in the time necessary to get out a good blog post.  I still want to test the capabilities of the in place database capabilities of STATISTICA Data Miner further with a larger data set than can reside in my computer’s memory.  I also want to address a question about n-grams  that was posed after one of my text mining posts a few months back.  You will see results of these projects in the coming weeks.  I have a lot of other ideas that are forming in the back of my mind, but I am always open to taking on projects posed by my readers.  Please feel free to post questions or suggestions below.  I apologize for not being able to get a post out this past weekend.  I hope to be able to post on Saturday February 8th.

Regards,

Toby Barrus

STATISTICA vs Rapid Miner Round 2: Aggregating Big Data

18th of January, 2014

In the last post I presented my findings of the time to load the mortgage data set into memory.  Remember the task was possible with STATISTICA but failed with Rapid Miner.  I wanted to devise a test using the streaming capability of Rapid Miner and the In Place Database functionality of STATISTICA so I could actually make a comparison of the results.  There is a State_Code column in the mortgage data which I would like to get an aggregate count on the number of loans by state.

The first step was to install a MySQL database on my computer.  I then uploaded the mortgage .csv file from the last blog post to the database.  I also downloaded and installed the MySQL ODBC connectors. Then an ODBC connection was created for the MySQL database.

Now that the database and connectors were in place I aggregated the state code with the following methods and corresponding time results:

Table of times to aggregate

Notice that having the .sta file in memory leads to the fastest results.  So it looks like if it is possible to load the file into memory, this is the best way to go.  Another take away is that the ability to write a query against the In Place Database is a nice feature in STATISTICA.  This gave me the ability to only return the state code column.  This led to significantly better performance than returning all the data.  This shows where STATISTICA is a much more polished product than Rapid Miner.

Yes, Rapid Miner can stream a database which allows big data sets to be analyzed.  However, currently there is no way to limit what is returned by the streaming feature.  The whole database table must be streamed.   An attribute filter operator can be placed after the stream database operator, but this does not speed up the time for the process to complete.  The fact that the whole table must be returned is definitely impacting the performance in this case.

There is some evidence here that the JDBC connector is faster than the ODBC connector.  I am working on getting an OLE DB connector directly to MySQL that could be used by STATISTICA.  If I am successful, I will follow-up on this in a later blog post.

If money is an issue, then Rapid Miner can stream a database for a Big Data project.  Just keep in mind that there will need to be some consideration for longer times to stream a whole table or additional work to create a view of the desired column within the database.  If time to produce results is a concern, then I think STATISTICA is the clear winner here.

Look for another blog post in two weeks where I will continue to compare the In Place Database feature from STATISTICA and the Streaming Database operator in Rapid Miner.

 

STATISTICA vs. Rapid Miner on a Big Data Problem

4th of January, 2014

I acquired some public mortgage data from www.ffiec.gov for today’s big data challenge between STATISTICA and Rapid Miner.  If you would like to duplicate this challenge on your own computer, flat files can be obtained from the following link:

http://www.ffiec.gov/hmda/hmdaflat.htm

You will see three headings on this web page:  TS & MSA Info, LAR, and File Formats & Documentation (PDFs).  The data can be obtained by clicking the links under the LAR heading.  Decriptions of the data sets can be found under the File Formats & Documentation (PDFs) heading.  For today’s challenge I clicked on ALL under 2012.  The resulting downloaded zip file was 427 MB in size.  When the zip file was unpacked the the resulting CSV file was 3.1 GB in size.

This is a simple challenge:  Can the respective software load the mortgage data into memory?

Three different trials were conducted with an HP Envy desktop computer with i7 processor and 12 GB of RAM and the same environmental conditions.

The first trial was conducted with the trial version of Rapid Miner version 6.  After about 15 minutes the following error message was recieved:

error csv file import to Rapid Miner 6

Therefore the total time to load and the resulting file size are unknown with the trial version of Rapid Miner 6.

The second trial was conducted with Rapid Miner 5.3 which works virtually the same as version 6 when importing CSV files.  Approximately twenty minutes passed after the read csv operator was launched in Rapid Miner and the following message was displayed:

error csv file import to Rapid Miner 5_3

The third trial was conducted with STATISTICA version 12 which took six minutes to read in 18.7 million records.  The resulting size of the STATISTICA spreadsheet (.sta) was 7.8 GB.

Here is a video capture of the three trials:

 

Please post your experiences loading the mortgage data set in the comments section below.  I’ll post again in two weeks with another Big Data challenge between STATISTICA and Rapid Miner.

My Linear Model for the NBA data posted in the last blog entry…

17th of December, 2013

I’ll get to the model for the NBA data in a minute.  First of all please allow me a small rant.  Sometimes the simple things are the best.  I have found this to be true with data mining models.  Some might ask, why build a linear model, when it is so much more sexy to build a neural network or a support vector machine?  My reply to this question is to pose a question in return:  the simple things are easier to explain and if there is equivalent performance why add unneeded complexity?

I really like Data Miner Recipes in STATISTICA Data Miner.  It provides the most satisfying automated data mining experience I have found among commercial or free data mining tools.  Unfortunately linear models are not included in Data Miner Recipes.  I understand that linear models aren’t appropriate for large data sets where everything becomes significant, but it would still be nice to have it as a option that could be checked or unchecked according to the situation.  If this option existed, it would make it real slick to quickly compare the results for a neural network versus a linear model for a small data mining project.  To me the results of a linear model should be the baseline of performance for the non-linear models.  If the linear model outperforms these other models, then there is no question which option to select.  On the other hand the non-linear model needs to be a clear winner to justify selecting it over the linear model.  I feel justified in making this statement because of the added burden on the data miner to explain what is going on inside the black box of the non-linear models.  So there is my rant for the day.

I followed steps to build a linear model essentially laid out by Data Miner Recipes.  I started by checking for correlations between the input variables.  I found that TS% and EFF FG% were highly correlated with OFF EFF.  I consequently removed TS% and EFF FG% from the analysis.  I did not do feature selection because I intended to use Best Subsets or Backwards Selection for the linear model.  The rule of thumb I use is there must be at least 5 cases to estimate each effect in the model.  This means at most there should be 5 to 6 input variables included in the NBA team winning percentage model.  Even with removing TS% and EFF FG% there are not enough cases to predict all the first order effects for the remaining input variables.

From advanced models I selected General Regression and then General Linear Models.  This allowed me to select Backwards Stepwise under the options tab.  I found in this analysis that the reduced model was the same whether using Backwards Stepwise or Best Subsets.  Only OFF EFF and DEF EFF were significant in the reduced model in predicting the winning percentage.  This makes sense.  Teams that are the best defensively and offensively should win the most.  This simple model overall is doing a pretty good job of predicting winning percentage (Table 1).

Table 1

Model

 

The observed versus predicted values graphically demonstrate the high R^2 value recorded in Table 1 (see Figure 1).

Figure 1

Observed Values vs. Predicted

 

For comparison I fit a neural network that got a comparable R^2 value, but the observed versus predicted graph seemed to indicate over-fitting.  I also used the feature selection tool after removing the correlated variables and DEF EFF and OFF EFF had the highest F-values.  This was a nice confirmation that the model I had arrived at with the linear model was a good fit.

What did you come up with for a model?  Do you get a comparable R^2 value?  Did you check for over-fitting?  Do you agree that linear models are underutilized for data mining or do you feel like my rant is off base?

I would be curious to hear your feedback.  Please post your comments or questions below.  Have a great week.  I hope to post next weekend.  I am planning a comparison between the new version of Rapid Miner and STATISTICA using a big data problem.

Building a Linear Model in STATISTICA for NBA Team Basketball Data

10th of December, 2013

I have lost my voice so I am not going to produce a video this time around.  I would like to throw out a challenge problem.  I will discuss the results in a blog post next weekend.

I collected some team statistics for the current NBA basketball season from the ESPN website that was current on December 4th.  I paired up the winning percentage  for the games up to that same day.  I propose that you create a linear model predicting the winning percentage with the given statistics as inputs.

Here is the data file in csv format:

NBA Team Statistics from ESPN up to 12-4-2013

Post your model or questions in the comments section below.

Have some fun and be on the watch for a new video next week demonstrating how I used STATISTICA to create the linear model.

R Integration with STATISTICA: Heat Maps

20th of November, 2013

Today I am going to graph some data from the current NBA season using a heat map.  Heat maps are not a standard graph in STATISTICA so we need to use the R Integration functionality to extend the capability to create a heat map.  This first requires R to be installed on your computer.  You will also need to have the latest maintenance release of STATISTICA (V12AUPD008).  Once these conditions are met, when you launch STATISTICA you will be prompted if you would like to integrate R into STATISTICA.  When you click Yes, the COMadaptR package will automatically be installed and R integration is complete.

R_Integration_1

At this point all you need to do is create an R script and open it in STATISTICA and it will be recognized as an SVB macro.  Here is the NBA data and the R script if you would like to duplicate the analysis.   Note that you will need to save the R script text file with a .R extension for it to work.

Top_50_relevant_stats_up_to_11-16-2013

NBA_heatmap_2

The following You Tube video explains the process of creating the heat map.

 

Please feel free to post a comment or question below.  I hope to have another exciting post in two weeks. 🙂

 

Multivariate SPC

5th of November, 2013

I would like to continue talking about Statistical Process Control, but this time I would like to focus on the case where there are multiple correlated metrics being monitored.  Should each of the metrics be monitored separately or would it make more sense to consider them together in a multivariate analysis?  I hope to answer this question in today’s blog post.

There is an R package called MSQC that provides some multivariate SPC data with correlated inputs.  The data set I will consider today is called bimetal1.  This is a dataset consisting of five physical characteristics for 28 bimetal thermometers.  The thermometers are constructed by fusing two different metals together that expand and contract at different rates.  The five physical characteristics that are recorded should stay fairly constant, but is this enough evidence to say that each of the thermometers are good?

I will demonstrate in the video below that the metrics are correlated.  This fact I believe answers the question above.  Since the metrics are correlated we should consider them together in the analysis.  If the metrics are considered separately, we are missing out on the additional information contained in the correlations.  A potential special cause could be indicated when a particular sample breaks from the correlation seen in the past.

Here is the data set in Excel format for those that would like to replicate the analysis shown in the video:

bimetal1

Notice in the following video that I made a mistake when I created the ellipse on the scatter plot.  If I would have used the normal ellipse with .95 and .99 confidence instead of range, the conclusions drawn from the scatterplot would have been consistent with the multivariate SPC charts.

 

 

As I mentioned in the video I wish I could have come up with a public domain data set that would have shown no signals for the individual data points and a signal for the multivariate SPC chart.  I hope data point number 20 was enough to help you see the potential for the multivariate SPC charts.  I would like to say that I know from personal experience that these types of charts can be very useful when there are multiple correlated metrics being monitored.  Unfortunately these data are proprietary and I cannot share them.

Next time I plan on demonstrating the Predictive SPC charting functionality from the STATISTICA QC Miner package.  Until then, I hope you enjoy discovering Multivariate SPC charting in STATISTICA.  If you find any other errors or if you have any questions, please feel free to make a comment below.

SPC charts signal when to use Predictive Quality Control

20th of October, 2013

In the last blog post I demonstrated a method to give clues for root cause analysis using the Feature Selection tool in STATISTICA Data Miner.  In essence I was trying to predict if any of the inputs could explain changes in the output.  This process has been referred to by others as Predictive Quality Control.

Today I want to talk about in what situations we would want to use this Predictive Quality Control method.  Obviously if the data collected from the process are good then there is no need to investigate small fluctuations in the metric of interest.  So this begs the question of when does it become necessary to perform root cause analysis?

To help answer this question I would like to refer back to the machine data from the last blog post.  Remember that the output was a Quality Score and the inputs consisted of various machines, machine users, and lots of materials.  Let’s assume that the data came from one day’s worth of production.  Let’s also assume that there were 19 days of production data that we have collected previously.

Here is what the SPC chart looks like for this hypothetical situation:

Quality Score SPC Chart

The following video takes into account this hypothetical scenario and motivates the use of statistical process control charts to signal when to use the predictive process control method discussed last time.