DM Answers

Data Mining with STATISTICA

Improved performance over Regular Expressions using Split String

15th of February, 2014

When employing the CRISP-DM data mining model, a good portion of the time is spent in the data preparation phase.  Last week I talked about how to pick off values from a string using regular expressions.  This is a project that has significance to me right now because this applies to a work related project.  I posted some fake data that is on a similar magnitude as my work problem and some example code last week.  Did you try it out?  With my computer the macro was completing in about a minute and a half.

I was not happy with this performance because I know that the data I am working with in real life is about one magnitude larger.  This translates into about a twenty minute wait using regular expressions to parse the strings.  I wanted to see if I could improve the performance.   I posted a question on the MSDN website and someone suggested using Split String.  That ended up being a great suggestion.  I put some code into my macros to measure the time to complete in seconds.  With the regular expression it was taking approximately 110 seconds on average to complete.  With the split string code it was taking about five seconds!  That is an awesome improvement.  With my larger data set at work, I was able to get the strings parsed in approximately 40 seconds instead of 21 minutes!

Anyway, I would like to thank the people that posted in reply to my question on the MSDN website.  This has led to a vast improvement in performance for my macro and has cut down the time spent in the data preparation phase of CRISM-DM.  Now I can get on to the data mining.

If you would like to try this out, please get the data from the last post.  I’ll put both macros below for you to test out.

Macro using regular expressions (remember to check Microsoft VBScript Regular Expressions Version 5.5 in the references):

Sub Main
StartTime = Now()
Dim re As New RegExp
Dim mc As MatchCollection
With re
.Pattern = “([0-9]+)(?=\,)|([0-9]+$)”
.Global = True
.IgnoreCase = True
End With
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim mStr As String
For j = 1 To S1.NumberOfCases
mStr = S1.Cells(j,1).Text
Set mc = re.Execute(mStr)
For i = 1 To mc.Count
S2.Cells(j,i) = mc.Item(i – 1).Value
Next
Next
EndTime = Now()
ElapsedTime = (EndTime – StartTime)*24*60*60
MsgBox(ElapsedTime & ” seconds”)
End Sub

 

Macro using split string:

Sub Main
StartTime = Now()
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.SetSize(S1.NumberOfCases,51)
S2.Visible = True
Dim ValStr As String
Dim splitList As Variant
Dim values(50) As Double
For i = 1 To S1.NumberOfCases
ValStr = S1.Cells(i,1)
splitList = Split(ValStr, “,”)
For j= 0 To 50
values(j) = CDbl(Mid(splitList(j),InStr(splitList(j),”=”)+1,InStr(splitList(j),Right(splitList(j),1))+1))
Next
S2.CData(i) = values
Next
EndTime = Now()
ElapsedTime = (EndTime – StartTime)*24*60*60
MsgBox(ElapsedTime & ” seconds”)
End Sub

Video showing performance gains:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*