I have been working the last two weeks in my free time on implementing Regular Expressions(regex) in STATISTICA.  If you are not familiar with regex, they can be very useful to search through text strings and pull out relevant information you want for data mining.  For instance I could have a string that looks like the following:


For this example, the numbers to the left of equals sign designate a position and the numbers to the right of the equals sign are values corresponding to the respective positions.  Regex gives you the ability to get the parts of string you want.  In this case I want the values to the right of the equals sign.  To begin, all you have to do is activate the Microsoft VBScript Regular Expressions Version 5.5 in the references for a SVB macro.  The hardest thing in my opinion is coming up with the regex pattern to match the data you want to pick off from the string.  Microsoft has a good reference for regular expressions:


Here is a link to some made up data for this example (tab delimited text file):

Strings with values

And here is some starter code (Don’t forget to add the reference to the VBScript Regular Expressions 5.5):

‘#Language “WWB-COM”
Sub Main
Dim re As New RegExp
Dim mc As MatchCollection
With re
.Pattern = “([0-9]+)(?=\,)|([0-9]+$)”
.Global = True
.IgnoreCase = True
End With
Dim S1 As Spreadsheet
Set S1 = ActiveDataSet
Dim S2 As New Spreadsheet
S2.Visible = True
Dim mStr As String
For j = 1 To S1.NumberOfCases
mStr = S1.Cells(j,1).Text
Set mc = re.Execute(mStr)
For i = 1 To mc.Count
S2.Cells(j,i) = mc.Item(i – 1).Value
End Sub

Typically there are many different ways to get the same matches.  See if you can figure out why I used this particular pattern.  Can you come up with some alternate patterns that work?

This macro completes in approximately 1.5 minutes for my computer.  This is longer than I want it to take.  My assumption is that the inner for loop is taking the most time.  I am looking into a way to transfer a MatchCollection to an array and then to a STATISTICA spreadsheet thus avoiding doing the inner for loop.  I hope to have something working by Monday evening.  If I do, I will post a video to show the comparison between the two methods.  In the meantime, I would encourage you set a baseline by running the macro with the attached data.