DM Answers

Data Mining with STATISTICA

Predicting Home Prices

22nd of September, 2013

I spent the past few days collecting some housing data near where I live.  I have taken out any identifying information and changed a few of the values, but for the most part this is a real life data set.

House Closing Prices Sept 2013_subset3

I used a STATISTICA Data Miner workspace for this data mining project.



Here is the summary of the output from the Data Health Node:



The only default that I changed for the Data Miner Workspace was for the GLM to use 2nd order interactions.  All other options were left as defaults.  I checked the predictions of the models against the known closing prices for the homes sold in the last year.  The output of the goodness of fit statistical calculations shows that the GLM is doing the best at predicting the closing prices using the input data.



I would be interested if someone can do better with the data provided.  Let me know how it goes.  I would like to see the prediction of my current home value based on the analysis. 🙂

I hope you find the attached data interesting and have some fun data mining with it.  Have a great weekend and I hope to hear some feedback on your experience using STATISTICA for this project.


STATISTICA Data Health Node

7th of September, 2013

I mentioned in my last post on demand forecasting that I had used a new feature in STATISTICA called the Data Health Node.  I would like to go into more detail today about the usefulness of this new feature.  I think you will find it quite impressive as I did.  If you have any questions, please feel free to post them below.

Demand Forecasting

24th of August, 2013

I was able to obtain a data set which describes characteristics of 184 gas stations.  I eliminated many of the variables because I do not know what they describe.  Of the variables that remain, the main ones of interest are related to marketing campaigns.  I also included demographic information.  I want to see if the demand for gas (gas volume) can be predicted using the variables I have included in the data set.  It will be interesting to see if one of the marketing campaigns is a good predictor.  This knowledge would help a manager to know if spending money on a similar marketing campaign would be justified based on the expected increase in demand for gasoline.

I used the new Data Health Node in a STATISTICA Data Mining Workspace to eliminate some of the redundant data.  I would like to recommend this new node to everyone.  The report does not take long to run and the resulting information is very useful.  To recreate all the output manually within STATISTICA would take a considerable amount of time.  It was great that the node could remove the redundant data automatically so I could move onto model building right away.  I was so impressed that I plan on devoting a whole blog post in the near future to the Data Health Node.

Please refer to the following YouTube video where I describe the gas station data set along with the regression model to predict the demand for gasoline.

Anyone that has questions about the model I built or would like to obtain a copy of the data set, please post a comment and I will get in touch with you.  Thanks for reading and I hope you had a great weekend!


Stock Quote Forecast using STATISTICA Automated Neural Networks

13th of August, 2013

I am posting a short tutorial on how to take the technical indicator data collected using R and create a forecast using STATISTICA Automated Neural Networks.  First of all I would like to make a disclaimer that I cannot be held responsible for any trading losses you might sustain due to using some of the techniques discussed here.  This tutorial is meant to demonstrate how to use STATISTICA to make a forecast.  I do not recommend using this information to guide you in making decisions about trading in the stock market.  If you do you use the techniques discussed in this tutorial, this is a warning that you do so at your own risk and will not hold the author responsible for any losses you might sustain.

The .csv file created using the information from the last blog post will need to be imported into STATISTICA.  There will need to be some formatting of the data to make it usable which will not be covered in detail here.  Note that typically the first column in the data set will not have a variable name so the column names will get offset by one.  There are several ways to deal with this.  One solution would be to remove the first column and align the variable names to the correct column using Microsoft Excel before importing the data.

Once the .csv is imported and saved as a .sta file, you will need to add five cases to the end of the .sta file.  This is necessary to provide the space to create the forecast.  This will need to be completed before watching the following video.

Technical Indicators

11th of August, 2013

Looking over the formulas for technical indicators, I decided I would use an R package called quantmod to create the historical data set for Yahoo.  Why recreate the wheel?  The technical indicators of interest for this project were RSI, Bollinger Bands, Stochastics, and MACD.  I won’t define these technical indicator terms here.  I used these technical indicators for no particular reason other than for demonstration purposes.  I do not intend to go into detail about the formulas for these indicators.  Plenty of information is available on the internet for anyone that would like additional information about how to calculate these indicators from the raw data.  I do not recommend anyone using the model that will eventually be built to guide them in making trading decisions.  If you do decide to create your own model or recreate the model created for this blog, you are responsible for any stock market trading losses you might experience.  Use of the information on this blog will be at your own risk.  If you would like to eventually do some real trading, I would encourage you to first do some paper trading.

The assumption will be that all readers have some background in using R.  If this is not the case, don’t despair.  The R code included below should get you started.  Note that R can be integrated into STATISTICA, but I chose not to demonstrate this capability in this blog post.  The historical stock quote data and the technical indicators were saved to a text file which was imported into STATISTICA.  You will need to have R installed on your computer along with the packages TTR and quantmod.  Once R is installed, at the command prompt in R, you could enter install.packages(“TTR”, dependencies=TRUE) to install the TTR package and any supporting packages to TTR.  Once quantmod and TTR are installed, cut and paste the following code into R:

getSymbols(“YHOO”, from=”2010-06-01″, to=”2013-09-09″)
write.table(YHOOcomb, file=’C:/Temp/YAHOO.csv’)


Once the csv file is created, import it into Microsoft Excel and ensure the formatting is correct.  Modify as needed and save an Excel file of the data.  This Excel file can now be imported into STATISTICA.  I have run out of time to produce the video demonstrating how to take this data and make a forecast using neural networks.  I hope to have the video produced by Monday evening.  I would encourage you in the meantime to get a copy of the Yahoo data along with the technical indicators calculated by R.

Automated Trading Recommendations

8th of August, 2013

Hey, everyone.  Just got back from an extended vacation.  Sorry for the delay in getting my latest blog post out.  I want to explore how STATISTICA can be used for automated trading recommendations.  Some of the leg work has been done with collecting stock quote data using YQL which was covered in the last post.

I need to do some research on technical indicators.  I have used the STATISTICA Automated Neural Networks to do some forecasting in the past.  I found it to be powerful, but very easy to use.  This coming weekend I hope to have enough research done to put these two things together.  Again I apologize for the delay.  I promise it will be worth the wait! 🙂

Automating Data Collection from YQL Tables

20th of July, 2013

Last time I posted I demonstrated how to use Yahoo Query Language to get real-time stock quotes from Yahoo Finance.  I would encourage you to have the SVB code from last time running smoothly before proceeding with today’s post.  I will show how to automate the collection of the data using the SVB code from last time with the Windows Task Scheduler.

The first step is to create a batch file.  This can be done in a plain text editor such as Notepad.  The path in the following command line code will need to be pointed to the location you saved the SVB code from the last post.  The following code opens an instance of STATISTICA and runs the specified macro.

Example code for batch file:

“C:\Program Files\StatSoft\STATISTICA 12\statist.exe” /runmacro=”C:\temp\StockQuote.svb”

Once you have typed in the text in Notepad, go ahead and save the text with a file extension of “.bat”.  I named my batch file StockQuote.bat.  The default will be to have a file extension of .txt.  It is very important that you add .bat to the end of the filename when saving it in Notepad.

Creating an automated task using a Windows Scheduled Task

Now you can proceed to create a Windows scheduled task to run this batch file.  Keep in mind that the stock market opens at 9:30 AM EDT and closes at 4 PM EDT.  I live in the Mountain time zone so I will specify 7:30 AM as the start time and 2 PM as the stop time.

Importing Open Source Data in XML format using Yahoo Query Language

6th of July, 2013

I am excited to share an open source of interesting data.  The data set discussed today can be accessed using Yahoo Query Language (YQL) and the XML import techniques I have been talking about recently.  The first step is to create a url that will capture the data.  This can be done on the YQL console.

On the right of the page you will see a frame with the heading DATA TABLES.  Underneath this heading there is a link for Show Community Tables.  Click on this link.  Now you can either use the search feature or navigate to the data of interest.  For the example today navigate down to the yahoo tables.  Expand the yahoo tree and select

A query will now be displayed in Your YQL Statement for the yahoo finance data.  The results of the query will be displayed below that in XML format.  Below the results there will be THE REST QUERY.  Modify the YQL statement to include the stock symbol you are interested.  In this example we will query data for Yahoo.  Copy the url under THE REST QUERY.

The following code can be used to import the XML.  Create a blank macro in STATISTICA.  Paste the code below into the macro window.  Add a reference to Microsoft XML Services version 3.0.  Paste the url in the constant after the equals sign.  You will need to add quotations around the url.  I have done this already for you in the example code.

Code to import Yahoo stock quote in XML format:

Const Yahoo = “*”
Sub Main
Dim objXML As DOMDocument
Set objXML = CreateObject(“MSXML2.DOMDocument”)
objXML.async = False
objXML.validateOnParse = False
objXML.resolveExternals = False
Set objXMLHTTP = CreateObject(“MSXML2.XMLHTTP”)
objXMLHTTP.Open “Get”, Yahoo,False,,
objXMLHTTP.setRequestHeader “Content-type”, “text/html”
sAns = objXMLHTTP.responseText
bAns = objXML.loadXML(sAns) ‘ Ensure you have a valid XML response
If Not bAns Then GoTo EmptySetTrap
End Sub


Run the code and look at the resulting XML with a tool such as XML Notepad.  Here is a screenshot of some of the results I obtained using the code:

XMLNotepad_yahoo quote

I was interested in the values for AskRealtime and BidRealtime.  The following code picks off those values and writes them to a STATISTICA spreadsheet.  Notice what XPath I used to pick off the values of interest.  Also note that you will need a STATISTICA spreadsheet open that has one case and three variables.  The spreadsheet should look like the following screen shot:



Code to write XML results to STATISTICA spreadsheet:

‘#Language “WWB-COM”
Option Base 1
Const Yahoo = “*”
Sub Main
Set s = ActiveDataSet
Dim objXML As DOMDocument
Set objXML = CreateObject(“MSXML2.DOMDocument”)
objXML.async = False
objXML.validateOnParse = False ‘ necessary because MSXML doesn’t seem to work very well when an external DTD is referred to
objXML.resolveExternals = False
Set objXMLHTTP = CreateObject(“MSXML2.XMLHTTP”)
objXMLHTTP.Open “Get”, Yahoo,False,,
objXMLHTTP.setRequestHeader “Content-type”, “text/html”
sAns = objXMLHTTP.responseText
bAns = objXML.loadXML(sAns) ‘ Ensure you have a valid XML response
If Not bAns Then GoTo EmptySetTrap
Set oRoot = objXML.documentElement
Set oItemNodes = oRoot.selectNodes(“//quote”)
Dim oNode As IXMLDOMNode
For Each oNode In oItemNodes
Set sTime = Now()
Set sAskNode = oNode.selectSingleNode(“./AskRealtime”)
If Not sAskNode Is Nothing Then sAsk = sAskNode.Text Else sAsk = “”
Set sBidNode = oNode.selectSingleNode(“./BidRealtime”)
If Not sBidNode Is Nothing Then sBid = sBidNode.Text Else sBid = “”
s.AddCases(s.NumberOfCases, 1)
Set cell = s.NumberOfCases
s.SetData(cell, 1, sTime)
s.SetData(cell, 2, sAsk)
s.SetData(cell, 3, sBid)
Set sTime = Nothing
Set sAsk = Nothing
Set sBid = Nothing
End Sub


Run the code and see what results are written to the spreadsheet.  Explore some of the other data available on YQL using the techniques discussed today.  If you have questions please feel free to leave a comment below the blog post.  Next time I will share some information on on how to automate the collection of the XML shown today using a Windows Scheduled Task.  Have some fun in the meantime. 😉

Automated Document Classification of PubMed XML

22nd of June, 2013

I have produced a video to show how to tie together the things I have been talking about in the past posts.  This is done by recording a STATISTICA Visual Basic (SVB) macro while performing the following tasks.  First you will see how to deploy new documents using an existing text mining project.  Next you will see how to rapidly deploy the Boosted Tree model created previously to the text mining results.  Finally the recorded macro is used to automate these two tasks.  Very cool!!!  I hope you have an appreciation for how powerful this concept is to automatically classify documents.  I would be interested to hear about anyone’s successes or failures with this process.  Feel free to post your questions or comments below.

Automated Document Classification Video