In the last post I presented my findings of the time to load the mortgage data set into memory.  Remember the task was possible with STATISTICA but failed with Rapid Miner.  I wanted to devise a test using the streaming capability of Rapid Miner and the In Place Database functionality of STATISTICA so I could actually make a comparison of the results.  There is a State_Code column in the mortgage data which I would like to get an aggregate count on the number of loans by state.

The first step was to install a MySQL database on my computer.  I then uploaded the mortgage .csv file from the last blog post to the database.  I also downloaded and installed the MySQL ODBC connectors. Then an ODBC connection was created for the MySQL database.

Now that the database and connectors were in place I aggregated the state code with the following methods and corresponding time results:

Table of times to aggregate

Notice that having the .sta file in memory leads to the fastest results.  So it looks like if it is possible to load the file into memory, this is the best way to go.  Another take away is that the ability to write a query against the In Place Database is a nice feature in STATISTICA.  This gave me the ability to only return the state code column.  This led to significantly better performance than returning all the data.  This shows where STATISTICA is a much more polished product than Rapid Miner.

Yes, Rapid Miner can stream a database which allows big data sets to be analyzed.  However, currently there is no way to limit what is returned by the streaming feature.  The whole database table must be streamed.   An attribute filter operator can be placed after the stream database operator, but this does not speed up the time for the process to complete.  The fact that the whole table must be returned is definitely impacting the performance in this case.

There is some evidence here that the JDBC connector is faster than the ODBC connector.  I am working on getting an OLE DB connector directly to MySQL that could be used by STATISTICA.  If I am successful, I will follow-up on this in a later blog post.

If money is an issue, then Rapid Miner can stream a database for a Big Data project.  Just keep in mind that there will need to be some consideration for longer times to stream a whole table or additional work to create a view of the desired column within the database.  If time to produce results is a concern, then I think STATISTICA is the clear winner here.

Look for another blog post in two weeks where I will continue to compare the In Place Database feature from STATISTICA and the Streaming Database operator in Rapid Miner.