DM Answers

Data Mining with STATISTICA

Text Mining PubMed XML with STATISTICA

9th of June, 2013


The main idea behind automated document classification is emulating a human expert.  I performed a search on “migraine genetics” in the last blog.  In Tutorial V of Practical Text Mining by Gary Miner et al, some expert classifications are given for the same PubMed search.  Practical Text Mining is a great book and I recommend it.  While I cannot share the file from the disk included with Practical Text Mining, I will show you the results of text mining these expert classifications using STATISTICA Text Miner in this blog post.

Someone that has not used PubMed might ask why this document classification is necessary.  The fact is that many of the results returned by this particular PubMed search will not show a relationship between genetics and migraines.  An expert must sift through the search results to ultimately determine which articles are relevant to their research.  The advantage of a document classification algorithm is that this sifting step by the researcher can be automated thus freeing up the researcher to do more import tasks.

To begin with, the abstracts for all the journal articles returned with the “migraine genetics” search are needed.  I hope abstract is one of the things you tried importing on your own the last two weeks.  In case you did not, here is the XPath to get the abstract:


The resulting dataset for my search conducted for the last blog included 1984 abstracts.  We will hold onto this dataset and score it once the document classification algorithm has been created.  If you were an expert in the genetic causes of migraines you could expertly classify a random subset of the abstracts and then hand over these classifications to a data miner who will create the classification model.  The idea is to see if there is something that distinguishes the text classified as relevant to genetics and the text that is not.

To create this classification model, the abstract text must be converted to a numeric representation of the data accessible to classification algorithms such as Decision Trees or Neural Networks.  This is where STATISTICA Text Miner comes into play.

Text Mining

Here is a breakdown table of the expert classifications given in Practical Text Mining Tutorial V:


There are 76 documents that were expertly classified as being relevant to migraine genetics.  Forty-one of the documents were classified as Non-relevant.  Classification models tend to be more accurate when the categories are equally represented.  This is a rule of thumb which is loosely met with this data set.  If you were expertly classifying your own search results, you would need to pause and do that now.  Assuming we can create a good model with a low misclassification rate, potentially there is  a good return on investment.  A model created quickly with 107 cases which is used to score 1984 documents could lead to a big savings in expert classification time.

Here is a short video demonstrating the conversion of abstract text to numerical data that can be used to create a classification model:


This video demonstrates how to create the classification model using STATISTICA Data Miner:


The time to create the text mining results and the classification algorithm took less than 10 minutes.  I am sure it would take longer for a human to determine the expert classifications for these 76 documents.  Think about the time savings when extending the classification algorithm to the entire data set!

I will do a quick follow-up post next week to show how to take the classification model and use it to score the 1984 abstracts.  If you have any questions about how I created the classification model, please feel free to post them.  Good luck with your data mining this week.

  1. Dear Toby,

    I have found many valuable information on your web page, thank you for sharing.
    Could you please give me advice how to include phrase analysis into text mining in Statistica (if is it possible)? I know the option “Phrases” in Text mining settings, but it requires manual insertion of such phrases or upload of phrase list. I wonder if there is any option for automatical recognition of phrases based on frequence and co-occurrence of two or three words (e.g. “sustainable development”;”genetically modified organisms” ..etc.) or if there is any option to use delimiters (such as “,” or “;”) to tell statistica when to analyse more words together as phrase…for example if I will use exported keywords from some database (such PubMed, WOS,…). This is usually included in some free online SEO tools, but it would be definitely much better to have such feature included in Statistica directly.

    Best regards,


    • Hi Miloslav,

      I agree that automatic recognition of phrases within STATISTICA would be ideal. I have even requested this functionality in the past to Stat Soft. So far it hasn’t been added. 🙁 After I import the xml into STATISTICA I create an Excel file of the output and import it into Rapid Miner. I then use the n-grams function in Rapid Miner to find all the common phrases. Then I manually input the phrases into STATISTICA.

      If you have R integrated with STATISTICA, another option would be to use the textcat package. That way you could return the phrases directly to STATISTICA. I think these are your best options until this is included in STATISTICA, unless you want to create a STATISTICA visual basic macro yourself to generate all the n-grams (Google search for n-gram algorithm). 🙂



Leave a Reply

Your email address will not be published. Required fields are marked *