The main idea behind automated document classification is emulating a human expert. I performed a search on “migraine genetics” in the last blog. In Tutorial V of Practical Text Mining by Gary Miner et al, some expert classifications are given for the same PubMed search. Practical Text Mining is a great book and I recommend it. While I cannot share the file from the disk included with Practical Text Mining, I will show you the results of text mining these expert classifications using STATISTICA Text Miner in this blog post.
Someone that has not used PubMed might ask why this document classification is necessary. The fact is that many of the results returned by this particular PubMed search will not show a relationship between genetics and migraines. An expert must sift through the search results to ultimately determine which articles are relevant to their research. The advantage of a document classification algorithm is that this sifting step by the researcher can be automated thus freeing up the researcher to do more import tasks.
To begin with, the abstracts for all the journal articles returned with the “migraine genetics” search are needed. I hope abstract is one of the things you tried importing on your own the last two weeks. In case you did not, here is the XPath to get the abstract:
The resulting dataset for my search conducted for the last blog included 1984 abstracts. We will hold onto this dataset and score it once the document classification algorithm has been created. If you were an expert in the genetic causes of migraines you could expertly classify a random subset of the abstracts and then hand over these classifications to a data miner who will create the classification model. The idea is to see if there is something that distinguishes the text classified as relevant to genetics and the text that is not.
To create this classification model, the abstract text must be converted to a numeric representation of the data accessible to classification algorithms such as Decision Trees or Neural Networks. This is where STATISTICA Text Miner comes into play.
Here is a breakdown table of the expert classifications given in Practical Text Mining Tutorial V:
There are 76 documents that were expertly classified as being relevant to migraine genetics. Forty-one of the documents were classified as Non-relevant. Classification models tend to be more accurate when the categories are equally represented. This is a rule of thumb which is loosely met with this data set. If you were expertly classifying your own search results, you would need to pause and do that now. Assuming we can create a good model with a low misclassification rate, potentially there is a good return on investment. A model created quickly with 107 cases which is used to score 1984 documents could lead to a big savings in expert classification time.
Here is a short video demonstrating the conversion of abstract text to numerical data that can be used to create a classification model:
This video demonstrates how to create the classification model using STATISTICA Data Miner:
The time to create the text mining results and the classification algorithm took less than 10 minutes. I am sure it would take longer for a human to determine the expert classifications for these 76 documents. Think about the time savings when extending the classification algorithm to the entire data set!
I will do a quick follow-up post next week to show how to take the classification model and use it to score the 1984 abstracts. If you have any questions about how I created the classification model, please feel free to post them. Good luck with your data mining this week.