First off, I would like to say Happy Mother’s Day to all the mothers out there.  I would especially like to honor my wife, my mom, and my mom’s mom.  These are great women that I owe a lot to.  I would not be the man I am today without their positive influence in my life.

Now on to my latest dmanswers blog post using R for a text mining project.  I have a use case where I need to take some transactions from a checking account and use the R tm package to create a term document matrix.  The term document matrix will then be used to create a predictive model for a transaction category.  I will assume that these transactions will be accessible by an ODBC connection to a database.  For the purposes of this blog post, a comma delimited text file has been created from a fake data set which is shown below (Table 1).  This data set was imported into R using the following command.

Dataset <- read.table(“C:/temp/Book1.csv”, header=TRUE, sep=”,”, na.strings=”NA”, dec=”.”, strip.white=TRUE)

Table 1
Transactions
1 PURCHASED AUTHORIZED GROCERY ON 05/08 STORE AAAAA
2 BILL PAY GAS COMPANY
3 PURCHASED AUTHORIZED GROCERY ON 05/06 STORE AAAAA
4 PURCHASE AUTHORIZED ON 05/01 FAST FOOD CCCC
5 PURCHASE AUTHORIZED ON 05/01 MOVIE THEATRE DDDDDDD
6 CREDIT CARD E-PAYMENT
7 SCHOOL LUNCH
8 BILL PAY MORTGAGE COMPANY EEEEE
9 BILL PAY CAR LOAN 111111
10 PURCHASE AUTHORIZED ON 04/28 GAS STATION FFFFFFFFF
11 PURCHASE AUTHORIZED ON 04/15 ELECTRIC COMPANY
12 PURCHASE AUTHORIZED ON 04/13 PIZZA RESTAURANT
13 PURCHASED AUTHORIZED GROCERY ON 04/12 STORE GGGGG
14 PURCHASE AUTHORIZED ON 04/08 GAS STATION 2222222
15 CAR INSURANCE
16 TRANSFER TO SAVINGS
17 PURCHASE AUTHORIZED ON 05/01 FAST FOOD HHH
18 DENTIST
19 CHECK

Typically individual text files are imported into a corpus for analysis.  Most examples I have looked at on the internet fit into this category.  This is a unique situation where text is will be stored in a data frame after being queried from a database.  I was curious if there were any built in functions that could take the data frame and convert it into a corpus.  The following reference was a good resource to answer my question:

http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf

For the use case, I do not need to specify the title, authors, or topics of the documents.  I just need each row of the data frame imported into separate documents.  After reading through the reference above, I wondered if I could get away without specifying the unneeded meta data for my use case.  I tried the following command:

corpus <- VCorpus(DataframeSource(Dataset))

This command successfully imported the contents of each row into a corpus with 19 separate documents.  I also verified the contents of the corpus by using the inspect command on the corpus object.  Everything seems to have worked correctly when importing the data frame into the corpus.  I hope to use tm package to do some text mining on this corpus in the coming weeks.