First off, I would like to say Happy Mother’s Day to all the mothers out there. I would especially like to honor my wife, my mom, and my mom’s mom. These are great women that I owe a lot to. I would not be the man I am today without their positive influence in my life.
Now on to my latest dmanswers blog post using R for a text mining project. I have a use case where I need to take some transactions from a checking account and use the R tm package to create a term document matrix. The term document matrix will then be used to create a predictive model for a transaction category. I will assume that these transactions will be accessible by an ODBC connection to a database. For the purposes of this blog post, a comma delimited text file has been created from a fake data set which is shown below (Table 1). This data set was imported into R using the following command.
Dataset <- read.table(“C:/temp/Book1.csv”, header=TRUE, sep=”,”, na.strings=”NA”, dec=”.”, strip.white=TRUE)
1 PURCHASED AUTHORIZED GROCERY ON 05/08 STORE AAAAA
2 BILL PAY GAS COMPANY
3 PURCHASED AUTHORIZED GROCERY ON 05/06 STORE AAAAA
4 PURCHASE AUTHORIZED ON 05/01 FAST FOOD CCCC
5 PURCHASE AUTHORIZED ON 05/01 MOVIE THEATRE DDDDDDD
6 CREDIT CARD E-PAYMENT
7 SCHOOL LUNCH
8 BILL PAY MORTGAGE COMPANY EEEEE
9 BILL PAY CAR LOAN 111111
10 PURCHASE AUTHORIZED ON 04/28 GAS STATION FFFFFFFFF
11 PURCHASE AUTHORIZED ON 04/15 ELECTRIC COMPANY
12 PURCHASE AUTHORIZED ON 04/13 PIZZA RESTAURANT
13 PURCHASED AUTHORIZED GROCERY ON 04/12 STORE GGGGG
14 PURCHASE AUTHORIZED ON 04/08 GAS STATION 2222222
15 CAR INSURANCE
16 TRANSFER TO SAVINGS
17 PURCHASE AUTHORIZED ON 05/01 FAST FOOD HHH
Typically individual text files are imported into a corpus for analysis. Most examples I have looked at on the internet fit into this category. This is a unique situation where text is will be stored in a data frame after being queried from a database. I was curious if there were any built in functions that could take the data frame and convert it into a corpus. The following reference was a good resource to answer my question:
For the use case, I do not need to specify the title, authors, or topics of the documents. I just need each row of the data frame imported into separate documents. After reading through the reference above, I wondered if I could get away without specifying the unneeded meta data for my use case. I tried the following command:
corpus <- VCorpus(DataframeSource(Dataset))
This command successfully imported the contents of each row into a corpus with 19 separate documents. I also verified the contents of the corpus by using the inspect command on the corpus object. Everything seems to have worked correctly when importing the data frame into the corpus. I hope to use tm package to do some text mining on this corpus in the coming weeks.