Sunday, February 7, 2010

Data Supporting a Hypothesis

A while back I blogged about the obvious consequence of the eliminating pre-existing condition exclusions from health insurance. It seemed obvious to me, at least, but perhaps the proponents of the plan weren't convinced that it necessarily followed. The excerpt below lends a data point in support of my conclusion.

in reference to:

"In a state with 19 million people, 88 New Yorkers between the ages of 18 and 24—88!—have bought WellPoint's best-selling individual insurance product because insurance laws make it perfectly rational not to acquire costly coverage until people need it."
- The Weekend Interview with Angela Braly: 'A Wasted Opportunity' - (view on Google Sidewiki)

My, what fun it would be to be able to mine WellPoint's actuarial data!

Random Sampling

If anyone needs random sampling done on all US firms that have operated since 1960, let me know. I spent a few hours yesterday putting together a nice little [R] program that does exactly that. The purpose is just to find a random sample of non-bankrupt company codes of an appropriate size (in this case: matched in size to my bankrupt population), so that you could then submit the list to a WRDS query and get data to test. In the spirit of open soure sharing: here's the code:

#First: find all companies that operated in the desired time period using a WRDS query. Then import the data.

NonBankrupt<-AllFirms[AllFirms$costat == "A" ,"gvkey"] #select non-bankrupt firms
SampleSize<-1112 #Change as needed; 1112 is just the size of my bankrupt sample
rSampleNonBankrupt<-sample(NonBankrupt, SampleSize, replace=FALSE)
write.csv(rSampleNonBankrupt, "/THESIS/rSampleNonBankrupt.csv")

I know what you're thinking: "that took a few hours to write?" Well, yes. There were some kinks to iron out with data-types and such. Presumably it would have taken much less time if I really knew what I was doing (so: next time it'll take five minutes). There's still a minor issue with the type of the output: WRDS needs a list with one input per line, not CSV. I'm trying to work out how to do that without bringing it into a spreadsheet.

This represents the final stage of my thesis; now all I need to do is bring in the non-bankrupt data and run it though the same tests as I did with the bankrupt data. I had been thinking of doing this for a while, but I was stymied with the problem the size of the data and the memory issues it presents. To my shame, I had forgotten that a random sample of non-bankrupt firms would produce a statistically significant comparison; I had been thinking that I should run the analysis on the entire population. *forehead smack* Statistics 101. Nevertheless, I could overcome the memory issues by setting up an SQL server and querying it dynamically from [R]... its just not necessary in this case. I'd still like to develop the capability, 'cause I imagine there are problems where the random sample itself will present memory issues.

Added: Well, that was easy. I just replaced the last line in the code above with this simpler one:
write(rSampleNonBankrupt, ncolumns=1, "/THESIS/rSampleNonBankrupt.txt")
Does the trick without any complaints.
Also, I discovered that the package "ff" handles large datasets. Soooo, no need for SQL servers. I also found a multicore allocation package; which would be extremely awesome if I had more than two cores. Maybe I can use COB's servers?