Pages

Sunday, February 7, 2010

Random Sampling

If anyone needs random sampling done on all US firms that have operated since 1960, let me know. I spent a few hours yesterday putting together a nice little [R] program that does exactly that. The purpose is just to find a random sample of non-bankrupt company codes of an appropriate size (in this case: matched in size to my bankrupt population), so that you could then submit the list to a WRDS query and get data to test. In the spirit of open soure sharing: here's the code:

#First: find all companies that operated in the desired time period using a WRDS query. Then import the data.

AllFirms<-read.csv("/THESIS/AllNonBankrupt.csv")
NonBankrupt<-AllFirms[AllFirms$costat == "A" ,"gvkey"] #select non-bankrupt firms
NonBankrupt<-levels(as.factor(NonBankrupt))
SampleSize<-1112 #Change as needed; 1112 is just the size of my bankrupt sample
rSampleNonBankrupt<-sample(NonBankrupt, SampleSize, replace=FALSE)
write.csv(rSampleNonBankrupt, "/THESIS/rSampleNonBankrupt.csv")

I know what you're thinking: "that took a few hours to write?" Well, yes. There were some kinks to iron out with data-types and such. Presumably it would have taken much less time if I really knew what I was doing (so: next time it'll take five minutes). There's still a minor issue with the type of the output: WRDS needs a list with one input per line, not CSV. I'm trying to work out how to do that without bringing it into a spreadsheet.

This represents the final stage of my thesis; now all I need to do is bring in the non-bankrupt data and run it though the same tests as I did with the bankrupt data. I had been thinking of doing this for a while, but I was stymied with the problem the size of the data and the memory issues it presents. To my shame, I had forgotten that a random sample of non-bankrupt firms would produce a statistically significant comparison; I had been thinking that I should run the analysis on the entire population. *forehead smack* Statistics 101. Nevertheless, I could overcome the memory issues by setting up an SQL server and querying it dynamically from [R]... its just not necessary in this case. I'd still like to develop the capability, 'cause I imagine there are problems where the random sample itself will present memory issues.

Added: Well, that was easy. I just replaced the last line in the code above with this simpler one:
write(rSampleNonBankrupt, ncolumns=1, "/THESIS/rSampleNonBankrupt.txt")
Does the trick without any complaints.
Also, I discovered that the package "ff" handles large datasets. Soooo, no need for SQL servers. I also found a multicore allocation package; which would be extremely awesome if I had more than two cores. Maybe I can use COB's servers?

No comments: