Friday, January 22, 2010

"Best Fit" funtion in [R]?

Is there a function in R that will find the best distribution among all the different possible ones? The fitdist() function is quite capable of testing the fit of a specific distribution and finding the optimal parameters, but it requires the user to specify which distribution to test. Given that there are very, very many possibilities, the naive approach to specifying the distribution is just trial and error and comparison of the fit statistics. Of course, a knowledgeable analyst could look at the plot of the raw data and make an educated guess as to which distribution generated it, but there are certainly cases where this is not practical; ie: when there are a very large number vectors that may all have come from unique distributions.

Having a function that tested a handful of the most likely distributions and found the best fit seems like a useful (and fairly obvious) tool. I know that such packages exist outside of R (for example, in the Arena data analyzer and BestFit from Palisades), but as yet I haven't been able to find a previously written R function that does it. The Google results may be confounded by the fact that R^2 is a fit statistic returned by all such fitting functions, as opposed to being a function for [R], the language. Forevermore I shall type it in brackets to avoid the misinterpretation.

One approach I was thinking of taking (aside from the brute-force approach, which obviously is problematic with large data sets) is fitting a normal Q-Q plot, fitting a polynomial to the residuals, and using the information about concavity and inflection points to heuristically guide the selection of the distribution to fit. For example: residuals with a 2nd degree polynomial fit that is concave up are likely to come from a distribution that is skewed right (at least according to my old SAS book). Such an observation could point us to Gamma functions right away, and we wouldn't need to test the rest.

Anyway, I'll search a bit more, and if none exist, I'll write it. I must admit, I'm excited by the prospect of contributing to the [R] community. I've been very impressed with how effective and cohesive this far-flung group of obscure programmers in an obscure language seems to be, and I'd like to be a part of that.

Likewise, I'm well on my way to becoming part of the LaTeX community. I've transcribed two-and-a-half (of twelve) volumes of Professor Levy's handwritten notes. I'm getting much faster, but also running into new challenges. I will be an expert 'ere long.

No comments: