August 23, 2008

Data transformation in R

Filed under: R project — izabela @ 3:28 pm

I just dug out from the bottom of my drawer a very old set of data. The story is long, but in an essence what I am trying to do is to make the data look normal (normally distributed). Let’s see how it would progress in R.
Step 1- I drew Q-Q plots for all my 73 variables only to discover, that 5 look decently close to normal. Most of them look long-tail skewed, long-tail symmetric at best:

Non-transformed data

For anybody interested, two simple commands in R:

qqnorm(variable1, main=’text you want above plot’)
qqline(variable1)

Step 2- basic thing to do, log transformation:

log.data <- log(data)

and re-check the Q-Q plots. Seemed that the transformation helped the long-tailed skewed data:

Log transformation

but not the symmetric. It also seemed to mess up previously looking normal data :(. Anyway, 17 variables still needed to be taken care of.

Step 3- when I was at it, I tried square root transformation, with effect similar as above. It mostly helped the long-tailed skewed, especially the ones with outliers on this end. By the way, to do it in R (I know,it’s basic):

sqrt(data)

Step 4 - looking through some literature, I found a paper with idea of hyperbolic sinus (sinh) or inverse hyperbolic sinus (asinh) transformations for symmetric non-normal data. The idea is to do it on median adjusted data.

Step 5- I got ready for some power transformations.
Boxcox in MASS library works only with lm and aov models, so I dismissed it quickly.
I Googled bctrans command in alr3 package. Error quickly stopped me:
Error in optim(start, neg.kernel.profile.logL, hessian = TRUE, method = “L-BFGS-B”, :
L-BFGS-B needs finite values of ‘fn’

Looks like maybe two variables are too similar, as suggested here?

Before I figured out Box-Cox in R, the project went back into the drawer. Anyway, some thoughts on data transformations by a chemist, not statistician….
What I find interesting is that in case of multivariate data set, like mine, you have all kind of distributions, and some are closer and some are further from normality, almost each one is of its kind. By trying to apply one transformation to all of them, you also affect them all.
Just to compare how different transformations affected my “good” and “bad” data:

Tranformations comparison

And I am not convinced that this is a good thing. I also realize that statisticians used it for years and probably proved it superior to not transformed data. But I would love to see some comments from people working with “real life” problems.

Technorati Tags: , , , , ,

Powered by WordPress.
Theme by Ron and Andrea. Background image from Gimp Patterns. Theme images created using The GIMP 2.2.8.