Data transformation in R
I just dug out from the bottom of my drawer a very old set of data. The story is long, but in an essence what I am trying to do is to make the data look normal (normally distributed). Let’s see how it would progress in R.
Step 1- I drew Q-Q plots for all my 73 variables only to discover, that 5 look decently close to normal. Most of them look long-tail skewed, long-tail symmetric at best:
For anybody interested, two simple commands in R:
qqnorm(variable1, main=’text you want above plot’)
qqline(variable1)
Step 2- basic thing to do, log transformation:
log.data <- log(data)
and re-check the Q-Q plots. Seemed that the transformation helped the long-tailed skewed data:
but not the symmetric. It also seemed to mess up previously looking normal data :(. Anyway, 17 variables still needed to be taken care of.
Step 3- when I was at it, I tried square root transformation, with effect similar as above. It mostly helped the long-tailed skewed, especially the ones with outliers on this end. By the way, to do it in R (I know,it’s basic):
sqrt(data)
Step 4 - looking through some literature, I found a paper with idea of hyperbolic sinus (sinh) or inverse hyperbolic sinus (asinh) transformations for symmetric non-normal data. The idea is to do it on median adjusted data.
Step 5- I got ready for some power transformations.
Boxcox in MASS library works only with lm and aov models, so I dismissed it quickly.
I Googled bctrans command in alr3 package. Error quickly stopped me:
Error in optim(start, neg.kernel.profile.logL, hessian = TRUE, method = “L-BFGS-B”, :
L-BFGS-B needs finite values of ‘fn’
Looks like maybe two variables are too similar, as suggested here?
Before I figured out Box-Cox in R, the project went back into the drawer. Anyway, some thoughts on data transformations by a chemist, not statistician….
What I find interesting is that in case of multivariate data set, like mine, you have all kind of distributions, and some are closer and some are further from normality, almost each one is of its kind. By trying to apply one transformation to all of them, you also affect them all.
Just to compare how different transformations affected my “good” and “bad” data:
And I am not convinced that this is a good thing. I also realize that statisticians used it for years and probably proved it superior to not transformed data. But I would love to see some comments from people working with “real life” problems.
Technorati Tags: normally distributed, Q-Q plots, long-tail skewed, long-tail symmetric, log transformation, power transformations