August 23, 2008

Data transformation in R

Filed under: R project — izabela @ 3:28 pm

I just dug out from the bottom of my drawer a very old set of data. The story is long, but in an essence what I am trying to do is to make the data look normal (normally distributed). Let’s see how it would progress in R.
Step 1- I drew Q-Q plots for all my 73 variables only to discover, that 5 look decently close to normal. Most of them look long-tail skewed, long-tail symmetric at best:

Non-transformed data

For anybody interested, two simple commands in R:

qqnorm(variable1, main=’text you want above plot’)
qqline(variable1)

Step 2- basic thing to do, log transformation:

log.data <- log(data)

and re-check the Q-Q plots. Seemed that the transformation helped the long-tailed skewed data:

Log transformation

but not the symmetric. It also seemed to mess up previously looking normal data :(. Anyway, 17 variables still needed to be taken care of.

Step 3- when I was at it, I tried square root transformation, with effect similar as above. It mostly helped the long-tailed skewed, especially the ones with outliers on this end. By the way, to do it in R (I know,it’s basic):

sqrt(data)

Step 4 - looking through some literature, I found a paper with idea of hyperbolic sinus (sinh) or inverse hyperbolic sinus (asinh) transformations for symmetric non-normal data. The idea is to do it on median adjusted data.

Step 5- I got ready for some power transformations.
Boxcox in MASS library works only with lm and aov models, so I dismissed it quickly.
I Googled bctrans command in alr3 package. Error quickly stopped me:
Error in optim(start, neg.kernel.profile.logL, hessian = TRUE, method = “L-BFGS-B”, :
L-BFGS-B needs finite values of ‘fn’

Looks like maybe two variables are too similar, as suggested here?

Before I figured out Box-Cox in R, the project went back into the drawer. Anyway, some thoughts on data transformations by a chemist, not statistician….
What I find interesting is that in case of multivariate data set, like mine, you have all kind of distributions, and some are closer and some are further from normality, almost each one is of its kind. By trying to apply one transformation to all of them, you also affect them all.
Just to compare how different transformations affected my “good” and “bad” data:

Tranformations comparison

And I am not convinced that this is a good thing. I also realize that statisticians used it for years and probably proved it superior to not transformed data. But I would love to see some comments from people working with “real life” problems.

Technorati Tags: , , , , ,

July 3, 2008

Terrible NA’s

Filed under: R project — izabela @ 4:32 pm

In R project, most of stuff is simple unless you have NA values (missing data). I struggled with them already in PCA. This time, I am not talking about any complicated programming, models or what not. No. All I want to do in this moment is to calculate a median!
I discovered on more than one page on the Net that summary functions like mean(data), median(data), var(data) or range(data) when data set contains NA, require na.rm=T to work:

median(data, na.rm=T)

Surprisingly, summary(data) will work without NA argument. Who would guess.
But, to make my life more miserable, mean(data, na.rm=T) on my data set works, while median(data, na.rm=T) produces only an error :( :

Error in median.default(set) : need numeric data

(Problem unsolved so far. E-mailed my R-guru, but he is on vacation till next week)

After few e-mail exchange with my R guru, there are following conclusions:

median(as.matrix(data), na.rm=T)

gives you median for whole data set, not variable by variable. Useless.

median(data[,3],na.rm=T)

gives you median for variable in column 3. If you have 120 variables, I guess you need to do it one by one. Not more helpful.
Bottom line? No solution at this time…

Technorati Tags: , , , , , ,

R for Dummies

Filed under: R project — izabela @ 3:57 pm

Just a short note on what I just discovered on the Internet, looking for some R stuff. It turns out, there is something like R-Commander. In simple words- R with pull-down menus and basic features, something like R for Dummies :). And by basic features I mean data summaries and graphs, t-test, ANOVA, clusters, PCA, distributions and some more!
I am installing it on my computer right now. All you need to do is to install package Rcmdr, and load it. If you are missing any packages (quite many, in my case), you get a message and R takes care of it by itself.
R-Commander itself seems to be rather intuitive and easy to use.

Technorati Tags: R-Commander,

July 2, 2008

Hotelling T2 test

Filed under: R project — izabela @ 12:06 pm

This time an easy assignment for R project. I needed Hotelling T-test, to compare multivariate means between two data sets. It turns out, all you need to do is to load the package ICSNP, and use function HotellingsT2. OK, with R there always needs to be a trick, and you always have to learn something else meantime. And here, you need to have you data divided into two sets, by group, with no group labels. So, if you have everything in one data frame, function subset comes in handy. Nicely described in Verzani book, the function allows you to restrict rows using logical phrase in subset= or columns, using numbers in select= as in example below:

X <- subset(data, subset = group == 'group1', select = 2:10)
Y <- subset(data, subset = group == 'group2', select = 2:10)

and then it is simple as that:

HotellingsT2 ( X, Y, test = ‘f’ )

The other option for test is ‘chi’, if your data is not-so-perfectly-normal (you can check package manual for more details).

Short note- this needs to be told what to do with outliers. na.action=na.omit or similar solves the problem.

Technorati Tags: , , , ,

November 29, 2007

Heatmap in R

Filed under: R project — izabela @ 4:45 pm

When I decided to switch from SAS to R project, it was because somebody told me that multivariate analysis and plots like heatmap are much easier to do in R. I guess somebody should have told me what I learned much later, that R has a steep learning curve. I guess my is already flattening a bit, but I just recently spend two weeks figuring out how to draw heatmap. Maybe because all the needed information were not available in one spot, and -’x’ must contain finite values only- and -’x’ must be a numeric matrix - in Google search results in brilliant responses of those who master R in this style - if your R tells you the data set is wrong, it is, so look for idiotic entries. Aha. Or read your manual, this one is even more irritating.
Anyway, let me write how to avoid some traps and looking for help from professionals, who usually make you just feel more stupid then you are, unfortunately :(.

You may want to do scaling using scale function, as described with PCA, or not. Up to you. The first trick then is to get rid of your sample/group names, or whatever text you have in first column:

data.names <- data[,1]
data.set <- data[,-1]

Next, make sure your data is in matrix rather than data frame:

data.map <- as.matrix(data.set)

The heatmap, opposite to PCA, takes missing data without any problem, just leaves you with blank spots. You just have to be careful, if you color scheme contains white color as well.

Now all you do is:

heatmap(data.map)

Well, maybe you don’t like colors too much. You can personalize your heatmap a bit:

heatmap(data.map, col=topo.colors, labRow=data.names, xlab=’variables’, ylab=’samples’)

However, the colors schemes are limited to:

cm.colors
topo.colors
terrain.colors
rainbow
heat.colors

Also, I am sure you could figure out how to add color legend, but what for, if the appropriate function is there for you. Just download gplots library:

heatmap.2(data.map, col=greenred(256), density.info=’none’, trace=’none’, cexRow=0.5, labRow=data.names)

You have some additional color schemes, and you can read more about it here.

Technorati Tags: ,

November 28, 2007

Principal Component Analysis in R

Filed under: R project — izabela @ 7:40 pm

The one supposedly easy thing to do in R project was biplot. Well, indeed it is, when you have your Principal Component Analysis (PCA) calculated right. Which may not necessarily be easy.

Let’s start from easier function princomp. The one problem I found with it was dealing with missing data. As all PCA procedures will disregard samples where even one number is missing. You may as well remove them before the analysis, if you don’t feel like re-calculating/replacing/dealing with them some other way:

data.na <- na.omit(data)

I also found a problem with column containing sample names, groups etc. so remove them as well by putting them into another “file”:

data.names <- data.na[,1]
data.pc <- data.na[,-1]

and now you are ready to roll:

pc.data <- princomp(data.pc, cor=T, scores=T)

To see how the eigenvalues looks like (scree plot):

plot(pc.data)
abline(h=1)

and you can make a decision how many PCs to retain based on how many PCs have eigenvalues more than 1, which may be better visible from:

summary(pc.data)

This also gives you information about variance explained by each PC and cumulative variance. Nice. Now, biplot:

biplot(pc.data, xlabs=data.names)

To get loadings and scores, you just type:

pc.data$loadings
pc.data$scores

So if you need fancier plots:

plot(pc.data$scores[,2]~pc.data$scores[,1], xlab=’PC1 (% of total variance)’, ylab=’PC2 (% of total variance), type=’n')
text(pc.data$scores[1], pc.data$scores[2], labels=data.names)

or even symbol or color coding plots:

libarary(lattice)
xyplot(pc.data$scores[,2]~pc.data$scores[,1],
xlab=’PC1 (% of total variance)’, ylab=’PC2 (% of total variance),
col=’black’,
pch=c(1,2,3 etc.), cex=1.2,

key=list(points=list(pch=c(1,2,3 etc.)),
text=list(c(’name1′, ‘name 2′, ‘name 3′ etc.))))

I find it much easier than in SAS….

Now, in case your matrix is “wider than longer” meaning more variables than samples, the princomp will tell you to get lost, by some nice error message. Of course, beauty of R, there is a solution. Function prcomp is there for you, as explained here doing Q-mode PCA other then R-mode PCA, for those who know what it means. You just need to figure out how to deal with it. First, how to put together the code:
pc.data <- prcomp(data.set, cor=T)


I find it most appealing. I tried scaling in the function, and I am not saying there is anything wrong with it. But you may as well do it before, by scale function:

data.scaled <- scale(data, scale=T, center=T)

Going back to our PCA, now the trick is to extract information we want- meaning eigenvalues, scores and loadings.

summary(pc.data)

gives us desired proportion of variance and cumulative proportion, but…. well, just look at numbers in first line, and you know something is wrong. You are right, these obviously are not eigenvalues, and as you can read in R help for the function this is because of calculations are done bu singular value decomposition. I will use elbow on scree plot to decide how many PCs to retain, although I am sure the professionals know better how to work around it…
For that, I suggest rather than using plot to use

plot(pc.data$sdev)

and make the decision visually , but this requires already knowing how the output looks like. Which gets me to the point that, of course, there is no pc.data$loadings and pc.data$scores in prcomp. However, let me start from the fact, that biplot works as it should:

biplot(pc.data, xlabs=data.names)

Just in case you really have extreme number of variables which is making you plot impossible to look at, I found this trick working beautifully:

biplot(pc.data, xlabs=data.names, col=c(’black’, ‘none’)

for those interested, if you use white as second color, it will affect the labels on sample names directly below, setting the color to none is really superior. Don’t worry about error message thou, as you done it on purpose and got the results you wanted.
This takes care of scores plotting. If you want to plot loadings on PC1 and PC2, or another pair, the simples solution would just be to set first color to none and second to black, respectively. However, if you want to plot loadings just on one PC, well, it requires this knowledge, which was not necessarily easy to find on the net, but is for example explained here:

pc.data$scores = pc.data$x
pc.data$loadings=pc.data$rotation

Thus, your code may look like that:

plot(pc.data$rotation[1:number of variables]~variables.scale, type=’l', xlab=’x scale’, ylab=’loadings on PC1′)

Now, plotting loadings on PC2, you need your first number in range be number of samples+1, second number be number of samples times 2, if you get the idea….

NOTE - there is a pca function in labdsv library, if you want to check it here. But.. The great advantage is that you can limit number of extracted PCs and the scores and loadings are in output by these names.But… it has a limiting plotting capabilities for general applications.

Technorati Tags: , , , , , , , , , ,

August 28, 2007

Power calculations in R

Filed under: R project, University — izabela @ 11:01 am

My first huge disappointment with R. My boss wanted me to do simple power calculations to see how adding 2 animals to one of groups will improve statistical power. Yeah, I know, you are not supposed to do “retrospective power calculations”. Tell it to people providing lab animals, they always give you couple more just in case something happened to one or two you have. So, sometimes we have to do just that. Well, not in R.
There is this nice and simple power.t.test function, with
n= for number of subjects in a group
delta = for difference between groups
sd= for standard deviation
sig.level= for desired alpha
power= for power, obviously
All you need to do is to define all but one and the remaining is calculated. Great, but how about if your groups are of unequal sizes? There is always Google? Not this time.
There is nice package pwr, if you happen to work on proportions and can use beauty of binomial distribution. That’s it.
Solution? I opened SAS and did what I needed in 1 min.
Proc power works like power.t.test, but you can do calculations with different sd’s in groups, different group sizes and different experimental design. Wonder, when R catches up…

Technorati Tags: , , ,

August 22, 2007

Dunnett all-to-control test in R

Filed under: R project, University — izabela @ 1:24 pm

OK, it is going to be a first of series of blog entries on my struggle with R project. I consider myself an amateur in statistics and in R, so they are going to be without any theoretical background and discussions of the math underneath the stuff I am going to write about, so any comments are welcome.

Anyway, my task was to run Dunnett test on set of data from some animal study. If it was my experiment, I would run one-way ANOVA and maybe happily discover some other relationships significant and were unable to make sense out of it in the paper. This time, not my project, and I was told Dunnett is all they need.

OK, thanks God for Google, I found nice article in R News from December 2002, where the stuff was quite well explained, read my data into R, one line of code:
simint (variable1~group, conf.level=0.95, alternative=’two.sided’

got my results and….disturbing:

Warning message:
’simint.default’ is deprecated.
Use ‘glht’ instead.
See help(”Deprecated”) and help(”multcomp-deprecated”).

Great! In plain English, I just wasted my time discovering something that maybe next time I need to run same test won’t be there any more.

OK, count down from 150, and start looking for glht.

Of course, something I just minutes ago did in one line, now requires no less then three:

model <- aov(variable1~group)
results <- glht(model, linefc=mcp(group='Dunnett'), alternative='t' )
summary(results)

Maybe it is better though, as I got all significance levels at once.

Technorati Tags: R project, ,

Powered by WordPress.
Theme by Ron and Andrea. Background image from Gimp Patterns. Theme images created using The GIMP 2.2.8.