November 29, 2007

Heatmap in R

Filed under: R project — izabela @ 4:45 pm

When I decided to switch from SAS to R project, it was because somebody told me that multivariate analysis and plots like heatmap are much easier to do in R. I guess somebody should have told me what I learned much later, that R has a steep learning curve. I guess my is already flattening a bit, but I just recently spend two weeks figuring out how to draw heatmap. Maybe because all the needed information were not available in one spot, and -’x’ must contain finite values only- and -’x’ must be a numeric matrix - in Google search results in brilliant responses of those who master R in this style - if your R tells you the data set is wrong, it is, so look for idiotic entries. Aha. Or read your manual, this one is even more irritating.
Anyway, let me write how to avoid some traps and looking for help from professionals, who usually make you just feel more stupid then you are, unfortunately :(.

You may want to do scaling using scale function, as described with PCA, or not. Up to you. The first trick then is to get rid of your sample/group names, or whatever text you have in first column:

data.names <- data[,1]
data.set <- data[,-1]

Next, make sure your data is in matrix rather than data frame:

data.map <- as.matrix(data.set)

The heatmap, opposite to PCA, takes missing data without any problem, just leaves you with blank spots. You just have to be careful, if you color scheme contains white color as well.

Now all you do is:

heatmap(data.map)

Well, maybe you don’t like colors too much. You can personalize your heatmap a bit:

heatmap(data.map, col=topo.colors, labRow=data.names, xlab=’variables’, ylab=’samples’)

However, the colors schemes are limited to:

cm.colors
topo.colors
terrain.colors
rainbow
heat.colors

Also, I am sure you could figure out how to add color legend, but what for, if the appropriate function is there for you. Just download gplots library:

heatmap.2(data.map, col=greenred(256), density.info=’none’, trace=’none’, cexRow=0.5, labRow=data.names)

You have some additional color schemes, and you can read more about it here.

Technorati Tags: ,

November 28, 2007

Principal Component Analysis in R

Filed under: R project — izabela @ 7:40 pm

The one supposedly easy thing to do in R project was biplot. Well, indeed it is, when you have your Principal Component Analysis (PCA) calculated right. Which may not necessarily be easy.

Let’s start from easier function princomp. The one problem I found with it was dealing with missing data. As all PCA procedures will disregard samples where even one number is missing. You may as well remove them before the analysis, if you don’t feel like re-calculating/replacing/dealing with them some other way:

data.na <- na.omit(data)

I also found a problem with column containing sample names, groups etc. so remove them as well by putting them into another “file”:

data.names <- data.na[,1]
data.pc <- data.na[,-1]

and now you are ready to roll:

pc.data <- princomp(data.pc, cor=T, scores=T)

To see how the eigenvalues looks like (scree plot):

plot(pc.data)
abline(h=1)

and you can make a decision how many PCs to retain based on how many PCs have eigenvalues more than 1, which may be better visible from:

summary(pc.data)

This also gives you information about variance explained by each PC and cumulative variance. Nice. Now, biplot:

biplot(pc.data, xlabs=data.names)

To get loadings and scores, you just type:

pc.data$loadings
pc.data$scores

So if you need fancier plots:

plot(pc.data$scores[,2]~pc.data$scores[,1], xlab=’PC1 (% of total variance)’, ylab=’PC2 (% of total variance), type=’n')
text(pc.data$scores[1], pc.data$scores[2], labels=data.names)

or even symbol or color coding plots:

libarary(lattice)
xyplot(pc.data$scores[,2]~pc.data$scores[,1],
xlab=’PC1 (% of total variance)’, ylab=’PC2 (% of total variance),
col=’black’,
pch=c(1,2,3 etc.), cex=1.2,

key=list(points=list(pch=c(1,2,3 etc.)),
text=list(c(’name1′, ‘name 2′, ‘name 3′ etc.))))

I find it much easier than in SAS….

Now, in case your matrix is “wider than longer” meaning more variables than samples, the princomp will tell you to get lost, by some nice error message. Of course, beauty of R, there is a solution. Function prcomp is there for you, as explained here doing Q-mode PCA other then R-mode PCA, for those who know what it means. You just need to figure out how to deal with it. First, how to put together the code:
pc.data <- prcomp(data.set, cor=T)


I find it most appealing. I tried scaling in the function, and I am not saying there is anything wrong with it. But you may as well do it before, by scale function:

data.scaled <- scale(data, scale=T, center=T)

Going back to our PCA, now the trick is to extract information we want- meaning eigenvalues, scores and loadings.

summary(pc.data)

gives us desired proportion of variance and cumulative proportion, but…. well, just look at numbers in first line, and you know something is wrong. You are right, these obviously are not eigenvalues, and as you can read in R help for the function this is because of calculations are done bu singular value decomposition. I will use elbow on scree plot to decide how many PCs to retain, although I am sure the professionals know better how to work around it…
For that, I suggest rather than using plot to use

plot(pc.data$sdev)

and make the decision visually , but this requires already knowing how the output looks like. Which gets me to the point that, of course, there is no pc.data$loadings and pc.data$scores in prcomp. However, let me start from the fact, that biplot works as it should:

biplot(pc.data, xlabs=data.names)

Just in case you really have extreme number of variables which is making you plot impossible to look at, I found this trick working beautifully:

biplot(pc.data, xlabs=data.names, col=c(’black’, ‘none’)

for those interested, if you use white as second color, it will affect the labels on sample names directly below, setting the color to none is really superior. Don’t worry about error message thou, as you done it on purpose and got the results you wanted.
This takes care of scores plotting. If you want to plot loadings on PC1 and PC2, or another pair, the simples solution would just be to set first color to none and second to black, respectively. However, if you want to plot loadings just on one PC, well, it requires this knowledge, which was not necessarily easy to find on the net, but is for example explained here:

pc.data$scores = pc.data$x
pc.data$loadings=pc.data$rotation

Thus, your code may look like that:

plot(pc.data$rotation[1:number of variables]~variables.scale, type=’l', xlab=’x scale’, ylab=’loadings on PC1′)

Now, plotting loadings on PC2, you need your first number in range be number of samples+1, second number be number of samples times 2, if you get the idea….

NOTE - there is a pca function in labdsv library, if you want to check it here. But.. The great advantage is that you can limit number of extracted PCs and the scores and loadings are in output by these names.But… it has a limiting plotting capabilities for general applications.

Technorati Tags: , , , , , , , , , ,

Powered by WordPress.
Theme by Ron and Andrea. Background image from Gimp Patterns. Theme images created using The GIMP 2.2.8.