[R] random forest and vegetation data

Thu Jan 31 23:18:04 CET 2008

Hi there,

I am an environmental studies masters student trying to get my thesis out the door.  I am also newbie at trees in general, but I like what I see in the literature about the random forest algorithm.  I think I get the general gist of things, but even after reading stuff I’m unclear about how I could be getting the results I’m seeing.  I obviously am missing something about how the split points in the final tree are decided.

I’ve been using random forests in image classification by entering split values into decision tree classifiers, and that has seemed work very well.  The map output appears legitimate and withheld data gives confusion matrices similar to the predictive errors from the random forest.  This leads me to assume that the split points are effective.

However now that I’ve turned to the ecological portion of my analysis, with a data set that contains few variable levels and lots of zeros, suddenly the splitting node information is not making sense.

Here is my situation.  I have a matrix of study plots that each belong to one of three elevation classes and which each have percent cover class data for 15 plant species associated with them.  

plot	elev	sp1	sp2	sp3… sp15
1	3	0	2	6…      5
2	0	0	0	1…      0
etc.

The species data are ordered factors from 0-9.  When I run the algorithm using species cover values to predict elevation class, two species alone come up as the best predictors.  That makes ecological sense in this setting, given the species ranges in question.

Here’s my difficulty though.  The split point values can’t be interpreted, as far as I can tell.  I’m getting split points of, say, 1.5 and 2.5 for a species who’s cover is either 0 (absent) or 4 and above.  So obviously the split points in the final tree are being generated in some way I don’t understand.  Averaged?  

I’ve tried running the tree using the data as factors, using the data as ordered factors, and using the data as numerical variables, just to see if I could gain insight into what’s going on, but I’m coming up clueless.  My literature hunt reveals repeated instances of folks saying that the final tree can’t be interpreted the way other trees are, but I’m not getting a lot on just why that might be.  

Some folks talk about the final tree being “averaged,” others say that “mode,” is employed (which doesn’t make sense to me if I’m getting 1.5 and 2.5 split values).  If the trees are only good as black box predictors (which is of course a very useful thing in itself), should I even be using the node information in my image classifications?  

As you see, I’m missing some rather important point or other here.  Can you enlighten?

Thanks,
A