[BioC] Fwd: How to decide which distance metric to use for micoarray data clustering?

Wed Oct 7 18:13:27 CEST 2009

Hi Peng,

On Oct 7, 2009, at 11:54 AM, Peng Yu wrote:

> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at gmail.com>  
> wrote:
>>
>>
>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>
>>> Besides the distance metrics, there are other things that may also  
>>> be
>>> important. For example, multiple probesets map to a same gene. I can
>>> do clustering on probeset values or on averaged probeset values of
>>> genes. What factors should I consider when I make this kind of
>>> decisions?
>>>
>>
>> It is generally best not to average probes.  You could choose one  
>> to be
>> representative of each gene, but averaging is not the best way to go.
>
> Is there any justification why it is not good to average probes?

There is a very informative discussion that touches this topic on the  
BioC list from back in April 2009. I have it flagged with the  
intention of going back to it to work out some examples myself, but  
alas, haven't yet done so.

Anyway, this is the thread:

http://thread.gmane.org/gmane.science.biology.informatics.conductor/22758

While I recommend you read the whole thing, if you go ~9 Messages  
deep, you'll find a post by James MacDonald (April 24th) with the  
following comment:

"""Yes. You are missing the fact that the data from Affy probes  
usually are
not normally distributed. In fact, it is not uncommon for a given
probeset to have widely divergent intensity levels for its component
probes. Because of the fact that the mean is not robust to outliers,
people long ago abandoned methods based on a normal distribution."""

Hope that's helpful,

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact