[BioC] Fwd: How to decide which distance metric to use for micoarray data clustering?
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Oct 7 18:13:27 CEST 2009
Hi Peng,
On Oct 7, 2009, at 11:54 AM, Peng Yu wrote:
> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at gmail.com>
> wrote:
>>
>>
>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>
>>> Besides the distance metrics, there are other things that may also
>>> be
>>> important. For example, multiple probesets map to a same gene. I can
>>> do clustering on probeset values or on averaged probeset values of
>>> genes. What factors should I consider when I make this kind of
>>> decisions?
>>>
>>
>> It is generally best not to average probes. You could choose one
>> to be
>> representative of each gene, but averaging is not the best way to go.
>
> Is there any justification why it is not good to average probes?
There is a very informative discussion that touches this topic on the
BioC list from back in April 2009. I have it flagged with the
intention of going back to it to work out some examples myself, but
alas, haven't yet done so.
Anyway, this is the thread:
http://thread.gmane.org/gmane.science.biology.informatics.conductor/22758
While I recommend you read the whole thing, if you go ~9 Messages
deep, you'll find a post by James MacDonald (April 24th) with the
following comment:
"""Yes. You are missing the fact that the data from Affy probes
usually are
not normally distributed. In fact, it is not uncommon for a given
probeset to have widely divergent intensity levels for its component
probes. Because of the fact that the mean is not robust to outliers,
people long ago abandoned methods based on a normal distribution."""
Hope that's helpful,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list