[BioC] Removing duplicate probes from expressionset
Martin Morgan
mtmorgan at fhcrc.org
Sun Apr 15 01:37:55 CEST 2012
On 04/14/2012 03:59 PM, Angela McDonald wrote:
> Hello,
>
> I am wondering how to remove duplicate probes from an expression set in Bioconductor. I have tried to use nsFilter with no success.
>
> When I use the following:
>
> featureFilter(xenexp, require.entrez=TRUE, remove.dupEntrez=TRUE)
>
> The error I get is:
>
> Error in rowQ(exprs(imat), which) :
> cannot calculate order statistic on object with 2 columns
>
> The xenexp expression set includes two samples on the mgu74av2 array
Hi Angela --
featureFilter tries to identify which duplicate ENTREZ id to remove by
identifying the probeset with the largest interquartile range. The
interquartile range is not defined for a sample of size 2, leading to
the error above.
From looking at the source for featureFilter
> featureFilter
function (eset, require.entrez = TRUE, require.GOBP = FALSE,
require.GOCC = FALSE, require.GOMF = FALSE, require.CytoBand = FALSE,
remove.dupEntrez = TRUE, feature.exclude = "^AFFX")
{
[...]
you'll see that duplicate probes are removed by the lines
if (remove.dupEntrez) {
uniqGenes <- findLargest(featureNames(eset), rowIQRs(eset),
annotation(eset))
eset <- eset[uniqGenes, ]
}
so after consulting ?findLargest you could use some statistic other than
rowIQRs (row inter-quartile range) to select which probeset to retain,
e.g., using the 'sample.ExpressionSet' data and select probesets with
the largest range for subsequent analysis
data(sample.ExpressionSet)
eset <- sample.ExpressionSet
rng <- apply(exprs(eset), 1, function(x) diff(range(x)))
uniqGenes <- findLargest(featureNames(eset), rng, annotation(eset))
eset <- eset[uniqGenes,]
You're asking to remove duplicate Entrez gene identifiers, rather than
duplicate probesets; it is not uncommon to perform analysis without
removing duplicates, anticipating in the results that probesets from the
same gene will be qualitatively similar in the signal that they convey.
Also the small sample size restricts the type of analysis possible
anyway, so the usual motivation for removing duplicates -- reducing
number of statistical tests -- may not be relevant.
Martin
>
> Thank you so much,
>
> Angela
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list