[Bioc-sig-seq] coverage ranges included in gene ranges and input data
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Oct 22 20:41:52 CEST 2009
Florence,
Below is code I would use to find the median coverage depth within specified ranges. This snippet includes a simplification to the coverage method that I just checked into svn and so will be available tomorrow at bioconductor.org via biocLite. Does this code meet your needs and seem reasonable to you?
> suppressMessages(library(IRanges))
>
> TSS_Dm_Anno_ch <-
+ RangedData(IRangesList(
+ "2L" = IRanges(start=c(2, 3, 5, 8), width=2),
+ "2R" = IRanges(start=c(1, 4, 6), width=3)))
> polII_cov_wt_ch <-
+ RangedData(IRangesList(
+ "2L" = IRanges(start=c(1, 4, 6), width=c(3, 2, 4)),
+ "2R" = IRanges(start=c(1, 2, 3, 6), width=c(3, 3, 3, 4))),
+ score = c(2, 4, 7, 3, 1, 1, 1))
> ## requires IRanges >= 1.3.96
> polIIcoverage <- coverage(polII_cov_wt_ch, weight = "score")
> ##otherwise:
> ##polIIcoverage <- coverage(polII_cov_wt_ch, weight = values(polII_cov_wt_ch)[,"score"])
> cov_med <- aggregate(polIIcoverage, ranges(TSS_Dm_Anno_ch), median)
> TSS_Dm_Anno_ch$med_cov <- unlist(cov_med)
> TSS_Dm_Anno_ch
RangedData with 7 ranges and 1 value column on 2 sequences
colnames(1): med_cov
names(2): 2L 2R
> as.data.frame(TSS_Dm_Anno_ch)
space start end width med_cov
1 2L 2 3 2 2.0
2 2L 3 4 2 3.0
3 2L 5 6 2 5.5
4 2L 8 9 2 7.0
5 2R 1 3 3 4.0
6 2R 4 6 3 1.0
7 2R 6 8 3 1.0
> sessionInfo()
R version 2.10.0 RC (2009-10-18 r50160)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.96
Michael Lawrence wrote:
> On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <florence at ebi.ac.uk>wrote:
>
>
>> Hello,
>>
>> I am working with ChIP-seq data. I want to get the median of the coverage
>> for each gene that I have in a RangedData.
>> My coverage is a RangedData object as well. I extracted the ranges of my
>> two
>> objects and I used the %in% function but it gives me the ranges that are
>> totally or partly in a particular gene. I wish to be more restrictive and
>> to
>> include only the ranges that are entirely in the gene.
>>
>> Here is an example.
>>
>>
>>> class(TSS_Dm_Anno_ch)
>>>
>> [1] "RangedData"
>> attr(,"package")
>> [1] "IRanges"
>>
>>> region.TU=ranges(TSS_Dm_Anno_ch)
>>> region.TU
>>>
>> CompressedIRangesList: 7 ranges
>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>
>>> class(region.TU)
>>>
>> [1] "CompressedIRangesList"
>> attr(,"package")
>> [1] "IRanges"
>>
>>
>>> polII_cov_wt_ch
>>>
>> RangedData: 6931803 ranges by 1 column on 7 sequences
>> colnames(1): score
>> names(7): 2L 2R 3L 3R 4 U X
>>
>>> values.ranges=ranges(polII_cov_wt_ch)
>>> values.ranges
>>>
>> CompressedIRangesList: 7 ranges
>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>
>> ## Get the borders for each gene, i,e the row numbers in values.ranges (min
>> and max) which correspond to ranges that are in the gene (g) range.
>> borders=lapply(names(values.ranges), function(ch)
>> t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
>> range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))
>>
>>
>
> And I use the following function to obtain the median of the coverage on the
>
>> ranges included in the borders
>>
>> getMedianInRegionFromRangedData <- function(M.values,ranges){
>> median.values=sapply(1:nrow(ranges), function(r)
>> median(M.values[ranges[r,1]:ranges[r,2],]$score))
>> return(median.values)
>> }
>>
>>
>>
> Assuming that the RangedData represents a run-length encoded coverage
> vector, the above code is taking the median of the run values, not the
> median of the actual coverage. That is, as far as I can tell, you are not
> weighting the median calculation by the run lengths.
>
> Maybe that's intentional, but anyway RangedData is designed for storing
> variables on a set of ranges. A value of a variable, like score, applies to
> the range, not to every element of the range. In other words, RangedData is
> not formally a run-length encoding, so it is not a very convenient
> representation of coverage.
>
> Coverage is more naturally represented as a sequence, e.g. an Rle object.
> You could construct an RleList from the RangedData (which would be easy if
> your RangedData has the zero regions). Then form an RleViews(List) using the
> gene ranges and viewApply() over it.
>
> Maybe others have better ideas..
>
>
>
>> median.values=unlist(lapply(names(borders), function(ch)
>> getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))
>>
>> I am sure my code should be improved, using a mapply I guess!?
>>
>> I haven't found yet the appropriate function to be more restrictive.
>> So far, I did it using something like
>> min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
>> max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
>> these borders but it's not efficient at all.
>> Should I use a seqselect or Find functions of the Sequence class?
>>
>> One more question, is there any particular function in the chipseq package
>> or in an other package to help us to use the input data, taking it into
>> account to analyse the real signal?
>>
>> Many thanks in advance,
>> Florence
>>
>>
>>> sessionInfo()
>>>
>> R version 2.10.0 alpha (2009-10-04 r49926)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>> [2] chipseq_0.1.27
>> [3] ShortRead_1.3.40
>> [4] lattice_0.17-26
>> [5] BSgenome_1.13.16
>> [6] Biostrings_2.13.53
>> [7] IRanges_1.3.94
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.5.8 grid_2.10.0 hwriter_1.1 tools_2.10.0
>>
>> ----------------------------------------
>> Florence Cavalli
>> PhD student, Luscombe group
>> EMBL-European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> CB10 1SD
>> Cambridge
>> UK
>> email:florence at ebi.ac.uk <email%3Aflorence at ebi.ac.uk> <
>> email%3Aflorence at ebi.ac.uk <email%253Aflorence at ebi.ac.uk>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
More information about the Bioc-sig-sequencing
mailing list