[Bioc-sig-seq] coverage ranges included in gene ranges and input data

Thu Oct 22 20:41:52 CEST 2009

Florence,
Below is code I would use to find the median coverage depth within specified ranges. This snippet includes a simplification to the coverage method that I just checked into svn and so will be available tomorrow at bioconductor.org via biocLite. Does this code meet your needs and seem reasonable to you?

> suppressMessages(library(IRanges))
> 
> TSS_Dm_Anno_ch <-
+   RangedData(IRangesList(
+              "2L" = IRanges(start=c(2, 3, 5, 8), width=2),
+              "2R" = IRanges(start=c(1, 4, 6), width=3)))

> polII_cov_wt_ch <-
+   RangedData(IRangesList(
+              "2L" = IRanges(start=c(1, 4, 6), width=c(3, 2, 4)),
+              "2R" = IRanges(start=c(1, 2, 3, 6), width=c(3, 3, 3, 4))),
+              score = c(2, 4, 7, 3, 1, 1, 1))

> ## requires IRanges >= 1.3.96
> polIIcoverage <- coverage(polII_cov_wt_ch, weight = "score")
> ##otherwise:
> ##polIIcoverage <- coverage(polII_cov_wt_ch, weight = values(polII_cov_wt_ch)[,"score"])

> cov_med <- aggregate(polIIcoverage, ranges(TSS_Dm_Anno_ch), median)

> TSS_Dm_Anno_ch$med_cov <- unlist(cov_med)

> TSS_Dm_Anno_ch
RangedData with 7 ranges and 1 value column on 2 sequences
colnames(1): med_cov
names(2): 2L 2R

> as.data.frame(TSS_Dm_Anno_ch)
  space start end width med_cov
1    2L     2   3     2     2.0
2    2L     3   4     2     3.0
3    2L     5   6     2     5.5
4    2L     8   9     2     7.0
5    2R     1   3     3     4.0
6    2R     4   6     3     1.0
7    2R     6   8     3     1.0

> sessionInfo()
R version 2.10.0 RC (2009-10-18 r50160) 
i386-apple-darwin9.8.0 

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] IRanges_1.3.96

Michael Lawrence wrote:
> On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <florence at ebi.ac.uk>wrote:
>
>   
>> Hello,
>>
>> I am working with ChIP-seq data. I want to get the median of the coverage
>> for each gene that I have in a RangedData.
>> My coverage is a RangedData object as well. I extracted the ranges of my
>> two
>> objects and I used the %in% function but it gives me the ranges that are
>> totally or partly in a particular gene. I wish to be more restrictive and
>> to
>> include only the ranges that are entirely in the gene.
>>
>> Here is an example.
>>
>>     
>>> class(TSS_Dm_Anno_ch)
>>>       
>> [1] "RangedData"
>> attr(,"package")
>> [1] "IRanges"
>>     
>>> region.TU=ranges(TSS_Dm_Anno_ch)
>>> region.TU
>>>       
>> CompressedIRangesList: 7 ranges
>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>     
>>> class(region.TU)
>>>       
>> [1] "CompressedIRangesList"
>> attr(,"package")
>> [1] "IRanges"
>>
>>     
>>> polII_cov_wt_ch
>>>       
>> RangedData: 6931803 ranges by 1 column on 7 sequences
>> colnames(1): score
>> names(7): 2L 2R 3L 3R 4 U X
>>     
>>> values.ranges=ranges(polII_cov_wt_ch)
>>> values.ranges
>>>       
>> CompressedIRangesList: 7 ranges
>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>
>> ## Get the borders for each gene, i,e the row numbers in values.ranges (min
>> and max) which correspond to ranges that are in the gene (g) range.
>> borders=lapply(names(values.ranges), function(ch)
>> t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
>> range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))
>>
>>     
>
> And I use the following function to obtain the median of the coverage on the
>   
>> ranges included in the borders
>>
>> getMedianInRegionFromRangedData <- function(M.values,ranges){
>>  median.values=sapply(1:nrow(ranges), function(r)
>> median(M.values[ranges[r,1]:ranges[r,2],]$score))
>>  return(median.values)
>> }
>>
>>
>>     
> Assuming that the RangedData represents a run-length encoded coverage
> vector, the above code is taking the median of the run values, not the
> median of the actual coverage. That is, as far as I can tell, you are not
> weighting the median calculation by the run lengths.
>
> Maybe that's intentional, but anyway RangedData is designed for storing
> variables on a set of ranges. A value of a variable, like score, applies to
> the range, not to every element of the range. In other words, RangedData is
> not formally a run-length encoding, so it is not a very convenient
> representation of coverage.
>
> Coverage is more naturally represented as a sequence, e.g. an Rle object.
> You could construct an RleList from the RangedData (which would be easy if
> your RangedData has the zero regions). Then form an RleViews(List) using the
> gene ranges and viewApply() over it.
>
> Maybe others have better ideas..
>
>
>   
>>  median.values=unlist(lapply(names(borders), function(ch)
>> getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))
>>
>> I am sure my code should be improved, using a mapply I guess!?
>>
>> I haven't found yet the appropriate function to be more restrictive.
>> So far, I did it using something like
>> min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
>> max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
>> these borders but it's not efficient at all.
>> Should I use a seqselect or Find functions of the Sequence class?
>>
>> One more question, is there any particular function in the chipseq package
>> or in an other package to help us to use the input data, taking it into
>> account to analyse the real signal?
>>
>> Many thanks in advance,
>> Florence
>>
>>     
>>> sessionInfo()
>>>       
>> R version 2.10.0 alpha (2009-10-04 r49926)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
>>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>> [2] chipseq_0.1.27
>> [3] ShortRead_1.3.40
>> [4] lattice_0.17-26
>> [5] BSgenome_1.13.16
>> [6] Biostrings_2.13.53
>> [7] IRanges_1.3.94
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.5.8 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>
>> ----------------------------------------
>> Florence Cavalli
>> PhD student, Luscombe group
>> EMBL-European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> CB10 1SD
>> Cambridge
>> UK
>> email:florence at ebi.ac.uk <email%3Aflorence at ebi.ac.uk> <
>> email%3Aflorence at ebi.ac.uk <email%253Aflorence at ebi.ac.uk>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>     
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>