[Bioc-sig-seq] Comparing two chipseq position sets

Thu May 21 01:58:47 CEST 2009

Hi Ivan,

Ivan Gregoretti wrote:
> Hello Nicolas and everybody,
> 
> Have you managed to get the output of the peak finder MACS into a
> genomeIntervals object?
> 
> 
> The output of MACS as seen through a data.frame object looks like this
> 
>> head(A, 10)
>     chr   start     end length summit tags Tpvalue fold_enrichment  FDR
> 1  chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
> 2  chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
> 3  chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
> 4  chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
> 5  chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
> 6  chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
> 7  chr1 4847808 4848846   1039    533   26  325.36           82.17 0.05
> 8  chr1 5008094 5009386   1293    973   20  117.98           13.28 0.08
> 9  chr1 5009515 5010046    532    179   10   94.86           24.65 0.38
> 10 chr1 5010096 5010583    488    170    6   52.31           20.54 5.39
> 
> 
> My attempt to make it a genomeIntervals object goes as far as this:
> 
> giA <- new("Genome_intervals", A[ ,2:3], closed=c(TRUE,TRUE),
> annotation=data.frame(A[ ,1], inter_base=FALSE, strand="+", A[ ,4:9]))
> 
> The operation reports no errors but it doesn't really work because I can't
> use it. For instance, I can't make a simple call to the first three elements
> like this
> 
>> giA[1:3]
> Error in .nextMethod(x, i, j, ..., drop) : subscript out of bounds

Because your giA object is broken (have you tried validObject on it?
or to just display it?)

Try:

   giA_annotation <- data.frame(seq_name=A[ ,1], # note the required named field!
                                inter_base=FALSE,
                                strand="+",
                                A[ ,4:9])

   giA <- new("Genome_intervals",
              as.matrix(A[ ,2:3]),
              closed=c(TRUE,TRUE),
              annotation=giA_annotation)

   > validObject(giA)
   [1] TRUE

Maybe the authors of the genomeIntervals package should not let you
build broken Genome_intervals instances so easily...

Cheers,
H.

> 
> Any insights?
> 
> Thank you,
> 
> Ivan
> 
> 
> 
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
> 
> 
> On Thu, May 7, 2009 at 11:29 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
> 
>> Hi Ivan,
>>
>> I'm not yet so familiar with iRanges, just starting to use it. By the way,
>> I use the opportunity to thank the guys behing those libraries (ShortRead,
>> chipseq, iRanges, etc.) as they are doing a tremendous work. Chapeau bas
>> messieurs!
>> Back to your question, the fact that A and B have different size is not a
>> problem for genomeIntervals. The interval_overlap function will return a
>> list, which for every interval of A will have a vector of the position in B
>> of the corresponding overlapping interval.
>> I agree, this is confusing, so here is the bit of code applied to for your
>> example:
>>
>> require(genomeIntervals)
>>
>> A.set<-new("Genome_intervals",
>>                matrix(
>>                        c(3660781,3662707,
>>                        4481742,4482656,
>>                        4482813,4484003,
>>                        4561320,4562262,
>>                        4774887,4776304,
>>                        4797291,4798822,
>>                        4847807,4848846,
>>                        5008093,5009386,
>>                        5009514,5010046,
>>                        5010095,5010583),ncol=2,byrow=TRUE),
>>                closed=c(TRUE,TRUE),
>>                annotation=data.frame(
>>                seq_name=rep("chr1",10),
>>                inter_base=FALSE,
>>                strand="+"
>>                )
>>           )
>>
>> B.set<-new("Genome_intervals",
>>                matrix(
>>                        c(
>>                         3659579,3662079,
>>                          4773791,4776291,
>>                           4797473,4799973,
>>                            4847394,4849894,
>>                             5007460,5009960,
>>                              5072753,5075253,
>>                               6204242,6206742,
>>                                7078730,7081230,
>>                                 9282452,9284952,
>>                                  9683423,9685923),ncol=2,byrow=TRUE),
>>                closed=c(TRUE,TRUE),
>>                annotation=data.frame(
>>                seq_name=rep("chr1",10),
>>                inter_base=FALSE,
>>                strand="+"
>>                )
>>           )
>>
>> A.B.overlap<-interval_overlap(A.set,B.set)
>>
>> A.B.overlap
>>
>> [[1]]
>> [1] 1
>>
>> [[2]]
>> integer(0)
>>
>> [[3]]
>> integer(0)
>>
>> [[4]]
>> integer(0)
>>
>> [[5]]
>> [1] 2
>>
>> [[6]]
>> [1] 3
>>
>> [[7]]
>> [1] 4
>>
>> [[8]]
>> [1] 5
>>
>> [[9]]
>> [1] 5
>>
>> [[10]]
>> integer(0)
>>
>> As you can see the first interval in A overlap with the first in B; the
>> fifth in A with the second in B and so on...
>>
>> The comment from Steve are very good, I've read about it quite often in the
>> literature, especially for ChIP-chip but I don't know if there's a package
>> around. Let me know if you find/know about one.
>>
>> Best,
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> High Throughput Functional Genomics Center
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8426
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>> On 7 May 2009, at 16:02, Ivan Gregoretti wrote:
>>
>>  Hello Steve, Nicolas and Michael,
>>> I agree with all of you: it is not a trivial question.
>>>
>>> I asked the bioc-sig-seq listers because I thought, --Hey, this must
>>> be the everyday's question of the genome analyst.
>>>
>>> Say you ran your chipseq under condition A and then you ran it under
>>> condition B. Then you have to decide whether A and B made any
>>> difference. It doesn't get any simpler than that!
>>>
>>> I can't compare the two means or the two dispersions. I have to
>>> compare pairs. The problem is that it is not trivial to unambiguously
>>> determine which spot in B must be paired with each spot in A. To start
>>> with, A and B may have different numbers of loci (ie 15000 versus
>>> 18000).
>>>
>>> I'll take a look at genomeIntervals and IRanges.
>>>
>>> By the way, Michael, would you let me know as soon as the new IRanges
>>> documentation comes out? You guys were working on something, I
>>> understand.
>>>
>>> Thank you all,
>>>
>>> Ivan
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>>
>>> On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence <mflawren at fhcrc.org>
>>> wrote:
>>>
>>>>
>>>> On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti <ivangreg at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Bioc-sig-seq,
>>>>>
>>>>> Say you run your ChIP-seq and find binding positions like this
>>>>>
>>>>> chr1  3660781  3662707
>>>>> chr1  4481742  4482656
>>>>> chr1  4482813  4484003
>>>>> chr1  4561320  4562262
>>>>> chr1  4774887  4776304
>>>>> chr1  4797291  4798822
>>>>> chr1     4847807  4848846
>>>>> chr1  5008093  5009386
>>>>> chr1  5009514  5010046
>>>>> chr1  5010095  5010583
>>>>> ...[many more loci and chromosomes]...
>>>>>
>>>>> Then you want to compare it to published data like this
>>>>>
>>>>> chr1  3659579  3662079
>>>>> chr1  4773791  4776291
>>>>> chr1  4797473  4799973
>>>>> chr1  4847394  4849894
>>>>> chr1  5007460  5009960
>>>>> chr1  5072753  5075253
>>>>> chr1  6204242  6206742
>>>>> chr1  7078730  7081230
>>>>> chr1  9282452  9284952
>>>>> chr1  9683423  9685923
>>>>> ...[many more loci and chromosomes]...
>>>>>
>>>>> What method would you use to test whether these two lists are
>>>>> significantly different?
>>>>>
>>>> This is a tough statistical question that probably needs to be a bit more
>>>> specific, but as far as technical tools, in addition to genomeIntervals
>>>> there is the IRanges package and its efficient "overlap" function.
>>>> IRanges
>>>> is well integrated with the rest of sequence analysis infrastructure in
>>>> Bioconductor.
>>>>
>>>>
>>>>> Any pointer would be appreciated.
>>>>>
>>>>> Ivan
>>>>>
>>>>> Ivan Gregoretti, PhD
>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>> National Institutes of Health
>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>> Bethesda, MD 20892. USA.
>>>>> Phone: 1-301-496-1592
>>>>> Fax: 1-301-496-9878
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319