[Bioc-sig-seq] Comparing two chipseq position sets

Tue May 12 16:07:54 CEST 2009

Hi Wolfgang and all,

That looks like a non-trivial paper to read. I'll take a look at it.
I'm interested to see how bad can you do by choosing the simplest
randomization method. If it is not so bad, simplicity wins!

Thank you,

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

On Tue, May 12, 2009 at 9:32 AM, Wolfgang Huber <huber at ebi.ac.uk> wrote:
>
> Hi all,
>
> this paper addresses a related question:
>
> Non Parametric Methods for Genomic Inference
> PJ Bickel, N Boley, JB Brown, H Huang, NR Zhang
> http://encodestatistics.org/publications/statpaper-July31.pdf
>
> I don't claim to understand it though. Probably you are well served by an ad
> hoc measure of (the distribution of) interval overlap(s).
>
>        Best wishes
>        Wolfgang
>
> Ivan Gregoretti ha scritto:
>>
>> Hello Nicolas and everybody,
>>
>> I wanted to explore genomeIntervals before replying to the list. I have
>> now.
>>
>> Indeed, genomeIntervals is extremely powerful and it should make the
>> comparision between two sets less difficult.
>>
>> I'll try to come up with a clean and simple comparison method first
>> and then move on towards addressing Steve G's questions. Sooner or
>> later chipseqers will have to address them anyway.
>>
>> Thank you everybody,
>>
>> Ivan
>>
>>
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Thu, May 7, 2009 at 11:29 AM, Nicolas Delhomme <delhomme at embl.de>
>> wrote:
>>>
>>> Hi Ivan,
>>>
>>> I'm not yet so familiar with iRanges, just starting to use it. By the
>>> way, I
>>> use the opportunity to thank the guys behing those libraries (ShortRead,
>>> chipseq, iRanges, etc.) as they are doing a tremendous work. Chapeau bas
>>> messieurs!
>>> Back to your question, the fact that A and B have different size is not a
>>> problem for genomeIntervals. The interval_overlap function will return a
>>> list, which for every interval of A will have a vector of the position in
>>> B
>>> of the corresponding overlapping interval.
>>> I agree, this is confusing, so here is the bit of code applied to for
>>> your
>>> example:
>>>
>>> require(genomeIntervals)
>>>
>>> A.set<-new("Genome_intervals",
>>>               matrix(
>>>                       c(3660781,3662707,
>>>                       4481742,4482656,
>>>                       4482813,4484003,
>>>                       4561320,4562262,
>>>                       4774887,4776304,
>>>                       4797291,4798822,
>>>                       4847807,4848846,
>>>                       5008093,5009386,
>>>                       5009514,5010046,
>>>                       5010095,5010583),ncol=2,byrow=TRUE),
>>>               closed=c(TRUE,TRUE),
>>>               annotation=data.frame(
>>>               seq_name=rep("chr1",10),
>>>               inter_base=FALSE,
>>>               strand="+"
>>>               )
>>>          )
>>>
>>> B.set<-new("Genome_intervals",
>>>               matrix(
>>>                       c(
>>>                        3659579,3662079,
>>>                         4773791,4776291,
>>>                          4797473,4799973,
>>>                           4847394,4849894,
>>>                            5007460,5009960,
>>>                             5072753,5075253,
>>>                              6204242,6206742,
>>>                               7078730,7081230,
>>>                                9282452,9284952,
>>>                                 9683423,9685923),ncol=2,byrow=TRUE),
>>>               closed=c(TRUE,TRUE),
>>>               annotation=data.frame(
>>>               seq_name=rep("chr1",10),
>>>               inter_base=FALSE,
>>>               strand="+"
>>>               )
>>>          )
>>>
>>> A.B.overlap<-interval_overlap(A.set,B.set)
>>>
>>> A.B.overlap
>>>
>>> [[1]]
>>> [1] 1
>>>
>>> [[2]]
>>> integer(0)
>>>
>>> [[3]]
>>> integer(0)
>>>
>>> [[4]]
>>> integer(0)
>>>
>>> [[5]]
>>> [1] 2
>>>
>>> [[6]]
>>> [1] 3
>>>
>>> [[7]]
>>> [1] 4
>>>
>>> [[8]]
>>> [1] 5
>>>
>>> [[9]]
>>> [1] 5
>>>
>>> [[10]]
>>> integer(0)
>>>
>>> As you can see the first interval in A overlap with the first in B; the
>>> fifth in A with the second in B and so on...
>>>
>>> The comment from Steve are very good, I've read about it quite often in
>>> the
>>> literature, especially for ChIP-chip but I don't know if there's a
>>> package
>>> around. Let me know if you find/know about one.
>>>
>>> Best,
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> High Throughput Functional Genomics Center
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8426
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>> On 7 May 2009, at 16:02, Ivan Gregoretti wrote:
>>>
>>>> Hello Steve, Nicolas and Michael,
>>>>
>>>> I agree with all of you: it is not a trivial question.
>>>>
>>>> I asked the bioc-sig-seq listers because I thought, --Hey, this must
>>>> be the everyday's question of the genome analyst.
>>>>
>>>> Say you ran your chipseq under condition A and then you ran it under
>>>> condition B. Then you have to decide whether A and B made any
>>>> difference. It doesn't get any simpler than that!
>>>>
>>>> I can't compare the two means or the two dispersions. I have to
>>>> compare pairs. The problem is that it is not trivial to unambiguously
>>>> determine which spot in B must be paired with each spot in A. To start
>>>> with, A and B may have different numbers of loci (ie 15000 versus
>>>> 18000).
>>>>
>>>> I'll take a look at genomeIntervals and IRanges.
>>>>
>>>> By the way, Michael, would you let me know as soon as the new IRanges
>>>> documentation comes out? You guys were working on something, I
>>>> understand.
>>>>
>>>> Thank you all,
>>>>
>>>> Ivan
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>>
>>>>
>>>> On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence <mflawren at fhcrc.org>
>>>> wrote:
>>>>>
>>>>> On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti <ivangreg at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hello Bioc-sig-seq,
>>>>>>
>>>>>> Say you run your ChIP-seq and find binding positions like this
>>>>>>
>>>>>> chr1  3660781  3662707
>>>>>> chr1  4481742  4482656
>>>>>> chr1  4482813  4484003
>>>>>> chr1  4561320  4562262
>>>>>> chr1  4774887  4776304
>>>>>> chr1  4797291  4798822
>>>>>> chr1     4847807  4848846
>>>>>> chr1  5008093  5009386
>>>>>> chr1  5009514  5010046
>>>>>> chr1  5010095  5010583
>>>>>> ...[many more loci and chromosomes]...
>>>>>>
>>>>>> Then you want to compare it to published data like this
>>>>>>
>>>>>> chr1  3659579  3662079
>>>>>> chr1  4773791  4776291
>>>>>> chr1  4797473  4799973
>>>>>> chr1  4847394  4849894
>>>>>> chr1  5007460  5009960
>>>>>> chr1  5072753  5075253
>>>>>> chr1  6204242  6206742
>>>>>> chr1  7078730  7081230
>>>>>> chr1  9282452  9284952
>>>>>> chr1  9683423  9685923
>>>>>> ...[many more loci and chromosomes]...
>>>>>>
>>>>>> What method would you use to test whether these two lists are
>>>>>> significantly different?
>>>>>
>>>>> This is a tough statistical question that probably needs to be a bit
>>>>> more
>>>>> specific, but as far as technical tools, in addition to genomeIntervals
>>>>> there is the IRanges package and its efficient "overlap" function.
>>>>> IRanges
>>>>> is well integrated with the rest of sequence analysis infrastructure in
>>>>> Bioconductor.
>>>>>
>>>>>> Any pointer would be appreciated.
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> Ivan Gregoretti, PhD
>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>> National Institutes of Health
>>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>>> Bethesda, MD 20892. USA.
>>>>>> Phone: 1-301-496-1592
>>>>>> Fax: 1-301-496-9878
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
>
> Best wishes
>     Wolfgang
>
> ------------------------------------------------
> Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber
>