[Bioc-sig-seq] Comparing two chipseq position sets
Ivan Gregoretti
ivangreg at gmail.com
Fri May 8 17:41:52 CEST 2009
Hello Nicolas and everybody,
I wanted to explore genomeIntervals before replying to the list. I have now.
Indeed, genomeIntervals is extremely powerful and it should make the
comparision between two sets less difficult.
I'll try to come up with a clean and simple comparison method first
and then move on towards addressing Steve G's questions. Sooner or
later chipseqers will have to address them anyway.
Thank you everybody,
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
On Thu, May 7, 2009 at 11:29 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
> Hi Ivan,
>
> I'm not yet so familiar with iRanges, just starting to use it. By the way, I
> use the opportunity to thank the guys behing those libraries (ShortRead,
> chipseq, iRanges, etc.) as they are doing a tremendous work. Chapeau bas
> messieurs!
> Back to your question, the fact that A and B have different size is not a
> problem for genomeIntervals. The interval_overlap function will return a
> list, which for every interval of A will have a vector of the position in B
> of the corresponding overlapping interval.
> I agree, this is confusing, so here is the bit of code applied to for your
> example:
>
> require(genomeIntervals)
>
> A.set<-new("Genome_intervals",
> matrix(
> c(3660781,3662707,
> 4481742,4482656,
> 4482813,4484003,
> 4561320,4562262,
> 4774887,4776304,
> 4797291,4798822,
> 4847807,4848846,
> 5008093,5009386,
> 5009514,5010046,
> 5010095,5010583),ncol=2,byrow=TRUE),
> closed=c(TRUE,TRUE),
> annotation=data.frame(
> seq_name=rep("chr1",10),
> inter_base=FALSE,
> strand="+"
> )
> )
>
> B.set<-new("Genome_intervals",
> matrix(
> c(
> 3659579,3662079,
> 4773791,4776291,
> 4797473,4799973,
> 4847394,4849894,
> 5007460,5009960,
> 5072753,5075253,
> 6204242,6206742,
> 7078730,7081230,
> 9282452,9284952,
> 9683423,9685923),ncol=2,byrow=TRUE),
> closed=c(TRUE,TRUE),
> annotation=data.frame(
> seq_name=rep("chr1",10),
> inter_base=FALSE,
> strand="+"
> )
> )
>
> A.B.overlap<-interval_overlap(A.set,B.set)
>
> A.B.overlap
>
> [[1]]
> [1] 1
>
> [[2]]
> integer(0)
>
> [[3]]
> integer(0)
>
> [[4]]
> integer(0)
>
> [[5]]
> [1] 2
>
> [[6]]
> [1] 3
>
> [[7]]
> [1] 4
>
> [[8]]
> [1] 5
>
> [[9]]
> [1] 5
>
> [[10]]
> integer(0)
>
> As you can see the first interval in A overlap with the first in B; the
> fifth in A with the second in B and so on...
>
> The comment from Steve are very good, I've read about it quite often in the
> literature, especially for ChIP-chip but I don't know if there's a package
> around. Let me know if you find/know about one.
>
> Best,
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> High Throughput Functional Genomics Center
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8426
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
> On 7 May 2009, at 16:02, Ivan Gregoretti wrote:
>
>> Hello Steve, Nicolas and Michael,
>>
>> I agree with all of you: it is not a trivial question.
>>
>> I asked the bioc-sig-seq listers because I thought, --Hey, this must
>> be the everyday's question of the genome analyst.
>>
>> Say you ran your chipseq under condition A and then you ran it under
>> condition B. Then you have to decide whether A and B made any
>> difference. It doesn't get any simpler than that!
>>
>> I can't compare the two means or the two dispersions. I have to
>> compare pairs. The problem is that it is not trivial to unambiguously
>> determine which spot in B must be paired with each spot in A. To start
>> with, A and B may have different numbers of loci (ie 15000 versus
>> 18000).
>>
>> I'll take a look at genomeIntervals and IRanges.
>>
>> By the way, Michael, would you let me know as soon as the new IRanges
>> documentation comes out? You guys were working on something, I
>> understand.
>>
>> Thank you all,
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence <mflawren at fhcrc.org>
>> wrote:
>>>
>>>
>>> On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti <ivangreg at gmail.com>
>>> wrote:
>>>>
>>>> Hello Bioc-sig-seq,
>>>>
>>>> Say you run your ChIP-seq and find binding positions like this
>>>>
>>>> chr1 3660781 3662707
>>>> chr1 4481742 4482656
>>>> chr1 4482813 4484003
>>>> chr1 4561320 4562262
>>>> chr1 4774887 4776304
>>>> chr1 4797291 4798822
>>>> chr1 4847807 4848846
>>>> chr1 5008093 5009386
>>>> chr1 5009514 5010046
>>>> chr1 5010095 5010583
>>>> ...[many more loci and chromosomes]...
>>>>
>>>> Then you want to compare it to published data like this
>>>>
>>>> chr1 3659579 3662079
>>>> chr1 4773791 4776291
>>>> chr1 4797473 4799973
>>>> chr1 4847394 4849894
>>>> chr1 5007460 5009960
>>>> chr1 5072753 5075253
>>>> chr1 6204242 6206742
>>>> chr1 7078730 7081230
>>>> chr1 9282452 9284952
>>>> chr1 9683423 9685923
>>>> ...[many more loci and chromosomes]...
>>>>
>>>> What method would you use to test whether these two lists are
>>>> significantly different?
>>>
>>> This is a tough statistical question that probably needs to be a bit more
>>> specific, but as far as technical tools, in addition to genomeIntervals
>>> there is the IRanges package and its efficient "overlap" function.
>>> IRanges
>>> is well integrated with the rest of sequence analysis infrastructure in
>>> Bioconductor.
>>>
>>>>
>>>> Any pointer would be appreciated.
>>>>
>>>> Ivan
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list