[Bioc-sig-seq] Comparing two chipseq position sets

Thu May 7 17:29:53 CEST 2009

Hi Ivan,

I'm not yet so familiar with iRanges, just starting to use it. By the  
way, I use the opportunity to thank the guys behing those libraries  
(ShortRead, chipseq, iRanges, etc.) as they are doing a tremendous  
work. Chapeau bas messieurs!
Back to your question, the fact that A and B have different size is  
not a problem for genomeIntervals. The interval_overlap function will  
return a list, which for every interval of A will have a vector of the  
position in B of the corresponding overlapping interval.
I agree, this is confusing, so here is the bit of code applied to for  
your example:

require(genomeIntervals)

A.set<-new("Genome_intervals",
		matrix(
			c(3660781,3662707,
			4481742,4482656,
			4482813,4484003,
			4561320,4562262,
			4774887,4776304,
			4797291,4798822,
			4847807,4848846,
			5008093,5009386,
			5009514,5010046,
			5010095,5010583),ncol=2,byrow=TRUE),
            	closed=c(TRUE,TRUE),
            	annotation=data.frame(
              	seq_name=rep("chr1",10),
              	inter_base=FALSE,
              	strand="+"
		)
            )

B.set<-new("Genome_intervals",
		matrix(
			c(
			 3659579,3662079,
			  4773791,4776291,
			   4797473,4799973,
			    4847394,4849894,
			     5007460,5009960,
			      5072753,5075253,
			       6204242,6206742,
			        7078730,7081230,
				 9282452,9284952,
				  9683423,9685923),ncol=2,byrow=TRUE),
            	closed=c(TRUE,TRUE),
            	annotation=data.frame(
              	seq_name=rep("chr1",10),
              	inter_base=FALSE,
              	strand="+"
		)
            )

A.B.overlap<-interval_overlap(A.set,B.set)

A.B.overlap

[[1]]
[1] 1

[[2]]
integer(0)

[[3]]
integer(0)

[[4]]
integer(0)

[[5]]
[1] 2

[[6]]
[1] 3

[[7]]
[1] 4

[[8]]
[1] 5

[[9]]
[1] 5

[[10]]
integer(0)

As you can see the first interval in A overlap with the first in B;  
the fifth in A with the second in B and so on...

The comment from Steve are very good, I've read about it quite often  
in the literature, especially for ChIP-chip but I don't know if  
there's a package around. Let me know if you find/know about one.

Best,

---------------------------------------------------------------
Nicolas Delhomme

High Throughput Functional Genomics Center

European Molecular Biology Laboratory

Tel: +49 6221 387 8426
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 7 May 2009, at 16:02, Ivan Gregoretti wrote:

> Hello Steve, Nicolas and Michael,
>
> I agree with all of you: it is not a trivial question.
>
> I asked the bioc-sig-seq listers because I thought, --Hey, this must
> be the everyday's question of the genome analyst.
>
> Say you ran your chipseq under condition A and then you ran it under
> condition B. Then you have to decide whether A and B made any
> difference. It doesn't get any simpler than that!
>
> I can't compare the two means or the two dispersions. I have to
> compare pairs. The problem is that it is not trivial to unambiguously
> determine which spot in B must be paired with each spot in A. To start
> with, A and B may have different numbers of loci (ie 15000 versus
> 18000).
>
> I'll take a look at genomeIntervals and IRanges.
>
> By the way, Michael, would you let me know as soon as the new IRanges
> documentation comes out? You guys were working on something, I
> understand.
>
> Thank you all,
>
> Ivan
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Thu, May 7, 2009 at 9:24 AM, Michael Lawrence  
> <mflawren at fhcrc.org> wrote:
>>
>>
>> On Wed, May 6, 2009 at 12:40 PM, Ivan Gregoretti  
>> <ivangreg at gmail.com> wrote:
>>>
>>> Hello Bioc-sig-seq,
>>>
>>> Say you run your ChIP-seq and find binding positions like this
>>>
>>> chr1  3660781  3662707
>>> chr1  4481742  4482656
>>> chr1  4482813  4484003
>>> chr1  4561320  4562262
>>> chr1  4774887  4776304
>>> chr1  4797291  4798822
>>> chr1     4847807  4848846
>>> chr1  5008093  5009386
>>> chr1  5009514  5010046
>>> chr1  5010095  5010583
>>> ...[many more loci and chromosomes]...
>>>
>>> Then you want to compare it to published data like this
>>>
>>> chr1  3659579  3662079
>>> chr1  4773791  4776291
>>> chr1  4797473  4799973
>>> chr1  4847394  4849894
>>> chr1  5007460  5009960
>>> chr1  5072753  5075253
>>> chr1  6204242  6206742
>>> chr1  7078730  7081230
>>> chr1  9282452  9284952
>>> chr1  9683423  9685923
>>> ...[many more loci and chromosomes]...
>>>
>>> What method would you use to test whether these two lists are
>>> significantly different?
>>
>> This is a tough statistical question that probably needs to be a  
>> bit more
>> specific, but as far as technical tools, in addition to  
>> genomeIntervals
>> there is the IRanges package and its efficient "overlap" function.  
>> IRanges
>> is well integrated with the rest of sequence analysis  
>> infrastructure in
>> Bioconductor.
>>
>>>
>>> Any pointer would be appreciated.
>>>
>>> Ivan
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing