[Bioc-sig-seq] Comparing two chipseq position sets
Ivan Gregoretti
ivangreg at gmail.com
Thu May 21 21:34:07 CEST 2009
Hi Michael,
I read rtracklayer.pdf and I got a partial victory.
Can you show us with an example an example?
For instance, in my genomeIntervals workflow I do
1)
> # read in wild type
> A <- read.table("./wildtype_peaks.xls",
+ header=FALSE, skip=14,
+ col.names=c("chr", "start", "end", "length", "summit",
+ "tags", "Tpvalue", "fold_enrichment", "FDR"))
2)
> # converting the MACS output to a genomeIntervals object
> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
3)
> head(giA)
Object of class Genome_intervals
6 base intervals and 0 inter-base intervals(*):
chr1 [3660782, 3662707]
chr1 [4481743, 4482656]
chr1 [4482814, 4484003]
chr1 [4561321, 4562262]
chr1 [4774888, 4776304]
chr1 [4797292, 4798822]
annotation:
seq_name inter_base length summit tags Tpvalue fold_enrichment FDR
1 chr1 FALSE 1926 749 64 492.04 38.44 0.11
2 chr1 FALSE 914 179 14 83.64 10.01 0.88
3 chr1 FALSE 1190 671 18 108.48 16.43 0.17
4 chr1 FALSE 942 472 13 136.26 30.96 0.04
5 chr1 FALSE 1417 781 60 674.41 78.07 0.15
6 chr1 FALSE 1531 813 56 382.46 42.73 0.06
can you show us how to do steps 2) and 3) your way?
Then, with genomeIntervals I would find overlapping peaks like this:
intervals_overlap(giaA, giB)
Can you show us how to do that the non-genomeIntervals way?
Thank you
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <mflawren at fhcrc.org>wrote:
>
>
> 2009/5/21 Ivan Gregoretti <ivangreg at gmail.com>
>
>> Hello Hervé,
>>
>> With your help and a tip from Julien Gagneur, the author of
>> genomeIntervals
>> I got it to work very nicely.
>>
>> So, in a nutshell, be aware that there are two objects in this package:
>> Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you
>> need the second one. (Don't expect a call to strand() to work then!!)
>>
>> Also, I want to clarify that although IRanges is the king of simplicity,
>
>
> IRanges is anything but simple, and is hiding behind its "infrastructure
> package" status to avoid providing a vignette that would advertise all of
> its features.
>
> In IRanges, there is a class called RangesList that is useful for holding
> ranges on separate chromosomes (we use the more general term 'spaces'). One
> can call overlap() on a RangesList, and it will match up the chromosomes by
> name and return the overlapping regions as a RangesMatchingList. For certain
> operations there are conveniences. For example, %in% can be used to return a
> logical vector indicating whether a range in one list overlaps one in
> another list (in the devel version).
>
> As for holding extra variables, like a genomic annotation "track", this is
> the RangedData class. One can create a RangedData directly or import one
> from BED, GFF or WIG files with the rtracklayer package. The rtracklayer
> vignette describes the RangedData class fairly well, even though it's
> actually defined by IRanges.
>
> For example, I downloaded the repeat-masker BED track directly from UCSC
> using rtracklayer, which yields a RangedData. I then created a RangedData of
> chip-seq peaks using ShortRead, and coverage+slice in the IRanges package.
> To see which peaks overlap a repeat region across an entire genome, and
> store it as a variable:
>
> peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)
>
> The IRanges developers apologize for not getting the vignette put together
> in a timely fashion. Hopefully we will have one soon.
>
> Michael
>
>
>>
>> this time it should be a second best option. Chipseqers need to compare
>> multiple chromosomes at the same time and also keep track of annotation
>> features. genomeIntervals was designed with that in mind.
>>
>> You showed us a very nicely primer on overlapping analysis with IRanges.
>> That illustrates that IRanges can be used to create funtions for ranges
>> comparision, however, that would be like re-writing genomeIntervals.
>>
>> At the end of this message, find a copy of a simple workflow to read peaks
>> from the MACS 1.3.5 program output and compute the number of overlapping
>> peaks. (Why using MACS instead of bioconductor's chipseq should be another
>> discussion.)
>>
>> Thank you!
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>> *********************************************************************************
>>
>> > require(genomeIntervals)
>> Loading required package: genomeIntervals
>> Loading required package: intervals
>> Loading required package: Biobase
>>
>> Welcome to Bioconductor
>>
>> Vignettes contain introductory material. To view, type
>> 'openVignette()'. To cite Bioconductor, see
>> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>>
>> >
>> > # read in wild type
>> > A <- read.table("./wildtype_peaks.xls",
>> + header=FALSE, skip=14,
>> + col.names=c("chr", "start", "end", "length", "summit",
>> + "tags", "Tpvalue", "fold_enrichment",
>> "FDR"))
>> > # read in mutant
>> > B <- read.table("./mutant_peaks.xls",
>> + header=FALSE, skip=14,
>> + col.names=c("chr", "start", "end", "length", "summit",
>> + "tags", "Tpvalue", "fold_enrichment",
>> "FDR"))
>> >
>> >
>> > # select genomic pieces of interest (no random or mitochondrial chr)
>> > myChromosomes <- c( 'chr1', 'chr2', 'chr3', 'chr4', 'chr5',
>> + 'chr6', 'chr7', 'chr8', 'chr9', 'chr10',
>> + 'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
>> + 'chr16', 'chr17', 'chr18', 'chr19',
>> + 'chrX', 'chrY')
>> > A <- A[ A$chr %in% myChromosomes, ]
>> > B <- B[ A$chr %in% myChromosomes, ]
>> >
>> >
>> > # remove unused factors
>> > A[] <- lapply(A, "[", drop=TRUE)
>> > B[] <- lapply(B, "[", drop=TRUE)
>> >
>> >
>> > # peek at the peaks (as data.frame)
>> > head(A)
>> chr start end length summit tags Tpvalue fold_enrichment FDR
>> 1 chr1 3660782 3662707 1926 749 64 492.04 38.44 0.11
>> 2 chr1 4481743 4482656 914 179 14 83.64 10.01 0.88
>> 3 chr1 4482814 4484003 1190 671 18 108.48 16.43 0.17
>> 4 chr1 4561321 4562262 942 472 13 136.26 30.96 0.04
>> 5 chr1 4774888 4776304 1417 781 60 674.41 78.07 0.15
>> 6 chr1 4797292 4798822 1531 813 56 382.46 42.73 0.06
>> > head(B)
>> chr start end length summit tags Tpvalue fold_enrichment FDR
>> 1 chr1 3660943 3662070 1128 316 60 750.21 95.13 0.18
>> 2 chr1 4483371 4483833 463 183 12 90.00 15.42 0.11
>> 3 chr1 4561625 4562191 567 312 19 169.06 26.97 0.04
>> 4 chr1 4774935 4776419 1485 734 64 849.54 197.77 0.18
>> 5 chr1 4797256 4798908 1653 820 85 789.10 102.26 0.15
>> 6 chr1 4847116 4848877 1762 1038 67 300.77 20.93 0.06
>> >
>> >
>> > # total number of peaks
>> > dim(A)[1]
>> [1] 14848
>> > dim(B)[1]
>> [1] 15668
>> >
>> >
>> > # peaks at FDR equal or less than 5%
>> > dim(A[A$FDR <= 0.05, ])[1]
>> [1] 6330
>> > dim(B[A$FDR <= 0.05, ])[1]
>> [1] 6665
>> >
>> >
>> > ## histogram of peak widths (for FDR equal or less than 5%)
>> > #hist(A[A$FDR <= 0.05, ]$length)
>> > #hist(B[B$FDR <= 0.05, ]$length)
>> >
>> > # converting the MACS output to a genomeIntervals object
>> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
>> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>> > giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),
>> annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>> >
>> > # number of peaks in giA that are overlapped by at least one peak in giB
>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>> (x[1]))
>> == FALSE))
>> [1] 13265
>> > # number of peaks in giA that are NOT overlapped by any peak in giB
>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>> (x[1]))
>> == TRUE))
>> [1] 1583
>>
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wildtype_peaks.xls
Type: application/vnd.ms-excel
Size: 791387 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0002.xls>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mutant_peaks.xls
Type: application/vnd.ms-excel
Size: 837094 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0003.xls>
More information about the Bioc-sig-sequencing
mailing list