[BioC] New to Bioconductor is there a better way?

Hervé Pagès hpages at fhcrc.org
Sat Mar 17 05:49:48 CET 2012


Hi Brian,

Since you are new to Bioconductor maybe you are not aware there
is a much more convenient container than data.frame for storing
the kind of information you are dealing with: the GRanges container.

   library(GenomicRanges)

   snps <- GRanges(seqnames=snps$CHR,
                   ranges=IRanges(start=snps$POS, width=1))

   regions <- GRanges(seqnames=regions$CHR,
                      ranges=IRanges(start=region$START,
                                     end=region$STOP))

On 03/15/2012 07:05 AM, Kasper Daniel Hansen wrote:
> This is the way to do it.
>
> There is a convenience function called subsetByOverlaps(), you can
> probably guess what it does.

Yep.

I would also recommend you have a look at the various vignettes
in the GenomicRanges package to get you familiarized with the basic
infrastructure.

Cheers,
H.

>
> Kasper
>
> On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian<Brian.Davis at uth.tmc.edu>  wrote:
>> I'm very new to Bioconductor (first time to use it) but not to R.  I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem.
>>
>>
>> I have data frame of chromosome/position pairs (along with other data for the location).  For each pair I need to determine if it is with in a given data frame of ranges.  I need to keep only the pairs that are within any of the ranges for further processing.
>>
>>
>>
>> Example:
>>
>> snps<-NULL
>>
>> snps$CHR<-c("1","2","2","3","X")
>>
>> snps$POS<-as.integer(c(295,640,670,100,1100))
>>
>> snps$DAT<-seq(1:length(snps$CHR))
>>
>> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>>
>>
>>
>> snps
>>
>>   CHR  POS DAT
>>
>> 1   1  295   1
>>
>> 2   2  640   2
>>
>> 3   2  670   3
>>
>> 4   3  100   4
>>
>> 5   X 1100   5
>>
>>
>>
>> region<-NULL
>>
>> region$CHR<-c("1","1","2","2","2","X")
>>
>> region$START<-as.integer(c(10,210,430,650,810,1090))
>>
>> region$STOP<-as.integer(c(100,350,630,675,850,1111))
>>
>> region<-as.data.frame(region, stringsAsFactors=FALSE)
>>
>>
>>
>> region
>>
>>   CHR START STOP
>>
>> 1   1    10  100
>>
>> 2   1   210  350
>>
>> 3   2   430  630
>>
>> 4   2   650  675
>>
>> 5   2   810  850
>>
>> 6   X  1090 1111
>>
>>
>>
>>
>>
>> The result I need would look like
>>
>>
>>
>> Res
>>
>>
>>
>> CHR  POS DAT
>>
>>    1  295   1
>>
>>    2  670   3
>>
>>    X 1100   5
>>
>>
>>
>>
>>
>> My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through.
>>
>>
>>
>> My current solution is:
>>
>> library(GenomicRanges)
>> snplist<-with(snps, GRanges(CHR, IRanges(POS, POS)))
>> locations<-with(region, GRanges(CHR, IRanges(START, STOP)))
>> olaps<-findOverlaps(snplist, locations)
>>
>> then I can easily use olaps to subset as needed.  Just trying to see if there are other functions / ways to go about solving this in an effort to learn.
>>
>> Thanks,
>>
>> Brian Davis
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list