[BioC] New to Bioconductor is there a better way?
Hervé Pagès
hpages at fhcrc.org
Sat Mar 17 05:49:48 CET 2012
Hi Brian,
Since you are new to Bioconductor maybe you are not aware there
is a much more convenient container than data.frame for storing
the kind of information you are dealing with: the GRanges container.
library(GenomicRanges)
snps <- GRanges(seqnames=snps$CHR,
ranges=IRanges(start=snps$POS, width=1))
regions <- GRanges(seqnames=regions$CHR,
ranges=IRanges(start=region$START,
end=region$STOP))
On 03/15/2012 07:05 AM, Kasper Daniel Hansen wrote:
> This is the way to do it.
>
> There is a convenience function called subsetByOverlaps(), you can
> probably guess what it does.
Yep.
I would also recommend you have a look at the various vignettes
in the GenomicRanges package to get you familiarized with the basic
infrastructure.
Cheers,
H.
>
> Kasper
>
> On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian<Brian.Davis at uth.tmc.edu> wrote:
>> I'm very new to Bioconductor (first time to use it) but not to R. I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem.
>>
>>
>> I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing.
>>
>>
>>
>> Example:
>>
>> snps<-NULL
>>
>> snps$CHR<-c("1","2","2","3","X")
>>
>> snps$POS<-as.integer(c(295,640,670,100,1100))
>>
>> snps$DAT<-seq(1:length(snps$CHR))
>>
>> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>>
>>
>>
>> snps
>>
>> CHR POS DAT
>>
>> 1 1 295 1
>>
>> 2 2 640 2
>>
>> 3 2 670 3
>>
>> 4 3 100 4
>>
>> 5 X 1100 5
>>
>>
>>
>> region<-NULL
>>
>> region$CHR<-c("1","1","2","2","2","X")
>>
>> region$START<-as.integer(c(10,210,430,650,810,1090))
>>
>> region$STOP<-as.integer(c(100,350,630,675,850,1111))
>>
>> region<-as.data.frame(region, stringsAsFactors=FALSE)
>>
>>
>>
>> region
>>
>> CHR START STOP
>>
>> 1 1 10 100
>>
>> 2 1 210 350
>>
>> 3 2 430 630
>>
>> 4 2 650 675
>>
>> 5 2 810 850
>>
>> 6 X 1090 1111
>>
>>
>>
>>
>>
>> The result I need would look like
>>
>>
>>
>> Res
>>
>>
>>
>> CHR POS DAT
>>
>> 1 295 1
>>
>> 2 670 3
>>
>> X 1100 5
>>
>>
>>
>>
>>
>> My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through.
>>
>>
>>
>> My current solution is:
>>
>> library(GenomicRanges)
>> snplist<-with(snps, GRanges(CHR, IRanges(POS, POS)))
>> locations<-with(region, GRanges(CHR, IRanges(START, STOP)))
>> olaps<-findOverlaps(snplist, locations)
>>
>> then I can easily use olaps to subset as needed. Just trying to see if there are other functions / ways to go about solving this in an effort to learn.
>>
>> Thanks,
>>
>> Brian Davis
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list