[BioC] Getting the start and end positions of a list of genes
Cook, Malcolm
MEC at stowers.org
Mon Jun 18 18:07:54 CEST 2012
Hi,
I'll get you a step further:
On 6/17/12 5:57 PM, "Vincent Carey" <stvjc at channing.harvard.edu> wrote:
>good spec, but i can't get through the whole thing just now. this could
>get you started
>
>source("http://bioconductor.org/biocLite.R")
> biocLite("TxDb.Athaliana.BioMart.plantsmart12")
>library(TxDb.Athaliana.BioMart.plantsmart12)
>txdb = TxDb.Athaliana.BioMart.plantsmart12
>tr = transcriptsBy(txdb, by="gene")
# assuming that for each gene's coordinate, you want the extreme starts
and ends of its (potentially multiple) transcripts:
gene.gr <- reduce(tr) # ISA GenomicRange
gene.df<-as(gene.gr,'data.frame') # whose names are the gene identifiers
Now its a matter of coercing column names, and selecting from the BioMart
data just the rows for your identifiers (and checking they are all there,
and complaining if not).
Cheers,
Malcolm Cook
>
>> tr
>GRangesList of length 33602:
>$AT1G01010
>GRanges with 1 range and 2 elementMetadata cols:
> seqnames ranges strand | tx_id tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] 1 [3631, 5899] + | 9694 AT1G01010.1
>
>$AT1G01020
>GRanges with 2 ranges and 2 elementMetadata cols:
> seqnames ranges strand | tx_id tx_name
> [1] 1 [5928, 8737] - | 29355 AT1G01020.1
> [2] 1 [6790, 8737] - | 29354 AT1G01020.2
>
>$AT1G01030
>GRanges with 1 range and 2 elementMetadata cols:
> seqnames ranges strand | tx_id tx_name
> [1] 1 [11649, 13714] - | 26358 AT1G01030.1
>
>...
><33599 more elements>
>---
>seqlengths:
> 3 4 1 5 2 Pt Mt
> NA NA NA NA NA NA NA
>
>you could use an org.At* package a bit more simply, use the CHRLOC and
>CHRLOCEND
>elements. please look at the metadata page of bioconductor.org
>INSTALL node for your
>organism. this should be a standard use case or faq, perhaps
>
>
>
>On Sun, Jun 17, 2012 at 6:33 PM, Josh [guest]
><guest at bioconductor.org>wrote:
>
>>
>> Dear listserv,
>>
>> I am a long-time R user, novice Bioconductor user. I am quickly
>>realizing
>> they are not the same thing. I have a very basic question that I hope
>>you
>> can help me with.
>>
>> I have a list of genes in Arabidopsis thaliana. I want to input this
>>list
>> into R/Bioconductor and output a table listing the start and end
>>positions
>> of each gene.
>>
>> Specific code that will get the job done will be the most helpful for
>>me.
>> Also, please tell me the specific packages and databases and such I must
>> load into memory. I am a total newbie at this.
>>
>> Thanks in advance,
>> -----------------------------------
>> Josh Banta, Ph.D
>> Assistant Professor
>> Department of Biology
>> The University of Texas at Tyler
>> Tyler, TX 75799
>> Tel: (903) 565-5655
>> http://plantevolutionaryecology.org
>>
>> -- output of sessionInfo():
>>
>> > gene.pos <- data.frame(matrix(nrow = 3, ncol = 4))
>> > gene.list <- c("At5g35790", "AT5g60910", "AT1g16560")
>> > gene.pos[,1] <- gene.list
>> > colnames(gene.pos) <- c("gene", "chromosome", "nuc_sequence_start" ,
>> "nuc_sequence_end")
>> >
>> > gene.pos
>> gene chromosome nuc_sequence_start nuc_sequence_end
>> 1 At5g35790 NA NA NA
>> 2 AT5g60910 NA NA NA
>> 3 AT1g16560 NA NA NA
>> >
>> > #now what? How do I fill in the blanks?
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list