[BioC] Finding coding SNPs with predictCoding
Alex Gutteridge
alexg at ruggedtextile.com
Thu Apr 12 10:28:25 CEST 2012
Yup, same from me.
On 06.04.2012 00:25, Thomas Girke wrote:
> Hi Valerie,
>
> Thanks for the update and for implementing these functionalities.
> From my perspective, reporting in 1-letter amino acid code is totally
> sufficient.
>
> Thanks,
>
> Thomas
>
>
>
> On Thu, Apr 05, 2012 at 08:47:37PM +0000, Valerie Obenchain wrote:
>> Hi Thomas, Alex,
>>
>> Changes below have been made in release 1.2.4 and devel 1.3.4.
>>
>> predictCoding :
>>
>> - Now returns a proteinID that is the triplet number in the protein
>>
>> - The refAA and varAA are the 1-letter amino acid code but not the
>> 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a
>> function that returns the 3-letter code. Is this of interest/use?
>>
>> - Over the next dev cycle I will implement methods for annotations
>> (subject argument) to be gff/gtf. Currently genomes (seqSource
>> argument)
>> can be a BSgenome or fasta file.
>>
>>
>> locateVariants:
>>
>> To add flexibility for finding variants in a particular region the
>> function now dispatches on a 'region' argument (e.g.,
>> CodingVariants(),
>> IntronVariants(), etc.). I've included a SpliceSiteVariants() that
>> counts ranges that overlap with any portion of the first 2 or last 2
>> nucleotides of an intron. Called as,
>>
>> locateVariants(query, subject, range=SpliceSiteVariants())
>>
>> Thanks again for the suggestions.
>>
>> Valerie
>>
>>
>> On 03/05/2012 01:25 AM, Alex Gutteridge wrote:
>> > Hi Valerie,
>> >
>> > I'd exactly echo what Thomas wrote (thanks Thomas!). My specific
>> use
>> > case is mapping coding SNPs to PDB protein structures, so
>> something
>> > that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need.
>> >
>> > Alex Gutteridge.
>> >
>> > On 03.03.2012 00:58, Thomas Girke wrote:
>> >> Hi Valerie,
>> >>
>> >> Based on my experience the position in the complete protein
>> (rather than
>> >> CDS) sequence would be the most important piece of information to
>> record
>> >> here. For instance, if a SNP changes the 88th triplet CAT (His)
>> in the
>> >> ORF of a transcript to CAA (Gln) then you want to record it like
>> this:
>> >> His88 (CAT) -> Gln88 (CAA). This way the user can map this change
>> to a
>> >> protein structure or inject it into the corresponding protein
>> sequence
>> >> without any additional remapping or coding.
>> >>
>> >> Another feature to consider are SNPs affecting splice sites
>> (commonly
>> >> first and last two nucleotides of an intron).
>> >>
>> >> If possible it would be also very useful to support users who
>> want to
>> >> work with their custom genomes and annotations files provided as
>> FASTA
>> >> and GFF/GTF files, respectively.
>> >>
>> >> Best,
>> >>
>> >> Thomas
>> >>
>> >>
>> >> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain
>> wrote:
>> >>> Slight change to this -
>> >>>
>> >>> I'm now returning the following new columns,
>> >>>
>> >>> \item{\code{seqTxLoc}}{
>> >>> Location in transcript-based coordinates of the first
>> >>> nucleotide in
>> >>> the codon sequence to be translated. This position
>> >>> corresponds to the
>> >>> first nucleotide in both the \code{refSeq} and
>> \code{varSeq}
>> >>> columns.
>> >>> }
>> >>> \item{\code{varTxLoc}}{
>> >>> Location in transcript-based coordinates of the first
>> >>> nucleotide in
>> >>> the variant. This value will be the same as
>> \code{seqTxLoc}
>> >>> when the
>> >>> variant starts exactly at the beginning of a codon.
>> >>> }
>> >>> \item{\code{varCdsLoc}}{
>> >>> Location in cds-based coordinates of the first nucleotide
>> in
>> >>> the variant. This position is relative to the start of
>> the
>> >>> cds region
>> >>> defined in the \code{subject} annotation.
>> >>> }
>> >>> \item{\code{subjStrand}}{
>> >>> The strand of the \code{subject} the variant matched.
>> >>> \code{predictCoding}
>> >>> determines which variants fall in a coding region by
>> finding the
>> >>> overlaps
>> >>> between the \code{query} and \code{subject}. The
>> \code{query}
>> >>> may be
>> >>> un-stranded \sQuote{*} but the \code{subject} annotation
>> will
>> >>> have a strand.
>> >>> }
>> >>>
>> >>>
>> >>> You are interested in 'protein coordinates'. Does 'varCdsLoc'
>> described
>> >>> above meet the need or are you looking for the actual codon
>> number in
>> >>> the coding sequence? I am interested in hearing more about what
>> you are
>> >>> doing with the protein coordinates, how you are using them. It
>> would
>> >>> help us better design future functions.
>> >>>
>> >>> Thanks,
>> >>> Valerie
>> >>>
>> >>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
>> >>> > Thanks Valerie - much appreciated!
>> >>> >
>> >>> > On 01.03.2012 21:30, Valerie Obenchain wrote:
>> >>> >> A 'txLoc' column has been added to the output of
>> predictCoding.
>> >>> >> Available in devel version 1.1.57.
>> >>> >>
>> >>> >> Valerie
>> >>> >>
>> >>> >>
>> >>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>> >>> >>> Good suggestion. Yes, predictCoding is does this internally.
>> I'll
>> >>> >>> post back here when this has been added.
>> >>> >>>
>> >>> >>> Valerie
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>> >>> >>>> Hi Valerie,
>> >>> >>>>
>> >>> >>>> Thanks everything works great now. One small feature
>> request -
>> >>> >>>> would it be hard to output the protein sequence position of
>> the
>> >>> >>>> coding SNPs? At the moment once I've run predictCoding I'm
>> >>> >>>> re-extracting the cds and working out the position of each
>> coding
>> >>> >>>> SNP so I can see where in the protein sequence it is, but
>> it seems
>> >>> >>>> like this is probably just replicating what predictCoding
>> must be
>> >>> >>>> doing internally anyway?
>> >>> >>>>
>> >>> >>>> Alex Gutteridge
>> >>> >>>
>> >>> >>>
>> >>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>> >>> >>>> Hi Alex,
>> >>> >>>>
>> >>> >>>> Thanks for the bug report. The cdsID was taken from an
>> overlap
>> >>> >>>> between the query and GRangesList of cds by transcripts.
>> This gave
>> >>> >>>> the correct transcript number but (incorrectly) took the
>> first cds
>> >>> >>>> number in the list by default. Now fixed in devel 1.1.55.
>> >>> >>>>
>> >>> >>>> I've also updated the man page.
>> >>> >>>>
>> >>> >>>> Valerie
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>> >>> >>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
>> >>> >>>>>> Hi Alex,
>> >>> >>>>>>
>> >>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>> >>> >>>>>
>> >>> >>>>> [...]
>> >>> >>>>>
>> >>> >>>>>>> But the predictCoding call gives this error:
>> >>> >>>>>>>
>> >>> >>>>>>> Error in .setSeqNames(x, value) :
>> >>> >>>>>>> The replacement value for isActiveSeq must be a logical
>> vector,
>> >>> >>>>>>> with
>> >>> >>>>>>> names that match the seqlevels of the object
>> >>> >>>>>>
>> >>> >>>>>> The error message doesn't help much but I think the pb is
>> >>> that you
>> >>> >>>>>> didn't rename chMT properly. Try to do this:
>> >>> >>>>>>
>> >>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM",
>> seqlevels(snps))
>> >>> >>>>>>
>> >>> >>>>>> before you start the for(eg in entrez.ids){..} loop
>> again.
>> >>> >>>>>>
>> >>> >>>>>> Cheers,
>> >>> >>>>>> H.
>> >>> >>>>>
>> >>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty
>> >>> joining up
>> >>> >>>>> the output of predictCoding() with the query SNPs though.
>> If
>> >>> >>>>> someone could point out where the disconnect in my
>> thinking is I
>> >>> >>>>> would appreciate it!
>> >>> >>>>>
>> >>> >>>>> Here's my (now edited down) script:
>> >>> >>>>>
>> >>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>> >>> >>>>> library(VariantAnnotation)
>> >>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> >>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> >>> >>>>>
>> >>> >>>>> entrez.ids = c('6335')
>> >>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>> >>> >>>>>
>> >>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>> >>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>> >>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>> >>> >>>>>
>> >>> >>>>> gene.list = cdsBy(txdb19, by="gene")
>> >>> >>>>> vsd.list = gene.list[entrez.ids]
>> >>> >>>>> cds.list = cdsBy(txdb19,by="tx")
>> >>> >>>>>
>> >>> >>>>> eg = entrez.ids[1]
>> >>> >>>>>
>> >>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps,
>> vsd.list[[eg]])))
>> >>> >>>>> eg.snps = snps[snp.idx]
>> >>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"]
>> >>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>> >>> >>>>> variant.alleles =
>> >>> >>>>>
>> >>>
>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>> >>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> aa =
>> >>> >>>>>
>> >>>
>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>> >>> >>>>>
>> >>> >>>>> #####
>> >>> >>>>>
>> >>> >>>>> Then if I query the predictCoding results in aa in an
>> interactive
>> >>> >>>>> session I get the following (see inline comments for what
>> I think
>> >>> >>>>> should be happening, but I must be misinterpreting what
>> queryID
>> >>> >>>>> means)
>> >>> >>>>>
>> >>> >>>>> The docs for predictCoding() contain a small typo
>> >>> >>>>> (s/queryHits/queryID), but otherwise seem clear?
>> >>> >>>>>
>> >>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?,
>> ?varSeq?,
>> >>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
>> >>> >>>>>
>> >>> >>>>> ?queryHits? The ?queryHits? column provides a map
>> back to
>> >>> the
>> >>> >>>>> variants in the original ?query?. If the ?query?
>> was a
>> >>> >>>>> ?VCF?
>> >>> >>>>> object this index corresponds to the row in the
>> >>> >>>>> ?GRanges? in
>> >>> >>>>> the ?rowData? slot. If ?query? was an expanded
>> >>> ?GRanges?,
>> >>> >>>>> ?RangedData? or ?RangesList? the index
>> corresponds to
>> >>> >>>>> the row
>> >>> >>>>> in the expanded object.
>> >>> >>>>>
>> >>> >>>>> #####
>> >>> >>>>>
>> >>> >>>>>> aa[1,]
>> >>> >>>>> DataFrame with 1 row and 9 columns
>> >>> >>>>> queryID consequence refSeq varSeq
>> >>> refAA
>> >>> >>>>> <integer> <factor> <DNAStringSet> <DNAStringSet>
>> <AAStringSet>
>> >>> >>>>> 1 1 nonsynonymous CTC ATC
>> >>> L
>> >>> >>>>> varAA txID geneID cdsID
>> >>> >>>>> <AAStringSet> <character> <factor> <integer>
>> >>> >>>>> 1 I 10921 6335 33668
>> >>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps
>> to tx
>> >>> >>>>>> '10921' and cds '33668'.
>> >>> >>>>>> #If I look at the first query SNP I get this:
>> >>> >>>>>> eg.snps.exp[aa[1,'queryID'],]
>> >>> >>>>> GRanges with 1 range and 2 elementMetadata values:
>> >>> >>>>> seqnames ranges strand | RefSNP_id
>> >>> >>>>> alleles_as_ambig
>> >>> >>>>> <Rle> <IRanges> <Rle> | <character> <character>
>> >>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968
>> >>> >>>>> R
>> >>> >>>>> ---
>> >>> >>>>> seqlengths:
>> >>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21
>> chr22
>> >>> chrX
>> >>> >>>>> chrY chrM
>> >>> >>>>> NA NA NA NA NA NA ... NA NA
>> NA NA
>> >>> >>>>> NA NA
>> >>> >>>>>> #So SNP 1 is at 167055370 on chr2
>> >>> >>>>>> #But if I check tx '10921' I see that the cds overlapping
>> >>> >>>>>> 167055370 is actually '33651'
>> >>> >>>>>> #And cds '33668' is at the other end of the tx:
>> >>> >>>>>> cds.list[[aa[1,'txID']]]
>> >>> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
>> >>> >>>>> seqnames ranges strand |
>> cds_id
>> >>> >>>>> cds_name
>> >>> >>>>> <Rle> <IRanges> <Rle> | <integer> <character>
>> >>> >>>>> [1] chr2 [167168009, 167168266] - |
>> 33668 <NA>
>> >>> >>>>> [2] chr2 [167163466, 167163584] - |
>> 33667 <NA>
>> >>> >>>>> [3] chr2 [167163020, 167163109] - |
>> 33666 <NA>
>> >>> >>>>> [4] chr2 [167162302, 167162430] - |
>> 33647 <NA>
>> >>> >>>>> [5] chr2 [167160748, 167160839] - |
>> 33646 <NA>
>> >>> >>>>> [6] chr2 [167159600, 167159812] - |
>> 33645 <NA>
>> >>> >>>>> [7] chr2 [167151109, 167151172] - |
>> 33644 <NA>
>> >>> >>>>> [8] chr2 [167149741, 167149882] - |
>> 33643 <NA>
>> >>> >>>>> [9] chr2 [167144947, 167145153] - |
>> 33642 <NA>
>> >>> >>>>> ... ... ... ... ...
>> ...
>> >>> >>>>> ...
>> >>> >>>>> [18] chr2 [167099012, 167099166] - |
>> 33659 <NA>
>> >>> >>>>> [19] chr2 [167094604, 167094777] - |
>> 33658 <NA>
>> >>> >>>>> [20] chr2 [167089850, 167089972] - |
>> 33657 <NA>
>> >>> >>>>> [21] chr2 [167085201, 167085482] - |
>> 33656 <NA>
>> >>> >>>>> [22] chr2 [167084180, 167084233] - |
>> 33655 <NA>
>> >>> >>>>> [23] chr2 [167083077, 167083214] - |
>> 33654 <NA>
>> >>> >>>>> [24] chr2 [167060870, 167060974] - |
>> 33653 <NA>
>> >>> >>>>> [25] chr2 [167060465, 167060735] - |
>> 33652 <NA>
>> >>> >>>>> [26] chr2 [167055182, 167056374] - |
>> 33651 <NA>
>> >>> >>>>> exon_rank
>> >>> >>>>> <integer>
>> >>> >>>>> [1] 2
>> >>> >>>>> [2] 3
>> >>> >>>>> [3] 4
>> >>> >>>>> [4] 5
>> >>> >>>>> [5] 6
>> >>> >>>>> [6] 7
>> >>> >>>>> [7] 8
>> >>> >>>>> [8] 9
>> >>> >>>>> [9] 10
>> >>> >>>>> ... ...
>> >>> >>>>> [18] 19
>> >>> >>>>> [19] 20
>> >>> >>>>> [20] 21
>> >>> >>>>> [21] 22
>> >>> >>>>> [22] 23
>> >>> >>>>> [23] 24
>> >>> >>>>> [24] 25
>> >>> >>>>> [25] 26
>> >>> >>>>> [26] 27
>> >>> >>>>> ---
>> >>> >>>>> seqlengths:
>> >>> >>>>> chr1 chr2 ...
>> >>> >>>>> chr18_gl000207_random
>> >>> >>>>> 249250621 243199373 ...
>> >>> >>>>> 4262
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>> _______________________________________________
>> >>> >>>> Bioconductor mailing list
>> >>> >>>> Bioconductor at r-project.org
>> >>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> >>>> Search the archives:
>> >>> >>>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>> >>>
>> >>> >>> _______________________________________________
>> >>> >>> Bioconductor mailing list
>> >>> >>> Bioconductor at r-project.org
>> >>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> >>> Search the archives:
>> >>> >>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>> >
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at r-project.org
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> >>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
--
Alex Gutteridge
More information about the Bioconductor
mailing list