[BioC] Finding coding SNPs with predictCoding

Sun Mar 4 05:23:35 CET 2012

Hi Thomas,

Thanks very much for the ideas and specific examples. These are very 
helpful.

Take care,
Valerie

On 03/02/12 16:58, Thomas Girke wrote:
> Hi Valerie,
>
> Based on my experience the position in the complete protein (rather than
> CDS) sequence would be the most important piece of information to record
> here. For instance, if a SNP changes the 88th triplet CAT (His) in the
> ORF of a transcript to CAA (Gln) then you want to record it like this:
> His88 (CAT) ->  Gln88 (CAA). This way the user can map this change to a
> protein structure or inject it into the corresponding protein sequence
> without any additional remapping or coding.
>
> Another feature to consider are SNPs affecting splice sites (commonly
> first and last two nucleotides of an intron).
>
> If possible it would be also very useful to support users who want to
> work with their custom genomes and annotations files provided as FASTA
> and GFF/GTF files, respectively.
>
> Best,
>
> Thomas
>
>
> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
>> Slight change to this -
>>
>> I'm now returning the following new columns,
>>
>>     \item{\code{seqTxLoc}}{
>>         Location in transcript-based coordinates of the first nucleotide in
>>         the codon sequence to be translated. This position corresponds to the
>>         first nucleotide in both the \code{refSeq} and \code{varSeq} columns.
>>       }
>>       \item{\code{varTxLoc}}{
>>         Location in transcript-based coordinates of the first nucleotide in
>>         the variant. This value will be the same as \code{seqTxLoc} when the
>>         variant starts exactly at the beginning of a codon.
>>       }
>>       \item{\code{varCdsLoc}}{
>>         Location in cds-based coordinates of the first nucleotide in
>>         the variant. This position is relative to the start of the cds region
>>         defined in the \code{subject} annotation.
>>       }
>>       \item{\code{subjStrand}}{
>>         The strand of the \code{subject} the variant matched.
>> \code{predictCoding}
>>         determines which variants fall in a coding region by finding the
>> overlaps
>>         between the \code{query} and \code{subject}. The \code{query} may be
>>         un-stranded \sQuote{*} but the \code{subject} annotation will
>> have a strand.
>>       }
>>
>>
>> You are interested in 'protein coordinates'. Does 'varCdsLoc' described
>> above meet the need or are you looking for the actual codon number in
>> the coding sequence? I am interested in hearing more about what you are
>> doing with the protein coordinates, how you are using them. It would
>> help us better design future functions.
>>
>> Thanks,
>> Valerie
>>
>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
>>> Thanks Valerie - much appreciated!
>>>
>>> On 01.03.2012 21:30, Valerie Obenchain wrote:
>>>> A 'txLoc' column has been added to the output of predictCoding.
>>>> Available in devel version 1.1.57.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>>>>> Good suggestion. Yes, predictCoding is does this internally. I'll
>>>>> post back here when this has been added.
>>>>>
>>>>> Valerie
>>>>>
>>>>>
>>>>>
>>>>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>>>>>> Hi Valerie,
>>>>>>
>>>>>> Thanks everything works great now. One small feature request -
>>>>>> would it be hard to output the protein sequence position of the
>>>>>> coding SNPs? At the moment once I've run predictCoding I'm
>>>>>> re-extracting the cds and working out the position of each coding
>>>>>> SNP so I can see where in the protein sequence it is, but it seems
>>>>>> like this is probably just replicating what predictCoding must be
>>>>>> doing internally anyway?
>>>>>>
>>>>>> Alex Gutteridge
>>>>>
>>>>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> Thanks for the bug report. The cdsID was taken from an overlap
>>>>>> between the query and GRangesList of cds by transcripts. This gave
>>>>>> the correct transcript number but (incorrectly) took the first cds
>>>>>> number in the list by default. Now fixed in devel 1.1.55.
>>>>>>
>>>>>> I've also updated the man page.
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>>>>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>> But the predictCoding call gives this error:
>>>>>>>>>
>>>>>>>>> Error in .setSeqNames(x, value) :
>>>>>>>>> The replacement value for isActiveSeq must be a logical vector,
>>>>>>>>> with
>>>>>>>>> names that match the seqlevels of the object
>>>>>>>> The error message doesn't help much but I think the pb is that you
>>>>>>>> didn't rename chMT properly. Try to do this:
>>>>>>>>
>>>>>>>>    seqlevels(snps)<- gsub("chrMT", "chrM", seqlevels(snps))
>>>>>>>>
>>>>>>>> before you start the for(eg in entrez.ids){..} loop again.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> H.
>>>>>>> Thanks Herv? that nailed it. I'm having some difficulty joining up
>>>>>>> the output of predictCoding() with the query SNPs though. If
>>>>>>> someone could point out where the disconnect in my thinking is I
>>>>>>> would appreciate it!
>>>>>>>
>>>>>>> Here's my (now edited down) script:
>>>>>>>
>>>>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>>>>> library(VariantAnnotation)
>>>>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>>>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>>>
>>>>>>> entrez.ids = c('6335')
>>>>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>>>
>>>>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>>>>>>> seqlevels(snps)<- gsub("ch", "chr", seqlevels(snps))
>>>>>>> seqlevels(snps)<- gsub("chrMT", "chrM", seqlevels(snps))
>>>>>>>
>>>>>>> gene.list = cdsBy(txdb19, by="gene")
>>>>>>> vsd.list = gene.list[entrez.ids]
>>>>>>> cds.list = cdsBy(txdb19,by="tx")
>>>>>>>
>>>>>>> eg = entrez.ids[1]
>>>>>>>
>>>>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
>>>>>>> eg.snps = snps[snp.idx]
>>>>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>>>>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>>>>>>> variant.alleles =
>>>>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> aa =
>>>>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>>>>>>>
>>>>>>> #####
>>>>>>>
>>>>>>> Then if I query the predictCoding results in aa in an interactive
>>>>>>> session I get the following (see inline comments for what I think
>>>>>>> should be happening, but I must be misinterpreting what queryID
>>>>>>> means)
>>>>>>>
>>>>>>> The docs for predictCoding() contain a small typo
>>>>>>> (s/queryHits/queryID), but otherwise seem clear?
>>>>>>>
>>>>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?,
>>>>>>>       ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
>>>>>>>
>>>>>>>       ?queryHits? The ?queryHits? column provides a map back to the
>>>>>>>            variants in the original ?query?. If the ?query? was a
>>>>>>> ?VCF?
>>>>>>>            object this index corresponds to the row in the
>>>>>>> ?GRanges? in
>>>>>>>            the ?rowData? slot. If ?query? was an expanded ?GRanges?,
>>>>>>>            ?RangedData? or ?RangesList? the index corresponds to
>>>>>>> the row
>>>>>>>            in the expanded object.
>>>>>>>
>>>>>>> #####
>>>>>>>
>>>>>>>> aa[1,]
>>>>>>> DataFrame with 1 row and 9 columns
>>>>>>>      queryID   consequence         refSeq         varSeq         refAA
>>>>>>> <integer>  <factor>  <DNAStringSet>  <DNAStringSet>  <AAStringSet>
>>>>>>> 1         1 nonsynonymous            CTC            ATC            L
>>>>>>>            varAA        txID   geneID     cdsID
>>>>>>> <AAStringSet>  <character>  <factor>  <integer>
>>>>>>> 1             I       10921     6335     33668
>>>>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx
>>>>>>>> '10921' and cds '33668'.
>>>>>>>> #If I look at the first query SNP I get this:
>>>>>>>> eg.snps.exp[aa[1,'queryID'],]
>>>>>>> GRanges with 1 range and 2 elementMetadata values:
>>>>>>>        seqnames                 ranges strand |   RefSNP_id
>>>>>>> alleles_as_ambig
>>>>>>> <Rle>  <IRanges>  <Rle>  |<character>  <character>
>>>>>>>    [1]     chr2 [167055370, 167055370]      * |   111558968
>>>>>>>       R
>>>>>>>    ---
>>>>>>>    seqlengths:
>>>>>>>      chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX
>>>>>>> chrY  chrM
>>>>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA
>>>>>>>    NA    NA
>>>>>>>> #So SNP 1 is at 167055370 on chr2
>>>>>>>> #But if I check tx '10921' I see that the cds overlapping
>>>>>>>> 167055370 is actually '33651'
>>>>>>>> #And cds '33668' is at the other end of the tx:
>>>>>>>> cds.list[[aa[1,'txID']]]
>>>>>>> GRanges with 26 ranges and 3 elementMetadata values:
>>>>>>>         seqnames                 ranges strand   |    cds_id
>>>>>>> cds_name
>>>>>>> <Rle>  <IRanges>  <Rle>    |<integer>  <character>
>>>>>>>     [1]     chr2 [167168009, 167168266]      -   |     33668<NA>
>>>>>>>     [2]     chr2 [167163466, 167163584]      -   |     33667<NA>
>>>>>>>     [3]     chr2 [167163020, 167163109]      -   |     33666<NA>
>>>>>>>     [4]     chr2 [167162302, 167162430]      -   |     33647<NA>
>>>>>>>     [5]     chr2 [167160748, 167160839]      -   |     33646<NA>
>>>>>>>     [6]     chr2 [167159600, 167159812]      -   |     33645<NA>
>>>>>>>     [7]     chr2 [167151109, 167151172]      -   |     33644<NA>
>>>>>>>     [8]     chr2 [167149741, 167149882]      -   |     33643<NA>
>>>>>>>     [9]     chr2 [167144947, 167145153]      -   |     33642<NA>
>>>>>>>     ...      ...                    ...    ... ...       ...
>>>>>>> ...
>>>>>>>    [18]     chr2 [167099012, 167099166]      -   |     33659<NA>
>>>>>>>    [19]     chr2 [167094604, 167094777]      -   |     33658<NA>
>>>>>>>    [20]     chr2 [167089850, 167089972]      -   |     33657<NA>
>>>>>>>    [21]     chr2 [167085201, 167085482]      -   |     33656<NA>
>>>>>>>    [22]     chr2 [167084180, 167084233]      -   |     33655<NA>
>>>>>>>    [23]     chr2 [167083077, 167083214]      -   |     33654<NA>
>>>>>>>    [24]     chr2 [167060870, 167060974]      -   |     33653<NA>
>>>>>>>    [25]     chr2 [167060465, 167060735]      -   |     33652<NA>
>>>>>>>    [26]     chr2 [167055182, 167056374]      -   |     33651<NA>
>>>>>>>         exon_rank
>>>>>>> <integer>
>>>>>>>     [1]         2
>>>>>>>     [2]         3
>>>>>>>     [3]         4
>>>>>>>     [4]         5
>>>>>>>     [5]         6
>>>>>>>     [6]         7
>>>>>>>     [7]         8
>>>>>>>     [8]         9
>>>>>>>     [9]        10
>>>>>>>     ...       ...
>>>>>>>    [18]        19
>>>>>>>    [19]        20
>>>>>>>    [20]        21
>>>>>>>    [21]        22
>>>>>>>    [22]        23
>>>>>>>    [23]        24
>>>>>>>    [24]        25
>>>>>>>    [25]        26
>>>>>>>    [26]        27
>>>>>>>    ---
>>>>>>>    seqlengths:
>>>>>>>                      chr1                  chr2 ...
>>>>>>> chr18_gl000207_random
>>>>>>>                 249250621             243199373 ...
>>>>>>> 4262
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor