[BioC] Complementing alleles in a VCF

Martin Morgan mtmorgan at fhcrc.org
Thu Apr 19 15:34:16 CEST 2012


Hi Alex --

On 04/19/2012 02:39 AM, Alex Gutteridge wrote:
> Hi All,
>
> I'm having some difficulty with VCF handling and feeding 1000 genome
> VCFs into predictCoding from the VariantAnnotation package. Can anyone
> help?
>
> When I read 1000 genomes VCFs the SNPs come through with no strand
> specified e.g (where vcf.file and param are a 1000 genomes VCF file and
> a GRanges respectively):
>
>> vcf = readVcf(vcf.file, "hg19", param)
>> fixed(vcf)
> GRanges with 2610 ranges and 5 elementMetadata cols:
> seqnames ranges strand | paramRangeID
> <Rle> <IRanges> <Rle> | <factor>
> rs186838828 chr2 [167051756, 167051756] * | SCN9A
> rs191667986 chr2 [167051865, 167051865] * | SCN9A
> rs139483482 chr2 [167051900, 167051900] * | SCN9A
> rs182687583 chr2 [167052080, 167052080] * | SCN9A
> rs115766730 chr2 [167052144, 167052144] * | SCN9A
> rs186613025 chr2 [167052168, 167052168] * | SCN9A
> rs73017538 chr2 [167052328, 167052328] * | SCN9A
> rs114327563 chr2 [167052375, 167052375] * | SCN9A
> rs191401619 chr2 [167052418, 167052418] * | SCN9A
> ... ... ... ... ... ...
> rs73025590 chr2 [167231750, 167231750] * | SCN9A
> rs141453198 chr2 [167231812, 167231812] * | SCN9A
> rs16852069 chr2 [167231890, 167231890] * | SCN9A
> rs181276399 chr2 [167231932, 167231932] * | SCN9A
> rs185839773 chr2 [167232251, 167232251] * | SCN9A
> rs191091185 chr2 [167232439, 167232439] * | SCN9A
> rs148362057 chr2 [167232446, 167232446] * | SCN9A
> rs141521157 chr2 [167232450, 167232450] * | SCN9A
> rs1881440 chr2 [167232463, 167232463] * | SCN9A
> REF ALT QUAL FILTER
> <DNAStringSet> <DNAStringSetList> <numeric> <character>
> rs186838828 T ######## 100 PASS
> rs191667986 T ######## 100 PASS
> rs139483482 T ######## 100 PASS
> rs182687583 G ######## 100 PASS
> rs115766730 G ######## 100 PASS
> rs186613025 A ######## 100 PASS
> rs73017538 A ######## 100 PASS
> rs114327563 A ######## 100 PASS
> rs191401619 C ######## 100 PASS
> ... ... ... ... ...
> rs73025590 A ######## 100 PASS
> rs141453198 C ######## 100 PASS
> rs16852069 A ######## 100 PASS
> rs181276399 G ######## 100 PASS
> rs185839773 A ######## 100 PASS
> rs191091185 C ######## 100 PASS
> rs148362057 A ######## 100 PASS
> rs141521157 A ######## 100 PASS
> rs1881440 C ######## 100 PASS
> ---
> seqlengths:
> chr2
> NA
>
> When I feed these to predictCoding(), genes on the complement strand are
> not dealt with correctly - the ALT alleles given in the VCF are not
> complemented first, so the variant codons are not translated right.

A definitive answer won't be available until next week. Probably a more 
straight-forward way, but...

>
>> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
>> aa = predictCoding(vcf.filt, txdb, Hsapiens)
>
> It seems like predictCoding should be able to deal with genes on the
> complement strand, but maybe it requires the strand to be set correctly
> in the VCF? I've tried and failed to find a way of editing the VCF - if
> nothing else to just complement the ALT alleles before running
> predictcoding, but it doesn't seem to work. Though the manual seems to
> suggest it is possible. I didn't find a way to actually set the strand
> directly either:
>
> ‘alt(x)’, ‘alt(x) <- value’: Returns or sets the alternate allele data
> from the ALT column of the VCF file. ‘value’ can be a
> ‘DNAStringSet’ or a ‘CharacterList’ (for a structural VCF
> file).
>
>> tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]]))
>> tmp.alt
> A DNAStringSet instance of length 2610
> width seq
> [1] 1 T
> [2] 1 G
> [3] 1 G
> [4] 1 T
> [5] 1 T
> [6] 1 C
> [7] 1 C
> [8] 1 C
> [9] 1 C
> ... ... ...
> [2602] 1 A
> [2603] 1 T
> [2604] 1 C
> [2605] 1 T
> [2606] 1 C
> [2607] 1 C
> [2608] 1 C
> [2609] 1 C
> [2610] 1 T
>> alt(vcf) = tmp.alt
> Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function "alt<-", for signature
> "VCF", "DNAStringSet"

as an example

   fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
   vcf <- readVcf(fl, "hg19")

you can set the strand on vcf with

 > strand(rowData(vcf))
'factor' Rle of length 5 with 1 run
   Lengths: 5
   Values : *
Levels(3): + - *
 > strand(rowData(vcf)) <- "+"
 > strand(rowData(vcf))
'factor' Rle of length 5 with 1 run
   Lengths: 5
   Values : +
Levels(3): + - *

(probably there should be strand,VCF, and strand<-,VCF,ANY methods).

For alt, it seems like DNAStringSetList needs a constructor

DNAStringSetList <-
     function(...)
{
     listData <- list(...)
     if (length(listData) == 1 && is.list(listData[[1L]]))
         listData <- listData[[1L]]
     ok <- sapply(listData, is, "DNAStringSet")
     if (!all(ok))
         listData <- lapply(listData, as, "DNAStringSet")
     IRanges:::newCompressedList("DNAStringSetList", listData)
}

and then

 > orig <- values(alt(vcf))[["ALT"]]
 > dna <- complement(unlist(orig))
 > alt(vcf) <- relist(dna, orig)

(It's confusing, though convenient, how alt() returns a GRanges, but 
alt<- expects 'value' to be a DNAStringSetList; also as you note the 
docs say there should be a VCF,DNAStringSet method for alt<-, but there 
doesn't seem to be one (it wouldn't be appropriate in your case, because 
ALT is a DNAStringSetList).

Martin

>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] RColorBrewer_1.0-5
> [2] caTools_1.12
> [3] bitops_1.0-4.1
> [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
> [5] GenomicFeatures_1.8.1
> [6] org.Hs.eg.db_2.7.1
> [7] RSQLite_0.11.1
> [8] DBI_0.2-5
> [9] AnnotationDbi_1.18.0
> [10] Biobase_2.16.0
> [11] VariantAnnotation_1.2.5
> [12] Rsamtools_1.8.3
> [13] BSgenome.Hsapiens.UCSC.hg19_1.3.17
> [14] BSgenome_1.24.0
> [15] Biostrings_2.24.1
> [16] GenomicRanges_1.8.3
> [17] IRanges_1.14.2
> [18] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6 Matrix_1.0-6
> [5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.15.0
> [9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4
> [13] zlibbioc_1.2.0
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list