[BioC] readVcf - ALT allele as CompressedCharacterList instead of DNAStringSetList

Wed Oct 3 18:54:50 CEST 2012

Hi Francesco,

Yes, the DNAStringSetList is intended to handle multiple alternate 
alleles. ALT should only be made a CompressedList when structural 
variants are present.

Try the following with your file,

debug(VariantAnnotation:::.formatALT)
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
readVcf(fl, "hg19")
debug: {
     if (is.null(x))
         return(NULL)
     structural <- grep("<", x, fixed = TRUE)
     if (!identical(integer(0), structural))
         seqsplit(x, seq_len(length(x)))
     else .toDNAStringSetList(x)
}

Here 'x' is the alternate allele values,

Browse[2]> x
[1] "A"      "A"      "G,T"    "."      "G,GTCT"

My guess is that if you inspect these you'll see at least one structural 
variant in there.

Valerie

On 10/03/2012 07:23 AM, Lescai, Francesco wrote:
> Hi there,
> I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper.
>
> "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList.
>
> see here
>
>> head(fixed(haplo),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
>                seqnames           ranges strand | paramRangeID            REF
>                   <Rle>         <IRanges>   <Rle>  |<factor>  <DNAStringSet>
>    rs139570490        1 [865738, 865738]      * |<NA>               A
>      rs4372192        1 [876499, 876499]      * |<NA>               A
>                                      ALT      QUAL      FILTER
>                <CompressedCharacterList>  <numeric>  <character>
>    rs139570490                         G   4280.02        PASS
>      rs4372192                         G  11733.02        PASS
>    ---
>    seqlengths:
>      1 10 11 12 13 14 15 16 17 18 19  2 20 21 22  3  4  5  6  7  8  9  X  Y
>     NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
>> head(fixed(uni),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
>              seqnames           ranges strand | paramRangeID            REF
>                 <Rle>         <IRanges>   <Rle>  |<factor>  <DNAStringSet>
>     1:865738        1 [865738, 865738]      * |<NA>               A
>    rs4372192        1 [876499, 876499]      * |<NA>               A
>                             ALT      QUAL      FILTER
>              <DNAStringSetList>  <numeric>  <character>
>     1:865738           ########   5055.97        PASS
>    rs4372192           ########  13526.97        PASS
>    ---
>    seqlengths:
>      1 10 11 12 13 14 15 16 17 18 19  2 20 21 22  3  4  5  6  7  8  9  X  Y
>     NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles
>
> [me at wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep ","
> [me at wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep ","
> 1 1247578 . T G,TGG
> 1 1920434 rs144487103 TAA TA,TAAA
> 1 2303347 rs60972860 CAA CA,CAAA
> 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT
> 1 6291918 rs148639379 TA TAA,T
> 1 7913184 rs34962249 CAAAA CAAA,CAAAAA
> 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT
> 1 14057425 rs35643856 GA GAA,G
> 1 15654737 rs71000392 CT CTT,C
> [...cut...]
>
> but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it?
>
> thanks for your help,
> Francesco
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>   [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>   [7] LC_PAPER=C                     LC_NAME=C
>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31         Biostrings_2.25.12
> [4] GenomicRanges_1.9.66     IRanges_1.15.48          BiocGenerics_0.3.2
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.19.46  Biobase_2.17.8         biomaRt_2.13.2
>   [4] bitops_1.0-4.1         BSgenome_1.25.9        colorspace_1.1-1
>   [7] DBI_0.2-5              dichromat_1.2-4        digest_0.5.2
> [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1        grid_2.15.0
> [13] gtable_0.1.1           labeling_0.1           MASS_7.3-21
> [16] memoise_0.1            munsell_0.4            parallel_2.15.0
> [19] plyr_1.7.1             proto_0.3-9.2          RColorBrewer_1.0-5
> [22] RCurl_1.91-1           reshape2_1.2.1         RSQLite_0.11.2
> [25] rtracklayer_1.17.21    scales_0.2.2           stats4_2.15.0
> [28] stringr_0.6.1          tools_2.15.0           XML_3.95-0.1
> [31] zlibbioc_1.3.0
>
>
> ---------------------------------------------------------------------------------
> Francesco Lescai, PhD, EDBT
> Senior Research Associate in Genome Analysis
> University College London
> Faculty of Population Health Sciences
> Dept. Genes, Development&  Disease
> ICH - Molecular Medicine Unit, GOSgene team
> 30 Guilford Street
> WC1N 1EH London UK
>
> email: f.lescai at ucl.ac.uk<mailto:f.lescai at ucl.ac.uk>
> phone: +44.(0)207.905.2274
> [ext: 2274]
> --------------------------------------------------------------------------------
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor