[BioC] readVcf - ALT allele as CompressedCharacterList instead of DNAStringSetList
Valerie Obenchain
vobencha at fhcrc.org
Wed Oct 3 18:54:50 CEST 2012
Hi Francesco,
Yes, the DNAStringSetList is intended to handle multiple alternate
alleles. ALT should only be made a CompressedList when structural
variants are present.
Try the following with your file,
debug(VariantAnnotation:::.formatALT)
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
readVcf(fl, "hg19")
debug: {
if (is.null(x))
return(NULL)
structural <- grep("<", x, fixed = TRUE)
if (!identical(integer(0), structural))
seqsplit(x, seq_len(length(x)))
else .toDNAStringSetList(x)
}
Here 'x' is the alternate allele values,
Browse[2]> x
[1] "A" "A" "G,T" "." "G,GTCT"
My guess is that if you inspect these you'll see at least one structural
variant in there.
Valerie
On 10/03/2012 07:23 AM, Lescai, Francesco wrote:
> Hi there,
> I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper.
>
> "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList.
>
> see here
>
>> head(fixed(haplo),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
> seqnames ranges strand | paramRangeID REF
> <Rle> <IRanges> <Rle> |<factor> <DNAStringSet>
> rs139570490 1 [865738, 865738] * |<NA> A
> rs4372192 1 [876499, 876499] * |<NA> A
> ALT QUAL FILTER
> <CompressedCharacterList> <numeric> <character>
> rs139570490 G 4280.02 PASS
> rs4372192 G 11733.02 PASS
> ---
> seqlengths:
> 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
>> head(fixed(uni),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
> seqnames ranges strand | paramRangeID REF
> <Rle> <IRanges> <Rle> |<factor> <DNAStringSet>
> 1:865738 1 [865738, 865738] * |<NA> A
> rs4372192 1 [876499, 876499] * |<NA> A
> ALT QUAL FILTER
> <DNAStringSetList> <numeric> <character>
> 1:865738 ######## 5055.97 PASS
> rs4372192 ######## 13526.97 PASS
> ---
> seqlengths:
> 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles
>
> [me at wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep ","
> [me at wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep ","
> 1 1247578 . T G,TGG
> 1 1920434 rs144487103 TAA TA,TAAA
> 1 2303347 rs60972860 CAA CA,CAAA
> 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT
> 1 6291918 rs148639379 TA TAA,T
> 1 7913184 rs34962249 CAAAA CAAA,CAAAAA
> 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT
> 1 14057425 rs35643856 GA GAA,G
> 1 15654737 rs71000392 CT CTT,C
> [...cut...]
>
> but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it?
>
> thanks for your help,
> Francesco
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31 Biostrings_2.25.12
> [4] GenomicRanges_1.9.66 IRanges_1.15.48 BiocGenerics_0.3.2
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.19.46 Biobase_2.17.8 biomaRt_2.13.2
> [4] bitops_1.0-4.1 BSgenome_1.25.9 colorspace_1.1-1
> [7] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2
> [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1 grid_2.15.0
> [13] gtable_0.1.1 labeling_0.1 MASS_7.3-21
> [16] memoise_0.1 munsell_0.4 parallel_2.15.0
> [19] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5
> [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.2
> [25] rtracklayer_1.17.21 scales_0.2.2 stats4_2.15.0
> [28] stringr_0.6.1 tools_2.15.0 XML_3.95-0.1
> [31] zlibbioc_1.3.0
>
>
> ---------------------------------------------------------------------------------
> Francesco Lescai, PhD, EDBT
> Senior Research Associate in Genome Analysis
> University College London
> Faculty of Population Health Sciences
> Dept. Genes, Development& Disease
> ICH - Molecular Medicine Unit, GOSgene team
> 30 Guilford Street
> WC1N 1EH London UK
>
> email: f.lescai at ucl.ac.uk<mailto:f.lescai at ucl.ac.uk>
> phone: +44.(0)207.905.2274
> [ext: 2274]
> --------------------------------------------------------------------------------
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list