[BioC] Why is writeVcf() slow?
Valerie Obenchain
vobencha at fhcrc.org
Sat Oct 27 00:14:45 CEST 2012
Hi Pete,
Thanks for sending the data. I confirmed your numbers for writing out
the sample file.
> load('~/Downloads/slow_writeVcf_example_data.RData')
> vcf <- vcf.4aff.constrained
> vcf
class: VCF
dim: 2668 7
genome: mm9
...
> system.time(writeVcf(vcf, "mm9.vcf"))
user system elapsed
164.362 0.548 165.378
Warning message:
In .Method(..., deparse.level = deparse.level) :
number of rows of result is not a multiple of vector length (arg 1)
Sample data from the package with ~10K variants was also slow.
> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> vcf
class: VCF
dim: 10376 5
genome: hg19
...
> system.time(writeVcf(vcf, "hg19.vcf"))
user system elapsed
1113.089 2.052 1117.831
Warning message:
In .Method(..., deparse.level = deparse.level) :
number of rows of result is not a multiple of vector length (arg 1)
The warning came from a problem when creating the FORMAT lines for the
geno fields when we both list and non-list matrices are present. The
speed issues I identified were parsing bottlenecks when preparing the
info and geno data for writing.
With the modifications these two examples are showing a >95% speed increase.
## Note that the VCF class structure has changed in VariantAnnotation ##
in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ##
and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ##
objects. See devel ?VCF for details.
> load("~/Downloads/slow_writeVcf_example_data.RData")
> vcf <- updateObject(vcf.4aff.constrained)
> vcf
class: CollapsedVCF
dim: 2668 7
genome: mm9
...
> system.time(writeVcf(vcf, "mm9.vcf"))
user system elapsed
2.276 0.060 2.339
Again the sample file from the package,
> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> vcf
class: CollapsedVCF
dim: 10376 5
genome: hg19
...
> system.time(writeVcf(vcf, "hg19.vcf"))
user system elapsed
4.717 0.012 4.738
Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release.
Thanks for reporting this.
Valerie
On 10/23/2012 09:08 PM, hickey at wehi.EDU.AU wrote:
> Valerie - I'll send you a subset of the VCF off list.
>
> Tim - This isn't something I'd considered. I'd always thought of VCF as
> just another delimited format but that post highlights the "intricacies"
> of VCF that I hadn't considered. In saying that, the files I'm dealing
> with here are pretty small (300K variants on 7 samples, 21Mb bgzipped),
> especially compared to the 1KG VCFs which are multi-Gb in size and
> contain thousands of samples.
>
> Pete
>
>
> On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote:
>
>> is part of this an issue with the VCF format itself?
>>
>> http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/
>>
>> or is that a separate concern?
>>
>> thanks,
>>
>> --t
>>
>>
>> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>> Hi Peter,
>>
>> I'd say yes, the warning is a cause for concern. Is the vcf you're
>> working with publicly available? If not, can you send a smaller
>> subset that still produces the error?
>>
>> Valerie
>>
>>
>>
>> On 10/23/12 17:29, hickey at wehi.EDU.AU <mailto:hickey at wehi.EDU.AU>
>> wrote:
>>
>> I am using the writeVcf() function in the VariantAnnotation
>> package, which I understand to be 'under construction' from
>> the documentation. I am wondering why writeVcf() is taking so
>> long to do it's job and if there is anything I can do to speed
>> this up? For example, I have a VCF object containing some
>> 36,000 variants that I want to write to file and this takes
>> nearly 3 hours. It also produces a warning that I am unsure
>> how to interpret, i.e. should I be worried by it? The VCF
>> written to disk appears to be valid.
>>
>> Many thanks for your help.
>>
>> ## Code
>>
>> library(VariantAnnotation)
>> vcf<- readVcf(file =
>> 'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome
>> = 'mm9')
>> vcf
>>
>> class: VCF
>> dim: 292155 7
>> genome: mm9
>> exptData(1): header
>> fixed(4): REF ALT QUAL FILTER
>> info(21): AC AF ... SB STR
>> geno(5): AD DP GQ GT PL
>> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696
>> chr19:61340708
>> rowData values names(1): paramRangeID
>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385
>> 506/Bcl2#409
>> colData names(1): Samples
>>
>> ## Some subsetting of the 'vcf' object to create
>> vcf.2aff.constrained
>>
>> vcf.2aff.constrained
>>
>> class: VCF
>> dim: 35514 7
>> genome: mm9
>> exptData(1): header
>> fixed(4): REF ALT QUAL FILTER
>> info(21): AC AF ... SB STR
>> geno(5): AD DP GQ GT PL
>> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449
>> chr19:61340453
>> rowData values names(1): paramRangeID
>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385
>> 506/Bcl2#409
>> colData names(1): Samples
>>
>> system.time(writeVcf(obj = vcf.2aff.constrained, filename
>> =
>> 'UnifiedGenotyper.autosomal.__snps.indels.constrained.__genotypes.2aff.vcf'))
>>
>> user system elapsed
>> 10653.768 0.244 10659.163
>> Warning message:
>> In .Method(..., deparse.level = deparse.level) :
>> number of rows of result is not a multiple of vector length
>> (arg 1)
>>
>> sessionInfo()
>>
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
>> base
>>
>> other attached packages:
>> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1
>> Biostrings_2.26.2
>> [4] GenomicRanges_1.10.2 IRanges_1.16.2
>> BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0
>> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5
>> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1
>> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1
>> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0
>>
>> ------------------------------__--
>> Peter Hickey,
>> PhD Student/Research Assistant,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>
>> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>> http://www.wehi.edu.au <http://www.wehi.edu.au/>
>>
>>
>> __________________________________________________________________________
>> The information in this email is confidential and
>> intend...{{dropped:8}}
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}
More information about the Bioconductor
mailing list