[Bioc-sig-seq] Filtering Bowtie alignments (has nosubject before)
João Fadista
Joao.Fadista at agrsci.dk
Thu Jan 22 10:50:02 CET 2009
Hi,
Dan Koboldt will present a poster at this year's Marco Island sequencing meeting evaluating short read aligners. He has compiled a partial list of aligners at: http://www.massgenomics.org/2009/01/short-read-aligners-update-at-agbt.html
Cheers,
João
-----Original Message-----
From: bioc-sig-sequencing-bounces at r-project.org [mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf Of delhomme at embl.de
Sent: Thursday, January 22, 2009 9:51 AM
To: Martin Morgan
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] Filtering Bowtie alignments (has nosubject before)
Hi Martin,
Thanks for that! Sorry that I forgot to put a subject in the original post.
I'm currently looking for some serious benchmarking of the different alignment software available (maq, bowtie, zoom, pairwiseAlignment in R, etc...) and although I've been scanning the literature and forums quite a lot, I didn't find anything done recently. Has anybody heard of something like that?
Cheers,
Nico
Quoting Martin Morgan <mtmorgan at fhcrc.org>:
> Hi --
>
> delhomme at embl.de writes:
>
>> Hi,
>>
>> I'm currently comparing the alignment generated by bowtie and maq on
>> the data used in the HilbertVis pdf file of the ShortRead package
>> (NCBI SRA
>> accession:
>> SRA000206). From the original .seq and .prb solexa files, I have
>> generated the fastq files and aligned them to the ref. genome using
>> bowtie and maq.
>>
>> I load both map files in R using readAligned. Then when filtering my
>> reads, I realized that most of the default filter functions have been
>> written for maq and are therefore not necessarily available for
>> bowtie.
>
> Not really for MAQ, but you're right that the filters make assumptions
> about data available for filtering.
>
>> I'm actually interested in the alignQualityFilter in order to sort
>> those reads which map several times in the genome (as described in
>> the page 22 of the "I/0 and Quality Assessment using ShortRead" of
>> last November bioC tutorial session). But, this method relies on the
>> quality slot of the object returned by the alignQuality accessor,
>> which is NA in the case of a bowtie alignment. This is actually
>> normal, as the final quality mapping value stored in this object is
>> only available in the maq map file and now in the bowtie one.
>>
>> So I'd just would like to know if there would be another way to
>> filter out the reads mapping multiple times in the genome when using
>> an AlignedRead object obtained from a bowtie map file.
>
>> Sorry that this sounds quite confusing... Here is the actual code and
>> what
>> happens:
>>
>> filt <- alignQualityFilter(threshold=1)
>>
>> # works
>> aln<-readAligned( "H3K4me1", "run13_lane4.map", type="MAQMap",
>> filter=filt)
>>
>> # fails
>> aln2<-readAligned( "H3K4me1.bowtie", "run13_lane4.map",
>> type="Bowtie",
>> filter=filt)
>
> As far as I can tell, Bowtie relies on the read identifier being
> unique for each read. So you can
>
> aln2 <- readAligned("H3K4me1.bowtie", "run13_lane4.map",
> type="Bowtie")
> aln2[!srduplicated(id(aln2))]
>
> Several important caveats. This requires ShortRead >1.1.37 (available
> in svn now, with BiocLite Wednesday after about 1 Seattle time, all
> being well; the original bowtie parser [intentionally] ignored the
> id). Note that the above keeps the _first_ of the duplicated
> identifiers; MAQ chooses a random read to report (though the random
> number can be set on the MAQ run and hence obtain repeatable results,
> or all alignments can be printed out, but the MAQ format when
> reporting all reads is, I believe, different from that expected by
> ShortRead and readAligned).
>
> You can create your own more elaborate filters, and define them (e.g.,
> to be used in readAligned) with the srFilter function, see ?srFilter
> for an example.
>
> Hope that helps,
>
> Martin
>
>> Error in x at ranges[i] : subscript contains NAs
>>
>> Any ideas how to filter a bowtie type alignment for reads mapping
>> multiple time in the genome?
>>
>> Cheers,
>>
>> N.
>>
>>
>>
>>
>>
>>> sessionInfo()
>> R version 2.9.0 Under development (unstable) (2009-01-04 r47472)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_U
>> S.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC
>> _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDE
>> NTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] HilbertVis_1.1.4 ShortRead_1.1.33 lattice_0.17-20 BSgenome_1.11.8
>> [5] Biobase_2.3.9 Biostrings_2.11.22 IRanges_1.1.33
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.9.0 Matrix_0.999375-17 tools_2.9.0
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center 1100
> Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list