[BioC] Biostring: print sequence alignment to file

Martin Preusse martin.preusse at googlemail.com
Wed Apr 18 12:20:31 CEST 2012


Hi,  

I just found this function to print a pairwise alignments in blocks. Doesn't add the match/mismatch indicators between sequences, but might be a starting point:

http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment


Cheers
Martin



Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:

> Hi everybody,  
>  
> I think the output format depends on the purpose of the alignment.  
>  
> A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like".
>  
> seq1: 1 ATCTGC 7
> | | | . . |
> seq2: 1 ATCAAC 7
>  
> When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species.
>  
> So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
>  
> This is an overview of sequence alignment formats used by EMBOSS:
> http://emboss.sourceforge.net/docs/themes/AlignFormats.html
>  
> 'pair' or 'markx0' would be perfectly fine.
>  
>  
> Cheers
> Martin
>  
>  
>  
> Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
>  
> > Hi Hervé,
> >  
> > To me, the most basic and versatile MSA or pairwise alignment format to output
> > to would be FASTA since it is compatible with almost any other alignment
> > editing software. For text-based viewing purposes my preference would be
> > to also output to a format similar to the one shown in the following
> > example. When there are only two sequences then one could show instead
> > of a consensus line the pipe characters between the two sequences to
> > indicate identical residues which mimics the blast output. A more
> > standardized version of this pairwise alignment format can be found
> > here:
> > http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html  
> >  
> > library(Biostrings)
> > p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta")  
> >  
> > StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
> > if(class(msa)=="AAStringSet") msa <- AAStringSet(msa, start=start, end=end)
> > if(class(msa)=="DNAStringSet") msa <- DNAStringSet(msa, start=start, end=end)
> > msavec <- sapply(msa, toString)
> > offset <- (counter-1)-nchar(nchar(msavec[1]))
> > legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0,  
> > nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
> > consensus <- consensusString(msavec, ambiguityMap=".", ...)
> > msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
> > msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec,  
> > consensus), sep=" ")
> > msavec <- c("<html><pre>", msavec, "</pre></html>")
> > writeLines(msavec, file)
> > if(browser==TRUE) { browseURL(file) }
> > }
> > StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)  
> > StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0)  
> >  
> >  
> > Thomas
> >  
> > On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
> > > Hi Thomas,
> > >  
> > > On 04/17/2012 11:49 AM, Thomas Girke wrote:
> > > > What about providing an option in pairwiseAlignment to output to the
> > > > MultipleAlignment class in Biostrings and then write the latter to
> > > > different alignment formats?
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > Or we could provide coercion methods to switch between
> > > PairwiseAlignedXStringSet and MultipleAlignment.
> > >  
> > > Anyway that kind of moves Martin's problem from having a
> > > write.PairwiseAlignedXStringSet() function that produces BLAST output
> > > to having a write.MultipleAlignment() function that produces BLAST
> > > output. For the specific case of BLAST output, would it make sense
> > > to support it for MultipleAlignment? Can someone point me to an example
> > > of such output? Or even better, to the specs of such format?
> > >  
> > > Note that right now there is the write.phylip() function in Biostrings
> > > for writing a MultipleAlignment object to a file but the Phylip format
> > > looks very different from the BLAST output:
> > >  
> > > hpages at latitude:~$ head -n 20 phylip_test.txt
> > > 9 2343
> > > Mask 0000000000 0000000000 0000000000 0000000000 0000000000
> > > Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
> > > Chimp ---------- ---------- ---------- ---------- ----------
> > > Cow ---------- ---------- ---------- ---------- ----------
> > > Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
> > > Rat ---------- ---------- ---------- ---------- ----------
> > > Dog ---------- ---------- ---------- ---------- ----------
> > > Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
> > > Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
> > >  
> > > 0000000000 0000000000 0000000000 0001111111 1111111111
> > > CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
> > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
> > > CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
> > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
> > > CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
> > >  
> > > Thanks!
> > > H.
> > >  
> > > >  
> > > > Thomas
> > > >  
> > > > On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
> > > > > Hi Martin,
> > > > >  
> > > > > On 04/16/2012 04:06 AM, Martin Preusse wrote:
> > > > > > Hi Charles,
> > > > > >  
> > > > > > thanks! Your solution allows to print the two alignment strings separately.
> > > > > >  
> > > > > > I was thinking of an output as generated by alignment tools:
> > > > > >  
> > > > > > AGT-TCTAT
> > > > > > | | | | | | | | |
> > > > > > AGTATCTAT
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > This looks like BLAST output. Is this what you have in mind? Note that
> > > > > there are many alignment tools and many ways to output the result to a
> > > > > file. I'm not really familiar with the BLAST output format. Is it
> > > > > specified somewhere? Would that make sense to add something like a
> > > > > write.PairwiseAlignedXStringSet() function to Biostrings for writing
> > > > > the result of pairwiseAlignment() to a file? We could do this and
> > > > > support the BLAST format if that's a commonly used format.
> > > > >  
> > > > > Thanks,
> > > > > H.
> > > > >  
> > > > > >  
> > > > > > For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right?
> > > > > >  
> > > > > > Cheers
> > > > > > Martin
> > > > > >  
> > > > > >  
> > > > > >  
> > > > > > Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
> > > > > >  
> > > > > > > write.XStringSet
> > > > > >  
> > > > > >  
> > > > > > _______________________________________________
> > > > > > Bioconductor mailing list
> > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > --
> > > > > Hervé Pagès
> > > > >  
> > > > > Program in Computational Biology
> > > > > Division of Public Health Sciences
> > > > > Fred Hutchinson Cancer Research Center
> > > > > 1100 Fairview Ave. N, M1-B514
> > > > > P.O. Box 19024
> > > > > Seattle, WA 98109-1024
> > > > >  
> > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > Phone: (206) 667-5791
> > > > > Fax: (206) 667-1319
> > > > >  
> > > > > _______________________________________________
> > > > > Bioconductor mailing list
> > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > --  
> > > Hervé Pagès
> > >  
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >  
> > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > Phone: (206) 667-5791
> > > Fax: (206) 667-1319
> >  
>  



More information about the Bioconductor mailing list