[Bioc-sig-seq] identifying a common motif in a set of sequences
Muino, Jose
jose.muino at wur.nl
Tue Feb 9 14:07:17 CET 2010
Hi,
Perhaps you can try the "sub" function from R. Not sure if there is a
more efficient way, but it should work.
By the way, if you google the sequence (GGCCACGCGTCGACTAGTAC) you will
find it in several papers. I have the impression that sometimes it is
used as a primer for the generation of the first cDNA strand.
Dr. Jose M Muino
Plant Research International B.V.
P.O. Box 619, 6700 AP Wageningen, The Netherlands
Phone: +0317-481122.
E-mail: jose.muino at wur.nl
http://www.pri.wur.nl
> -----Original Message-----
> From: bioc-sig-sequencing-bounces at r-project.org
> [mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf
> Of Johannes Rainer
> Sent: dinsdag 9 februari 2010 13:37
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] identifying a common motif in a set
> of sequences
>
> dear all,
>
> I'm wondering if there is already a function implemented in
> any Bioconductor package that allows to identify a common
> sequence pattern in a set of sequences.
>
> I'm asking this because in my ChIPseq data out of the 20 mio
> reads only about 3 mio can be aligned to the (human) genome
> (using bowtie), and, by looking at the sequences that can not
> be aligned (see below), there seem to be certain sequence
> patterns (like GGCCACGCGTCGACTAGTAC). Actually I have
> absolutely no idea where these sequences could come from.
> They are not adapter or primer sequences, since I've aligned
> all adapter/primer sequences I've got from the provider
> against these sequences.
>
> Is there any way to extract common sequence patterns (like
> the GGCCACGCGTCGACTAGTAC) in an automated manner form these sequences?
> besides that, did anybody experience the same problem?
>
> bests, jo
>
>
> A DNAStringSet instance of length 16196935
> width seq
> [1] 36 GGCCCCGCGTCGCCTAGTACTACATAAACAATGACC
> [2] 36 GGCGATGACCTTCTTGTGACCGTTGTGCATGCCGNC
> [3] 36 GTTTCCCAGTCACGGTCATGCTTCCTGTTTCCCAGC
> [4] 36 GTTTCCCAGTCACGGTCGTCCTTTTATTCTGACCTG
> [5] 36 GGCCACGCGTCGACTAGTACTTAAAAATATCGCACG
> [6] 36 GGCCACGCGTCGACTAGTACAGAAAAGACCGTGACT
> [7] 36 GGCCACGCGTCGACTAGTACAAAGGACATCACGCCG
> [8] 36 GGCCACGCGTCGACTAGTACAGAGTAAACAACGACC
> [9] 36 CAGTCACGGTCAAAAAATACATACTAAACACCTACT
> ... ... ...
> [16196927] 36 CAGTCACGGTCTGGCGGNATNNTTTTTGTACTAGTC
> [16196928] 36 TAGCCAGCCAAGCCAGCNAANNCAGCCATCCAGCCA
> [16196929] 36 GCGCCCCTGTCGCGGACNACNNGTAAGCAGCTCTCT
> [16196930] 36 ACTACACCCCTTAGCAANGANNATCTGAGCCTCCAT
> [16196931] 36 ACTACAAGCAAACAGTGNTCNNCTATGGTCCAGATC
> [16196932] 36 GCAGCCACGTCCCGATCNCCNNTTTGAGTGCGTGCG
> [16196933] 36 GGCCACGCGTCGACTAGNACNNCGAAAAATACGACC
> [16196934] 36 GGCCACGCGTCGACTAGTACNNAAAAAACAACGCCT
> [16196935] 36 AGTCACGGTCAAGTAACACANNAACAGAAAACCAAA
>
> --
> Johannes Rainer, PhD
> Bioinformatics Group,
> Division Molecular Pathophysiology,
> Biocenter, Medical University Innsbruck, Fritz-Pregl-Str
> 3/IV, 6020 Innsbruck, Austria and Tyrolean Cancer Research
> Institute Innrain 66, 6020 Innsbruck, Austria
>
> Tel.: +43 512 570485 13
> Email: johannes.rainer at i-med.ac.at
> johannes.rainer at tcri.at
> URL: http://bioinfo.i-med.ac.at
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list