[Bioc-sig-seq] Adapter removal
Herve Pages
hpages at fhcrc.org
Fri Jul 18 06:16:04 CEST 2008
Hi Krys,
And to add up on Patrick's comments, I'd like to mention a particular
way of using the new pairwiseAlignment() function in order to find
the "edit distance" (i.e. the min nb of subs/ins/dels required for
transforming one string into another) between the linker and the set
of reads.
Let's say the reads are in 'dict1':
> dict1
A DNAStringSet instance of length 3506447
width seq
[1] 35 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
[2] 35 ATTGGAGAATTATAGTAGTAGATATTAATGATTGT
[3] 35 TCAGGTTGATCCACTCCACTACGGCTGCTTTTCTT
[4] 35 ATTACAAAATTACTTTACCAAGATTCTCGAAGATG
[5] 35 ATTTACTTTTTCCTTTCCAATATTGAAGTCTTGAA
[6] 35 ATATTTGATGTAAAATAAGTAACAAAGAAACTCAC
[7] 35 GATCAACATAAAGATGAACTTTCACAAATTAATGC
[8] 35 GTATTCTATACAATTGAGTGTATCATCATCTCTAA
[9] 35 TGTTTTTCTGACATTTTCATACATTTTGTTGATGG
... ... ...
[3506439] 35 CTGCTTCAAATGTTGTGACATTCACTTATCTGTCT
[3506440] 35 CACACAAGATATGGAAGATAATAGTCTCTCTAGGC
[3506441] 35 CCGTGATTTTCAGTTTTCTCGCCATATTCACGTTC
[3506442] 35 CAAAGAGTACAATGCCAAGCATTTCAAGCAGCACT
[3506443] 35 CACGTGCCGCCTCGATGGTCGCATTAAGTGCGAGC
[3506444] 35 CCACCTGTGTGACATCACAGAGGAGGCAGCAGGTG
[3506445] 35 ACAAAATAAGATAGTATATAAGTAGAAAATTCACA
[3506446] 35 GTGGGTTGGCATGGTCAAACTTAACTGGAAATTTC
[3506447] 35 AATATCCCATTGAAAGGAAAATATATACATATAAT
And your linker is:
linker <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG")
By using this special substitution matrix:
> mat <- matrix(-1L, nrow = 4, ncol = 4)
> diag(mat) <- 0L
> rownames(mat) <- colnames(mat) <- DNA_ALPHABET[1:4]
> mat
A C G T
A 0 -1 -1 -1
C -1 0 -1 -1
G -1 -1 0 -1
T -1 -1 -1 0
and a gap opening score of 0 and a gap extension score of -1, then
the scores returned by
scores <- pairwiseAlignment(dict1, linker, substitutionMatrix=mat,
gapOpening=0, gapExtension=-1, scoreOnly=TRUE)
are the opposites of the edit distances between the reads and the
linker.
As Patrick mentioned, the above call is very fast because the strings
to align are very short on both side of the comparison (will take
between 1 and 2 minutes on a modern PC).
Then by looking at the right end of the "edit distance" histogram:
> table(scores)
scores
-31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21
17 80458 6632 2924 4924 10647 33937 93756 219541 412330 594273
-20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10
648006 535821 340020 166160 63729 19741 4840 1092 343 165 228
-9 -8 -7 -6 -5 -4 -3 -2
291 502 868 1802 3554 14293 65037 180516
we can see that there are a lot of reads that are very close to the
linker: 259846 of them are at a distance <= 4. To "see" those reads:
> dict1[which(scores >= -4)]
A DNAStringSet instance of length 259846
width seq
[1] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTGA
[2] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
[3] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA
[4] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAAA
[5] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
[6] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA
[7] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
[8] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
[9] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
... ... ...
[259838] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTCAA
[259839] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGATTAAA
[259840] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA
[259841] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA
[259842] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA
[259843] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTAA
[259844] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTTCTTGAA
[259845] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTCTA
[259846] 35 GATCGGAAGAGCTCGTATGCCGTATTCTGCTTAGA
This method finds reads that differ from the linker by
a small number of insertions and/or deletions, in addition
to the substitutions:
> dict1[which(scores == -4)]
A DNAStringSet instance of length 14293
width seq
[1] 35 GATCGGAAGAGCTCGTATGCGTCTTCTGCTTAGAT
[2] 35 GATCGGAAGAGCTCGTATGCCGTTTCTGCTTAGAT
[3] 35 GATCGGAAGAGCTCGTATGCGTCTTCTGCTTAGAT
[4] 35 GATCGGAAGAGCTCGGTATGCCGTTTTCTGTTTCG
[5] 35 GATCGGAAGAGCTCGTATGCCGTTTTCTGCTTATA
[6] 35 GATCGGAAGAGCTCGTATGCCGCCTTCTGCTTCAC
[7] 35 GATCGGAAGAGCTCGTATGCCCTCTTCTGCCTCGA
[8] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTTCTTTAA
[9] 35 GATCGGAAGAGCTCGTATGCCGACTTCTGCTTCAA
... ... ...
[14285] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGTTTCTA
[14286] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTCCTTTAA
[14287] 35 GATCGGAAGAGCTCGTATGCCGTCTTATGCTTTAA
[14288] 35 GATCGGAAGCGCTCGTATGCCGTCTTCTGCTTCAC
[14289] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTTCTTTAA
[14290] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTTCTTTTA
[14291] 35 GATCGGAAGAGCTCGTATGCGTCTTCTGCTTAGAT
[14292] 35 GATCGGAAGAGCTCGTATGCCCTCTTCTGCTTTAA
[14293] 35 GATCGGAAGAGCTCGTATGCCGTCTTCTGATTAAA
To remove them from 'dict1':
cleandict1 <- dict1[-which(scores >= -4)]
Hope this helps.
H.
More information about the Bioc-sig-sequencing
mailing list