[Bioc-sig-seq] Bioc short read directions

Wed Apr 2 21:22:57 CEST 2008

Hi Harris,

Hope you don't mind if I put this back to the list.

Harris A. Jaffee wrote:
>>>>
> 
> Thanks to both of you for all the work!
> 
> Maybe I missed some old discussion, but is there a facility to specify 
> wild-cards in the reads,
> say "." or "N"?  These should match any of A, C, G, or T.  In my mind, 
> these are (exact) matches.

Not yet. But wild-cards in the subject will be supported soon (I'm working
on this and hopefully this will be ready before the end of the week).

> I think the user should only have to mention the concept of mismatch (or 
> better, fuzzy match), in
> the call, if he uses "N".  A few percent of our data contain unknowns, 
> like this.  (The Solexa
> machine may actually output ".".  That's what I see.)  I could get 
> better statistics if you want.
> 
> Also, can the user put a global upper limit on the number of 
> (exact/fuzzy/mis-) matches, per read?
> For example, he may feel that more than 10 matches is equivalent to 
> matching an infinite number of
> times.  I know that one can filter the result, but why make more trouble 
> for the software?

Nothing like that for now. If you are counting only (countPDict()), counting
all the counts for everybody (even when some reads have an insane number
of matches) is not more work for the current algo than stopping the count
for the reads that have reached a given limit (actually stopping the count
for those reads would actually be a little more work since you would need to
compare the count with the global limit before you decide to increment it or
not).
With matchPDict(), reporting all the matches, even for reads that have an
insane number of them, will of course require a lot of memory and slow down
things so in this case some global upper limit would indeed make sense.
Something that could be added later but not a priority.

> 
> Nobody seems to have mentioned it, but what about a "both strand" mode?  
> If RC is reverse-complement,
> this feature would basically automate the first statement here:
> 
>     Dict = PDict(c(patterns, RC(patterns)))
>         matchPDict(Dict,Seq)
> 
> The user would just pass 'patterns' and have to say whether he wants 
> forward and reverse matches
> distinguished.  The result would be of length 2*length(patterns) as it 
> is now, if they should be,
> but of length length(patterns) if they can be combined.

Instead of putting this at the PDict() level, I would rather build some
higher level function _on top_ of PDict() that would handle this.

> 
> I wasn't watching closely, but you mentioned FASTA input.  Is pure text 
> input allowed, without
> the FASTA comment lines?  I wasn't aware of this file input, so I have 
> been reading in from files
> myself, which are usually the output of "sort | uniq -c" applied to the 
> data.  So the bookkeeping
> is not so painful for UNIX users.  :-)

This should not prevent PDict()/matchPDict() to handle the duplicated reads
properly. Then it's up to the user to remove or keep them before s/he
builds the PDict object.

Cheers,
H.

> 
> Sorry for the dumb questions.  I have been off on other things.