[Bioc-sig-seq] find overlaps compatible with a transcript

Sun Sep 19 22:24:02 CEST 2010

Hi Elizabeth,

AFAIK we don't have anything for detecting this kind of "compatibility"
between a gapped read and the exons in a transcript.

Here is a function that detects whether a gapped read is "compatible"
with the exons of a given transcript:

## 'gread' must be a GRanges object of length >= 2 representing the
## gapped read.
## 'exons' must be a GRanges object of length >= 2 representing the
## exons of a given transcript assumed to be already ordered by rank.
## Returns an integer vector of length 2:
##   - If the ranges in 'gread' are compatible with 'exons', then
##     returns the ranks of the first and last exons spanned by 'gread';
##   - Otherwise returns 2 NAs.
gappedReadAndExonsCompatibility <- function(gread, exons)
{
     seqname1 <- unique(seqnames(gread))
     strand1 <- unique(strand(gread))
     seqname2 <- unique(seqnames(exons))
     strand2 <- unique(strand(exons))
     if (length(seqname1) != 1L || length(strand1) != 1L
      || length(seqname2) != 1L || length(strand2) != 1L)
         stop("trans-splicing is not supported, sorry")
     if (strand1 == "*" || strand2 == "*")
         stop("strand * is not supported (must be + or -)")
     ## 'gread' and 'exons' are not on the same chromosome/strand:
     if (seqname1 != seqname2 || strand1 != strand2)
         return(c(NA_integer_, NA_integer_))
     if (length(gread) < 2L || length(exons) < 2L)
         stop("'length(gread)' and 'length(exons)' must be >= 2")
     if (strand1 == "-")
         exons <- rev(exons)
     ## Determine rank of first exon that fits:
     first_exon <- which(start(gread)[1L] >= start(exons) &
                         end(gread)[1L] == end(exons))
     ## Determine rank of last exon that fits
     last_exon <- which(start(gread)[length(gread)] == start(exons) &
                        end(gread)[length(gread)] <= end(exons))
     if (length(first_exon) > 1L)
         stop("unsupported splicing: ",
              "first range in 'gread' fits more than 1 exon")
     if (length(last_exon) > 1L)
         stop("unsupported splicing: ",
              "last range in 'gread' fits more than 1 exon")
     if (length(first_exon) == 0L || is.na(first_exon)
      || length(last_exon) == 0L || is.na(last_exon)
      || length(gread) != last_exon - first_exon + 1L)
         return(c(NA_integer_, NA_integer_))
     ## Check exons between first and last exons:
     middle_ranges <- seq_len(length(gread) - 2L) + 1L
     middle_exons <- seq_len(length(gread) - 2L) + first_exon
     if (any(start(gread)[middle_ranges] != start(exons)[middle_exons])
      || any(end(gread)[middle_ranges] != end(exons)[middle_exons]))
         return(c(NA_integer_, NA_integer_))
     ans <- c(first_exon, last_exon)
     if (strand1 == "-")
         ans <- length(exons) - rev(ans) + 1L
     ans
}

Then with:

exons <- GRanges(
     seqnames=rep("chr1", 6),
     ranges=IRanges(
         start=c(601,1001,1401,2001,2501,3001),
         end=c(700,1090,1600,2100,2800,3050)),
     strand="+"
)

gread <- GRanges(
     seqnames=rep("chr1", 3),
     ranges=IRanges(
         start=c(1085,1401,2001),
         end=c(1090,1600,2033)),
     strand="+"
)

## Compatible and spans exons 2 to 4:
 > gappedReadAndExonsCompatibility(gread, exons)
[1] 2 4
## Compatible and spans exons 1 to 3:
 > gappedReadAndExonsCompatibility(gread, exons[-1])
[1] 1 3
## Not compatible:
 > gappedReadAndExonsCompatibility(gread, exons[-3])
[1] NA NA

So after you've used findOverlaps() to find hits between your gapped
reads and your transcripts, you could do a second pass and filter those
hits using gappedReadAndExonsCompatibility(). The function is not
vectorized to work directly on GRangesList objects so you'll have
to use a loop. But since the space that you explore now is much
reduced (thanks to initial findOverlaps filtering), performance
should still be ok.

Let is know if that works for you (and I could add this utility function
to the GenomicFeatures package).

Cheers,
H.

On 09/13/2010 11:18 AM, Elizabeth Purdom wrote:
> Hello,
>
> I am using a TranscriptDb and trying to find overlaps with transcripts.
> For example, I have a gapped alignment and I want to see what
> transcripts it is compatible with. If txdb is my TranscriptDb, and gr is
> my gapped alignment as a GenomicRanges object, I can do findOverlaps to
> see if my read overlaps in any way overlaps with the individual exons of
> the transcript, but not whether it overlaps with the implied transcript.
> For example, if my gapped read overlaps exon 1,2,3 of the transcript, it
> can only be compatible if it overlaps in a particular way (it must
> contain the end of exon 1, the beginning of exon 3, and all of exon 2).
>
> Is there a way to check this? This is probably answered somewhere, but I
> can't seem to find it.
>
> Thanks,
> Elizabeth
>
> An example:
>  > txdb <- loadFeatures(system.file("extdata",
> "UCSC_knownGene_sample.sqlite", package="GenomicFeatures"))
>  > exByTx<-exonsBy(newtxdb$txdb,"tx")
> #this is compatible
>  > grOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"), ranges
> =IRanges(c(2000,2476,3084),c(2090,2584,3089)), strand =rep("*",3))
> #this is not
>  > grNotOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"),ranges =
> IRanges(c(2000,2500,3084),c(2090,2584,3089)),
> strand =rep("*",3))
> #both overlap the same set of transcripts, but the the second is not
> compatible with either transcript
>  > findOverlaps(GRangesList(grOk,grNotOk),exByTx)
> An object of class "RangesMatching"
> Slot "matchMatrix":
> query subject
> [1,] 1 1
> [2,] 1 2
> [3,] 2 1
> [4,] 2 2
>
> Slot "DIM":
> [1] 2 135
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319