[Bioc-sig-seq] find overlaps compatible with a transcript
Hervé Pagès
hpages at fhcrc.org
Sun Sep 19 22:24:02 CEST 2010
Hi Elizabeth,
AFAIK we don't have anything for detecting this kind of "compatibility"
between a gapped read and the exons in a transcript.
Here is a function that detects whether a gapped read is "compatible"
with the exons of a given transcript:
## 'gread' must be a GRanges object of length >= 2 representing the
## gapped read.
## 'exons' must be a GRanges object of length >= 2 representing the
## exons of a given transcript assumed to be already ordered by rank.
## Returns an integer vector of length 2:
## - If the ranges in 'gread' are compatible with 'exons', then
## returns the ranks of the first and last exons spanned by 'gread';
## - Otherwise returns 2 NAs.
gappedReadAndExonsCompatibility <- function(gread, exons)
{
seqname1 <- unique(seqnames(gread))
strand1 <- unique(strand(gread))
seqname2 <- unique(seqnames(exons))
strand2 <- unique(strand(exons))
if (length(seqname1) != 1L || length(strand1) != 1L
|| length(seqname2) != 1L || length(strand2) != 1L)
stop("trans-splicing is not supported, sorry")
if (strand1 == "*" || strand2 == "*")
stop("strand * is not supported (must be + or -)")
## 'gread' and 'exons' are not on the same chromosome/strand:
if (seqname1 != seqname2 || strand1 != strand2)
return(c(NA_integer_, NA_integer_))
if (length(gread) < 2L || length(exons) < 2L)
stop("'length(gread)' and 'length(exons)' must be >= 2")
if (strand1 == "-")
exons <- rev(exons)
## Determine rank of first exon that fits:
first_exon <- which(start(gread)[1L] >= start(exons) &
end(gread)[1L] == end(exons))
## Determine rank of last exon that fits
last_exon <- which(start(gread)[length(gread)] == start(exons) &
end(gread)[length(gread)] <= end(exons))
if (length(first_exon) > 1L)
stop("unsupported splicing: ",
"first range in 'gread' fits more than 1 exon")
if (length(last_exon) > 1L)
stop("unsupported splicing: ",
"last range in 'gread' fits more than 1 exon")
if (length(first_exon) == 0L || is.na(first_exon)
|| length(last_exon) == 0L || is.na(last_exon)
|| length(gread) != last_exon - first_exon + 1L)
return(c(NA_integer_, NA_integer_))
## Check exons between first and last exons:
middle_ranges <- seq_len(length(gread) - 2L) + 1L
middle_exons <- seq_len(length(gread) - 2L) + first_exon
if (any(start(gread)[middle_ranges] != start(exons)[middle_exons])
|| any(end(gread)[middle_ranges] != end(exons)[middle_exons]))
return(c(NA_integer_, NA_integer_))
ans <- c(first_exon, last_exon)
if (strand1 == "-")
ans <- length(exons) - rev(ans) + 1L
ans
}
Then with:
exons <- GRanges(
seqnames=rep("chr1", 6),
ranges=IRanges(
start=c(601,1001,1401,2001,2501,3001),
end=c(700,1090,1600,2100,2800,3050)),
strand="+"
)
gread <- GRanges(
seqnames=rep("chr1", 3),
ranges=IRanges(
start=c(1085,1401,2001),
end=c(1090,1600,2033)),
strand="+"
)
## Compatible and spans exons 2 to 4:
> gappedReadAndExonsCompatibility(gread, exons)
[1] 2 4
## Compatible and spans exons 1 to 3:
> gappedReadAndExonsCompatibility(gread, exons[-1])
[1] 1 3
## Not compatible:
> gappedReadAndExonsCompatibility(gread, exons[-3])
[1] NA NA
So after you've used findOverlaps() to find hits between your gapped
reads and your transcripts, you could do a second pass and filter those
hits using gappedReadAndExonsCompatibility(). The function is not
vectorized to work directly on GRangesList objects so you'll have
to use a loop. But since the space that you explore now is much
reduced (thanks to initial findOverlaps filtering), performance
should still be ok.
Let is know if that works for you (and I could add this utility function
to the GenomicFeatures package).
Cheers,
H.
On 09/13/2010 11:18 AM, Elizabeth Purdom wrote:
> Hello,
>
> I am using a TranscriptDb and trying to find overlaps with transcripts.
> For example, I have a gapped alignment and I want to see what
> transcripts it is compatible with. If txdb is my TranscriptDb, and gr is
> my gapped alignment as a GenomicRanges object, I can do findOverlaps to
> see if my read overlaps in any way overlaps with the individual exons of
> the transcript, but not whether it overlaps with the implied transcript.
> For example, if my gapped read overlaps exon 1,2,3 of the transcript, it
> can only be compatible if it overlaps in a particular way (it must
> contain the end of exon 1, the beginning of exon 3, and all of exon 2).
>
> Is there a way to check this? This is probably answered somewhere, but I
> can't seem to find it.
>
> Thanks,
> Elizabeth
>
> An example:
> > txdb <- loadFeatures(system.file("extdata",
> "UCSC_knownGene_sample.sqlite", package="GenomicFeatures"))
> > exByTx<-exonsBy(newtxdb$txdb,"tx")
> #this is compatible
> > grOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"), ranges
> =IRanges(c(2000,2476,3084),c(2090,2584,3089)), strand =rep("*",3))
> #this is not
> > grNotOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"),ranges =
> IRanges(c(2000,2500,3084),c(2090,2584,3089)),
> strand =rep("*",3))
> #both overlap the same set of transcripts, but the the second is not
> compatible with either transcript
> > findOverlaps(GRangesList(grOk,grNotOk),exByTx)
> An object of class "RangesMatching"
> Slot "matchMatrix":
> query subject
> [1,] 1 1
> [2,] 1 2
> [3,] 2 1
> [4,] 2 2
>
> Slot "DIM":
> [1] 2 135
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list