[BioC] illuminaHumanv4 mappings

Tue Sep 27 15:05:08 CEST 2011

Hi Mark,

Thanks for pointing out this issue, as it does deserve more
clarification. The Refseq IDs used for the package do not come
directly from the Illumina manifest file. Rather we have taken the
probe sequences and done a re-mapping to the genome and transcriptome.
The RefSeq IDs that we assign during this re-mapping are the basis for
a set of standard mappings provided by the AnnotationDBi
infrastructure.

However, as far as I know, probes that map to multiple EntrezIDs are
automatically filtered out. You can use the toggleProbes function to
change the usual mapping to return all return all values.

> allEGs = toggleProbes(illuminaHumanv4ENTREZID, "all")

> mget(ids, allEGs)
$ILMN_1651944
[1] NA

$ILMN_1807510
[1] NA

$ILMN_1696806
[1] "100528016" "1500"

$ILMN_1663159
[1] NA

$ILMN_2293511
[1] "100528016" "1500"

So two of the probes *do* have mappings, but they do not get mapped to
gene symbols because there is not one unique EntrezID.

Aside from the usual Bioconductor mappings, we have added other
information collected during our re-annotation to the package. Of most
interest here is the Probe Quality score and Coding Zone.

> unlist(mget(ids, illuminaHumanv4PROBEQUALITY))
ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
       "Bad"   "No match"    "Perfect"        "Bad"    "Perfect"

> unlist(mget(ids, illuminaHumanv4CODINGZONE))
ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
  "Intronic"           NA      "5pUTR"   "Intronic"      "5pUTR"

So one probe doesn't match to any part of the genome, two map to
introns and the other two uniqely map to a genomic location, but at
the 5' end of a gene. We did do our own mapping to Gene Symbol
(independent to the mapping done by Bioconductor). which would
correctly assign these probes to CTNND1. However, these mappings are
not currently part of the released packages. We plan to include them
in the next release though.

Best wishes,

Mark

On Thu, Sep 22, 2011 at 10:58 AM, Mark Cowley <m.cowley at garvan.org.au> wrote:
> Dear list,
> I've read the illuminaHumanv4.db.pdf, and it's not clear to me how the mappings are built. From the short package description, I thought the RefSeq ID's from the illumina array manifest would be used, but according to the pdf manual, I think its ACCNUM, but we're not told from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For chip packages such as this, the ACCNUM mapping comes directly from the manufacturer.").
>
> I raise the question, since within the illuminaHuman4SYMBOL table, there are no probes for the CTNND1 gene, whereas according to the manifest file, there are 5 probes that should map to that gene:
>
> from the manifest:
> $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14
> #Search_Key     ILMN_Gene       RefSeq_ID       Symbol
> XM_943087.1     CTNND1  XM_943087.1     ILMN_1651944
> XM_937008.1     CTNND1  XM_937008.1     ILMN_1807510
> XM_943098.1     CTNND1  NM_001085458.1  ILMN_1696806
> XM_943098.1     CTNND1  XM_943098.1     ILMN_1663159
> NM_001331.1     CTNND1  NM_001331.1     ILMN_2293511
>
> # from the illuminaHumanv4.db package
> require(illuminaHumanv4.db)
>> ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806", "ILMN_1663159", "ILMN_2293511")
>> unlist(mget(ids, illuminaHumanv4SYMBOL))
> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
>          NA           NA           NA           NA           NA
>> unlist(mget(ids, illuminaHumanv4REFSEQ))
> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
>          NA           NA           NA           NA           NA
> # why are there no REFSEQID's for these probes?
>
>> mget(ids, illuminaHumanv4ACCNUM)
> $ILMN_1651944
> [1] NA
> $ILMN_1807510
> [1] NA
> $ILMN_1696806
>  [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462"
>  [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467"
> [11] "NM_001085468" "NM_001085469" "NM_001331"    "NR_037646"
> $ILMN_1663159
> [1] NA
> $ILMN_2293511
>  [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462"
>  [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467"
> [11] "NM_001085468" "NM_001085469" "NM_001331"    "NR_037646"
>
> # all of these RefSeq ID's correspond to Entrez Gene ID 1500, CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ]
> # why do 3 probes not have an ACCNUM?
>
>
> If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see above) all align to NG_029078.1 (=CTNND1), but not to NM_001331 (=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both NG_029078.1 and NM_001331 amongst many others.
> mget(ids, illuminaHumanv4PROBESEQUENCE)
>>ILMN_1651944 -> NG_029078.1
> GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG
>>ILMN_1807510 -> NG_029078.1
> CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT
>>ILMN_1696806 -> NG_029078.1, NM_001331, many others
> GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA
>>ILMN_1663159 -> NG_029078.1
> GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA
>>ILMN_2293511 -> NG_029078.1, NM_001331, many others
> ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG
>
>
>
> I'd really love to get to the bottom of this, as the R annotation packages are very rich, but missing ID's make it hard to know whether they're better than the manufacturers manifest files.
>
> cheers,
> Mark
> -----------------------------------------------------
> Mark Cowley, PhD
>
> Pancreatic Cancer Program | Peter Wills Bioinformatics Centre
> Garvan Institute of Medical Research, Sydney, Australia
> -----------------------------------------------------
>
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] graphics  datasets  grDevices utils     grid      stats     methods
> [8] base
>
> other attached packages:
>  [1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0
>  [3] RSQLite_0.9-4             DBI_0.2-5
>  [5] AnnotationDbi_1.14.1      limma_3.8.3
>  [7] mjcdev_1.0                Cairo_1.4-9
>  [9] metaGSEA_1.0.2            pwbc_1.0.3
> [11] lumidat_1.0.1             lumi_2.4.0
> [13] nleqslv_1.8.6             updateR_1.0.4
> [15] roxygen_0.1-3             digest_0.5.0
> [17] codetools_0.2-8           haselst_0.1
> [19] blat_0.1                  genomics_0.1
> [21] mjcbase_0.1               GEOquery_2.19.2
> [23] cor_0.1                   xtable_1.5-6
> [25] rgl_0.92.798              qvalue_1.26.0
> [27] igraph_0.5.5-2            graph_1.30.0
> [29] XML_3.4-2                 SparseM_0.89
> [31] Biobase_2.12.2            sos_1.3-1
> [33] brew_1.0-6                gplots_2.8.0
> [35] caTools_1.12              bitops_1.0-4.1
> [37] gdata_2.8.1               gtools_2.6.2
>
> loaded via a namespace (and not attached):
>  [1] affy_1.30.0           affyio_1.20.0         annotate_1.30.0
>  [4] hdrcde_2.15           KernSmooth_2.23-6     lattice_0.19-30
>  [7] MASS_7.3-13           Matrix_0.999375-50    methylumi_1.8.0
> [10] mgcv_1.7-6            nlme_3.1-101          preprocessCore_1.14.0
> [13] RCurl_1.6-7           tcltk_2.13.1          tools_2.13.1
>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>