[BioC] get chr position for a batch of human SNPs

Fri Sep 23 16:21:38 CEST 2011

Dear Herve,

Thank you so much for your explanation. It is extremely helpful for me
to understand the difference between different versions of dbSNP
database.

I will also read the main page of the most recent SNPlocs packages,
such as SNPlocs.Hsapiens.dbSNP.20100427 to get more idea
of how the SNPs were curated.

Thanks again for your information.
Shirley

2011/9/23 Hervé Pagès <hpages at fhcrc.org>:
> Hi Shirley,
>
> The following page explains the difference between Build 36.1
> and Build 36.3:
>
>  http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html
>
> In short, the *reference* genome assembly is the same for builds 36.1,
> 36.2 and 36.3. In those 3 builds, the reference assembly corresponds
> to UCSC hg18.
>
> Note that what NCBI calls a "build" can contain other assemblies in
> addition to the reference assembly. For example builds 36.1 and 36.2
> contain only the reference assembly but Build 36.3 also contains
> the HuRef and Celera assemblies.
>
> All the SNP positions relative to the HuRef or Celera assemblies were
> ignored when the SNPlocs.Hsapiens.dbSNP.20090506 package was made.
> Only the positions relative to the "reference assembly" were used
> (i.e. positions relative to hg18).
>
> I've tried to refine a little bit the SNPlocs packages over the
> time, and, in the more recent ones (starting with
> SNPlocs.Hsapiens.dbSNP.20100427), the man page provides some
> details on how the SNPs were curated before inclusion in the
> package. In particular SNPs mapped to more than 1 location on
> the reference genome were dropped.
>
> Hope this clarifies,
>
> H.
>
>
> On 11-09-22 05:48 PM, shirley zhang wrote:
>>
>> Dear Herve,
>>
>> Thanks for your quick reply.
>>
>> I've checked the link you suggested. The SNPs in
>> "SNPlocs.Hsapiens.dbSNP.20090506" do map the hg18 genome (NCBI Build
>> 36.1). But when I manually check the hg18 positions for a few SNPs on
>> dbSNP at NCBI. I can only get position information for NCBI Build 36.3
>> and NCBI Build 37. What is the difference between NCBI Build 36.3 and
>> NCBI Build 36.1? Are they basically the same? Further, in dbSNP at
>> NCBI, even for NCBI Build 36.3, there are 3 versions of the hg18
>> position per SNP (each version has a different Group Label, i.e.
>> "reference", "Celera" or "HuRef"). Could you kindly tell me what's the
>> difference among these 3 versions?
>>
>> For my 2nd question, getting SNP position for SNPs without knowing the
>> chr information, since I need to map SNPids to hg18 positions very
>> often in my research, after all SNPs from all of chromosomes  were
>> retrieved using "SNPlocs.Hsapiens.dbSNP.20090506",  I combined them
>> into one big data frame and saved it on the server. But later on if I
>> need to map SNPids to hg19, I will follow your suggestions by creating
>> a big GRanges object using more recent SNPlocs packages.
>>
>> Thanks a lot for your detailed explanation. It is very helpful.
>>
>> Shirley
>>
>> 2011/9/22 Hervé Pagès<hpages at fhcrc.org>:
>>>
>>> Hi Shirley,
>>>
>>> On 11-09-22 11:29 AM, shirley zhang wrote:
>>>>
>>>> Dear All,
>>>>
>>>> I am planing to map the SNPids to hg18 positions (chr and position)
>>>> for a huge list of human snps. I've tried the package
>>>> "SNPlocs.Hsapiens.dbSNP.20090506" and have 2 questions regarding this
>>>> package:
>>>>
>>>> 1. Do the SNPs in this package map the hg18 genome (NCBI Build 36.3
>>>> with Group Label "reference" instead of "Celera" or "HuRef"?
>>>
>>> Yes, they are mapped to hg18. See:
>>>
>>>
>>>
>>> http://bioconductor.org/packages/release/data/annotation/html/SNPlocs.Hsapiens.dbSNP.20090506.html
>>>
>>> and the man page of the package for additional details:
>>>
>>>  >  library(SNPlocs.Hsapiens.dbSNP.20090506)
>>>  >  ?SNPlocs.Hsapiens.dbSNP.20090506
>>>
>>>>
>>>> 2. If I don't know the chr information (seqname), can I obtain the
>>>> position with dbSNP Id only?
>>>
>>> Unfortunately, because SNPs are stored in one data frame per
>>> chromosome, if you don't know the chr then you need to load and
>>> query each data frame individually.
>>>
>>> With more recent SNPlocs packages (e.g.
>>> SNPlocs.Hsapiens.dbSNP.20100427), provision was added to
>>> let the user load SNPs from more than 1 chromosome in a single
>>> GRanges object, so you can do something like:
>>>
>>>  ## Load all the SNPs in a big GRanges object (takes about
>>>  ## 13 minutes and requires 6GB of RAM!):
>>>  all_snps<- getSNPlocs(names(getSNPcount()), as.GRanges=TRUE)
>>>
>>>  ## Use the rs ids to set the names (takes about 6 minutes):
>>>  names(all_snps)<- paste("rs", elementMetadata(all_snps)$RefSNP_id,
>>>                           sep="")
>>>
>>>  ## Then extract your SNPs from the big GRanges object (again,
>>>  ## this can take a long time, depending on how many SNPs you
>>>  ## extract):
>>>  my_rs_ids<- sample(names(all_snps), 1000)
>>>  my_snps<- all_snps[my_rs_ids]
>>>
>>> However, please note that, starting with
>>> SNPlocs.Hsapiens.dbSNP.20100427 (i.e. dbSNP Build 131),
>>> SNPs are mapped to GRCh37 (UCSC hg19) instead of hg18.
>>>
>>> Hope this helps,
>>> H.
>>>
>>>>
>>>> Further, I find dbSNP batch queries a little more difficult to work
>>>> with because they map to different versions of the hg18 like Celera,
>>>> HumanRef, etc.Can anybody let me know a better option to get hg18 chr
>>>> position with the most popular or confident version of dbSNP?
>>>>
>>>> Thanks in advance
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
Xiaoling (Shirley) Zhang

M.D., Ph.D. (Bioinformatics)
Boston University, Boston, MA
Tel: (857) 233-9862
Email: zhangxl at bu.edu