[R] using (g)sub for efficient string handling (was Re: transforming one column into 2 columns)

Sat Feb 2 19:40:03 CET 2008

That actually reminds me of a problem I had to tackle a while ago.

Say I have the following:

txt <- c("Variation_0001 // chr1:1083805-1283805 // Array CGH //  
15286789 // Iafrate et al. (2004) // CopyNumber /// Variation_5452 //  
chr1:1142956-1147823 // Computational mapping of resequencing  
traces // 16902084 // Mills et al. (2006) // CopyNumber",  
"Variation_4192 // chr1:2062347-2242269 // Array CGH // 17160897 //  
Wong et al. (2007) // CopyNumber /// Variation_4193 //  
chr1:2145626-2314237 // Array CGH // 17160897 // Wong et al. (2007) //  
CopyNumber /// Variation_8246 // chr1:2224111-3755284 // Affymetrix  
500K and 100K SNP Mapping Arrays // 17638019 // Zogopoulos et al.  
(2007) // CopyNumber", "Variation_8246 // chr1:2224111-3755284 //  
Affymetrix 500K and 100K SNP Mapping Arrays // 17638019 // Zogopoulos  
et al. (2007) // CopyNumber")

For each record, I'm interested in keeping the following:

results <- c("Variation_0001;Variation_5452",  
"Variation_4192;Variation_4193;Variation_8246", "Variation_8246")

My solution was:

theNames <- function(tmp)
   sapply(strsplit(tmp, " /+ "),
          function(y)
          paste(y[grep("Variation_", y)],
                collapse=";"))

But my wish was to know the regular expression that I needed to select  
everything but "Variation_\\d+"... For example, something like:

gsub( NOT "Variation_\\d+", ";", txt, perl=TRUE)

Suggestions?

b

On Feb 2, 2008, at 1:03 PM, Peter Dalgaard wrote:

> Benilton Carvalho wrote:
>> help("strsplit")
>> b
>>
> Yes, but...
>
> The postprocessing gets a bit awkward. It might be easier to use  
> sub() to get rid of the first/last bit of the string i.e.
>
> C2 <- sub("^.*:", "",  Col)
> C1 <- sub(":.*$", "",  Col)
>
> An orthogonal idea is
>
> con <- textConnection("Col")
> read.table(con, sep=":")
> close(con)
>
>> On Feb 2, 2008, at 12:43 PM, joseph wrote:
>>
>>>
>>>
>>> Hello
>>>
>>> I have a data frame and one of its columns is as follows:
>>>
>>>
>>>
>>>
>>> Col
>>>
>>>
>>> chr1:71310034
>>>
>>>
>>>
>>> chr15:37759058
>>>
>>>
>>> chr22:18262638
>>>
>>>
>>> chrUn:31337214
>>>
>>>
>>> chr10_random:4369261
>>>
>>>
>>> chrUn:3545097
>>>
>>>
>>>
>>>
>>>
>>> I would like to get rid of colon (:) and replace this column
>>> with two new columns containing the terms on each side of the  
>>> colon. The new columns
>>> should look as follows:
>>>
>>>
>>>
>>>
>>> Col_a   Col_b
>>>
>>>
>>> chr1     71310034
>>>
>>>
>>> chr14   23354088
>>>
>>>
>>> chr15   37759058
>>>
>>>
>>> chr22   18262638
>>>
>>>
>>> chrUn   31337214
>>>
>>>
>>> chr10_random  4369261
>>>
>>>
>>> chrUn   3545097
>>>
>>>
>>>
>>>
>>>
>>> Any help will be much appreciated
>>>
>>>
>>> Joseph
>>>
>>>
>>>
>>>
>>>
>>>
>>>      
>>> ____________________________________________________________________________________
>>> Looking for last minute shopping deals?
>>>
>>>    [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ------------------------------------------------------------------------
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -- 
>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
> c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45)  
> 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45)  
> 35327907
>