[R] Help searching a matrix for only certain records
Matt Borkowski
mathias1979 at yahoo.com
Mon Mar 4 03:10:15 CET 2013
I appreciate all the feedback on this. I ended up using this line to solve my problem, just because I stumbled upon it first...
> alldata <- alldata[alldata$REC.TYPE == "SAO " | alldata$REC.TYPE == "FM-15",,drop=FALSE]
But I think Jim's solution would work equally as well. I was a bit confused by the relative complexity of the data frames solution, as it seems like more steps than necessary.
Thanks again for the input!
-Matt
Again, thanks for the feedback!
--- On Sun, 3/3/13, arun <smartpink111 at yahoo.com> wrote:
> From: arun <smartpink111 at yahoo.com>
> Subject: Re: [R] Help searching a matrix for only certain records
> To: "Matt Borkowski" <mathias1979 at yahoo.com>
> Cc: "R help" <r-help at r-project.org>, "jim holtman" <jholtman at gmail.com>
> Date: Sunday, March 3, 2013, 1:29 PM
> HI,
> You could also use ?data.table()
>
> n<- 300000
> set.seed(51)
> mat1<- as.matrix(data.frame(REC.TYPE=
> sample(c("SAO","FAO","FL-1","FL-2","FL-15"),n,replace=TRUE),Col2=rnorm(n),Col3=runif(n),stringsAsFactors=FALSE))
> dat1<- as.data.frame(mat1,stringsAsFactors=FALSE)
> table(mat1[,1])
> #
> # FAO FL-1 FL-15 FL-2 SAO
> #60046 60272 59669 59878 60135
> system.time(x1 <- subset(mat1, grepl("(SAO|FL-15)",
> mat1[, "REC.TYPE"])))
> #user system elapsed
> # 0.076 0.004 0.082
> system.time(x2 <- subset(mat1, mat1[, "REC.TYPE"] %in%
> c("SAO", "FL-15")))
> # user system elapsed
> # 0.028 0.000 0.030
>
> system.time(x3 <- mat1[match(mat1[, "REC.TYPE"]
> ,
> c("SAO", "FL-15")
> ,
> nomatch = 0) != 0
> ,,
> drop = FALSE]
> )
> #user system elapsed
> # 0.028 0.000 0.028
> table(x3[,1])
> #
> #FL-15 SAO
> #59669 60135
>
>
> library(data.table)
>
> dat2<- data.table(dat1)
> system.time(x4<- dat2[match(REC.TYPE,c("SAO",
> "FL-15"),nomatch=0)!=0,,drop=FALSE])
> # user system elapsed
> #0.024 0.000 0.025
> table(x4$REC.TYPE)
>
> #FL-15 SAO
> #59669 60135
> A.K.
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: jim holtman <jholtman at gmail.com>
> To: Matt Borkowski <mathias1979 at yahoo.com>
> Cc: "r-help at r-project.org"
> <r-help at r-project.org>
> Sent: Sunday, March 3, 2013 11:52 AM
> Subject: Re: [R] Help searching a matrix for only certain
> records
>
> If you are using matrices, then here is several ways of
> doing it for
> size 300,000. You can determine if the difference of 0.1
> seconds is
> important in terms of the performance you are after. It is
> taking you
> more time to type in the statements than it is taking them
> to execute:
>
> > n <- 300000
> > testdata <- matrix(
> + sample(c("SAO ", "FL-15", "Other"), n, TRUE,
> prob = c(1,2,1000))
> + , nrow = n
> + , dimnames = list(NULL, "REC.TYPE")
> + )
> > table(testdata[, "REC.TYPE"])
>
> FL-15 Other SAO
> 562 299151 287
> > system.time(x1 <- subset(testdata, grepl("(SAO
> |FL-15)", testdata[, "REC.TYPE"])))
> user system elapsed
> 0.17 0.00 0.17
> > system.time(x2 <- subset(testdata, testdata[,
> "REC.TYPE"] %in% c("SAO ", "FL-15")))
> user system elapsed
> 0.05 0.00 0.05
> > system.time(x3 <- testdata[match(testdata[,
> "REC.TYPE"]
> + , c("SAO ",
> "FL-15")
> + , nomatch =
> 0) != 0
> + ,, drop =
> FALSE]
> + )
> user system elapsed
> 0.03 0.00 0.03
> > identical(x1, x2)
> [1] TRUE
> > identical(x2, x3)
> [1] TRUE
> >
>
>
> On Sun, Mar 3, 2013 at 11:22 AM, Jim Holtman <jholtman at gmail.com>
> wrote:
> > there are way "more efficient" ways of doing many of
> the operations , but you probably won't see any differences
> unless you have very large objects (several hunfred thousand
> entries), or have to do it a lot of times. My background
> is in computer performance and for the most part I have
> found that the easiest/mostbstraight forward ways are fine
> most of the time.
> >
> > a more efficient way might be:
> >
> > testdata <- testdata[match(c('SAO ', 'FL-15'),
> testdata$REC.TYPE), ]
> >
> > you can always use 'system.time' to determine how long
> actions take.
> >
> > for multiple comparisons use %in%
> >
> > Sent from my iPad
> >
> > On Mar 3, 2013, at 9:22, Matt Borkowski <mathias1979 at yahoo.com>
> wrote:
> >
> >> Thank you for your response Jim! I will give this
> one a try! But a couple followup questions...
> >>
> >> In my search for a solution, I had seen something
> stating match() is much more efficient than subset() and
> will cut down significantly on computing time. Is there any
> truth to that?
> >>
> >> Also, I found the following solution which works
> for matching a single condition, but I couldn't quite figure
> out how to modify it it to search for both my acceptable
> conditions...
> >>
> >>> testdata <- testdata[testdata$REC.TYPE ==
> "SAO",,drop=FALSE]
> >>
> >> -Matt
> >>
> >>
> >>
> >>
> >> --- On Sun, 3/3/13, jim holtman <jholtman at gmail.com>
> wrote:
> >>
> >> From: jim holtman <jholtman at gmail.com>
> >> Subject: Re: [R] Help searching a matrix for only
> certain records
> >> To: "Matt Borkowski" <mathias1979 at yahoo.com>
> >> Cc: r-help at r-project.org
> >> Date: Sunday, March 3, 2013, 8:00 AM
> >>
> >> Try this:
> >>
> >> dataset <- subset(dataset, grepl("(SAO |FL-15)",
> REC.TYPE))
> >>
> >>
> >> On Sun, Mar 3, 2013 at 1:11 AM, Matt Borkowski
> <mathias1979 at yahoo.com>
> wrote:
> >>> Let me start by saying I am rather new to R and
> generally consider myself to be a novice programmer...so
> don't assume I know what I'm doing :)
> >>>
> >>> I have a large matrix, approximately 300,000 x
> 14. It's essentially a 20-year dataset of 15-minute data.
> However, I only need the rows where the column I've named
> REC.TYPE contains the string "SAO " or "FL-15".
> >>>
> >>> My horribly inefficient solution was to search
> the matrix row by row, test the REC.TYPE column and
> essentially delete the row if it did not match my criteria.
> Essentially...
> >>>
> >>>> j <- 1
> >>>> for (i in 1:nrow(dataset)) {
> >>>> if(dataset$REC.TYPE[j] != "SAO
> " && dataset$RECTYPE[j] != "FL-15") {
> >>>> dataset <- dataset[-j,]
> }
> >>>> else {
> >>>> j <- j+1 }
> >>>> }
> >>>
> >>> After watching my code get through only about
> 10% of the matrix in an hour and slowing with every row...I
> figure there must be a more efficient way of pulling out
> only the records I need...especially when I need to repeat
> this for another 8 datasets.
> >>>
> >>> Can anyone point me in the right direction?
> >>>
> >>> Thanks!
> >>>
> >>> Matt
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org
> mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained,
> reproducible code.
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >> Tell me what you want to do, not how you want to do
> it.
> >>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
>
>
More information about the R-help
mailing list