[R] How to transpose it in a fast way?

Fri Mar 8 18:31:43 CET 2013

On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:

> 
> You could use the fact that scan reads the data rowwise, and the fact that arrays are stored columnwise:
> 
> # generate a small example dataset
> exampl <- array(letters[1:25], dim=c(5,5))
> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>    sep="\t", quote=FALSE)
> 

This might avoid creation of some of the intermediate copies:

MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")

I tested it up to a 5000 x 5000 file:

> exampl <- array(letters[1:25], dim=c(5000,5000))
> MASS::write.matrix( matrix( scan("example.dat", what=character()), 5000,5000), file="fil.out")
Read 25000000 items
> 

Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.

-- 
David.

> # and read...
> d <- scan("example.dat", what=character())
> d <- array(d, dim=c(5,5))
> 
> t(exampl) == d
> 
> 
> Although this is probably faster, it doesn't help with the large size. You could used the n option of scan to read chunks/blocks and feed those to, for example, an ff array (which you ideally have preallocated).
> 
> HTH,
> 
> Jan
> 
> 
> 
> 
> peter dalgaard <pdalgd at gmail.com> schreef:
> 
>> On Mar 7, 2013, at 01:18 , Yao He wrote:
>> 
>>> Dear all:
>>> 
>>> I have a big data file of 60000 columns and 60000 rows like that:
>>> 
>>> AA AC AA AA .......AT
>>> CC CC CT CT.......TC
>>> ..........................
>>> .........................
>>> 
>>> I want to transpose it and the output is a new like that
>>> AA CC ............
>>> AC CC............
>>> AA CT.............
>>> AA CT.........
>>> ....................
>>> ....................
>>> AT TC.............
>>> 
>>> The keypoint is  I can't read it into R by read.table() because the
>>> data is too large,so I try that:
>>> c<-file("silygenotype.txt","r")
>>> geno_t<-list()
>>> repeat{
>>> line<-readLines(c,n=1)
>>> if (length(line)==0)break  #end of file
>>> line<-unlist(strsplit(line,"\t"))
>>> geno_t<-cbind(geno_t,line)
>>> }
>>> write.table(geno_t,"xxx.txt")
>>> 
>>> It works but it is too slow ,how to optimize it???
>> 
>> 
>> As others have pointed out, that's a lot of data!
>> 
>> You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
>> 
>> - The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
>> geno_t[[i]] <- <etc>
>> 
>> - You might use scan() instead of readLines, strsplit
>> 
>> - Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
>> 
>> --
>> Peter Dalgaard, Professor
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA