[R] pairwise cross tabulation tables
    AndyZon 
    andyzon at gmail.com
       
    Fri Jan 11 23:32:00 CET 2008
    
    
  
Thanks a million, Chuck! I think I got it.
Have a good weekend!
Andy
Charles C. Berry wrote:
> 
> On Thu, 10 Jan 2008, AndyZon wrote:
> 
>>
>> Thank you so much, Chuck!
>>
>> This is brilliant, I just tried some dichotomous variables, it was really
>> fast.
> 
> 
> Yes, and if you are on a multicore system with multithreaded linear 
> algebra, crossprod() will distribute the job across the cores making the 
> elapsed time shorter (by almost half on my Core 2 Duo MacBook as long as I 
> have nothing else gobbling up CPU cycles)!
> 
>>
>> Most categorical variables I am interested in are 3 levels, they are
>> actually SNPs, I want to look at their interactions. My question is:
>> after
>> generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"?
>> Should I just apply this function on these 2*n columns (originally I have
>> n
>> variables), and then operate on the generated cell counts?
> 
> 
> If I followed you here, and you have ONLY those three categories, then 
> yes.
> 
> Try a test case with perhaps 3 SNPs and a few subjects. Table the results 
> the old fashioned way via table() or xtabs() or even by hand. Then look at 
> what crossprod( test.case } gives you.
> 
> ---
> 
> If '11' shows up you'll have to use a 'contr.treatment' style 
> approach. ( run 'example( contrasts )' and look at what is going on).
> 
> Guess what these give:
> 
>  	contrasts( factor( c( "00","01","10" ) ) )
>  	contrasts( factor( c( "00","01","10","11" ) ) )
> 
> then run them if you have trouble seeing why '11' changes the picture.
> 
> ---
> 
> BTW, what I said (below) suggests that crossprod() returns integer 
> values, but its storage.mode is actually "double".
> 
> HTH,
> 
> Chuck
> 
> I am confused
>> about this.
>>
>> Your input will be greatly appreciated.
>>
>> Andy
> 
> 
> 
> 
>>
>>
>>
>> Charles C. Berry wrote:
>>>
>>> On Wed, 9 Jan 2008, AndyZon wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a huge number of categorical variables, say at least 10000, and
>>>> I
>>>> put
>>>> them into a matrix, each column is one variable. The question is: how
>>>> can
>>>> I
>>>> make all of the pairwise cross tabulation tables efficiently? The
>>>> straightforward solution is to use for-loops by looping two indexes on
>>>> the
>>>> table() function, but it was just too slow. Is there a more efficient
>>>> way
>>>> to
>>>> do that? Any guidance will be greatly appreciated.
>>>
>>> The totals are merely the crossproducts of a suitably constructed binary
>>> (zero-one) matrix is used to encode the categories. See
>>> '?contr.treatment'
>>> if you cannot grok 'suitably constructed'.
>>>
>>> If the categories are all dichotomies coded as 0:1, you can use
>>>
>>>  	res <- crossprod( dat )
>>>
>>> to find the totals for the (1,1) cells
>>>
>>> If you need the full tables, you can get them from the marginal totals
>>> using
>>>
>>>  	diag( res )
>>>
>>> to get the number in each '1' margin and
>>>
>>>  	nrow(dat)
>>>
>>> to get the table total from which the numbers in each '0' margin by
>>> subtracting the corresponding '1' margin.
>>>
>>> With dichotomous variables, dat has 10000 columns and you will only need
>>> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it
>>> takes
>>> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a
>>> redundant triangle
>>>
>>> This approach generalizes to any number of categories:
>>>
>>> To extend this to more than two categories, you will need to do for each
>>> such column what model.matrix(~factor( dat[,i] ) ) does by default
>>> ( using 'contr.treatment' ) - construct zero-one codes for all but one
>>> (reference) category.
>>>
>>> Note that with 10000 trichotomies, you will have a result with
>>>
>>>  	10000^2 * ( 3-1 )^2
>>>
>>> integers needing about 3 Gigabytes, and so on.
>>>
>>> HTH,
>>>
>>> Chuck
>>>
>>> p.s. Why on Earth are you doing this????
>>>
>>>
>>>>
>>>> Andy
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> Charles C. Berry                            (858) 534-2098
>>>                                              Dept of Family/Preventive
>>> Medicine
>>> E mailto:cberry at tajo.ucsd.edu	            UC San Diego
>>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego
>>> 92093-0901
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> Charles C. Berry                            (858) 534-2098
>                                              Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu	            UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
-- 
View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14766784.html
Sent from the R help mailing list archive at Nabble.com.
    
    
More information about the R-help
mailing list