[R] How to do the same thing for all levels of a column?

Tue Jul 24 18:44:45 CEST 2012

... and I neglected to mention that f = myfiles[,2]

Sigh....  More coffee needed.

-- Bert

On Tue, Jul 24, 2012 at 9:43 AM, Bert Gunter <bgunter at gene.com> wrote:
> Sorry. Typo in my previous. Should be:
>
>> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x,sum)))
> $X1
>          L          R          T
> 0.91491320 0.03675651 0.04833030
>
> $X2
>         E         M
> 0.9827278 0.0172722
>
> $X3
>         N         Y
> 0.0483303 0.9516697
>
> $X4
>         I         L         Q
> 0.8976410 0.0850868 0.0172722
>
> $X5
>         I         V
> 0.9516697 0.0483303
>
> $X6
>          P          S
> 0.96324349 0.03675651
>
> $X7
>         D         E         G
> 0.8976410 0.0540287 0.0483303
>
> $X8
>         A         C
> 0.9827278 0.0172722
>
>
>
> On Tue, Jul 24, 2012 at 9:37 AM, Bert Gunter <bgunter at gene.com> wrote:
>> OK, I admit it: I re-read what you wrote and now I'm confused. Is:
>>
>>> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x)))
>>
>>             X1       X2        X3       X4     X5  X6    X7  X8
>> [1,] 0.1428571 0.2 0.2857143 0.125 0.2 0.2 0.125 0.2
>> [2,] 0.4285714 0.2 0.1428571 0.250 0.4 0.2 0.375 0.2
>> [3,] 0.1428571 0.4 0.2857143 0.375 0.2 0.2 0.250 0.4
>> [4,] 0.2857143 0.2 0.2857143 0.250 0.2 0.4 0.250 0.2
>>
>> what you want?
>>
>> -- Bert
>> On Tue, Jul 24, 2012 at 9:17 AM, Bert Gunter <bgunter at gene.com> wrote:
>>> The OP's request is a bit ambiguous to me: at a given residue, do you
>>> wish to calculate the proportions for only those amino acids that
>>> appear at that residue, or do you wish to include the proportions for
>>> all amino acids, some of which might then be 0.
>>>
>>> Assuming the former, then I don't think one needs to go to the lengths
>>> described by John below.
>>>
>>> Using your example (thanks!), the following seems to suffice:
>>>
>>>> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x)))
>>>
>>> $X1
>>> x
>>>    L    R    T
>>> 0.50 0.25 0.25
>>>
>>> $X2
>>> x
>>>    E    M
>>> 0.75 0.25
>>>
>>> $X3
>>> x
>>>    N    Y
>>> 0.25 0.75
>>>
>>> $X4
>>> x
>>>    I    L    Q
>>> 0.25 0.50 0.25
>>>
>>> $X5
>>> x
>>>    I    V
>>> 0.75 0.25
>>>
>>> $X6
>>> x
>>>    P    S
>>> 0.75 0.25
>>>
>>> $X7
>>> x
>>>    D    E    G
>>> 0.25 0.50 0.25
>>>
>>> $X8
>>> x
>>>    A    C
>>> 0.75 0.25
>>>
>>>
>>> This could, of course, then be modified to add zero proportions for
>>> all non-appearing amino acids.
>>>
>>> -- Cheers,
>>> Bert
>>>
>>> On Tue, Jul 24, 2012 at 8:18 AM, John Kane <jrkrideau at inbox.com> wrote:
>>>>
>>>>    I think this does what you want using two packages, plyr and reshape2 that
>>>>    you may have to install.  If so install.packages("plyr", "reshape2") should
>>>>    do the trick.
>>>>    library(plyr)
>>>>    library(reshape2)
>>>>    # using supplied file 'myfile" from below
>>>>    time0total = sum(myfile[,2])
>>>>    mydata  <-  myfile[, 2:10]
>>>>    md1  <-  melt(mydata, id = "Time_zero")
>>>>    ddply(md1, .(variable, value), summarise, sum = sum(Time_zero)/time0total)
>>>>
>>>>
>>>>    John Kane
>>>>    Kingston ON Canada
>>>>
>>>>    -----Original Message-----
>>>>    From: zj29 at cornell.edu
>>>>    Sent: Tue, 24 Jul 2012 10:25:21 -0400
>>>>    To: jrkrideau at inbox.com
>>>>    Subject: Re: [R] How to do the same thing for all levels of a column?
>>>>
>>>>    Hi John,
>>>>    Thank you for the tips. My apologies about the unreadable sample data...
>>>>    So here is the output of the sample data, and hopefully it works this time
>>>>    :)
>>>>    myfile  <-  structure(list(Proteins = structure(1:4, .Label = c("p1", "p2",
>>>>    "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731,
>>>>    9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label = c("L",
>>>>    "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L
>>>>    ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L,
>>>>    1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"), X4 = structure(c(1L,
>>>>    2L,  3L,  2L),  .Label  =  c("I",  "L",  "Q"), class = "factor"), X5 =
>>>>    structure(c(1L,
>>>>    2L, 1L, 1L), .Label = c("I", "V"), class = "factor"), X6 = structure(c(1L,
>>>>    1L, 1L, 2L), .Label = c("P", "S"), class = "factor"), X7 = structure(c(1L,
>>>>    3L,  2L,  2L),  .Label  =  c("D",  "E",  "G"), class = "factor"), X8 =
>>>>    structure(c(1L,
>>>>    1L,  2L,  1L),  .Label  =  c("A",  "C"),  class = "factor")), .Names =
>>>>    c("Proteins",
>>>>    "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"), row.names =
>>>>    c(NA,
>>>>    4L), class = "data.frame")
>>>>    And here is my original question:
>>>>    Basically, I have a bunch of protein sequences composed of different amino
>>>>    acid residues, and each residue is represented by an uppercase letter. I
>>>>    want  to  calculate the ratio of different amino acid residues at each
>>>>    position of the proteins.
>>>>
>>>>    If  I  name  this table as myfile.txt, I have the following scripts to
>>>>    calculate the ratio of each amino acid residue at position 1:
>>>>
>>>>    # showing levels of the 3rd column, which means the types of residues
>>>>
>>>>    >myfile[,3]
>>>>
>>>>
>>>>    # calculating the ratio of L
>>>>
>>>>    >list=c(which(myfile[,3]=="L"))
>>>>
>>>>    >time0total=sum(myfile[,2])
>>>>
>>>>    >AA_L=0
>>>>
>>>>    >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>>>
>>>>    >ratio_L=AA_L/time0total
>>>>
>>>>
>>>>    So how can I write a script to do the same thing for the other two levels (T
>>>>    and R) in column 3, and also do this for every column that contains amino
>>>>    acid residues?
>>>>
>>>>    Thanks a lot!
>>>>
>>>>    Regards,
>>>>
>>>>    Zhao
>>>>    2012/7/24 John Kane <[1]jrkrideau at inbox.com>
>>>>
>>>>      First thing is to supply the data in a useable format.  As is it is
>>>>      essenatially unreadable.  All R-beginners do this. :)
>>>>      Have a look at the dput function  (?dput) for a good way to supply sample
>>>>      data in an email.
>>>>      If you have a large dataset probably a few dozen lines of data would be
>>>>      fine.
>>>>      Something like dput(head(mydata)) should be fine.  Just copy and paste the
>>>>      output into your email.
>>>>      Welcome to R.  I think you will like it.
>>>>      John Kane
>>>>      Kingston ON Canada
>>>>
>>>>    > -----Original Message-----
>>>>    > From: [2]zj29 at cornell.edu
>>>>    > Sent: Mon, 23 Jul 2012 18:01:11 -0400
>>>>    > To: [3]r-help at r-project.org
>>>>    > Subject: [R] How to do the same thing for all levels of a column?
>>>>    >
>>>>    > Dear all,
>>>>    >
>>>>    >
>>>>    >
>>>>    > I am a R beginner, and I am looking for a way to do the same thing for
>>>>    > all
>>>>    > levels of a column in a table.
>>>>    >
>>>>    >
>>>>    >
>>>>    > Basically, I have a bunch of protein sequences composed of different
>>>>    > amino
>>>>    > acid residues, and each residue is represented by an uppercase letter. I
>>>>    > want to calculate the ratio of different amino acid residues at each
>>>>    > position of the proteins. Here is an example table:
>>>>    >
>>>>    > Proteins
>>>>    >
>>>>    > Time_zero
>>>>    >
>>>>    > 1
>>>>    >
>>>>    > 2
>>>>    >
>>>>    > 3
>>>>    >
>>>>    > 4
>>>>    >
>>>>    > 5
>>>>    >
>>>>    > 6
>>>>    >
>>>>    > 7
>>>>    >
>>>>    > 8
>>>>    >
>>>>    > p1
>>>>    >
>>>>    > 0.0050723
>>>>    >
>>>>    > L
>>>>    >
>>>>    > E
>>>>    >
>>>>    > Y
>>>>    >
>>>>    > I
>>>>    >
>>>>    > I
>>>>    >
>>>>    > P
>>>>    >
>>>>    > D
>>>>    >
>>>>    > A
>>>>    >
>>>>    > p2
>>>>    >
>>>>    > 0.0002731
>>>>    >
>>>>    > T
>>>>    >
>>>>    > E
>>>>    >
>>>>    > N
>>>>    >
>>>>    > L
>>>>    >
>>>>    > V
>>>>    >
>>>>    > P
>>>>    >
>>>>    > G
>>>>    >
>>>>    > A
>>>>    >
>>>>    > p3
>>>>    >
>>>>    > 9.757E-05
>>>>    >
>>>>    > L
>>>>    >
>>>>    > M
>>>>    >
>>>>    > Y
>>>>    >
>>>>    > Q
>>>>    >
>>>>    > I
>>>>    >
>>>>    > P
>>>>    >
>>>>    > E
>>>>    >
>>>>    > C
>>>>    >
>>>>    > p4
>>>>    >
>>>>    > 0.0002077
>>>>    >
>>>>    > R
>>>>    >
>>>>    > E
>>>>    >
>>>>    > Y
>>>>    >
>>>>    > L
>>>>    >
>>>>    > I
>>>>    >
>>>>    > S
>>>>    >
>>>>    > E
>>>>    >
>>>>    > A
>>>>    >
>>>>    >
>>>>    >
>>>>    > If I name this table as myfile.txt, I have the following scripts to
>>>>    > calculate the ratio of each amino acid residue at position 1:
>>>>    >
>>>>    > # showing levels of the 3rd column, which means the types of residues
>>>>    >
>>>>    > >myfile[,3]
>>>>    >
>>>>    >
>>>>    >
>>>>    > # calculating the ratio of L
>>>>    >
>>>>    > >list=c(which(myfile[,3]=="L"))
>>>>    >
>>>>    > >time0total=sum(myfile[,2])
>>>>    >
>>>>    > >AA_L=0
>>>>    >
>>>>    > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>>>    >
>>>>    > >ratio_L=AA_L/time0total
>>>>    >
>>>>    >
>>>>    >
>>>>    > So how can I write a script to do the same thing for the other two levels
>>>>    > (T and R) in column 3, and also do this for every column that contains
>>>>    > amino acid residues?
>>>>    >
>>>>    >
>>>>    >
>>>>    > Many thanks for any help you could give me on this topic! :)
>>>>    >
>>>>    >
>>>>    >
>>>>    > Regards,
>>>>    >
>>>>    > Zhao
>>>>    > --
>>>>    > Zhao JIN
>>>>    > Ph.D. Candidate
>>>>    > Ruth Ley Lab
>>>>    > 467 Biotech
>>>>    > Field of Microbiology, Cornell University
>>>>    > Lab: 607.255.4954
>>>>    > Cell: 412.889.3675
>>>>    >
>>>>
>>>>      >       [[alternative HTML version deleted]]
>>>>      >
>>>>      > ______________________________________________
>>>>      > [4]R-help at r-project.org mailing list
>>>>      > [5]https://stat.ethz.ch/mailman/listinfo/r-help
>>>>      > PLEASE do read the posting guide
>>>>      > [6]http://www.R-project.org/posting-guide.html
>>>>      > and provide commented, minimal, self-contained, reproducible code.
>>>>      ____________________________________________________________
>>>>      FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
>>>>      your desktop!
>>>>      Check it out at [7]http://www.inbox.com/marineaquarium
>>>>
>>>>    --
>>>>    Zhao JIN
>>>>    Ph.D. Candidate
>>>>    Ruth Ley Lab
>>>>    467 Biotech
>>>>    Field of Microbiology, Cornell University
>>>>    Lab: 607.255.4954
>>>>    Cell: 412.889.3675
>>>>      _________________________________________________________________
>>>>
>>>>    [8]3D Earth Screensaver Preview
>>>>    Free 3D Earth Screensaver
>>>>    Watch   the   Earth   right   on   your   desktop!  Check  it  out  at
>>>>    [9]www.inbox.com/earth
>>>>
>>>> References
>>>>
>>>>    1. mailto:jrkrideau at inbox.com
>>>>    2. mailto:zj29 at cornell.edu
>>>>    3. mailto:r-help at r-project.org
>>>>    4. mailto:R-help at r-project.org
>>>>    5. https://stat.ethz.ch/mailman/listinfo/r-help
>>>>    6. http://www.R-project.org/posting-guide.html
>>>>    7. http://www.inbox.com/marineaquarium
>>>>    8. http://www.inbox.com/earth
>>>>    9. http://www.inbox.com/earth
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Biostatistics
>>>
>>> Internal Contact Info:
>>> Phone: 467-7374
>>> Website:
>>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>>
>>
>>
>> --
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>> Internal Contact Info:
>> Phone: 467-7374
>> Website:
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm