[R] How to do the same thing for all levels of a column?
Bert Gunter
gunter.berton at gene.com
Tue Jul 24 18:43:24 CEST 2012
Sorry. Typo in my previous. Should be:
> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x,sum)))
$X1
L R T
0.91491320 0.03675651 0.04833030
$X2
E M
0.9827278 0.0172722
$X3
N Y
0.0483303 0.9516697
$X4
I L Q
0.8976410 0.0850868 0.0172722
$X5
I V
0.9516697 0.0483303
$X6
P S
0.96324349 0.03675651
$X7
D E G
0.8976410 0.0540287 0.0483303
$X8
A C
0.9827278 0.0172722
On Tue, Jul 24, 2012 at 9:37 AM, Bert Gunter <bgunter at gene.com> wrote:
> OK, I admit it: I re-read what you wrote and now I'm confused. Is:
>
>> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x)))
>
> X1 X2 X3 X4 X5 X6 X7 X8
> [1,] 0.1428571 0.2 0.2857143 0.125 0.2 0.2 0.125 0.2
> [2,] 0.4285714 0.2 0.1428571 0.250 0.4 0.2 0.375 0.2
> [3,] 0.1428571 0.4 0.2857143 0.375 0.2 0.2 0.250 0.4
> [4,] 0.2857143 0.2 0.2857143 0.250 0.2 0.4 0.250 0.2
>
> what you want?
>
> -- Bert
> On Tue, Jul 24, 2012 at 9:17 AM, Bert Gunter <bgunter at gene.com> wrote:
>> The OP's request is a bit ambiguous to me: at a given residue, do you
>> wish to calculate the proportions for only those amino acids that
>> appear at that residue, or do you wish to include the proportions for
>> all amino acids, some of which might then be 0.
>>
>> Assuming the former, then I don't think one needs to go to the lengths
>> described by John below.
>>
>> Using your example (thanks!), the following seems to suffice:
>>
>>> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x)))
>>
>> $X1
>> x
>> L R T
>> 0.50 0.25 0.25
>>
>> $X2
>> x
>> E M
>> 0.75 0.25
>>
>> $X3
>> x
>> N Y
>> 0.25 0.75
>>
>> $X4
>> x
>> I L Q
>> 0.25 0.50 0.25
>>
>> $X5
>> x
>> I V
>> 0.75 0.25
>>
>> $X6
>> x
>> P S
>> 0.75 0.25
>>
>> $X7
>> x
>> D E G
>> 0.25 0.50 0.25
>>
>> $X8
>> x
>> A C
>> 0.75 0.25
>>
>>
>> This could, of course, then be modified to add zero proportions for
>> all non-appearing amino acids.
>>
>> -- Cheers,
>> Bert
>>
>> On Tue, Jul 24, 2012 at 8:18 AM, John Kane <jrkrideau at inbox.com> wrote:
>>>
>>> I think this does what you want using two packages, plyr and reshape2 that
>>> you may have to install. If so install.packages("plyr", "reshape2") should
>>> do the trick.
>>> library(plyr)
>>> library(reshape2)
>>> # using supplied file 'myfile" from below
>>> time0total = sum(myfile[,2])
>>> mydata <- myfile[, 2:10]
>>> md1 <- melt(mydata, id = "Time_zero")
>>> ddply(md1, .(variable, value), summarise, sum = sum(Time_zero)/time0total)
>>>
>>>
>>> John Kane
>>> Kingston ON Canada
>>>
>>> -----Original Message-----
>>> From: zj29 at cornell.edu
>>> Sent: Tue, 24 Jul 2012 10:25:21 -0400
>>> To: jrkrideau at inbox.com
>>> Subject: Re: [R] How to do the same thing for all levels of a column?
>>>
>>> Hi John,
>>> Thank you for the tips. My apologies about the unreadable sample data...
>>> So here is the output of the sample data, and hopefully it works this time
>>> :)
>>> myfile <- structure(list(Proteins = structure(1:4, .Label = c("p1", "p2",
>>> "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731,
>>> 9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label = c("L",
>>> "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L
>>> ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L,
>>> 1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"), X4 = structure(c(1L,
>>> 2L, 3L, 2L), .Label = c("I", "L", "Q"), class = "factor"), X5 =
>>> structure(c(1L,
>>> 2L, 1L, 1L), .Label = c("I", "V"), class = "factor"), X6 = structure(c(1L,
>>> 1L, 1L, 2L), .Label = c("P", "S"), class = "factor"), X7 = structure(c(1L,
>>> 3L, 2L, 2L), .Label = c("D", "E", "G"), class = "factor"), X8 =
>>> structure(c(1L,
>>> 1L, 2L, 1L), .Label = c("A", "C"), class = "factor")), .Names =
>>> c("Proteins",
>>> "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"), row.names =
>>> c(NA,
>>> 4L), class = "data.frame")
>>> And here is my original question:
>>> Basically, I have a bunch of protein sequences composed of different amino
>>> acid residues, and each residue is represented by an uppercase letter. I
>>> want to calculate the ratio of different amino acid residues at each
>>> position of the proteins.
>>>
>>> If I name this table as myfile.txt, I have the following scripts to
>>> calculate the ratio of each amino acid residue at position 1:
>>>
>>> # showing levels of the 3rd column, which means the types of residues
>>>
>>> >myfile[,3]
>>>
>>>
>>> # calculating the ratio of L
>>>
>>> >list=c(which(myfile[,3]=="L"))
>>>
>>> >time0total=sum(myfile[,2])
>>>
>>> >AA_L=0
>>>
>>> >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>>
>>> >ratio_L=AA_L/time0total
>>>
>>>
>>> So how can I write a script to do the same thing for the other two levels (T
>>> and R) in column 3, and also do this for every column that contains amino
>>> acid residues?
>>>
>>> Thanks a lot!
>>>
>>> Regards,
>>>
>>> Zhao
>>> 2012/7/24 John Kane <[1]jrkrideau at inbox.com>
>>>
>>> First thing is to supply the data in a useable format. As is it is
>>> essenatially unreadable. All R-beginners do this. :)
>>> Have a look at the dput function (?dput) for a good way to supply sample
>>> data in an email.
>>> If you have a large dataset probably a few dozen lines of data would be
>>> fine.
>>> Something like dput(head(mydata)) should be fine. Just copy and paste the
>>> output into your email.
>>> Welcome to R. I think you will like it.
>>> John Kane
>>> Kingston ON Canada
>>>
>>> > -----Original Message-----
>>> > From: [2]zj29 at cornell.edu
>>> > Sent: Mon, 23 Jul 2012 18:01:11 -0400
>>> > To: [3]r-help at r-project.org
>>> > Subject: [R] How to do the same thing for all levels of a column?
>>> >
>>> > Dear all,
>>> >
>>> >
>>> >
>>> > I am a R beginner, and I am looking for a way to do the same thing for
>>> > all
>>> > levels of a column in a table.
>>> >
>>> >
>>> >
>>> > Basically, I have a bunch of protein sequences composed of different
>>> > amino
>>> > acid residues, and each residue is represented by an uppercase letter. I
>>> > want to calculate the ratio of different amino acid residues at each
>>> > position of the proteins. Here is an example table:
>>> >
>>> > Proteins
>>> >
>>> > Time_zero
>>> >
>>> > 1
>>> >
>>> > 2
>>> >
>>> > 3
>>> >
>>> > 4
>>> >
>>> > 5
>>> >
>>> > 6
>>> >
>>> > 7
>>> >
>>> > 8
>>> >
>>> > p1
>>> >
>>> > 0.0050723
>>> >
>>> > L
>>> >
>>> > E
>>> >
>>> > Y
>>> >
>>> > I
>>> >
>>> > I
>>> >
>>> > P
>>> >
>>> > D
>>> >
>>> > A
>>> >
>>> > p2
>>> >
>>> > 0.0002731
>>> >
>>> > T
>>> >
>>> > E
>>> >
>>> > N
>>> >
>>> > L
>>> >
>>> > V
>>> >
>>> > P
>>> >
>>> > G
>>> >
>>> > A
>>> >
>>> > p3
>>> >
>>> > 9.757E-05
>>> >
>>> > L
>>> >
>>> > M
>>> >
>>> > Y
>>> >
>>> > Q
>>> >
>>> > I
>>> >
>>> > P
>>> >
>>> > E
>>> >
>>> > C
>>> >
>>> > p4
>>> >
>>> > 0.0002077
>>> >
>>> > R
>>> >
>>> > E
>>> >
>>> > Y
>>> >
>>> > L
>>> >
>>> > I
>>> >
>>> > S
>>> >
>>> > E
>>> >
>>> > A
>>> >
>>> >
>>> >
>>> > If I name this table as myfile.txt, I have the following scripts to
>>> > calculate the ratio of each amino acid residue at position 1:
>>> >
>>> > # showing levels of the 3rd column, which means the types of residues
>>> >
>>> > >myfile[,3]
>>> >
>>> >
>>> >
>>> > # calculating the ratio of L
>>> >
>>> > >list=c(which(myfile[,3]=="L"))
>>> >
>>> > >time0total=sum(myfile[,2])
>>> >
>>> > >AA_L=0
>>> >
>>> > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>> >
>>> > >ratio_L=AA_L/time0total
>>> >
>>> >
>>> >
>>> > So how can I write a script to do the same thing for the other two levels
>>> > (T and R) in column 3, and also do this for every column that contains
>>> > amino acid residues?
>>> >
>>> >
>>> >
>>> > Many thanks for any help you could give me on this topic! :)
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Zhao
>>> > --
>>> > Zhao JIN
>>> > Ph.D. Candidate
>>> > Ruth Ley Lab
>>> > 467 Biotech
>>> > Field of Microbiology, Cornell University
>>> > Lab: 607.255.4954
>>> > Cell: 412.889.3675
>>> >
>>>
>>> > [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > [4]R-help at r-project.org mailing list
>>> > [5]https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > [6]http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>> ____________________________________________________________
>>> FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
>>> your desktop!
>>> Check it out at [7]http://www.inbox.com/marineaquarium
>>>
>>> --
>>> Zhao JIN
>>> Ph.D. Candidate
>>> Ruth Ley Lab
>>> 467 Biotech
>>> Field of Microbiology, Cornell University
>>> Lab: 607.255.4954
>>> Cell: 412.889.3675
>>> _________________________________________________________________
>>>
>>> [8]3D Earth Screensaver Preview
>>> Free 3D Earth Screensaver
>>> Watch the Earth right on your desktop! Check it out at
>>> [9]www.inbox.com/earth
>>>
>>> References
>>>
>>> 1. mailto:jrkrideau at inbox.com
>>> 2. mailto:zj29 at cornell.edu
>>> 3. mailto:r-help at r-project.org
>>> 4. mailto:R-help at r-project.org
>>> 5. https://stat.ethz.ch/mailman/listinfo/r-help
>>> 6. http://www.R-project.org/posting-guide.html
>>> 7. http://www.inbox.com/marineaquarium
>>> 8. http://www.inbox.com/earth
>>> 9. http://www.inbox.com/earth
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>> Internal Contact Info:
>> Phone: 467-7374
>> Website:
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
--
Bert Gunter
Genentech Nonclinical Biostatistics
Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
More information about the R-help
mailing list