Skip to content

Instantly share code, notes, and snippets.

@pedroj
Created February 21, 2012 00:01
Show Gist options
  • Select an option

  • Save pedroj/1872426 to your computer and use it in GitHub Desktop.

Select an option

Save pedroj/1872426 to your computer and use it in GitHub Desktop.
Table manipulation functions from a post by Campbell Webb
*******************************************************************
*** Solutions to a question I posted on the R-news mailing list ***
*** about simple table manipulation ***
*** Campbell Webb post in R-help ***
*******************************************************************
THE QUESTION:
Hi all,
I am fairly new to S-plus (actually R, on linux), and am continually
frustrated at how hard it seems to be to do the most basic table/matrix
manipulations. So, if anyone has some spare time for a challenge, I pose
the following problem...
Of several canned stats packages, I have found the table manipulation
options in JMP to be the most helpful/intuitive. The options there
are Group, Split, Stack, Join, Subset, Transpose, Sort. What would be
the simplest S-plus/R code to do the same? Some of these are quite easy
but I leave them in for completeness. A full solution to this will be an
invaluable resource - apologies/request for directions if it already
exists!
In detail, with examples:
1) GROUP: Turn this table(A):
A a 1
A a 3
A a 1
A b 1
A b 2
B c 2
B d 1
into this (B):
Group1 Group2 Nrows SumOfCol3
A a 3 5
A b 2 3
B c 1 2
B d 1 1
2) SPLIT: Turn col3 of the second table (B) into this (C):
a b c d
A 3 2 . .
B . . 1 1
3) STACK: Turn the above table (C) into this (D):
A a 3
A b 2
A c .
A d .
B a .
B b .
B c 1
B d 1
4) JOIN these tables (E, F):
A 1
B 2
C 3
A a
B a
B a
B b
C c
A b
to give (G):
A a 1
B a 2
B a 2
B b 2
C c 3
A b 1
5) SUBSET Table G to give H:
A a 1
A b 1
6) TRANSPOSE H to give:
A A
a b
1 1
7) and finally SORT, by up to 3 keys.
Solutions referring to data frames would be most helpful. A big thanks to
anyone who responds!
Cam
This message was distributed by [email protected].
THE ANSWERS:
1111111111111111111111111111111111111111111111111111111111111111111111111111111
********** From: Bill Dunlap <[email protected]> **********
> 1) GROUP: Turn this table(A):
>
> A a 1
> A a 3
> A a 1
> A b 1
> A b 2
> B c 2
> B d 1
>
> into this (B):
>
> Group1 Group2 Nrows SumOfCol3
> A a 3 5
> A b 2 3
> B c 1 2
> B d 1 1
You can use aggregate(), but you only get 1 new column for each call
to aggregate(). E.g.,
> aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=sum)
Group1 Group2 x
1 A a 5
2 A b 3
3 B c 2
4 B d 1
> aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=length)
Group1 Group2 x
1 A a 3
2 A b 2
3 B c 1
4 B d 1
Use, e.g., cbind() to paste together the columns of interest:
> B <- aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=length)
> names(B)[3] <- "Nrows"
> B <- cbind(B,SumOfCol3=aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=sum)$x)
> B
Group1 Group2 Nrows SumOfCol3
1 A a 3 5
2 A b 2 3
3 B c 1 2
4 B d 1 1
aggregate() is a pretty general purpose function and there exist versions
which are optimized for FUN=sum or length in slightly restricted circumstances.
They are much faster when you are dealing with a lot of categories.
Check in the S-news archives for aggsum or agglength.
********** From: Ed Kademan <[email protected]> **********
## I am doing the following in R.
A <- data.frame(
c1 = c('A', 'A', 'A', 'A', 'A', 'B', 'B'),
c2 = c('a', 'a', 'a', 'b', 'b', 'c', 'd'),
c3 = c(1, 3, 1, 1, 2, 2, 1))
problem1 <-
function(A)
{
nRows <- as.vector(table(A[1:2]))
sumOfCol3 <- as.vector(tapply(A[[3]], A[1:2], sum))
groups <- expand.grid(lapply(A[1:2], levels))
dimnames(groups)[[2]][1:2] <- c('Group1', 'Group2')
cbind(groups, nRows, sumOfCol3)[nRows > 0,]
}
B <- problem1(A)
********** From: Frank Harrell <[email protected]> **********
Some of what you want is worked out in
http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf
section 4.2
- My libraries will be available in R in a few weeks
********** From: "Patrick Connolly" <[email protected]> **********
Check out aggregate(). It'll do most of what you want.
********** From: [email protected] **********
2 calls to aggregate() do this
2222222222222222222222222222222222222222222222222222222222222222222222222222222
********** From: Bill Dunlap <[email protected]> **********
> 2) SPLIT: Turn col3 of the second table (B) into this (C):
>
> a b c d
> A 3 2 . .
> B . . 1 1
The fastest way to get this result is to start with the raw data A:
> table(A$Group1, A$Group2)
a b c d
A 3 2 0 0
B 0 0 1 1
Do you really want missing values (NA's in Splus) where table puts 0's?
Do you really want to be able to turn B directly into C?
********** From: Ed Kademan <[email protected]> **********
## I am not turning B into C as you specified, but maybe this is ok.
C <- table(A[1:2])
********** From: [email protected] **********
see my attached function table.trans (originally Richard Heiberger's code)
table.trans(df[,3],df[1:2])
> table.trans
function(X, INDICES, ...)
{
len.ind <- length(INDICES)
IN1 <- interaction(INDICES[ - len.ind], drop = T)
IN2 <- INDICES[[len.ind]]
a <- tapply(X, data.frame(IN1, IN2), c, ...)
cbind(INDICES[seq(1, nrow(INDICES), length(levels(IN2))), -
len.ind], a)
}
HELP FILE:
filename="table_trans.d"
.BG
.FN table.trans
.TL
Transpose a data frame with 'by' variables
.DN
Transposes a data frame with SAS-style 'by' variables
.CS
table.trans(X, INDICES, sep="#")
.RA
.AG X
The single data vector to transpose
.AG INDICES
All columns except the last are the 'by' variables.
The last column is a factor whose levels are the names
of the transposed columns.
.OA
.AG sep
Delimiter used internally. Shouldn't occur as a character in any of the 'by' variables
.RT
A data frame with columns corresponding to 'by' variables
and extra columns corresponding to levels of last column
of INDICES whose value comes from X
.SE
none
.DT
Transposes all values for vector X into columns with names given by the
levels of the last column of INDICES. Missing entries are denoted by NA.
Usually the 'by' variables in INDICES will be factors or discrete values.
.br
The inverse of this function is `reshape', which folds columns into rows.
.SH REFERENCES
Adapted from code by Richard M Heiberger in s-news 22 Mar 1996
.SA
`reshape', as.data.frame.array
.EX
> names(dimnames(iris)) <- c("flower", "measurement", "species")
> iris1 <- as.data.frame.array(iris, row.dims = 1:3)
> iris1[1:10, ]
x.1 flower measurement species
1 5.1 1 Sepal L. Setosa
2 4.9 2 Sepal L. Setosa
3 4.7 3 Sepal L. Setosa
4 4.6 4 Sepal L. Setosa
5 5.0 5 Sepal L. Setosa
6 5.4 6 Sepal L. Setosa
7 4.6 7 Sepal L. Setosa
8 5.0 8 Sepal L. Setosa
9 4.4 9 Sepal L. Setosa
10 4.9 10 Sepal L. Setosa
> iris.meas <- table.trans(iris1[, 1], iris1[, c(2, 4, 3)])
> iris.meas[1:10, ]
flower species Sepal.L. Sepal.W. Petal.L. Petal.W.
1 1 Setosa 5.1 3.5 1.4 0.2
2 1 Versicolor 7.0 3.2 4.7 1.4
3 1 Virginica 6.3 3.3 6.0 2.5
4 2 Setosa 4.9 3.0 1.4 0.2
5 2 Versicolor 6.4 3.2 4.5 1.5
6 2 Virginica 5.8 2.7 5.1 1.9
7 3 Setosa 4.7 3.2 1.3 0.2
8 3 Versicolor 6.9 3.1 4.9 1.5
9 3 Virginica 7.1 3.0 5.9 2.1
10 4 Setosa 4.6 3.1 1.5 0.2
> samp <- sample(nrow(iris1), 20)
> iris.meas <- table.trans(iris1[ - samp, 1], iris1[ - samp, c(2, 3, 4)])
> iris.meas[1:10, ]
flower measurement Setosa Versicolor Virginica
1 1 Sepal L. 5.1 7.0 6.3
2 1 Sepal W. 3.5 3.2 3.3
3 1 Petal L. 1.4 4.7 6.0
4 1 Petal W. 0.2 1.4 2.5
5 2 Sepal L. 4.9 6.4 5.8
6 2 Sepal W. 3.0 3.2 2.7
7 2 Petal L. 1.4 4.5 5.1
8 2 Petal W. 0.2 1.5 1.9
9 3 Sepal L. 4.7 6.9 7.1
10 3 Sepal W. 3.2 3.1 3.0
.KW misc
.WR
3333333333333333333333333333333333333333333333333333333333333333333333333333333
********** From: Bill Dunlap <[email protected]> **********
> 3) STACK: Turn the above table (C) into this (D):
>
> A a 3
> A b 2
> A c .
> A d .
> B a .
> B b .
> B c 1
> B d 1
I think an equivalent result is
> as.data.frame.array(table(A$Group1, A$Group2), row.dims=1:2)
X Dim1 Dim2
1 3 A a
2 0 B a
3 2 A b
4 0 B b
5 0 A c
6 1 B c
7 0 A d
8 1 B d
See the sorting section on how to reorder it to your taste.
The GUI has a pretty flexible stacking function.
********** From: Ed Kademan <[email protected]> **********
problem3 <- function(C)
{
C <- t(C)
cbind(expand.grid(dimnames(C))[2:1], as.vector(C))
}
D <- problem3(C)
********** From: [email protected] **********
see my attached function reshape()
reshape(df,col.unfol=1:4)
> reshape
function(obj, col.copy = NULL, col.unfold = 1:dim(obj)[2], label = "label",
value = "value")
{
n <- dim(obj)[1]
p <- length(col.unfold)
res <- vector("list", length(col.copy) + 2)
names(res) <- c(if(is.null(col.copy)) NULL else dimnames(obj[,
col.copy, drop = F])[[2]], label, value)
for(i in seq(along = col.copy)) {
res[[i]] <- rep(obj[, col.copy[i]], p)
}
res[[label]] <- rep(dimnames(obj[, col.unfold, drop = F])[[2]],
rep(n, p))
res[[value]] <- as.vector(data.matrix(obj[, col.unfold, drop = F]))
as.data.frame(res)
}
HELP FILE:
filename="reshape.d"
.BG
.FN reshape
.TL
Reshape, unfold or transpose a data frame
.DN
Reshape, unfold or transpose a data frame.
.CS
reshape(obj, col.copy=NULL, col.unfold=1:dim(obj)[2],
label="label", value="value")
.RA
.AG obj
the data frame to reshape
.OA
.AG col.copy
The columns of `obj' to copy to result
.AG col.unfold
The columns of `obj' to be unfolded to rows
.AG label
the name of the label column
.AG value
the name of the value column
.RT
A data frame consisting of the copied columns replicated p times
(where p is number of unfolded columns) and two extra columns:
label contains the name of the unfolded column
value contains the value
.SE
none
.DT
see code below. The inverse of this function is table.trans.
.SH WARNINGS
the data in the unfolded columns should all have the same mode
.SA
`table.trans', as.data.frame.array
.EX
> air[1:5, ]
ozone radiation temperature wind
1 3.448217 190 67 7.4
2 3.301927 118 72 8.0
3 2.289428 149 74 12.6
4 2.620741 313 62 11.5
5 2.843867 299 65 8.6
> reshape(air[1:5,],2,c(1,3,4),"attribute")
radiation attribute value
1 190 ozone 3.448217
2 118 ozone 3.301927
3 149 ozone 2.289428
4 313 ozone 2.620741
5 299 ozone 2.843867
6 190 temperature 67.000000
7 118 temperature 72.000000
8 149 temperature 74.000000
9 313 temperature 62.000000
10 299 temperature 65.000000
11 190 wind 7.400000
12 118 wind 8.000000
13 149 wind 12.600000
14 313 wind 11.500000
15 299 wind 8.600000
> reshape(air[1:5,],label="attribute")
attribute value
1 ozone 3.448217
2 ozone 3.301927
3 ozone 2.289428
4 ozone 2.620741
5 ozone 2.843867
6 radiation 190.000000
7 radiation 118.000000
8 radiation 149.000000
9 radiation 313.000000
10 radiation 299.000000
11 temperature 67.000000
12 temperature 72.000000
13 temperature 74.000000
14 temperature 62.000000
15 temperature 65.000000
16 wind 7.400000
17 wind 8.000000
18 wind 12.600000
19 wind 11.500000
20 wind 8.600000
.KW misc
.WR
4444444444444444444444444444444444444444444444444444444444444444444444444444444
********** From: Bill Dunlap <[email protected]> **********
> 4) JOIN these tables (E, F):
>
> A 1
> B 2
> C 3
>
> A a
> B a
> B a
> B b
> C c
> A b
>
> to give (G):
>
> A a 1
> B a 2
> B a 2
> B b 2
> C c 3
> A b 1
Use merge():
> E<-data.frame(Group1=c("A","B","C"), Number=1:3)
> # F is a reserved word, so I call it FF
> FF <- data.frame(Group1=c("A","B","B","B","C","A"), Group2=c("a","a","a","b","c","b"))
> G <- merge(FF, E)
> G
Group1 Group2 Number
1 A a 1
2 A b 1
3 B a 2
4 B a 2
5 B b 2
6 C c 3
You get a different order but I think it is an equivalent table.
********** From: Ed Kademan <[email protected]> **********
E <- data.frame(c1 = c('A', 'B', 'C'), c2 = 1:3)
F <- data.frame(c1 = c('A', 'B', 'B', 'B', 'C', 'A'),
c2 = c('a', 'a', 'a', 'b', 'c', 'b'))
## F is a very bad name for a variable in R, by the way.
G <- cbind(F, E[[2]][match(F[[1]], E[[1]])])
********** From: [email protected] **********
merge(df1,df2) # if fields have common name
5555555555555555555555555555555555555555555555555555555555555555555555555555555
********** From: Bill Dunlap <[email protected]> **********
> 5) SUBSET Table G to give H:
>
> A a 1
> A b 1
You didn't say what subset this represents. To get the first 2 rows
use G[1:2,]. To get the rows with Number==1 use G[ G$Number==1,].
To get the rows with Group1=="A" use G[ G$Group1=="A",]. To get the
rows with Group1=="A" and Group2 != "c" use G[G$Group1=="A" & G$Group2!="c",].
********** From: Ed Kademan <[email protected]> **********
H <- G[G[[1]] == 'A',]
## I am assuming you just want those rows whose first column contains
## 'A.'
********** From: [email protected] **********
df[df[,1] == "A",]
also see help(subscript)
6666666666666666666666666666666666666666666666666666666666666666666666666666666
********** From: Bill Dunlap <[email protected]> **********
> 6) TRANSPOSE H to give:
>
> A A
> a b
> 1 1
Transpose is harder because data frames must have only one type of data
per column. Our transpose function converts the data frame into a
matrix (which has one type for the whole matrix) and returns the transpose
of that matrix. You would have to convert that back to a data frame:
> data.frame(t(H))
X1 X2
Group1 A A
Group2 a b
Number 1 1
********** From: Ed Kademan <[email protected]> **********
transposeH <- t(H)
********** From: [email protected] **********
t(df)
7777777777777777777777777777777777777777777777777777777777777777777777777777777
********** From: Bill Dunlap <[email protected]> **********
> 7) and finally SORT, by up to 3 keys.
Use order() and subscripting. E.g.,
> G[ order(G$Group1, G$Group2, G$Number), ]
Group1 Group2 Number
1 A a 1
2 A b 1
3 B a 2
4 B a 2
5 B b 2
6 C c 3
> G[ order(G$Number, G$Group2), ]
Group1 Group2 Number
1 A a 1
2 A b 1
3 B a 2
4 B a 2
5 B b 2
6 C c 3
********** From: Ed Kademan <[email protected]> **********
dfr <- as.data.frame(matrix(sample(5, 120, replace = T), 30, 4))
sortedDfr <- dfr[do.call('order', dfr[1:3]),]
## The first column is the primary key, the second is the secondary
## key, and the third is tertiary.
********** From: [email protected] **********
use order()
e.g. df[order(df$a,df$b,df$c),]
===============================================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment