I've being studying the memory usage, especially for serialize
. For my tests I'm creating a data frame with 50M rows of doubles that occupies 400MB. I'm using /usr/bin/time -v
to gauge memory usage. (In my tests R always has an overhead of 20M, that's the reason why 420MB is reported instead of 400MB).
jorgem@ubuntu:~$ cat df.R
di <- data.frame(runif(50e6,1,wh10))
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df.R 2>&1|grep resident|grep Max
Maximum resident set size (kbytes): 421332
If we add serialization the memory peak is 1.2GB:
jorgem@ubuntu:~$ cat df-ser.R
di <- data.frame(runif(50e6,1,10))
ser <- serialize(di,NULL)
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df-ser.R 2>&1|grep resident|grep Max
Maximum resident set size (kbytes): 1202544
In this case serialize()
has an overhead of 800MB. After the serialization is done the memory usage is 800MB (original data frame + serialized data frame). However at its peak R uses 1.2GB.
When serializing to a file serialize
doesn't incur in any overhead:
jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-file.R
di <- data.frame(runif(50e6,1,10))
#ser <- serialize(di,NULL)
tf2 <- tempfile()
con <- file(tf2, open = "wb")
serialize(di, con)
close(con)
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-file.R 2>&1|grep resident|grep Max
Maximum resident set size (kbytes): 421440
If we write the file to shm the execution time is almost the same and the memory usage is way lower (400MB vs 1.2GB):
jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-file-shm.R
di <- data.frame(runif(50e6,1,10))
#tf2 <- tempfile()
tf2 <- '/run/shm/testfile'
con <- file(tf2, open = "wb")
serialize(di, con)
close(con)
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-file-shm.R
NULL
Command being timed: "Rscript df-ser-file-shm.R"
User time (seconds): 3.86
System time (seconds): 0.25
Maximum resident set size (kbytes): 421692
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser.R
Command being timed: "Rscript df-ser.R"
User time (seconds): 3.71
System time (seconds): 0.66
Maximum resident set size (kbytes): 1204192
Writing to a real file takes way longer due to slow IO (9secs vs 3secs in shared memory).
I've tried protocol buffers, just out of curiosity.
Dependencies:
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libprotobuf-dev
sudo apt-get install libprotoc-dev
R package:
install.packages('RProtoBuf')
Using protobuf the result is even worse (1.8GB):
jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-protobuf.R
di <- data.frame(runif(50e6,1,10))
library(RProtoBuf)
msg <- serialize_pb(di, NULL)
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-protobuf.R 2>&1|grep resident|grep Max
Maximum resident set size (kbytes): 1847676
- Serializing to an object in memory has a big overhead (peak of 1.2GB for a 400MB dataframe).
- Serializing to a file doesn't have overhead.
- Serializing to a file in shared memory could be an interesting option.