R-memtest.md

Intro

I've being studying the memory usage, especially for serialize. For my tests I'm creating a data frame with 50M rows of doubles that occupies 400MB. I'm using /usr/bin/time -v to gauge memory usage. (In my tests R always has an overhead of 20M, that's the reason why 420MB is reported instead of 400MB).

jorgem@ubuntu:~$ cat df.R 
di <- data.frame(runif(50e6,1,wh10))
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df.R 2>&1|grep resident|grep Max
	Maximum resident set size (kbytes): 421332

If we add serialization the memory peak is 1.2GB:

jorgem@ubuntu:~$ cat df-ser.R 
di <- data.frame(runif(50e6,1,10))
ser <- serialize(di,NULL)
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df-ser.R 2>&1|grep resident|grep Max
	Maximum resident set size (kbytes): 1202544

In this case serialize() has an overhead of 800MB. After the serialization is done the memory usage is 800MB (original data frame + serialized data frame). However at its peak R uses 1.2GB.

Serializing to files

When serializing to a file serialize doesn't incur in any overhead:

    jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-file.R
    di <- data.frame(runif(50e6,1,10))
    #ser <- serialize(di,NULL)
    tf2 <- tempfile()
    con <- file(tf2, open = "wb")
    serialize(di, con)
    close(con)
    jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-file.R 2>&1|grep resident|grep Max
    	Maximum resident set size (kbytes): 421440

Writing file to shm

If we write the file to shm the execution time is almost the same and the memory usage is way lower (400MB vs 1.2GB):

jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-file-shm.R
di <- data.frame(runif(50e6,1,10))
#tf2 <- tempfile()
tf2 <- '/run/shm/testfile'
con <- file(tf2, open = "wb")
serialize(di, con)
close(con)

jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-file-shm.R
NULL
	Command being timed: "Rscript df-ser-file-shm.R"
	User time (seconds): 3.86
	System time (seconds): 0.25
	Maximum resident set size (kbytes): 421692
	
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser.R
	Command being timed: "Rscript df-ser.R"
	User time (seconds): 3.71
	System time (seconds): 0.66
	Maximum resident set size (kbytes): 1204192

Writing to a real file takes way longer due to slow IO (9secs vs 3secs in shared memory).

Using Google Protocol Buffers

I've tried protocol buffers, just out of curiosity.

Dependencies:

sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libprotobuf-dev
sudo apt-get install libprotoc-dev

R package:

install.packages('RProtoBuf')

Using protobuf the result is even worse (1.8GB):

jorgem@ubuntu:~/projects/R-3.1.2$ cat df-ser-protobuf.R
di <- data.frame(runif(50e6,1,10))
library(RProtoBuf)
msg <- serialize_pb(di, NULL)
jorgem@ubuntu:~/projects/R-3.1.2$ /usr/bin/time -v Rscript df-ser-protobuf.R 2>&1|grep resident|grep Max
	Maximum resident set size (kbytes): 1847676

Conclusions

Serializing to an object in memory has a big overhead (peak of 1.2GB for a 400MB dataframe).
Serializing to a file doesn't have overhead.
Serializing to a file in shared memory could be an interesting option.

jorgemarsal/R-memtest.md

Intro

Serializing to files

Writing file to shm

Using Google Protocol Buffers

Conclusions