Jorge Martinez jorgemarsal

Intro

As you all know R users tend to install packages from CRAN using “install.packages”. Making DR available there would greatly help adoption.

Options

Source package
Binary package
Hybrid

Source package

Prereq

sudo apt-get install libXt-dev
sudo apt-get install texinfo
sudo apt-get install texlive-latex-base
sudo apt-get install texlive-fonts-extra

Compile && install

wget http://cran.r-project.org/src/base/R-3/R-3.1.2.tar.gz

Introduction

Debugging workers and executors is hard because they are started automatically. One possible way is to sleep for a few seconds when the programs start. This gives us time to attach a debugger before the programs does anything.

Implementation

One option is to create 2 files: /tmp/r_executor_startup_sleep_secs and /tmp/r_executor_startup_sleep_secs . The first thing the workers and executors do is to check if that file exists. If it exists the processes sleep for the number of seconds specified in the file:

$ cat /tmp/r_executor_startup_sleep_secs

30

This is one possible flow for backporting fixes to old branches.

Let's say we want to backport commit f54200217d57c64bdeac93192aa3ff9fc53d5890 to branch DistR-1_0_x.

First we create a local Distr-1_0_x branch:

$ git checkout -b DistR-1_0_x remotes/origin/DistR-1_0_x

Then we backport the commit with git cherry-pick:

Intro

I've being studying the memory usage, especially for serialize. For my tests I'm creating a data frame with 50M rows of doubles that occupies 400MB. I'm using /usr/bin/time -v to gauge memory usage. (In my tests R always has an overhead of 20M, that's the reason why 420MB is reported instead of 400MB).

jorgem@ubuntu:~$ cat df.R 
di <- data.frame(runif(50e6,1,wh10))
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df.R 2>&1|grep resident|grep Max
	Maximum resident set size (kbytes): 421332

If we add serialization the memory peak is 1.2GB:

The issue

In some compiler versions (e.g. GCC 4.6.4 in Ubuntu) when compiling with -rdynamic two functions with the same name (e.g. dataptr defined both in routines.h and barrier.cpp) are placed in the dynamic symbol table. Subsequently When we do R_GetCCallable("Rcpp", "dataptr") we get the wrong function at runtime.

When Rcpp registers dataptr it means this function (in barrier.cpp):

// [[Rcpp::register]]
void* dataptr(SEXP x){
    return DATAPTR(x);
}

Story of the Rcpp hang

Intro

Lately we've seeing hangs when running distributedR_start(). This only happens with some compiler versions. E.g. GCC 4.8.2 on Ubuntu hangs while GCC 4.6.4 doesn't.

Let's see what's going on:

library(distributedR)

	import jinja2
	import json
	import logging
	import os
	import requests
	import tempfile

	import pykube.config
	import pykube.http

	import numpy as np
	import seaborn as sns
	import matplotlib.pyplot as plt

	sns.set(style="white", context="talk")

	# Set up the matplotlib figure
	f, (ax1) = plt.subplots(1, 1, figsize=(10, 6), sharex=True)

	# Specify data

	FROM centos:centos7

	# install required packages
	RUN yum -y install vim openssh-server sudo glibc tar openssh-clients initscripts

	# create user
	RUN useradd --create-home jorgem
	RUN mkdir -p /home/jorgem/.ssh/
	ADD id_rsa.pub /home/jorgem/.ssh/id_rsa.pub
	ADD id_rsa /home/jorgem/.ssh/id_rsa