Skip to content

Instantly share code, notes, and snippets.

View jorgemarsal's full-sized avatar

Jorge Martinez jorgemarsal

View GitHub Profile
@jorgemarsal
jorgemarsal / k8s.py
Created January 7, 2016 10:21
Helper functions to get started with Kubernetes API
import jinja2
import json
import logging
import os
import requests
import tempfile
import pykube.config
import pykube.http
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", context="talk")
# Set up the matplotlib figure
f, (ax1) = plt.subplots(1, 1, figsize=(10, 6), sharex=True)
# Specify data
@jorgemarsal
jorgemarsal / Dockerfile
Created June 5, 2015 16:31
docker-example
FROM centos:centos7
# install required packages
RUN yum -y install vim openssh-server sudo glibc tar openssh-clients initscripts
# create user
RUN useradd --create-home jorgem
RUN mkdir -p /home/jorgem/.ssh/
ADD id_rsa.pub /home/jorgem/.ssh/id_rsa.pub
ADD id_rsa /home/jorgem/.ssh/id_rsa

Intro

As you all know R users tend to install packages from CRAN using “install.packages”. Making DR available there would greatly help adoption.

Options

  • Source package
  • Binary package
  • Hybrid

Source package

Introduction

Debugging workers and executors is hard because they are started automatically. One possible way is to sleep for a few seconds when the programs start. This gives us time to attach a debugger before the programs does anything.

Implementation

One option is to create 2 files: /tmp/r_executor_startup_sleep_secs and /tmp/r_executor_startup_sleep_secs . The first thing the workers and executors do is to check if that file exists. If it exists the processes sleep for the number of seconds specified in the file:

$ cat /tmp/r_executor_startup_sleep_secs 

30

This is one possible flow for backporting fixes to old branches.

Let's say we want to backport commit f54200217d57c64bdeac93192aa3ff9fc53d5890 to branch DistR-1_0_x.

First we create a local Distr-1_0_x branch:

$ git checkout -b DistR-1_0_x remotes/origin/DistR-1_0_x

Then we backport the commit with git cherry-pick:

Intro

I've being studying the memory usage, especially for serialize. For my tests I'm creating a data frame with 50M rows of doubles that occupies 400MB. I'm using /usr/bin/time -v to gauge memory usage. (In my tests R always has an overhead of 20M, that's the reason why 420MB is reported instead of 400MB).

jorgem@ubuntu:~$ cat df.R 
di <- data.frame(runif(50e6,1,wh10))
jorgem@ubuntu:~$ /usr/bin/time -v Rscript df.R 2>&1|grep resident|grep Max
	Maximum resident set size (kbytes): 421332

If we add serialization the memory peak is 1.2GB:

The issue

In some compiler versions (e.g. GCC 4.6.4 in Ubuntu) when compiling with -rdynamic two functions with the same name (e.g. dataptr defined both in routines.h and barrier.cpp) are placed in the dynamic symbol table. Subsequently When we do R_GetCCallable("Rcpp", "dataptr") we get the wrong function at runtime.

When Rcpp registers dataptr it means this function (in barrier.cpp):

// [[Rcpp::register]]
void* dataptr(SEXP x){
    return DATAPTR(x);
}

Story of the Rcpp hang

Intro

Lately we've seeing hangs when running distributedR_start(). This only happens with some compiler versions. E.g. GCC 4.8.2 on Ubuntu hangs while GCC 4.6.4 doesn't.

Let's see what's going on:

library(distributedR)