Why doRedis?

doRedis is a foreach backend that is robust to node failure and that lets you add or remove worker nodes on the fly. Those are always useful features, but they are absolute requirements when you're running on EC2 spot instances that can shut down without warning. By using doRedis, you can move processing that would have required more reliable on-demand instances onto vastly cheaper spot instances.

Implementation Sketch

If you're starting from scratch, the basic steps in running a simple personal cluster are:

Set up security groups for your master and worker nodes
start an on-demand instance in the master security group and install redis along with the R and the packages you need
create an AMI from that instance
use that AMI to start however many worker nodes you want as spot instances
start redis on the master
start up R on the master, attach the doRedis backend, and use foreach to your heart's content

I assume a certain level of knowledge about starting/stopping instances on EC2, so if you get confused by some of the EC2-related steps, look at some of the previous R user group materials. There are also a huge number of tutorials online, and this one gives fairly concise steps for Ubuntu instances: https://help.ubuntu.com/community/EC2StartersGuide

I also assume that you're connecting from a Linux or Mac machine -- I don't run EC2 control tools from Windows, so I can't tell you what to do.

Security groups

In the AWS console, pick the region you want to run everything in. On the left side, go to Network & Security > Security Groups, click Create Security Group, and name it "master" or another name you like. Create another one called "workers". Copy the worker Security Group's Group ID (something like sg-12abc345), then add a custom TCP rule for your master security group to allow access to port 6379 from your worker security group.

Master Setup

Start up an EBS-backed on-demand instance in your master security group (I'm partial to Canonical's Ubuntu images). At this point, you also have to decide whether you want your instance to be HVM-based (to run on the cluster compute and m3 instance types) or paravirtual (the older, traditional node types).

After the master instance has booted, SSH in and install R and whatever packages you'll need. When you're done and logged out of the master, select the master instance in the AWS console and click Actions > Create Image (EBS AMI). That will shut down your instance while it creates the image that you'll use on the workers.

Start Workers

After the AMI is created, go to Images > AMIs in the AWS console, where you should see the instance that you just created. Click Spot Request at the top, pick the number of instances you want, their type, your maximum price, and the availability zone (it's cheapest to keep them in the same AZ as the master). If you start instances in stages, make sure they all use the same key pair.

Assuming your maximum price was higher than the current spot price (see Instances > Spot Requests > Pricing History in the AWS console), it will take a anywhere between a few seconds and a few minutes for the spot requests to work through the AWS bidding system and actually start booting nodes.

Once the nodes are booted, you want to grab a list of all your worker instances so that you can start/stop R worker processes on them as needed. If you've got the aws command-line tools installed (see http://aws.amazon.com/developertools/351), and your workers are in their own security group, you can do something like this to dump all their host names into a single bash variable.

workerSG="sg-abc12345" # put your actual worker security group ID here
workers=$(ec2-describe-instances | grep "$workerSG" | cut -f4)

Once that variable is defined in your terminal, you can copy/paste in little snippets to run things on all the worker nodes. Obviously if you're doing this a lot, you're probably better off making python or R tools to do this control for you, but for a quick personal cluster, it's hard to beat copy/pasting into a terminal.

The two examples below (don't run them yet) will start/stop doRedis worker processes on all the worker nodes. Replace the host argument with the hostname of your master instance, and bump up n to the number of cores on the worker node (or to fit within the memory limits of the worker nodes if the jobs are memory-heavy).

# start workers on each spot instance 
for h in $workers; do
  ssh -i [[[path to ec2 worker private key]]] ubuntu@$h <<EOF
    ((echo 'library(doRedis); startLocalWorkers(n=1, queue="rdemo", host="[[[hostname of redis server]]]");' | R --slave --no-save >/dev/null 2>&1 &) &) &
EOF
done

# start workers on each spot instance 
# kill all worker processes
for h in $workers; do
  (ssh -i [[[path to EC2 worker private key]]] ubuntu@$h killall R &) &
done

Start Redis on the master

On the master, follow the directions on the redis download page. Then start up screen and run redis-server. It'll occasionally dump some files to the directory where you start redis-server, so you might want to run it from outside the source directory.

Now that redis is running on the master, you can start an R process on the master and attach it to the redis database. Here, the job queue is called rdemo, and since the redis server is on the default port on the same host as the master R process, we can skip the hostname and port arguments.

library(doRedis)
registerDoRedis('rdemo')

You can then spin up all the workers using the copy/paste bash code above (making sure that you give them the same job queue name that you used on the master), and then foreach %dopar% loops will get processed in parallel by the worker nodes.

# look at the node hostnames
allnames <- foreach(i=1:100) %dopar% {system('hostname',intern=T)}
unique(unlist(allnames))

Tips for using doRedis

Only run one master process per redis database

In our experience, when you start running multiple masters, and particularly if you accidentally start up multiple master processes with the same job queue name, things break. You can sometimes get back results from past jobs (sometimes without warning) and other very bad outcomes.

One benefit of running on EC2 is that if you've got another job that needs to run at the same time, you can just spin up another master and its own set of workers, so it's easy to avoid problems.

When you restart the master process, clear the database

When the master R process fails for whatever reason, it can leave a bunch of junk in the redis database. If you're using a single database per master, then you can just use flushdb or flushall to delete everything and start from scratch. When you do that, though, all the worker processes will quit after a timeout, so you'll have to start them up again when you restart the master R process.

Don't pass large objects back from workers

In particular when you're running parallel model jobs or other things that might have large data objects attached, don't pass the results back to the master directly. Redis is an in-memory database, and if you're running a large cluster or a large number of jobs, it's very easy to fill up the redis store.

Modeling functions that attach the fitted data to their return values are particularly bad culprits here since you can end up with many duplicates of the same data filling up your redis server.

For large clusters, tweak some things in sysctl

If you're passing a lot of data to/from workers, Redis likes vm.overcommit_memory=1. Each worker process creates at least 2 connections to the redis server (sometimes more), so with more than 500 cores running (ex. 32 cc2.8xlarge worker instances with 16 processes each), you can hit the maximum number of connections. To boost that, you need to increase maxclients in redis.conf and you may also need to increase the sysctl settings for file-max and change the default setting of ulimit -n to something larger than 1024.

peterfoley/ec2_doRedis_demo.md