The problem is that R already schedules and runs the parallelization tasks, so we need to expose that to the broader cluster and allow interconnections between containers.
One option for this would be to setup a redis container to manage parallel tasks via the doredis package in R (this limits our parallelization options though since some commands don't run parallel via foreach, but instead run via stuff like parapply -- ???)
try ecs -- setup several container instances. Specify each slave and master node as taking up the full resources of the container instance. There should be essentially 3 big steps here.
If going the redis route, few steps:
- the first step is to setup a redis container, which is easy since there is already a widely available redis image on docker. Setup the redis container as a task on ecs, have it share with the master container instance (or its own container instance -- pref a micro/medium instance.)
- ensure that an appropriate port is exposed. By default this port is 6379. The redis image is built to expose this. It may be worth telling the ecs task to expose it too. do this in the task definition json. also, name the container redis
- launch master container as an ecs task. setup link to redis container
- run R, import data, load doredis, and setup the jobs queue. Setup a parallelizable job in R that will use foreach. Register the redis cluster, telling R to send tasks to the redis server and it's job queue.
- launch slave nodes with link to redis container.
- in each slave container, launch R, load redis, register processors with the redis server, pointing at the jobs queue.
- instructions for this are in the (diredus documentation)[https://cran.r-project.org/web/packages/doRedis/vignettes/doRedis.pdf]
Note, doing things this way, the r code must use %dopar% for parallelization. Also, this process should be fault-tolerant and --more or less -- load-balanced.
if going ssh route, 3 big steps:
-
first, set each docker container must have ssh installed and sshd running with a specific gpg key setup. The ideal way to do this would be to setup the image with the requisite R install. Install ssh, gen the public key and host key and save both on the image. This is a collossal security issue maybe, but it shouldn't be too bad if we setup the links right.
-
run containers from the image as tasks on ecs -- again specifying the tasks to take up the whole value of the cpu and memory per the container instance. So, that's 1 task for each container instance. Also, the images need to run in some sort of persistant daemon mode where they will linger and wait for a connection after the run command. When running each slave task (i.e. slave container) make sure links to the master container are specified. That will help communication. I think you can pick the instance and override the scheduler so that containers are run on specific instances and have the appropriate computing resources -- the key seems to be using the start-task in stead of run-task command in the task definition. This will, however, require defining one task for each slave container instance. If going the redis route -- run this step after setting up the master -- have each slave container startup and register with the redis server -- the ssh config is less of an issue in this case. If using ssh, each container must expose port 22 somehow. You can probably setup vpc on aws to expose 22 to other container instances while not exposing 22 to the internet
-
run the master container (which will delegate the parallelization tasks, etc). if not using redis, provide links to each of the slave containers. If using redis, this is not so much of a problem. If not using redis (using ssh), you need to tell R to register a SOCKcluster using 1 connection per processor on each node.
-
run code in R on master. Save out to s3 or whatever.
Some important caveats:
- if the parallelizable function requires special packages, then those need to be installed on each slave container instance.
- containers will not terminate automatically!!! you need to keep an eye on this process and shut things down accordingly. It may be worth while to setup lambda or cloud-watch to terminate/notify when your processing task is completed.
- In theory, you should be able to use 1 image for slave and master containers and use that same container image in your local environment for development. Just be sure to either save and commit the container image from the master node or export your processed data to an external location (s3, etc) before terminating the ecs cluster.
- one last note on the doredis package. you must set
options('redis:num'=TRUE)
(in the current version 1.1.1). Probably a good idea to set this in the .Rprofile in the container image.