This is a easy workaround to enable Blender rendering on a super-computing cluster. The samples shown here are based on the Euler cluster of ETH Zürich (IBM LSF batch system).
So here's the premise: For scenes that take around a minute or less to render, performance is actually worse if you render on all of the cards with a single instance of Blender. This is because (AFAIK) there's a bit of additional time necessary to collect the render results from each card and stitch them together. That time is a fixed short duration, so it's negligible on larger/longer render jobs. However, on shorter render jobs, the 'stitch time' has a much more significant impact.
I ran into this with a machine I render on that has 4 Quadro K6000s in it. To render animations, I ended up writing a few little scripts to facilitate launching 4 separate instances of Blender, each one tied to one GPU. Overall rendertime was much shorter with that setup than one instance of Blender using all 4 GPUs.
The setup works basically like this... I have a the following Python script (it can be anywhere on your hard drive, so long as you remember the path to it).