How to parallelize CASA with an easy to work with interface

Using CASA can be slow at times. There are many tasks that have to work in a serial way and thus cannot be parallelized. Often, these tasks have to be run for a number of independent files which opens up an opportunity for a manual, kind of brute-force parallelization: just execute multiple instances of CASA and let each do only one specific task. This is quite easy to implement in python. The function run_in_casa in my python_helpers collection is a wrapper to launch CASA and pass commands to it. It takes the approach of passing the commands as a file to execute rather than directly connecting (e.g. to stdin). This is simple but robust and has worked very well for me for years. With this method to execute CASA, parallel instances can be created with the multiprocessing package.

Annotated Example

The following is meant to be run in python3 since I will use the very convenient fstrings. In python2 this has to be replaced with str() conversions.

Define the tasks to be executed

As a basic example, say I want to import 100 fits files into CASA images. The fits files are named 1.fits to 100.fits. Create a list with the commands that shold be executede by CASA:

commands = [f"importfits(fitsimage={i}.fits, imagename={i}.image)" for i in np.arange(1,101)]

For longer commands it is useful to used multiline strings. That way the code can be nicely formatted. For example:

commands = []
for i in np.arange(1,101):
    commands.append(f"""import numpy as np
importfits(fitsimage={i}.fits, 
           imagename={i}.image"
          )"""

Note that indentation require some care with multiline strings. If the importfits is indented to match import numpy, CASA will complain about a wrongly indented code block.

Set up a multiprocess.Pool to execute code in parallel

Since importing data is I/O intensive, I do not want to use all 40 cores available to me because the processes would overwhelm the available I/O performance and run slower then possible. Instead, I want to use only 10 cores.

from multiprocessing import Pool
pool = Pool(10)

Execute the CASA commands in parallel

Once set up, a multiprocess.Pool instance just needs a function to execute and a list or arguments. It then runs the function (run_in_casa with each of the arguments (the commands list) given in the list in as many parallel threads as specified above (here: 10). This works because the run_in_casa function takes the CASA commands to execute as its one argument. The CASA instances will close themselves automatically after executing their commands.

from python_helpers import run_in_casa
pool.map(run_in_casa, commands)

Note: Since all CASA instances produce output printed to the terminal, it might get messy. That's the catch with this crude method of parallelization. For information or debugging, the CASA log files are much more helpful. Since each CASA instance is independent, the log files are not messed up.

Clean up multiprocess.Pool

pool.close()
pool.join()

and that's it.

Full example script

from python_helpers import run_in_casa
commands = [f"importfits(fitsimage={i}.fits, imagename={i}.image)" for i in np.arange(1,101)]
from multiprocessing import Pool
pool = Pool(10)
pool.map(run_in_casa, commands)
pool.close()
pool.join()

GiantMolecularCloud/parallel_casa.md