I’ve been learning how to use PyMC to fit Bayesian models over the last few months, and recently I decided I wanted to figure out how to sample separate chains in parallel on different cores on whatever machine I’m using.
I poked around a bit, looking at various multi-processing Python packages, and then I found this IPython notebook that walks through a fairly simple example using parallel processing machinery in IPython.
It’s easy to run multiple chains of a PyMC in serial by writing a for loop and calling a model’s sample method once for each chain (see here for details about model fitting in PyMC). But if you want lots of samples in each chain, or if it takes a long time to get even a small number of samples from your model, it seems very inefficient to sample your chains in serial.
Okay, so here’s a very quick and dirty rundown of how to run chains in parallel.
First, you need IPython and PyMC installed, and you need a model script that does everything (i.e., reads in data, defines your model, writes your samples to disk, etc…). Let’s call your model script model_script.py.
The IPython notebook linked above seems to make some of this more complicated than it needs to be, in my opinion, importing PyMC for each core, then pushing the data to each core, and only then feeding the model script to the cores. If your script works when you call it using Python from the command line, it should work as described below.
Let’s say you want three chains. Open one terminal windows and, after the prompt ($), type:
$ ipcluster start -n 3
Now, open another terminal window (or tab) and start IPython. Once that’s up and running, type:
In : from IPython.parallel import Client In : client = Client() In : direct = client[:] In : direct.block = True In : model_script = open('model_script.py').read() In : direct.execute(model_script)
I don’t have a particularly deep understanding of what all is going on here, but I know it works. For the time being, that’s good enough for me. You can, and I probably should (and maybe even will, at some point), read all about ipcluster and various associated tools here.
Anyway, assuming, as mentioned above, that your model_script.py saves your samples for you (rather than, e.g., only keeping them in working memory for use in IPython), you should be good to go.
For what it’s worth, I ran into a bit of a problem when I first got this working. Specifically, the separate cores were creating databases with identical names in rapid succession, so only the last core to create a database was actually saving anything.
I fixed this by using something that I joked about when I first learned of its existence, namely Unix time, which is the number of seconds that have elapsed since the beginning of January 1, 1970.
I was fiddling with the time package in Python, trying to get unique time stamps to add to the database names (since I can’t figure out how to feed each core a unique string to add to its database name), and realized that time.time() returns a number with a few decimal places. I tried adding the following to the database names, under the assumption that somewhere in the neighborhood of the fourth, fifth, or sixth decimal place of Unix time there would be measurable delays between the creation of the databases from the different cores.
clock_string = str(int(time.time()*1000000))
It works, and it looks like I could even shave a couple zeros off and still get distinct names.