Created
May 4, 2012 19:19
-
-
Save rrnewton/2597128 to your computer and use it in GitHub Desktop.
meta-par-accelerate blog post
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> module Main where | |
How to write hybrid CPU/GPU programs with Haskell | |
------------------------------------------------- | |
What's better than programming a GPU with a high-level, | |
Haskell-embedded DSL (domain-specific-language)? Well, perhaps | |
writing portable CPU/GPU programs that utilize both pieces of | |
silicon--with dynamic load-balancing between them--would fit the | |
bill. | |
This is one of the heterogeneous programming scenarios supported by | |
our new **meta-par** packages. A draft paper [can be found | |
here](http://www.cs.indiana.edu/~rrnewton/papers/meta-par_submission.pdf), | |
which explains the mechanism for building parallel schedulers out of | |
"mix-in" components. In this post, however, we will skip over that | |
and take a look at CPU/GPU programming specifically. | |
This post assumes familiarity with the **monad-par** parallel | |
programming library, [which is described in this | |
paper](http://www.cs.indiana.edu/~rrnewton/papers/haskell2011_monad-par.pdf). | |
Getting Started | |
------------------------------------------------- | |
First, we install the just-released [**meta-par-accelerate** | |
package](http://hackage.haskell.org/package/meta-par-accelerate): | |
cabal install meta-par-accelerate | |
And then we import the following module: | |
> import Control.Monad.Par.Meta.AccSMP | |
This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par | |
multicore CPU) scheduling: It also reexports the [**ParAccelerate** | |
type | |
class](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#t:ParAccelerate) | |
which provides the ability to launch GPU computations from within a | |
**Par** computation. | |
Next, we also import Accelerate itself to so that we can express **Acc** | |
computations that can run on the GPU: | |
> import Data.Array.Accelerate | |
> import Data.Array.Accelerate.CUDA as CUDA | |
(By the way, this blog post is an executable literate Haskell file | |
[that can be found here](GITHUB_GIST).) | |
Now we are ready to create a trivial Accelerate computation: | |
> triv :: Acc (Scalar Float) | |
> triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 ) | |
> in fold (+) 0 arr | |
We could run this directly using CUDA, which would print out | |
**Array (Z) [33.0]**, Accelerate's of saying **33.0** | |
(i.e. it's a zero-dimensional array): | |
> runDirect = print (CUDA.run triv) | |
If we are instead inside a Par computation, we simply use **runACC* or | |
**runAccWith**: | |
> runWPar1 = print (runPar (runAcc triv)) | |
> runWPar2 = print (runPar (runAccWith CUDA.run triv)) | |
The former uses the default Accelerate implementation. The latter | |
specifies which Accelerate implementation to use. After all, there | |
might ultimately be several -- OpenCL, CUDA, plus varius CPU backends. | |
(In the future, we plan to add the ability to change the default | |
Accelerate backend either at the **runPar** site, or statically. Stay | |
tuned for that. But for now just use **runAccWith**.) | |
One might at this point observe that it is possible to use @CUDA.run@ | |
directly within a **Par** computation. This is true. The advantage | |
of using **runAcc** is that it informs the **Par** scheduler of what's | |
going on. The scheduler can therefore execute other work on the CPU | |
core that would otherwise be waiting for the GPU. | |
An application could achieve the same effect by creating a dedicated | |
thread to talk to the GPU, but that wouldn't jive well with a pure | |
computation (**forkIO**), and it's easier to let **meta-par** handle | |
it. | |
The second benefit of integrated scheduling is that the scheduler can | |
automatically divide work between the CPU and GPU. Eventually, when | |
there are [full-featured, efficient CPU-backends for | |
Accelerate](https://github.com/HIPERFIT/accelerate-opencl), this will | |
happen transparently. For now you need to use **unsafeHybrid** | |
described in the next section. | |
Finally, our [soon-forthcoming CPU/GPU/Distributed | |
schedulers](https://github.com/simonmar/monad-par/tree/4332a2dc6fab7ccdb702ad5b285e052f62b43c14/meta-par-dist-tcp) | |
can make more intelligent decisions if they know where all the calls | |
to GPU computations occur. | |
Hybrid CPU/GPU workloads. | |
------------------------------------------------- | |
The [meta-par](http://hackage.haskell.org/package/meta-par) and | |
[meta-par-accelerate](http://hackage.haskell.org/package/meta-par-accelerate) | |
packages, as currently released, include a generalized work-stealing | |
infrastructure. | |
The relevant point for our purposes here, is that the CPU and GPU can | |
each steal work from one another. Work-stealing is by no means the | |
most sophisticated CPU/GPU partitioning on the scene. Much literature | |
has been written on the subject, and it can get quite sophisticated | |
(for example, modeling memory transfer time). However, as on regular | |
multicores, work-stealing provides an admirable combination of | |
simplicity and efficacy. For example, if a given program runs much | |
better on the CPU or the GPU respectively, then that device will end | |
up doing more of the work. | |
In the current release, we use [unsafeHybridWith, documented | |
here](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#v:unsafeHybrid), | |
to spawn a task with two separate implementations--one CPU and one | |
GPU--leaving the scheduler to choose between them. Here's a silly | |
example: | |
> hybrid :: Par (IVar Float) | |
> hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) (return 33.0, triv) | |
> runHybrid = print (runPar (hybrid >>= get)) | |
The call to **unsafeHybridWith** is passed a task that consists of a | |
separate CPU **(return 33.0)** and GPU (**triv**) component. | |
Further generalizations | |
----------------------------- | |
* ParAccelerate | |
Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it | |
is possible to genereralize this even further, abstracting over not | |
just Accelerate implementations but over different kinds of EDSLs thata | |
* ParOffChip | |
Note that those types are general enough to encapsulate (**Arrays** | |
constraint, **Acc** type) and CloudHaskell-style remote calls | |
(**Serializable** constraint, **Closure** type). | |
Yet, we haven't yet seen a strong motivation for generalizing the | |
interface to this extent. (And there's always the danger that the | |
interface becomes difficult to use due to ambiguity errors from the | |
type checker.) If you have one, let us know! | |
Notes for Hackers | |
----------------------------- | |
If you want to work with the github repositories, you need to have GHC | |
7.4 and the latest cabal-install (0.14.0). You can check everything | |
out here: | |
git clone git://github.com/simonmar/monad-par.git --recursive | |
Try **make mega-install-gpu** if you already have CUDA installed on your machine. | |
Appendix: Documentation Links | |
----------------------------- | |
* [accelerate-cuda](http://www.cs.indiana.edu/~rrnewton/haddock/accelerate-cuda/) | |
> main = do putStrLn "hi" | |
> runDirect | |
> runWPar1 | |
> runWPar2 | |
> tmp :: Par (Scalar Float) | |
> tmp = runAcc triv |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment