rrnewton · May 4, 2012 19:19
diff --git a/meta-par-accelerate_blog_post.lhs b/meta-par-accelerate_blog_post.lhs


 > module Main where

 How to write hybrid CPU/GPU programs with Haskell
 -------------------------------------------------

 What's better than programming a GPU with a high-level,
 Haskell-embedded DSL (domain-specific-language)?  Well, perhaps
 writing portable CPU/GPU programs that utilize both pieces of
 silicon--with dynamic load-balancing between them--would fit the
 bill.

 This is one of the heterogeneous programming scenarios supported by
 our new **meta-par** packages.  A draft paper [can be found
 here](http://www.cs.indiana.edu/~rrnewton/papers/meta-par_submission.pdf),
 which explains the mechanism for building parallel schedulers out of
 "mix-in" components.  In this post, however, we will skip over that
 and take a look at CPU/GPU programming specifically.

 This post assumes familiarity with the **monad-par** parallel
 programming library, [which is described in this
 paper](http://www.cs.indiana.edu/~rrnewton/papers/haskell2011_monad-par.pdf).

 Getting Started
 -------------------------------------------------

 First, we install the just-released  [**meta-par-accelerate**
 package](http://hackage.haskell.org/package/meta-par-accelerate):

    cabal install meta-par-accelerate

 And then we import the following module: 

 > import Control.Monad.Par.Meta.AccSMP

 This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par
 multicore CPU) scheduling: It also reexports the [**ParAccelerate**
 type
 class](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#t:ParAccelerate)
  which provides the ability to launch GPU computations from within a
 **Par** computation.

 Next, we also import Accelerate itself to so that we can express **Acc**
 computations that can run on the GPU:

 > import Data.Array.Accelerate
 > import Data.Array.Accelerate.CUDA as CUDA

 (By the way, this blog post is an executable literate Haskell file
 [that can be found here](GITHUB_GIST).)

 Now we are ready to create a trivial Accelerate computation:

 > triv :: Acc (Scalar Float)
 > triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 )
 >        in fold (+) 0 arr

 We could run this directly using CUDA, which would print out 
 **Array (Z) [33.0]**, Accelerate's of saying **33.0** 
 (i.e. it's a zero-dimensional array):

 > runDirect = print (CUDA.run triv)

 If we are instead inside a Par computation, we simply use **runACC* or
 **runAccWith**:

 > runWPar1   = print (runPar (runAcc triv))
 > runWPar2   = print (runPar (runAccWith CUDA.run triv))

 The former uses the default Accelerate implementation.  The latter
 specifies which Accelerate implementation to use.  After all, there
 might ultimately be several -- OpenCL, CUDA, plus varius CPU backends.

 (In the future, we plan to add the ability to change the default
 Accelerate backend either at the **runPar** site, or statically.  Stay
 tuned for that.  But for now just use **runAccWith**.)

 One might at this point observe that it is possible to use @CUDA.run@
 directly within a **Par** computation.  This is true.  The advantage
 of using **runAcc** is that it informs the **Par** scheduler of what's
 going on.  The scheduler can therefore execute other work on the CPU
 core that would otherwise be waiting for the GPU.  

 An application could achieve the same effect by creating a dedicated
 thread to talk to the GPU, but that wouldn't jive well with a pure
 computation (**forkIO**), and it's easier to let **meta-par** handle
 it.  

 The second benefit of integrated scheduling is that the scheduler can
 automatically divide work between the CPU and GPU.  Eventually, when
 there are [full-featured, efficient CPU-backends for
 Accelerate](https://github.com/HIPERFIT/accelerate-opencl), this will
 happen transparently.  For now you need to use **unsafeHybrid**
 described in the next section.

 Finally, our [soon-forthcoming CPU/GPU/Distributed
 schedulers](https://github.com/simonmar/monad-par/tree/4332a2dc6fab7ccdb702ad5b285e052f62b43c14/meta-par-dist-tcp)
 can make more intelligent decisions if they know where all the calls
 to GPU computations occur.


 Hybrid CPU/GPU workloads.
 -------------------------------------------------

 The [meta-par](http://hackage.haskell.org/package/meta-par) and
 [meta-par-accelerate](http://hackage.haskell.org/package/meta-par-accelerate)
 packages, as currently released, include a generalized work-stealing
 infrastructure.  

 The relevant point for our purposes here, is that the CPU and GPU can
 each steal work from one another.  Work-stealing is by no means the
 most sophisticated CPU/GPU partitioning on the scene.  Much literature
 has been written on the subject, and it can get quite sophisticated
 (for example, modeling memory transfer time).  However, as on regular
 multicores, work-stealing provides an admirable combination of
 simplicity and efficacy.  For example, if a given program runs much
 better on the CPU or the GPU respectively, then that device will end
 up doing more of the work.

 In the current release, we use [unsafeHybridWith, documented
 here](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#v:unsafeHybrid),
 to spawn a task with two separate implementations--one CPU and one
 GPU--leaving the scheduler to choose between them.  Here's a silly
 example:

 > hybrid :: Par (IVar Float)
 > hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) (return 33.0, triv)
 > runHybrid = print (runPar (hybrid >>= get))

 The call to **unsafeHybridWith** is passed a task that consists of a
 separate CPU **(return 33.0)** and GPU (**triv**) component.


 Further generalizations
 -----------------------------

 * ParAccelerate

 Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it
 is possible to genereralize this even further, abstracting over not
 just Accelerate implementations but over different kinds of EDSLs thata

 * ParOffChip

 Note that those types are general enough to encapsulate (**Arrays**
 constraint, **Acc** type) and CloudHaskell-style remote calls
 (**Serializable** constraint, **Closure** type).

 Yet, we haven't yet seen a strong motivation for generalizing the
 interface to this extent.  (And there's always the danger that the
 interface becomes difficult to use due to ambiguity errors from the
 type checker.)  If you have one, let us know!


 Notes for Hackers
 -----------------------------

 If you want to work with the github repositories, you need to have GHC
 7.4 and the latest cabal-install (0.14.0).  You can check everything
 out here:

    git clone git://github.com/simonmar/monad-par.git --recursive

 Try **make mega-install-gpu** if you already have CUDA installed on your machine.


 Appendix: Documentation Links
 -----------------------------



 * [accelerate-cuda](http://www.cs.indiana.edu/~rrnewton/haddock/accelerate-cuda/)



 > main = do putStrLn "hi"
 >           runDirect            
 >           runWPar1
 >           runWPar2
 > tmp :: Par (Scalar Float)
 > tmp = runAcc triv


	> module Main where

	How to write hybrid CPU/GPU programs with Haskell
	-------------------------------------------------

	What's better than programming a GPU with a high-level,
	Haskell-embedded DSL (domain-specific-language)? Well, perhaps
	writing portable CPU/GPU programs that utilize both pieces of
	silicon--with dynamic load-balancing between them--would fit the
	bill.

	This is one of the heterogeneous programming scenarios supported by
	our new meta-par packages. A draft paper [can be found
	here](http://www.cs.indiana.edu/~rrnewton/papers/meta-par_submission.pdf),
	which explains the mechanism for building parallel schedulers out of
	"mix-in" components. In this post, however, we will skip over that
	and take a look at CPU/GPU programming specifically.

	This post assumes familiarity with the monad-par parallel
	programming library, [which is described in this
	paper](http://www.cs.indiana.edu/~rrnewton/papers/haskell2011_monad-par.pdf).

	Getting Started
	-------------------------------------------------

	First, we install the just-released [meta-par-accelerate
	package](http://hackage.haskell.org/package/meta-par-accelerate):

	cabal install meta-par-accelerate

	And then we import the following module:

	> import Control.Monad.Par.Meta.AccSMP

	This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par
	multicore CPU) scheduling: It also reexports the [ParAccelerate
	type
	class](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#t:ParAccelerate)
	which provides the ability to launch GPU computations from within a
	Par computation.

	Next, we also import Accelerate itself to so that we can express Acc
	computations that can run on the GPU:

	> import Data.Array.Accelerate
	> import Data.Array.Accelerate.CUDA as CUDA

	(By the way, this blog post is an executable literate Haskell file
	[that can be found here](GITHUB_GIST).)

	Now we are ready to create a trivial Accelerate computation:

	> triv :: Acc (Scalar Float)
	> triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 )
	> in fold (+) 0 arr

	We could run this directly using CUDA, which would print out
	Array (Z) [33.0], Accelerate's of saying 33.0
	(i.e. it's a zero-dimensional array):

	> runDirect = print (CUDA.run triv)

	If we are instead inside a Par computation, we simply use *runACC or
	runAccWith:

	> runWPar1 = print (runPar (runAcc triv))
	> runWPar2 = print (runPar (runAccWith CUDA.run triv))

	The former uses the default Accelerate implementation. The latter
	specifies which Accelerate implementation to use. After all, there
	might ultimately be several -- OpenCL, CUDA, plus varius CPU backends.

	(In the future, we plan to add the ability to change the default
	Accelerate backend either at the runPar site, or statically. Stay
	tuned for that. But for now just use runAccWith.)

	One might at this point observe that it is possible to use @CUDA.run@
	directly within a Par computation. This is true. The advantage
	of using runAcc is that it informs the Par scheduler of what's
	going on. The scheduler can therefore execute other work on the CPU
	core that would otherwise be waiting for the GPU.

	An application could achieve the same effect by creating a dedicated
	thread to talk to the GPU, but that wouldn't jive well with a pure
	computation (forkIO), and it's easier to let meta-par handle
	it.

	The second benefit of integrated scheduling is that the scheduler can
	automatically divide work between the CPU and GPU. Eventually, when
	there are [full-featured, efficient CPU-backends for
	Accelerate](https://github.com/HIPERFIT/accelerate-opencl), this will
	happen transparently. For now you need to use unsafeHybrid
	described in the next section.

	Finally, our [soon-forthcoming CPU/GPU/Distributed
	schedulers](https://github.com/simonmar/monad-par/tree/4332a2dc6fab7ccdb702ad5b285e052f62b43c14/meta-par-dist-tcp)
	can make more intelligent decisions if they know where all the calls
	to GPU computations occur.


	Hybrid CPU/GPU workloads.
	-------------------------------------------------

	The [meta-par](http://hackage.haskell.org/package/meta-par) and
	[meta-par-accelerate](http://hackage.haskell.org/package/meta-par-accelerate)
	packages, as currently released, include a generalized work-stealing
	infrastructure.

	The relevant point for our purposes here, is that the CPU and GPU can
	each steal work from one another. Work-stealing is by no means the
	most sophisticated CPU/GPU partitioning on the scene. Much literature
	has been written on the subject, and it can get quite sophisticated
	(for example, modeling memory transfer time). However, as on regular
	multicores, work-stealing provides an admirable combination of
	simplicity and efficacy. For example, if a given program runs much
	better on the CPU or the GPU respectively, then that device will end
	up doing more of the work.

	In the current release, we use [unsafeHybridWith, documented
	here](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#v:unsafeHybrid),
	to spawn a task with two separate implementations--one CPU and one
	GPU--leaving the scheduler to choose between them. Here's a silly
	example:

	> hybrid :: Par (IVar Float)
	> hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) (return 33.0, triv)
	> runHybrid = print (runPar (hybrid >>= get))

	The call to unsafeHybridWith is passed a task that consists of a
	separate CPU (return 33.0) and GPU (triv) component.


	Further generalizations
	-----------------------------

	* ParAccelerate

	Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it
	is possible to genereralize this even further, abstracting over not
	just Accelerate implementations but over different kinds of EDSLs thata

	* ParOffChip

	Note that those types are general enough to encapsulate (Arrays
	constraint, Acc type) and CloudHaskell-style remote calls
	(Serializable constraint, Closure type).

	Yet, we haven't yet seen a strong motivation for generalizing the
	interface to this extent. (And there's always the danger that the
	interface becomes difficult to use due to ambiguity errors from the
	type checker.) If you have one, let us know!


	Notes for Hackers
	-----------------------------

	If you want to work with the github repositories, you need to have GHC
	7.4 and the latest cabal-install (0.14.0). You can check everything
	out here:

	git clone git://github.com/simonmar/monad-par.git --recursive

	Try make mega-install-gpu if you already have CUDA installed on your machine.


	Appendix: Documentation Links
	-----------------------------



	* [accelerate-cuda](http://www.cs.indiana.edu/~rrnewton/haddock/accelerate-cuda/)



	> main = do putStrLn "hi"
	> runDirect
	> runWPar1
	> runWPar2
	> tmp :: Par (Scalar Float)
	> tmp = runAcc triv