jvoorhis · July 9, 2010 19:57
diff --git a/Galois Tech Talk: Requirements and Performance of Data Intensive, Irregular Applications b/Galois Tech Talk: Requirements and Performance of Data Intensive, Irregular Applications
 2010 07 09 10:30

 Galois Tech Talk: Requirements and Performance of Data Intensive, Irregular Applications

 Iavor S. Diatchki
  Pacific NW National Laboratory

 Analytics
  Security
    Anomaly detection
  Semantic web
    Trends

 Data is graph based
  Linear algebra methods fall down
  favors special architectures
    Cray XMT used in speaker's research 
      http://en.wikipedia.org/wiki/Cray_XMT
      http://www.cray.com/products/xmt/
      BSD Microkernel schedules only user code on Threadstorm CPUs
        500 MHz maximal processor count from 256 to 8192, and maximum memory to 128 TB
        single core may run up to 15 programs simultaneously
      system IO performed by Opterons running Linux
        code manually moved to compute cores via software
      global address space
    sponsored by Cray

 low diameter graphs
  work explodes
  difficult to partition
  high % of nodes visited

 scale-free graphs
  few nodes are highly connected hubs
  difficult to partition
  concentrates in few nodes

 graph methods
  paths (shortest path, min/max, etc)
  strutcures (SCC, etc)
  groups (coloring, etc)
  orderings (topological, etc)

 graph properties
  degree distribution – normal vs scale-free
  planar / non-planar
  static / dynamic
  weighted / unweighted
  typed / untyped edges

 challenges
  memory intensive
  low locality
  hide latency w/ parallelism
  low computation:communication
  fine grained synchronization (per entity)
  work is imbalanced
    favors work-stealing?

 software implications
  !! place data near computation
  access data in order / reuse
  !! partition into tasks
  minimize communication / synchronization
  avoid modifying shared data

 multithreading
  avoid stalling cpu pipeline
    context switch on memory cache miss
    high memory bandwidth 
    many threads per core, small thread state
    PNNL: software tools to facilitate use of local memory

 Maximum matching
  matching is a subset of edges with no incident vertices
  maximize cost function
  Hoepman's Algorithm
    greedy, fast, terminating

 Parallel partitioning
  Halappanavar, Dobrina, Pothen
  partition graph on shared memory machine
  independent computation followed by shared computation

 Tools
  METIS Partitioner

 DataFlow algorithm for maximum matching
  nodes propose by setting signals on edges
    state is Maybe Bit per (edge x vertex)
    fine grain synchronization
  performance insensitive to structure / weight
  !! deadlock prone
    pre-emption is too expensive (XMT does not pre-empt by default!)
    manually order tasks to guarantee progress
      heaviest edge will always match
      add is-active field to vertices
        if CAS succeeds, make recursive call to visit connected vertex
          would benefit from CPS?
  techniques extend to graph coloring, other trad algorithms


 Summary
  unstrutured / irregular data is graph-centric
  dataflow techniques useful for parallelizing graph algorithms
  parallel architecture increases productivity

 Q/A
  PNNL not researching FPGA / GPGPU yet
    other research in progress, future area of interest
	2010 07 09 10:30

	Galois Tech Talk: Requirements and Performance of Data Intensive, Irregular Applications

	Iavor S. Diatchki
	Pacific NW National Laboratory

	Analytics
	Security
	Anomaly detection
	Semantic web
	Trends

	Data is graph based
	Linear algebra methods fall down
	favors special architectures
	Cray XMT used in speaker's research
	http://en.wikipedia.org/wiki/Cray_XMT
	http://www.cray.com/products/xmt/
	BSD Microkernel schedules only user code on Threadstorm CPUs
	500 MHz maximal processor count from 256 to 8192, and maximum memory to 128 TB
	single core may run up to 15 programs simultaneously
	system IO performed by Opterons running Linux
	code manually moved to compute cores via software
	global address space
	sponsored by Cray

	low diameter graphs
	work explodes
	difficult to partition
	high % of nodes visited

	scale-free graphs
	few nodes are highly connected hubs
	difficult to partition
	concentrates in few nodes

	graph methods
	paths (shortest path, min/max, etc)
	strutcures (SCC, etc)
	groups (coloring, etc)
	orderings (topological, etc)

	graph properties
	degree distribution – normal vs scale-free
	planar / non-planar
	static / dynamic
	weighted / unweighted
	typed / untyped edges

	challenges
	memory intensive
	low locality
	hide latency w/ parallelism
	low computation:communication
	fine grained synchronization (per entity)
	work is imbalanced
	favors work-stealing?

	software implications
	!! place data near computation
	access data in order / reuse
	!! partition into tasks
	minimize communication / synchronization
	avoid modifying shared data

	multithreading
	avoid stalling cpu pipeline
	context switch on memory cache miss
	high memory bandwidth
	many threads per core, small thread state
	PNNL: software tools to facilitate use of local memory

	Maximum matching
	matching is a subset of edges with no incident vertices
	maximize cost function
	Hoepman's Algorithm
	greedy, fast, terminating

	Parallel partitioning
	Halappanavar, Dobrina, Pothen
	partition graph on shared memory machine
	independent computation followed by shared computation

	Tools
	METIS Partitioner

	DataFlow algorithm for maximum matching
	nodes propose by setting signals on edges
	state is Maybe Bit per (edge x vertex)
	fine grain synchronization
	performance insensitive to structure / weight
	!! deadlock prone
	pre-emption is too expensive (XMT does not pre-empt by default!)
	manually order tasks to guarantee progress
	heaviest edge will always match
	add is-active field to vertices
	if CAS succeeds, make recursive call to visit connected vertex
	would benefit from CPS?
	techniques extend to graph coloring, other trad algorithms


	Summary
	unstrutured / irregular data is graph-centric
	dataflow techniques useful for parallelizing graph algorithms
	parallel architecture increases productivity

	Q/A
	PNNL not researching FPGA / GPGPU yet
	other research in progress, future area of interest