Jeff Larkin jefflarkin

Background

OpenACC defines data acording to whether it is in discrete or shared memory. When in discrete, specific data operations are specified and implicit data clauses are defined. When in shared memory, data clauses may be ignored if they exist. As an optimization, an implementation may wish to use data clauses as optimization hints. I have historically thought of these in terms of CUDA Unified/Managed Memory with preferred location and prefetching hints. A few cases were brought to my attention that are potentially interesting examples of how this thinking may not be sufficient.

Modifying an allocation during an asynchronous region

I have been made aware of an application that extensively uses the pattern below. A temporary array is allocated locally, in the example below it is an automatic array, and dynamic data lifetimes are used to expose it to the device asynchronously. It is possible that the function would return, deallocting the automatic array, before all operations on that array have com

Biography

Jeff Larkin is the Director of HPC Architecture in NVIDIA's HPC Software Product team, where he leads teams responsible for HPC, Quantum, and CAE/EDA software architecture and technical marketing engineering. He is passionate about the advancement and adoption of parallel programming models for High Performance Computing. He was previously a member of NVIDIA's Developer Technology group, specializing in performance analysis and optimization of high performance computing applications. Jeff is also the chair of the OpenACC technical committee and has worked in both the OpenACC and OpenMP standards bodies. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory. Jeff holds a B.S. in Computer Science from Furman University and a M.S. in Computer Science from the University of Tennessee, where he was a member of the Innovative Computing Lab.

Headshot

Socia

Keybase proof

I hereby claim:

I am jefflarkin on github.
I am jefflarkin (https://keybase.io/jefflarkin) on keybase.
I have a public key whose fingerprint is 13D8 BCCC 1A50 57D3 FB4F E33A 950C 167F 0C1A 041D

To claim this, I am signing this object:

	#
	# Klipper configuration file for Anycubic i3 MEGA S
	#
	# This config file contains settings of all printer pins (steppers, sensors) for Anycubic i3 mega S with TMC2208 Drivers with stock plug orientation
	# Klipper firmware should be compiled for the atmega2560
	#
	# Config file includes
	# - Original or 2208(2209) rotated by cable drivers
	# - Mesh bed leveling: BLtouch (3DTouch sensor from Triangelab)
	# - Manual meshed bed leveling (commented out)

	# This should be added AFTER the FindCUDA macro has been run
	IF(USE_NVTX)
	IF(HAVE_CUDA)
	ADD_DEFINITIONS(-DUSE_NVTX)
	LINK_DIRECTORIES("${CUDA_TOOLKIT_ROOT_DIR}/lib64")
	LINK_LIBRARIES("nvToolsExt")
	ENDIF(HAVE_CUDA)
	ENDIF(USE_NVTX)

	#!/bin/bash
	# USAGE: Add between aprun options and executable
	# For Example: aprun -n 16 -N 1 ./foo arg1 arg2
	# Becomes: aprun -n 16 -N 1 ./cpu_profile.sh ./foo arg1 arg2

	# Give each rank a separate file
	LOG=cpu_profile_$ALPS_APP_PE.nvprof

	# Stripe each profile file by 1 to share the load on large runs
	if [ ! -f "$LOG" ] ; then

	#!/bin/bash
	#FIXME Add usage() function to improve documentation

	# Enable exposure of the specified GPIO pin (0-8)
	gpio_enable()
	{
	if [[("$1" -lt 0) \|\| ("$1" -gt 8)]] ; then
	echo "Valid pins are 0-8"
	return -1;
	fi

	#!/bin/bash
	# USAGE: Add between aprun options and executable
	# For Example: aprun -n 16 -N 1 ./foo arg1 arg2
	# Becomes: aprun -n 16 -N 1 ./nvprof.sh ./foo arg1 arg2
	export PMI_NO_FORK=1

	# Give each rank a separate file
	LOG=timeline_$ALPS_APP_PE.nvprof

	# Set the process and context names

	#include <pthread.h>
	#include <nvToolsExt.h>
	#include <nvToolsExtCudaRt.h>
	// Setup event category name
	{{fn name MPI_Init}}
	nvtxNameCategoryA(999, "MPI");
	{{callfn}}
	int rank;
	PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
	char name[256];