Bulat-Ziganshin · April 13, 2023 06:01
diff --git a/bulat-gpgpu-links.txt b/bulat-gpgpu-links.txt
 https://devtalk.nvidia.com/default/topic/933827/cuda-programming-and-performance/fast-256-bin-histogram/
 http://www.cse.uconn.edu/~zshi/course/cse5302/ref/chhugani08sorting.pdf
 http://link.springer.com/chapter/10.1007/978-3-642-23397-5_16
 http://arxiv.org/abs/1008.2849 Faster Radix Sort via Virtual Memory and Write-Combining Jan Wassenberg, Peter Sanders

 https://devtalk.nvidia.com/default/topic/378826/cuda-programming-and-performance/my-speedy-sgemm/post/2703033/#2703033
 https://devtalk.nvidia.com/default/topic/390366/cuda-programming-and-performance/instruction-latency/post/2768197/#2768197
 https://devtalk.nvidia.com/default/topic/913832/cuda-programming-and-performance/sum-reduction-working-in-fermi-kepler-and-maxwell/
 https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/1
 https://devtalk.nvidia.com/default/topic/690631/cuda-programming-and-performance/so-whats-new-about-maxwell-/post/4305310/#4305310
 https://devtalk.nvidia.com/default/topic/878664/cuda-programming-and-performance/custom-memory-allocator-for-cuda-desired/post/4755097/#4755097
 https://devtalk.nvidia.com/default/topic/695408/first-impressions-of-cuda-6-managed-memory/
 https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4672378/#4672378
 https://devtalk.nvidia.com/default/topic/638031/ldg-versus-textures/
 http://stackoverflow.com/questions/17004557/how-to-avoid-tlb-miss-and-high-global-memory-replay-overhead-in-cuda-gpus
 https://devtalk.nvidia.com/default/topic/873995/global-memory-access-bottleneck/
 http://www.techenablement.com/inside-nvidias-unified-memory-multi-gpu-limitations-and-the-need-for-a-cudamadvise-api-call/
 https://parallel-computing.pro/index.php/9-cuda/43-openmp-4-0-on-nvidia-cuda-gpus
 GPGPU-Sim 3.x Manual http://archive.is/krK9N
 https://devtalk.nvidia.com/default/topic/928796/gpu-accelerated-libraries/moderngpu-2-0/
 https://devtalk.nvidia.com/default/topic/844924/announcements/cudapad-and-its-source-code-are-now-available-for-download-/
 CUB: http://on-demand.gputechconf.com/gtc/2015/video/S5617.html
     https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/
 http://on-demand.gputechconf.com/gtc-express/2013/videos/understanding-parallel-graph-algorithms.mp4
 http://on-demand.gputechconf.com/gtc/2013/webinar/essential-optimization-techniques-for-nvidia-kepler-and-fermi-architecture.mp4
 https://devtalk.nvidia.com/default/topic/799429/cuda-programming-and-performance/possible-to-use-the-cuda-math-api-integer-intrinsics-to-find-the-nth-unset-bit-in-a-32-bit-int/post/4407256/#4407256
 https://devtalk.nvidia.com/default/topic/804281/cuda-programming-and-performance/maxwell-integer-mul-mad-instruction-counts
 https://devtalk.nvidia.com/default/topic/980740/cuda-programming-and-performance/xmad-meaning/
 http://stackoverflow.com/questions/35566178/instruction-replay-in-cuda/35593124#35593124
 https://devtalk.nvidia.com/default/topic/937736/cuda-programming-and-performance/saturated-16-bit-1-15-float-hack/post/4887190/#4887190
 http://stackoverflow.com/questions/37732735/nvprof-option-for-bandwidth
 https://devtalk.nvidia.com/default/topic/1006066/cuda-programming-and-performance/pascal-l1-cache
 https://devtalk.nvidia.com/default/topic/1009766/cuda-programming-and-performance/single-gpu-core-vs-single-cpu-core
 http://www.hardware.fr/articles/948-2/gp104-7-2-milliards-transistors-16-nm.html
    http://www.hardware.fr/articles/951-2/polaris-10-5-7-milliards-transistors-14-nm.html

 the way i learned sass is
 1. ptx manual: http://docs.nvidia.com/cuda/parallel-thread-execution/
 2. http://docs.nvidia.com/cuda/cuda-binary-utilities/#instruction-set-ref
 3. https://github.com/laanwj/decuda
 4. read wiki of asfermi project: https://github.com/hyqneuron/asfermi/wiki
 5. read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide
 6. there is also maxas, but its docs doesn't describe commands

 low-level benchmarks:
 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
 http://www.stuffedcow.net/research/cudabmk Demystifying GPU Microarchitecture through Microbenchmarking
 http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
 http://repository.lib.ncsu.edu/ir/bitstream/1840.16/9585/1/etd.pdf
 https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
 http://hgpu.org/?p=14541 Dissecting GPU Memory Hierarchy through Microbenchmarking
 http://hgpu.org/?p=16616 Understanding Latency Hiding on GPUs by Vasily Volkov


 a few books with low-level GPU details:
 http://www.cudahandbook.com/
 Shane Cook "CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs"
 Rob Farber "CUDA Application Design and Development"
 David Kirk, Wen-mei Hwu "Programming Massively Parallel Processors"

 Talks:
 http://on-demand-gtc.gputechconf.com/gtc-quicklink/9BNvqKX
 http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
 http://on-demand.gputechconf.com/gtc/2016/presentation/s6807-angerer-dynamic-parallelism.pdf


 AMD:
 https://radeonopencompute.github.io/documentation.html
 http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
 http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/documentation/
 http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
 https://forum.beyond3d.com/posts/1721467/
 https://forum.beyond3d.com/threads/amd-southern-islands-7-series-speculation-rumour-thread.50220/page-22#post-1515943
 https://github.com/SunsetQuest/Asm4GCN
 https://realhet.wordpress.com/
 http://x.pgy.hu/~worm/het/hp/GCN_Reference_Card.html

 http://www.asmcommunity.net/forums/topic/?id=30544


 Intel:
 https://software.intel.com/en-us/articles/introduction-to-gen-assembly

 ARM:
 http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71/2
	https://devtalk.nvidia.com/default/topic/933827/cuda-programming-and-performance/fast-256-bin-histogram/
	http://www.cse.uconn.edu/~zshi/course/cse5302/ref/chhugani08sorting.pdf
	http://link.springer.com/chapter/10.1007/978-3-642-23397-5_16
	http://arxiv.org/abs/1008.2849 Faster Radix Sort via Virtual Memory and Write-Combining Jan Wassenberg, Peter Sanders

	https://devtalk.nvidia.com/default/topic/378826/cuda-programming-and-performance/my-speedy-sgemm/post/2703033/#2703033
	https://devtalk.nvidia.com/default/topic/390366/cuda-programming-and-performance/instruction-latency/post/2768197/#2768197
	https://devtalk.nvidia.com/default/topic/913832/cuda-programming-and-performance/sum-reduction-working-in-fermi-kepler-and-maxwell/
	https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/1
	https://devtalk.nvidia.com/default/topic/690631/cuda-programming-and-performance/so-whats-new-about-maxwell-/post/4305310/#4305310
	https://devtalk.nvidia.com/default/topic/878664/cuda-programming-and-performance/custom-memory-allocator-for-cuda-desired/post/4755097/#4755097
	https://devtalk.nvidia.com/default/topic/695408/first-impressions-of-cuda-6-managed-memory/
	https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4672378/#4672378
	https://devtalk.nvidia.com/default/topic/638031/ldg-versus-textures/
	http://stackoverflow.com/questions/17004557/how-to-avoid-tlb-miss-and-high-global-memory-replay-overhead-in-cuda-gpus
	https://devtalk.nvidia.com/default/topic/873995/global-memory-access-bottleneck/
	http://www.techenablement.com/inside-nvidias-unified-memory-multi-gpu-limitations-and-the-need-for-a-cudamadvise-api-call/
	https://parallel-computing.pro/index.php/9-cuda/43-openmp-4-0-on-nvidia-cuda-gpus
	GPGPU-Sim 3.x Manual http://archive.is/krK9N
	https://devtalk.nvidia.com/default/topic/928796/gpu-accelerated-libraries/moderngpu-2-0/
	https://devtalk.nvidia.com/default/topic/844924/announcements/cudapad-and-its-source-code-are-now-available-for-download-/
	CUB: http://on-demand.gputechconf.com/gtc/2015/video/S5617.html
	https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/
	http://on-demand.gputechconf.com/gtc-express/2013/videos/understanding-parallel-graph-algorithms.mp4
	http://on-demand.gputechconf.com/gtc/2013/webinar/essential-optimization-techniques-for-nvidia-kepler-and-fermi-architecture.mp4
	https://devtalk.nvidia.com/default/topic/799429/cuda-programming-and-performance/possible-to-use-the-cuda-math-api-integer-intrinsics-to-find-the-nth-unset-bit-in-a-32-bit-int/post/4407256/#4407256
	https://devtalk.nvidia.com/default/topic/804281/cuda-programming-and-performance/maxwell-integer-mul-mad-instruction-counts
	https://devtalk.nvidia.com/default/topic/980740/cuda-programming-and-performance/xmad-meaning/
	http://stackoverflow.com/questions/35566178/instruction-replay-in-cuda/35593124#35593124
	https://devtalk.nvidia.com/default/topic/937736/cuda-programming-and-performance/saturated-16-bit-1-15-float-hack/post/4887190/#4887190
	http://stackoverflow.com/questions/37732735/nvprof-option-for-bandwidth
	https://devtalk.nvidia.com/default/topic/1006066/cuda-programming-and-performance/pascal-l1-cache
	https://devtalk.nvidia.com/default/topic/1009766/cuda-programming-and-performance/single-gpu-core-vs-single-cpu-core
	http://www.hardware.fr/articles/948-2/gp104-7-2-milliards-transistors-16-nm.html
	http://www.hardware.fr/articles/951-2/polaris-10-5-7-milliards-transistors-14-nm.html

	the way i learned sass is
	1. ptx manual: http://docs.nvidia.com/cuda/parallel-thread-execution/
	2. http://docs.nvidia.com/cuda/cuda-binary-utilities/#instruction-set-ref
	3. https://github.com/laanwj/decuda
	4. read wiki of asfermi project: https://github.com/hyqneuron/asfermi/wiki
	5. read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide
	6. there is also maxas, but its docs doesn't describe commands

	low-level benchmarks:
	http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
	http://www.stuffedcow.net/research/cudabmk Demystifying GPU Microarchitecture through Microbenchmarking
	http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
	http://repository.lib.ncsu.edu/ir/bitstream/1840.16/9585/1/etd.pdf
	https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
	http://hgpu.org/?p=14541 Dissecting GPU Memory Hierarchy through Microbenchmarking
	http://hgpu.org/?p=16616 Understanding Latency Hiding on GPUs by Vasily Volkov


	a few books with low-level GPU details:
	http://www.cudahandbook.com/
	Shane Cook "CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs"
	Rob Farber "CUDA Application Design and Development"
	David Kirk, Wen-mei Hwu "Programming Massively Parallel Processors"

	Talks:
	http://on-demand-gtc.gputechconf.com/gtc-quicklink/9BNvqKX
	http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
	http://on-demand.gputechconf.com/gtc/2016/presentation/s6807-angerer-dynamic-parallelism.pdf


	AMD:
	https://radeonopencompute.github.io/documentation.html
	http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
	http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/documentation/
	http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
	https://forum.beyond3d.com/posts/1721467/
	https://forum.beyond3d.com/threads/amd-southern-islands-7-series-speculation-rumour-thread.50220/page-22#post-1515943
	https://github.com/SunsetQuest/Asm4GCN
	https://realhet.wordpress.com/
	http://x.pgy.hu/~worm/het/hp/GCN_Reference_Card.html

	http://www.asmcommunity.net/forums/topic/?id=30544


	Intel:
	https://software.intel.com/en-us/articles/introduction-to-gen-assembly

	ARM:
	http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71/2