Skip to content

Instantly share code, notes, and snippets.

View didito's full-sized avatar

Dietmar Suoch didito

View GitHub Profile
@sebbbi
sebbbi / BetterBuffers.txt
Created February 28, 2019 05:04
Better buffers
All current buffer types in shading languages are slightly different ways to present homogeneous arrays (single struct or type repeating N times in memory).
DirectX has raw buffers (RWByteAddressBuffer) but that is limited to 32 bit integer types and the implementation doesn't require natural alignment for wide loads resulting in suboptimal codegen on Nvidia GPUs.
Complex use cases, such as tree traversal in spatial data structures (physics, ray-tracing, etc) require data structure that is non-homogeneous. You want different node payloads and tight memory layout.
Ability to mix 8/16/32 bit data types and 1d/2d/4d vectors to faciliate GPU wide loads (max bandwidth) in same data structure is crucial for complex use cases like this.
On the other hand we want better more readable/maintainable code syntax than DirectX raw buffers without manual bit packing/extracting and reinterpret casting. Goal should be to allow modern GPUs to use sub-register addressing (SDWA on AMD hardware). Saving both ALU and register
@nbouteme
nbouteme / ss-fs.glsl
Last active October 29, 2024 09:54
Skyward Sword Brush shader. Accurately emulates what's done with TEVs in a shader. Does NOT include the blurring pass.
#version 300 es
precision highp float;
in vec2 UV;
out vec4 out_color;
uniform float ratio, time;
uniform sampler2D texture0;
const float PI_3 = 1.0471975512;
@niklas-ourmachinery
niklas-ourmachinery / reducing-build-times.md
Created January 24, 2019 16:30
Reducing build times by 20 % with a one line change

Reducing build times by 20 % with a one line change

Experimenting a bit with the /d2cgsummary and /d1reportTime flags described by Aras here and here I noticed that one of our functions was consistently showing up in the Anomalistic Compile Times section:

1>	Anomalistic Compile Times: 2
1>		create_truth_types: 0.643 sec, 2565 instrs
1>		draw_nodes: 0.180 sec, 5348 instrs
/*This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit

Latency Comparison Numbers (~2012)

Name
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cachereference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us

State of Roblox graphics API across all platforms, with percentage deltas since EOY 2018. Updated December 29 2019.

Windows

API Share
Direct3D 11+ 85% (+5%)
Direct3D 10.1 8.5% (-1.5%)
Direct3D 10.0 5.5% (-2.5%)
Direct3D 9 1% (-1%)
@negrinho
negrinho / latency.txt
Created July 16, 2018 21:11 — forked from jboner/latency.txt
Latency Numbers Every Programmer Should Know
Latency Comparison Numbers Simplified (~2012)
---------------------------------- log2 log10
L1 cache reference 0 0 ~ 1 ns
Branch mispredict 3 1
L2 cache reference 4 1
Mutex lock/unlock 6 2
Main memory reference 8 2
Compress 1K bytes with Zippy 13 4
Send 1K bytes over 1 Gbps network 14 4
Read 4K randomly from SSD* 18 5
@jspohr
jspohr / microsecs.c
Last active January 7, 2026 06:16
Avoid overflow when converting time to microseconds
// Taken from the Rust code base: https://github.com/rust-lang/rust/blob/3809bbf47c8557bd149b3e52ceb47434ca8378d5/src/libstd/sys_common/mod.rs#L124
// Computes (value*numer)/denom without overflow, as long as both
// (numer*denom) and the overall result fit into i64 (which is the case
// for our time conversions).
int64_t int64MulDiv(int64_t value, int64_t numer, int64_t denom) {
int64_t q = value / denom;
int64_t r = value % denom;
// Decompose value as (value/denom*denom + value%denom),
// substitute into (value*numer)/denom and simplify.
// r < denom, so (denom*numer) is the upper bound of (r*numer)
@BeRo1985
BeRo1985 / MiniSoftFP32.pas
Last active July 30, 2018 02:19
MiniSoftFP32 - A simple small software 32-bit single precision floating point implementation
unit MiniSoftFP32; // Copyright (C) 2018, Benjamin "BeRo" Rosseaux (benjamin@rosseaux.de) - License: CC0
// Declaimer / Notice of caution:
// Attention, this code implements only the basic functions, but for example not the correct handling of
// Infinity, NaN, division-by-zero special cases and so on.
// In short, this code is only intended for demystifying the base floating point arithmetics (using 32-bit
// single precision floating point values in this implementation).
{$ifdef fpc}
{$mode delphi}
{$if defined(cpu386) or defined(cpuamd64)}
{$asmmode intel}