Skip to content

Instantly share code, notes, and snippets.

@graphitemaster
Created July 6, 2017 19:13
Show Gist options
  • Save graphitemaster/6a02896afce085c63446d6fb1bad75a8 to your computer and use it in GitHub Desktop.
Save graphitemaster/6a02896afce085c63446d6fb1bad75a8 to your computer and use it in GitHub Desktop.
OpenGL Black Bible

OpenGL Black Bible

Author: Dale Weiler

Preface

The following is a writeup of the following things I've independently discovered or been told over the years on how to utilize OpenGL effectively. Not everything in here I gurantee to be factually correct - though several others share similar ideas as present here. Take these are your own peril, like everything else - nothing is absolute; so profile to be sure and check the standards if something here is incorrect. These things presented here I've come to accept as being a safe subset of OpenGL supported practically everywhere and is a safe bet to use if you don't want to get caught up in the details.

Chapter 1. Things to avoid

Versions and Profiles

OpenGL has too many versions, most of which are garbage and should be avoided, a safe bet is to utilize Core Profile - 3.2. Newer modern versions may be appealing, however unless you're planning on producing a AAA quality title, all those new features can be ignored. There is something to be said about compute shaders, however those are still reasonably accessible through extensions. That being said some other nice benefits of using 3.2 in particular is that it's mostly the basis of GL ES 3 and WebGL. So not much effort is needed there to reach even more targets. Similarly it seems support for 3.x GL is fair if you check the usual suspects [Steam Hardware Survey] and [Wolfire's Support Matrix]. When requesting this version of OpenGL, the concept of "Core Profile" is pretty important since Compatability Profiles contain all the junk from previous versions of OpenGL. You would think this would be a good thing, however, and here is the first claim; Compatability Profiles run faster on Nvidia. I've never confirmed this but I've read of it in several places, including former Nvidia driver developers. So it may be beneficial to request Core on everything but Nvidia.

Geometry shaders

One of the important features of our selected version of OpenGL is that it gained support for Geometry shaders. Geometry shaders are a neat way to generate geometry, they sit in a part of the pipeline that permits this however, unless you're Intel, geometry shaders are going to be considerably slower than doing the equivlant work on the CPU with a streaming vertex buffer. The reason for this is that both Nvidia and AMD require geometry shaders make a round-trip through memory. The reason this is required is because GL requires that the output of a geometry shader be rendered in input order. This does not map well to Nvidia or AMD hardware, where the fixed-function hardware, that does the rendering, must consume geometry shader outputs serially. That serial consumption creates a syncronization point, which for the parallel nature of GPUs is a bottle neck, so instead the shader needs to buffer it, there are very few places where you can safely buffer this data. Either on chip cache, which is limited in size or off chip DRAM. AMD uses on-chip cache, but has to do a lot of work to deal with wrapping which means geometry heavy shaders often cause huge stall points, whereas Nvidia uses off-chip DRAM, which has high latency and cannot be hidden. Intel does not have these problems because their threads, unlike AMD and Nvidia have their own huge register file. There is something else to be said about the lack of support for Geometry Shaders in GLES and WebGL, so by avoiding them - you make your code base that much more portable to different flavors of OpenGL too.

Tesselation shaders

The evaluation and control shaders for tessellation suffer from similar problems to geometry shaders, they're also incredibly specific in terms of what they're meant to be used for; that the fact they're a shader at all is a testament to what happens when you shoehorn in ways to achieve popular rendering techniques. On a personal level, these things complicate the pipeline and actually introduce a cost even when not used (in terms of their interaction with other things, which in turn equates to added validation cost).

Cubemaps

Unless you're only planning on supporting Nvidia hardware, Cubemaps are going to give you a bad time. There is no longer any fixed-function hardware for cubemaps on modern GPUs. These things are all emulated in an unrolled fashion by the driver using a packing strategy that is implementation-defined. These things are definitely useful when sampling in a shader for doing environment reflections, skybox rendering and a few other common things, but all those things can still be done the manual way with cube geometry and individual faces. Lots of people utilize cubemaps when doing shadow mapping for omni-directional lighting, this is rather wrong - and it still surprises me people continue to do this since you need border pixels in the cubemap for filter taps, which is not possible without an extension. You're far better off doing this manually, at least then you can control layout and exploit that layout for marginal performance gain.

Uniform Buffer Objects

Uniform buffer objects are a misused feature in modern day OpenGL. There is a way to use them correctly, but chances are you're never going to hit a realistic senario where using them correctly will offer a performance advantage over the traditional uniform calls to update data. This is not surprsing because UBOs were never meant to be used for replacing uniform calls, they were created for two reasons, 1) for updating very large uniform data sets, specifically they were far larger and could be used for that purpose, and 2) for sharing the same uniform data with more than one shader program. This goes without saying, but I've herd stories that several vendors implement UBOs with a texture, which is far more costly to read (due to DRAM latency) than register-resident values of typical uniforms.

MapBuffer

There is no safe way to use glMapBuffer without the appropriate glUnmapBuffer call before a draw command because that is undefined. For this reason, MapBuffer is literally useless, since it forces a syncronization point between client and server. Just avoid it at all costs. glMapBufferRange with the appropriate access flags to map it unsyncronized is far faster. Never unmap it, keep it resident forever and deal with syncronization manually through the use of fence objects. The standard (and safe) approach is to have as many fence objects as you do mapped buffers (for double buffering) and just query completion state on the fence, don't actually wait. Fine tune the buffer count and size for the workload. Or treat some range as "staging" and the other range as "source". There is lots of ways to misuse this so be careful.

State changes

This is pretty much an understood concept in OpenGL, however it's not exactly as clear as simply avoiding state changes. Not all state changes are the same and in nearly all cases, state changes are deferred until quite later. Typically, ignored straight up to the draw call itself. This is actually what people complain about when draw calls tend to be expensinve. The issuing of the draw call is literally costless, what is costly is all the state changes "queued" up to the call, and the validation ontop to ensure the series of state changes even constitute a valid state to begin issuing a draw command in. In either case a good general rule is not only to avoid state changes, but to organize draw calls, and information in such a way that you avoid having to make changes at all, for instance batch by material. This goes without saying, but this can only be taken so far since draw call order does matter for things like alpha blending.

One such example of a nasty state change to avoid is depth/stencil mask and test state, this one is particularly nasty because changing it often results in a shader recompilation. The most common place both of these are changed, and often scissor too is when clearing the render target because depth/stencil mask (as well as scissor) is respected when doing a glClear. The problem is that glClear is often implemented internally by the driver itself as a fullscreen quad being rendered, which has a shader itself. So when state is changed here, the shader used for the glClear operation itself gets recompiled. Instead it's best to ensure your last few draw commands switch the state back to what is needed for the glClear to work.

Chapter 2. Things to watch out for

Framebuffer objects

Framebuffer objects, the defacto way to do offscreen rendering to a texture, in many ways - you need to tread very carefully with. In particular not all vendors get attachments for them correct. It's also very easy to accidentally misuse the attachments and have it silently work on one platform and fail on another. In particular, never under any circumstances have multiple attachments which have different types. If you have one color attachment of GL_RGBA8, then all other attachments should be GL_RGBA8. You're still allowed to have depth, and stencil attachments, however it's good practice to avoid individual depth and stencil attachments in favor of utilizing combined depth stencil formats which is what the hardware uses. The general rule is if you plan on clearing depth and stencil individually, for which ever reason, then do not utilize a combined depth and stencil format, if however you're planning on clearing them at the same time, then always use a combined depth stencil format. Not following this simple rule will result in some really annoying stalls on a variety of hardware, especially tiled-deferred hardware.

Speaking of tiled-deferred hardware, it's never a good idea to render to an FBO and then immediately source the attachment textures in a draw call, that will always result in a syncronization point due to the nature of how tiled-deferred works, instead utilizing more FBOs for the purposes of doing this far later is more beneficial, but can chew up video memory quickly, a good example would be shadow mapping. Atlasing is going to be a win for TDR / mobile here but will likely be slower on desktop. So you may want different rendering paths here if you look to achieve best performance on both types of hardware.

Read backs

It's never safe to read back depth or stencil, the standard does permit you to do this, but Intel in particular is notoriously bad for this and you'll pretty much always get inconsistent results across Intel driver versions and hardware. If you need to read back depth or stencil, sampling it in a shader and writing it out to a color attachment via fullscreen quad or triangle may be the better approach here, plus it's a nice place to do linearization of depth too.

If you're streaming read backs for doing things like light injection passes for global illumination, radiosity or just recording frames for video, always use Pixel buffer objects, never use glReadPixels directly, PBOs allow you to do a non-stalling async read back in a safe and consistent manner. It's also good for doing a non-hitching screenshot, which can be nice if someone accidentally hits the screenshot keybind in your game and it isn't this garring experience.

Program binary

One of the extensions that is supported by our choice in GL is the program binary extension, this extension is very neat in that it lets you serialize program state into a binary representation which you can reload and reuse to avoid the cost of compiling shaders on subsequent runs. The problem is that the outputted result isn't compatible with anything but the machine which produced it, and sometimes driver version changes can even break the produced binaries. This is easy enough to protect against and is expected for people who know the extension. However what is less known is that even if a specific hardware and software configuration claims it supports the extension, that doesn't actually mean that it supports the extension, instead you have to query the amount of "program binary formats" that are supported, and if that is zero - then program binaries are not supported. As of writing the only configuration that does this is Intel with Mesa, but it's still something to watch out for.

Samplers

Sampler objects are fine to use, I encourage them since they more appropriately map to other graphic APIs. That's not what this section is about though, it's about specifying sampler slots for a shader with glUniform*. For some reason or another, Intel continues to only support specifying a sampler slot with glUniform1i, if you try and use glUniform1iv, which I have in the past because I was feeling weird one day, it won't work.

Chapter 3. Streaming

General rules about streaming

If you're streaming any type of buffer data, the only two access flags you care about is GL_DYNAMIC_DRAW and GL_STREAM_DRAW. The general rule is if you're creating some data and want to render it only once, soon as you're finished uploading it, then you use GL_STREAM_DRAW. If you're going to be reusing the same buffer but changing it's data a lot, then you use GL_DYNAMIC_DRAW. Never, under any circumstances use GL_STATIC_DRAW for streaming contents.

When streaming data, it may be beneficial to double buffer, or triple buffer your contents, that is prepare the data for the next frame or the frame after the next frame, when doing things this way, orphan the buffer just sourced to inform the driver of this behavior, that is use glBufferData with null data and zero size. Do not use glBufferSubData to orphan. This goes without saying, but the same is also true for textures.

Through various personal experimentation, I've concluded there is no guranteed best way to do partial buffer or texture updates that is performant everywhere. What I've discovered instead is that for specifying a complete replacement of an existing buffer, glBufferData on Nvidia wins, whereas glBufferSubData wins on AMD and Intel. What is more concerning is that for partial updates, sometimes glBufferData is also faster too, depending on how large the partial update is.

What continues to be true, regardless of vendor is that if you're packing a lot of data inside a vertex buffer sourced for several different draws, it's best to always keep your vertices aligned on a natrual 16 byte boundary, this is even true on mobile. This is not much of a concern for GL_STATIC_DRAW though where it appears the implementation does its own optimization anyways to initially specified data. Keep this in mind for streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment