This article is an extract of months of reading about the PS1 GPU and its inner workings. Writing a PS1 emulator is no mean feat, so I would like to help anyone who finds that there is not enough info online about the GPU. I tried to keep it as simple as possible so that it can be understandable by anyone with some basic familiarity with how GPUs work. There might be mistakes in my thinking throughout so any corrections are welcome. Introduction
The PS1 GPU is arguably one of the most complex parts of the whole machine. It does not have any shader support, so it has specialized functions for drawing specific shapes (See NOCASH PSX docs for more info on those). The simple ones such as "Draw monochrome quad" are easy to understand and implement. But how about the textured ones? I always had no clue on how these work. I mean there is some information online about it, but I could not wrap my head around why there are things called "Textured Pages" and "CLUT tables" and how texels get fetched from VRAM. So I am writing this article for anyone who is in the same position as me and to fill that information gap. Texture Pages - What are those?
UPDATE: It is recommended to also check out @dgrips comment https://www.reddit.com/r/EmuDev/comments/fmhtcn/article_the_ps1_gpu_texture_pipeline_and_how_to/fs1x8n4?utm_medium=android_app&utm_source=share which has some additional information I failed to include in this article and also some corrections.
The GPU has its own dedicated VRAM to which the CPU has no direct access to. It is 1MB in size and usually is represented in memory as an array of 16bit numbers.
uint16_t vram[1024 * 512];
Why is that, we will get to shortly. So the VRAM is divided into chunks called texture pages. Each one is always 256 * 256 pixels but their size in VRAM depends on their colour depth. So as their name suggests these blocks contain textures, that the game is currently using. But if the console has a way to keep track of where each texture resides, why is there any need to put textures in texture pages? Well if you now how OpenGL works, in order to sample a texture we need to know the texture coordinate to sample a texel from. The PS1 GPU also uses textures coordinates to sample textures, but they are not expressed in the 0-1 range, as the PS1 does not have any floating-point capabilities. So they are expressed with integers. Looking at the NOCASH docs we can see the PS1 has a specific format in which it stores texture coordinates:
Vertex1 (YyyyXxxxh)
Texcoord1+Palette (ClutYyXxh)
Vertex2 (YyyyXxxxh)
Texcoord2+Texpage (PageYyXxh)
Vertex3 (YyyyXxxxh)
Texcoord3 (0000YyXxh)
Each variable is a 32bit unsigned integer. So the YyXx part of the texture coordinates we are interested in is 16bits long. Splitting that in two to get the x and y coordinate we are left with only 8bits for each. As we all know an 8bit number can reach of to 255 (texture coordinates are unsigned). But this is a problem as our VRAM is bigger than that. How did the designers combat this problem? Because the texture pages can only be 256 * 256 pixels, we can simply make the texture coordinates relative to the texture page start. So a coordinate of (0, 0) does not necessarily mean that we should fetch the first pixel in VRAM. You can read more in this document: http://hitmen.c02.at/files/docs/psx/psx.pdf (Just search for the "Texture Page" section).
PS1 VRAM divided into texture pages
So let's say we want to fetch a texel in the VRAM pictured above. What we have to do first is to figure out the desired texture page. The texture page is given to us by the command itself. If you look closely you can see the 4th parameter above has the "Page" attribute in the last 16bits. Looking at the docs it is trivial to extract the x, y coordinates of the page (if you are stuck look here: https://problemkaputt.de/psx-spx.htm#gpurenderingattributes at the Texpage attribute and specifically at the page x base and page y base bits). After that, we just have to add the texture coordinate and just get the pixel pointed by the final value. Right? It is actually not so simple. Before we fetch the texel we have to take something else into account first which is, as mentioned before the colour depth. Colour depths and CLUT tables - Why bother?
Textures in the PS1 can be stored with 3 different formats. There is a 4bit mode, an 8bit mode and a 16bit direct mode. What those modes mean is that each pixel takes up that much bits in VRAM. So when a texture is stored in 4bit each pixel of the texture is 4bits long so each 16bit VRAM value has 4 pixels packed together. Similarly, when a texture is stored in 8bit mode, each pixel takes up 8bits of space and each 16bit VRAM value has 2 pixels packed together. Finally, the 16bit mode is the simplest as each 16bit VRAM value represents a single pixel. So as it was mentioned before the VRAM is usually expressed as an array of 16bit numbers instead of bytes. This is because of the 16bit direct mode because it is more convenient when you have to fetch a 16bit texel. Computer architecture tells us, that when a pixel is represented by n bits, it can express up to 2^n colours. This means that a 4bit pixel can express up to 2^4 = 16 colours and an 8bit pixel can express up to 2^8 = 128 colours. Before it was mentioned that a textured page's size in VRAM depends on the colour depth. If the colour depth is 4bits than each 16bit VRAM value contains 4 pixels so that the texture page is 256 / 4 = 64 VRAM values wide. Same with the other modes (256 / 2) = 128 for 8bit and (256 / 1) = 256 for 16bit mode). The 4/8bit modes utilize something that is called CLUT (or Color LookUpTable). This is essentially a palette of colours and is also stored VRAM (if you look closely it is located in number 3 on the picture above). There can be multiple CLUT tables located in VRAM. Their size depends on the colour depth mode they are going to be used. So if a CLUT table is going to be used in 4bit mode it is 2^4 = 16 in size. Why is that? Remember the CLUT is a palette so it contains all the possible colours that can be generated in the colour depth mode it is used in. If you are familiar with Gameboy emulation, the Gameboy also uses palettes to get the final pixel colour to display to the screen, which is similar to this. Now when a 4/8bit pixel is fetched, it is used as an index in the CLUT table which gives the resulting colour back. This method of texturing is called paletted textures. Only the 16bit mode does not use the CLUT because it would need to allocate 2^17 bytes of VRAM just for a single table which is a lot of memory. Now that these details are known, in order to understand the texel fetching process let's take a look at a piece of code from a PS1 emulator called Project-PSX which sums up this process pretty well (I changed it a bit for simplicity sake, without removing any functionality):
int get_texel_4bit(int x, int y, Point2D clut, Point2D page) {
ushort texel = VRAM.Read(page.x + x / 4, page.y + y);
int index = (texel >> (x % 4) * 4) & 0xF;
return VRAM.Read(clut.x + index, clut.y);
}
This function might look daunting at first, but in reality, it is very simple. The function takes as input the x, y texture coordinates, the CLUT location in VRAM and the Texture page location in VRAM. The first line fetches the texel with the process detailed in the Texture page section. Why though does it divide x by 4? Because x is a texture coordinate, it is basically an offset in the texture page in pixels. Remember that a pixel should not be confused with VRAM values. So let's say we want to fetch a 4bit texel that is 2 pixels after the start of the texture page (which means that x = 2). Because we are in 4bit mode a VRAM value is represented like this.
| start |
| VRAM value |
| pixel 0 | pixel 1| pixel 2| pixel 3 |
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
It becomes apparent that we interested in pixel 2 as it +2 pixels after the start of the page. If just the value x was added the result would be the VRAM value at +2 offset from the start of the texture page. If you look closely the x / 4 is actually integer division. This is important because it is not possible to fetch a pixel directly. Before that, it is necessary to find first the VRAM value it is located in and then extract it from there. For values of x -> 0, 1, 2, 3 the division returns 0 which is the correct VRAM value. For values x -> 4, 5, 6, 7 the division returns 1 which is the next VRAM value and so on. Now that the correct VRAM value is found, we have to extract the correct bit range of the pixel. This can be done with the expression x % 4 which returns 0, 1, 2, 3 and tells which pixel we need to extract. The resulting value from this is then used as an index in the CLUT table to get the final colour value of the pixel. From that CLUT value it is possible to extract the individual RGB components with this formula (This formula can also be used with the 16bit depth mode):
red = (pixel_val << 3) & 0xf8;
green = (pixel_val >> 2) & 0xf8;
blue = (pixel_val >> 7) & 0xf8;
And we are done! This process can be repeated for all colour depths, with the exception of the 16bit depth which is much simpler as it does not use a CLUT table. Below are the other two functions:
int get_texel_8bit(int x, int y, Point2D clut, Point2D page) {
ushort texel = VRAM.Read(x / 2 + page.x, y + page.y);
int index = (texel >> (x % 2) * 8) & 0xFF;
return VRAM.Read(clut.x + index, clut.y);
}
int get_texel_16bit(int x, int y, Point2D page) {
return VRAM.Read(x + page.x, y + page.y);
}
Bonus: Possible OpenGL implementation
You probably noted the "and how to emulate it" part of the title of this article. So I am going to present a possible implementation using OpenGL. This method is my own and is not guaranteed to be correct, but it works pretty well from my testing so far. This process can easily be implemented with a software rasterizer and there are plenty of them out there on Github. Doing the same thing with a hardware renderer is much more difficult. The problem is that the use of paletted textures requires the texels to be copied in a temporary buffer, loop over all of them to get the colour value from the CLUT table (if they are in 4/8bit mode) and then send them the GPU as a texture. However, this can happen multiple times a frame, each frame so it ends up way slower than a software renderer. There are complex texture management systems which avoid all this copying (Notably Pete's OGL plugin: https://github.com/iCatButler/pcsxr/blob/62467b86871aee3d70c7445f3cb79f0858ec566e/plugins/peopsxgl/texture.c#L35) but I think there is an easier way to do this. My idea was to make the VRAM a texture, find the texture coordinates of the texture we want to sample from in the OpenGL 0-1 format and then apply the CLUT table lookup in the fragment shader. In order to retain fast performance using a PBO is recommended, if available for texture uploads. In order to properly handle the multiple texture depths we will keep 3 VRAM textures in memory: One with 4bit pixels, one with 8bit pixels and the standard with 16bit pixels. First, it is a good idea to create a VRAM class which will manage the reads/writes to the buffers. Note that this method only applies when not using draw call batching. If you want to batch any draw calls, the data must be sent as a vertex attribute!
/* ----------------------------- vram.h ------------------------------ */
class VRAM {
public:
/* Creates PBOs for all VRAM textures. */
void init();
/* Transfers the pixel data from the CPU to the GPU. */
void upload_to_gpu();
uint16_t read(uint32_t x, uint32_t y);
void write(uint32_t x, uint32_t y, uint16_t data);
public:
uint32_t pbo4, pbo8, pbo16;
/* The pixel arrays connected to the PBOs. */
uint8_t* ptr4;
uint8_t* ptr8;
uint16_t* ptr16;
/* The OpenGL textures. */
uint32_t texture4, texture8, texture16;
};
/* ----------------------------- vram.cpp ------------------------------ */
void VRAM::init()
{
uint32_t buffer_mode = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT;
/* 16bit VRAM pixel buffer. */
glGenBuffers(1, &pbo16);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo16);
glBufferStorage(GL_PIXEL_UNPACK_BUFFER, 1024 * 512, nullptr, buffer_mode);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glGenTextures(1, &texture16);
glBindTexture(GL_TEXTURE_2D, texture16);
/* Set the texture wrapping parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
/* Set texture filtering parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
/* Allocate space on the GPU. */
glTexImage2D(GL_TEXTURE_2D, 0, GL_R16, 1024, 512, 0, GL_RED, GL_UNSIGNED_BYTE, nullptr);
glBindTexture(GL_TEXTURE_2D, texture16);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo16);
ptr16 = (uint16_t*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER, 0, 1024 * 512, buffer_mode);
/* 4bit VRAM pixel buffer. */
glGenBuffers(1, &pbo4);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo4);
glBufferStorage(GL_PIXEL_UNPACK_BUFFER, 1024 * 512 * 4, nullptr, buffer_mode);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glGenTextures(1, &texture4);
glBindTexture(GL_TEXTURE_2D, texture4);
/* Set the texture wrapping parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
/* Set texture filtering parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
/* Allocate space on the GPU. */
glTexImage2D(GL_TEXTURE_2D, 0, GL_R8, 4096, 512, 0, GL_RED, GL_UNSIGNED_BYTE, nullptr);
glBindTexture(GL_TEXTURE_2D, texture4);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo4);
ptr4 = (uint8_t*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER, 0, 1024 * 512 * 4, buffer_mode);
/* 8bit VRAM pixel buffer. */
glGenBuffers(1, &pbo8);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo8);
glBufferStorage(GL_PIXEL_UNPACK_BUFFER, 1024 * 512 * 2, nullptr, buffer_mode);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glGenTextures(1, &texture8);
glBindTexture(GL_TEXTURE_2D, texture8);
/* Set the texture wrapping parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
/* Set texture filtering parameters. */
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
/* Allocate space on the GPU. */
glTexImage2D(GL_TEXTURE_2D, 0, GL_R8, 2048, 512, 0, GL_RED, GL_UNSIGNED_BYTE, nullptr);
glBindTexture(GL_TEXTURE_2D, texture8);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo8);
ptr8 = (uint8_t*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER, 0, 1024 * 512 * 2, buffer_mode);
}
void VRAM::upload_to_gpu()
{
/* Upload 16bit texture. */
glBindTexture(GL_TEXTURE_2D, texture16);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo16);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 1024, 512, GL_RED, GL_UNSIGNED_BYTE, 0);
/* Upload 4bit texture. */
glBindTexture(GL_TEXTURE_2D, texture4);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo4);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 4096, 512, GL_RED, GL_UNSIGNED_BYTE, 0);
/* Upload 8bit texture. */
glBindTexture(GL_TEXTURE_2D, texture8);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo8);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 2048, 512, GL_RED, GL_UNSIGNED_BYTE, 0);
}
uint16_t VRAM::read(uint32_t x, uint32_t y)
{
int index = (y * 1024) + x;
return ptr16[index];
}
void VRAM::write(uint32_t x, uint32_t y, uint16_t data)
{
int index = (y * 1024) + x;
/* Write data as 16bit. */
ptr16[index] = data;
/* Write data as 8bit. */
ptr8[index * 2 + 0] = (uint8_t)data;
ptr8[index * 2 + 1] = (uint8_t)(data >> 8);
/* Write data as 4bit. */
ptr4[index * 4 + 0] = (uint8_t)data & 0xf;
ptr4[index * 4 + 1] = (uint8_t)(data >> 4) & 0xf;
ptr4[index * 4 + 2] = (uint8_t)(data >> 8) & 0xf;
ptr4[index * 4 + 3] = (uint8_t)(data >> 12) & 0xf;
}
On every VRAM write we write to all 3 buffers on each splitting the value into the required ranges. Notice that we set up the textures with GL_RED and GL_R8/GL_R16. This is done because each value represents a single pixel and options like GL_RGB group values together which is something we do not want. So each pixel value gets stored in the red channel. Before we use it though we have to multiply it by 255 because OpenGL automatically converts all colour values to the 0-1 range. Also, note that the "upload_to_gpu" function should be called when a VRAM transfer completes and in the Vblank update function of the emulator in order to properly update the textures.
Now in order to sample texels, we have to calculate the texture coordinates to sample from. This can be done by this function:
glm::vec2 calc_tex_coords(int tx, int ty, int x, int y, int bpp)
{
double r = 16 / bpp;
double xc = (tx * r + x) / (1024.0 * r);
double yc = (ty + y) / 512.0;
return glm::vec2(xc, yc);
}
Where (tx, ty) is the texture page coordinate in VRAM, (x, y) is the texture coordinates which are the offset into the texture page and bpp is the colour depth of the texture and can take values 4, 8, 16. Then a vertex buffer can be built with vertices of the primitive we want to draw. The vertex shader is also very simple:
#version 430 core
layout (location = 0) in vec2 vpos;
layout (location = 1) in vec2 tex_coord;
out vec2 tex_coords;
uniform ivec2 offset = ivec2(0);
void main()
{
/* Add the draw offset. */
vec2 pos = vpos + offset;
/* Transform from 0-640 to 0-1 range. */
float posx = pos.x / 640 * 2 - 1;
/* Transform from 0-480 to 0-1 range. */
float posy = pos.y / 480 * (-2) + 1;
/* Emit vertex. */
gl_Position = vec4(posx, posy, 0.0, 1.0);
tex_coords = tex_coord;
}
The fragment shader is where the magic happens. First, before drawing, the colour depth of the texture to draw is sent and the CLUT table as a uniform array of int. The shader then performs the lookup and gets the final texel.
#version 430 core
in vec2 tex_coords;
out vec4 frag_color;
uniform int texture_depth;
/* Used for palleted texture lookup. */
uniform sampler2D texture_sample4;
uniform sampler2D texture_sample8;
uniform sampler2D texture_sample16;
uniform int clut4[16];
uniform int clut8[128];
vec4 split_colors(int data)
{
vec4 color;
color.r = (data << 3) & 0xf8;
color.g = (data >> 2) & 0xf8;
color.b = (data >> 7) & 0xf8;
color.a = 255.0f;
return color;
}
vec4 sample_texel()
{
if (texture_depth == 4) {
vec4 index = texture2D(texture_sample4, tex_coords);
int texel = clut[int(index.r * 255)];
return split_colors(texel) / vec4(255.0f);
}
else if (texture_depth == 8) {
vec4 index = texture2D(texture_sample8, tex_coords);
int texel = clut8[int(index.r * 255)];
return split_colors(texel) / vec4(255.0f);
}
else {
int texel = texture2D(texture_sample16, tex_coords).r * 255;
return split_colors(texel) / vec4(255.0f);
}
}
void main()
{
frag_color = sample_texel();
}
If you implement it right you should get something like this:
This rendering method tested in 4bit mode.
Please note that this method is not tested for anything else except for 4bit but because it works for 4bit, I assume it will work for other colour depths too. Now we have an efficient way of rendering textured geometry with all the benefits modern shader-based OpenGL provides. I emphasize again there may be mistakes in my thinking or some secret details I do not know about. There are also sections I have not covered such as semi-transparency and pixel masking and draw call batching. Thank you for reading this article.