Last active
April 5, 2018 02:05
-
-
Save drhelius/3730564 to your computer and use it in GitHub Desktop.
Gameboy's video hardware
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Nitty Gritty Gameboy Cycle Timing | |
--------------------------------- | |
A document about the down and dirty timing of the Gameboy's video hardware. | |
Written by: Kevin Horton | |
Version: 0.01 (preliminary) | |
My findings here are based on the original DMG, Super Gameboy, and GB | |
Pocket. All three appear to behave identically during testing, and the | |
SGB was used for all the reverse engineering. | |
An HP54645D mixed signal oscilloscope/logic analyzer was connected to the | |
SGB using a 16 wire pod. A 20 pin ribbon cable with IDC plug on the end | |
was soldered to various points on the SGB, and a pin header was inserted | |
into the IDC plug so that the pod connectors could be plugged in to | |
monitor the goings-on of the hardware. | |
--- | |
I have discovered some interesting things about how the Gameboy fetches | |
VRAM data in general. First, it will actually stop clocking the LCD and | |
stall it if it needs to fetch something and is not ready to send the | |
data out quite yet. | |
Secondly, the window function is a restarting of the data fetching | |
state machine, which is used to read the background tiles. The window | |
is triggered N clocks after the start of rendering, where N is determined | |
by the value in the xwindow register. | |
So, without further delay... | |
Scanline timing | |
--------------- | |
During the discussion of scanline timing, I will be ignoring Y timing | |
totally, since Y timing is unrelated to VRAM access patterns. The Y | |
timing only affects WHICH KIND of VRAM access occurs, and does not | |
affect it in any other way. | |
There are a couple cases that will be discussed, from simplest to most | |
complex. | |
* * * * * | |
The first case is what I call the degenerate case: xwindow is set to 0ffh | |
which disables the window totally, and then xscroll is adjusted. | |
There are only 8 different possible cases. | |
These types of access take from 173.5 to 180.5 cycles. The reasoning for | |
the half cycle will be described later. | |
The access pattern looks like this: | |
B01 - (6 cycles) fetch Background nametable byte, then 2 tile planes | |
B01s - (167.5 + (xscroll % 7) cycles) fetch another tile and sprite | |
window. | |
Where: | |
B = reading the background tile # (i.e. out of 1800h from the first | |
nametable) | |
0,1 = where the tile graphics are fetched. bitplanes 0 and 1. | |
s = sprite window. the sprite hardware will insert reads here if needed. | |
Each access to VRAM (B, 0, 1, s) takes 2 cycles to occur. A "cycle" is | |
exactly 1 period of the main input clock to the gameboy CPU chip. This | |
is nominally 4.19MHz approximately. | |
The last four accesses (B01s) is repeated until the proper number of | |
cycles has elapsed. | |
In the xscroll = 0 case, it will run for 167.5 cycles. This has an | |
interesting side effect- any tile access that is not complete just | |
gets unceremoniously cut off. This means that there will be 20 complete | |
tile accesses (B01s) and then 7.5 clocks worth of a 21st access, | |
cutting off the last half cycle of the sprite window. | |
In the xscroll = 2 case, it will run for 169.5 cycles. Similar to above, | |
this will result in 21 complete B01s accesses, and then 1.5 cycles of the | |
B fetch on a 22nd access. | |
This pattern repeats until xscroll = 7 (taking a total of 180.5 cycles) | |
until snapping back to 173.5 cycles when xscroll = 8. | |
The total number of cycles taken is (173.5 + (xscroll % 7)). | |
Now, you have been wondering what this extra 1/2 cycle business is about. | |
Well, it has to do with how the display clock is generated. The display | |
clock is generated via inverting the main input clock. I suspect it was | |
done so that the video hardware can get the data to the LCD ready on the | |
falling edge of the main clock. The display clocks data in on the RISING | |
edge of the display clock, thus necessitating the inverted display clock | |
relative to the main clock. | |
That causes the vram access pattern to be extended 1/2 clock on the end | |
to accommodate the inverted clock. | |
* * * * * | |
So, that takes care of the easy case. Now for what happens when the | |
window is used. | |
NOTE: When xwindow = 00h or xwindow = 0a6h, different things happen. | |
I will explain them later on. For now, the following information | |
holds when ((xwindow > 00h) && (xwindow < 0a6h)). | |
Interestingly, adding the window does not change a whole lot- in fact, | |
it simply restarts the whole fetch sequence over again, no matter where | |
it was! The timing is generated like so: | |
B01 (6 cycles) fetch background tile nametable+bitplanes | |
B01_ (1 to 172 cycles) ((xscroll % 7) + xwindow + 1) | |
W01 (6 cycles) fetch window tile nametable+bitplanes | |
W01_ (1.5 to 166.5 cycles) (166.5 - xwindow) | |
As can be seen, it's very similar to the first case. Only now, | |
the number of B12_ accesses is controlled by the xscroll value | |
and the xwindow value. As before, when the number of cycles has | |
elapsed, the access pattern is just cut short, and the W12 (W = window | |
nametable entry) access starts. | |
This window access pattern is identical to the background one, except | |
the window nametable is being accessed instead. | |
Turning on the window incurs a 6 cycle penalty, so the total number of | |
cycles taken is (173.5 + 6 + (xscroll % 7)). | |
* * * * * | |
OK, now things get slightly strange. When xwindow = 0, some slightly | |
different rules come into effect. | |
When (xscroll % 7) = 0 to 6, things work a bit different. Timing | |
looks like this: | |
B01B (7 cycles) technically the last B is part of the B01s pattern. | |
W01 (6 cycles) as above, the start of the window access pattern | |
W01s (167.5 to 173.5 cycles) (167.5 + (scroll % 7)) | |
As before, the W01s pattern is repeated for the required number of | |
cycles. When the count has expired, the access pattern is just cut off. | |
This takes 180.5 to 186.5 cycles. | |
When (xscroll % 7) = 7, then the timing is slightly modified version | |
of the above. The access pattern is identical to when (xscroll % 7) = 6 | |
except an extra cycle is inserted in the first sprite window, causing | |
the total amount to be 187.5 cycles. | |
* * * * * | |
When xwindow = 0a6h, then timing is identical to when the window is | |
disabled, i.e. 173.5 to 180.5 cycles. The difference is the window | |
nametable is used instead of the background nametable. Rendering | |
starts from the SECOND tile of each line, however. The net effect of | |
this is, the window register appears to be scrolled 8 pixels to the right | |
(if xscroll = 0). | |
Only the lower 3 bits of the xscroll register are used in this mode. | |
It shifts the the window left 0-7 cycles. Taking into account the first | |
paragraph above about the nametable, the net effect is the window appears | |
to be xscrolled 8 to 15 pixels left. | |
Effective xscroll = (xscroll % 7) + 8 | |
The top scanline of the screen is ALWAYS reflecting the very first | |
scanline of the window, when ywindow is less than or equal to 08fh. | |
The second scanline of the screen will reflect the second scanline of | |
the window, but ONLY when ywindow = 00h. Any other ywindow value will | |
result in the background showing for the first scanline. | |
The other scanlines of the screen (lines 3-144) will show the window, | |
*starting from the third window scanline* depending on the ywindow value. | |
This is hard to describe, but the effect is simple: | |
GB line: (ywindow = 0) | |
1 window 1 | |
2 window 2 | |
3 window 3 | |
4 window 4 | |
GB line: (ywindow = 1) | |
1 window 1 | |
2 background 2 | |
3 window 3 | |
4 window 4 | |
GB line: (ywindow = 2) | |
1 window 1 | |
2 background 2 | |
3 background 3 | |
4 window 3 | |
5 window 4 | |
6 window 5 | |
GB line: (ywindow = 3) | |
1 window 1 | |
2 background 2 | |
3 background 3 | |
4 background 4 | |
5 window 3 | |
6 window 4 | |
---- | |
What addresses are read during rendering | |
---------------------------------------- | |
Referring back to the fetch patterns above, I will go through what | |
addresses are read. | |
For the degenerate case, the access pattern looks like this: | |
B01 | |
B01s (repeated 20-22x) | |
Assuming that xscroll = 0, and yscroll = 0, and we have the background | |
reading the 9800h nametable: | |
B01 (reads 9800h) | |
B01s (reads 9800h, 9801h, 9802h, 9803h...9814h) | |
This is fairly simple: it just starts at the very upper left char of the | |
nametable and starts reading. It ends up reading 9800h TWICE. The | |
first access is just thrown away and is never used. It's here, because | |
it helps during windowing (I will describe later in the window section). | |
The way the characters are read is performed something like this: | |
At the start of the scanline: | |
1) latch the current character address | |
2) read a character from the address | |
3) read another character from the SAME address | |
4) increment address | |
5) read another character | |
6) repeat 4 and 5 20-22 times. | |
In step 1, the address we latch is calculated like this: | |
yscroll is the yscroll register on the GB CPU | |
xscroll is the xscroll register on the GB CPU | |
ycounter is the current scanline we are rendering (0-143) | |
whichnt is bit 3 of the LCD control reg on the GB CPU | |
ybase = (yscroll + ycounter) // calculates the effective vis. scanline | |
charaddr = (0x9800 | (whichnt << 10) | ((ybase & 0xf8) << 2) | | |
((xscroll & 0xf8) >> 3) | |
Another way to represent this address: | |
15 0 | |
------------------- | |
1001 1NYY YYYX XXXX | |
N = nametable # | |
Y = upper 5 bits of ybase | |
X = upper 5 bits of xscroll (which is then incremented between chars) | |
In step 2, we read from charaddr, and throw the result away | |
In step 3, we read from charaddr again and use it for the first vis. char | |
In step 4, *only the lower 5 bits* of charaddr is incremented | |
In step 5, we read the next character | |
Then, we repeat it enough times to fill out the scanline. | |
Once the nametable entry is fetched, we have to fetch the tile planes. | |
Depending on the state of the "BG & Window Tile Data Select" register, | |
which is LCD control bit 4, tile accesses are done one of two ways. | |
ntbyte = the nametable byte we read from the above NT address. | |
if (lcdcontrol[4]) tileaddr = (ntbyte << 4) | ((ysub & 0x7) << 1) | |
else tileaddr = (0x1000 - (ntbyte << 4)) | ((ysub & 0x7) << 1) | |
We will read the desired bytes for the tile data from tileaddr and | |
tileaddr+1 | |
Notice that xscroll's lower 3 bits don't SEEM to play into any of the | |
calculations above... this is because xscroll[2:0] does not affect | |
which characters are fetched in any way. Fine xscroll (lower 3 bits | |
of xscroll) only adjust the timing during LCD writing (explained | |
below). | |
* * * * * | |
So, this is all fine and good.. but what happens during windowing? | |
It's not much different than the above. | |
During a typical VRAM access pattern with the window, it looks something | |
like this: | |
B01 | |
B01s (repeated N times) | |
W01 | |
W01s (repeated M times) | |
The background rendering sequence is identical to the background only | |
sequence described previously. | |
When the window accesses start, the address calculation is similar... | |
First, a typical reading sequence: | |
B01 (9800h) | |
B01s (9800h, 9801h, 9802h, ...) | |
W01 (9C00h) | |
W01s (9C01h, 9C02h, ...) | |
The first change is that the first W01 access is NOT thrown away. | |
There is no duplicated read here as in the background read. | |
The nametable read is calculated like so: | |
windnt = the window nametable (LCD control bit 6) | |
basew = (ycount - yscroll) // calculates the effective window scanline | |
charaddr = (0x9800 | (windnt << 10) | ((basew & 0xf8) << 2) | |
Another way to represent this address: | |
15 0 | |
------------------- | |
1001 1NYY YYY0 0000 | |
N = nametable # | |
Y = upper 5 bits of basew | |
We simply read characters starting from the charaddr address, and | |
increment it each time we read a character until the scanline is | |
finished. | |
The tile plane address is calculated the same as it is calculated for | |
the background reads. | |
That wraps up the actual VRAM access patterns. | |
--- | |
LCD write timing | |
---------------- | |
Before I can describe how the LCD timing works, I have to first explain | |
how the LCD itself works. | |
The LCD is composed of a 2 bit wide by 159 bit deep shift register, | |
where the input pixels are shifted. Each rising edge of the display | |
clock, data is shifted one stage down the register. | |
When the display latch signal is activated, this shift register's value | |
is latched into the LCD column drivers. | |
The shift register is only 159 bits- the input data is used as the 160th | |
bit for latching into the LCD column drivers. | |
* The first pixel shifted into the register appears on the first column | |
* The last pixel shifted in appears on the second to last column | |
* The input pixel data on the input lines appears on the first column | |
Terrible ASCII: | |
DCLK: the display pixel clock | |
data0/1: the 2 bit pixel data | |
lat: the latch signal. when pulsed, latches the shift reg. data | |
bias: the LCD bias voltage (contrast wheel adjusts this) | |
inv: the LCD inverse signal (explained later) | |
+-----------------+ | |
LCD DCLK o-------|CLK | | |
| | | |
| 159*2 bit S/R | | |
LCD data0 o---*--|D0 | | |
LCD data1 o-*-+--|D1 | | |
| | | pix 1 -> 159 | | |
| | +-----------------+ | |
| | | ........... | | |
+--------------------------+ | |
| pix 0 pix 1 -> 159 | | |
| | | |
LCD lat o-|latch 160*2 bit latch | | |
| | | |
| 160*2 outputs | | |
+--------------------------+ | |
| .................... | | |
+--------------------------+ | |
| | | |
| 160 output drivers | | |
LCDbias o-|bias | | |
LCD inv o-|invert | | |
| | | |
| LCD outputs | | |
+--------------------------+ | |
|||||||||||||||||||||||| | |
+---------------------------------+ | |
| col 159 <- col 0 | | |
| LCD columns | | |
| | | |
| | | |
| LCD display | | |
| | | |
| | | |
| | | |
| | | |
| | | |
+---------------------------------+ | |
So now that the LCD column driving and latching has been explained, | |
the display timing that follows should make a bit more sense. | |
Because the LCD is clocked, unlike a CRT, this means that the hardware | |
has the ability to stop clocking the LCD for awhile if it feels like it. | |
The GB video hardware indeed does do this, and even uses it to advantage | |
during scrolling, sprite fetching, and starting the window rendering. | |
When windowing is disabled, the display clock always runs for the last | |
159 cycles. This is very interesting to me, because that means the | |
video hardware is actively shifting pixels out to the display, but | |
some of these pixels DO NOT HAVE A CORRESPONDING DCLK! This is how | |
fine X scroll is achieved- the first 0 to 7 pixels are just thrown away. | |
They get shifted out, but since the display clock is not running, they | |
do not get shifted in. By delaying these cycles, the display data will | |
shift left from 0 to 7 pixels. | |
The windowing function works the same way- the pixel where the window | |
is started will restart the rendering engine and thus allow single pixel | |
precision on where the window starts on the LCD. | |
During the first 6 cycles of the window fetch, the LCD clock is stalled. | |
This lets the pipeline fill and then display clocking resumes after the | |
6 cycle delay. | |
That takes care of the timing of display data clocking. | |
Now for the interesting part about how data is read and shifted out the | |
LCD data pins: | |
When the nametable fetch starts, the LAST tile data read will be latched | |
into two 8 bit shift registers, and then shift out the data pins one | |
pixel per clock. No matter what. | |
So, referring to the access pattern again: | |
B01 read the first tile | |
B01s latch the pixel data into the output shift registers | |
B01s latch the pixel data into the output shift registers | |
B01s... | |
Since each B01s access takes exactly 8 cycles, the output shift registers | |
will be exactly refilled when they are empty, and continue the output | |
data sending without interruption. | |
Fine xscroll is effected by controlling the point in this process where | |
the display clocking is started relative to the start of the rendering | |
phase. The data will always shift out the pixel data pins at the same | |
point in the render cycle, but since the DCLK is started earlier or | |
later, the point where the LCD starts latching data changes relative to | |
the data. This causes a 0-7 shift in the data on the LCD. | |
The afore-mentioned output shift registers will blindly shift out their | |
8 pixels of data without stopping, except when the LCD hardware is | |
stalled by a sprite fetch (described later). Thus, the timing of the | |
VRAM reads determines the amount of fine xscroll on the background | |
and on the window. | |
--- |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment