drhelius · April 5, 2018 02:05
diff --git a/Gameboy's video hardware b/Gameboy's video hardware
 Nitty Gritty Gameboy Cycle Timing
                  ---------------------------------
                  
                  
 A document about the down and dirty timing of the Gameboy's video hardware.

 Written by:  Kevin Horton
 Version: 0.01 (preliminary)

 My findings here are based on the original DMG, Super Gameboy, and GB
 Pocket.  All three appear to behave identically during testing, and the
 SGB was used for all the reverse engineering.

 An HP54645D mixed signal oscilloscope/logic analyzer was connected to the
 SGB using a 16 wire pod.  A 20 pin ribbon cable with IDC plug on the end
 was soldered to various points on the SGB, and a pin header was inserted
 into the IDC plug so that the pod connectors could be plugged in to 
 monitor the goings-on of the hardware.

 ---


 I have discovered some interesting things about how the Gameboy fetches
 VRAM data in general.  First, it will actually stop clocking the LCD and
 stall it if it needs to fetch something and is not ready to send the
 data out quite yet.

 Secondly, the window function is a restarting of the data fetching
 state machine, which is used to read the background tiles.  The window
 is triggered N clocks after the start of rendering, where N is determined
 by the value in the xwindow register.

 So, without further delay...

 Scanline timing
 ---------------

 During the discussion of scanline timing, I will be ignoring Y timing
 totally, since Y timing is unrelated to VRAM access patterns.  The Y
 timing only affects WHICH KIND of VRAM access occurs, and does not
 affect it in any other way.

 There are a couple cases that will be discussed, from simplest to most
 complex.

 * * * * *

 The first case is what I call the degenerate case: xwindow is set to 0ffh
 which disables the window totally, and then xscroll is adjusted.

 There are only 8 different possible cases.

 These types of access take from 173.5 to 180.5 cycles.  The reasoning for
 the half cycle will be described later.

 The access pattern looks like this:

 B01   - (6 cycles) fetch Background nametable byte, then 2 tile planes
 B01s  - (167.5 + (xscroll % 7) cycles) fetch another tile and sprite
 			window.

 Where:
 B = reading the background tile # (i.e. out of 1800h from the first
 nametable)
 0,1 = where the tile graphics are fetched.  bitplanes 0 and 1.
 s = sprite window.  the sprite hardware will insert reads here if needed.

 Each access to VRAM (B, 0, 1, s) takes 2 cycles to occur.  A "cycle" is
 exactly 1 period of the main input clock to the gameboy CPU chip.  This
 is nominally 4.19MHz approximately.

 The last four accesses (B01s) is repeated until the proper number of
 cycles has elapsed.  

 In the xscroll = 0 case, it will run for 167.5 cycles.  This has an
 interesting side effect- any tile access that is not complete just
 gets unceremoniously cut off. This means that there will be 20 complete
 tile accesses (B01s) and then 7.5 clocks worth of a 21st access,
 cutting off the last half cycle of the sprite window.

 In the xscroll = 2 case, it will run for 169.5 cycles. Similar to above,
 this will result in 21 complete B01s accesses, and then 1.5 cycles of the 
 B fetch on a 22nd access.

 This pattern repeats until xscroll = 7 (taking a total of 180.5 cycles)
 until snapping back to 173.5 cycles when xscroll = 8.

 The total number of cycles taken is (173.5 + (xscroll % 7)).

 Now, you have been wondering what this extra 1/2 cycle business is about.
 Well, it has to do with how the display clock is generated.  The display
 clock is generated via inverting the main input clock.  I suspect it was
 done so that the video hardware can get the data to the LCD ready on the
 falling edge of the main clock.  The display clocks data in on the RISING
 edge of the display clock, thus necessitating the inverted display clock
 relative to the main clock.

 That causes the vram access pattern to be extended 1/2 clock on the end
 to accommodate the inverted clock.

 * * * * *

 So, that takes care of the easy case.  Now for what happens when the
 window is used.

 NOTE: When xwindow = 00h or xwindow = 0a6h, different things happen.
 I will explain them later on.  For now, the following information
 holds when ((xwindow > 00h) && (xwindow < 0a6h)).

 Interestingly, adding the window does not change a whole lot- in fact,
 it simply restarts the whole fetch sequence over again, no matter where
 it was!  The timing is generated like so:

 B01   (6 cycles) fetch background tile nametable+bitplanes
 B01_  (1 to 172 cycles) ((xscroll % 7) + xwindow + 1)
 W01   (6 cycles) fetch window tile nametable+bitplanes
 W01_  (1.5 to 166.5 cycles)  (166.5 - xwindow)

 As can be seen, it's very similar to the first case.  Only now,
 the number of B12_ accesses is controlled by the xscroll value
 and the xwindow value.  As before, when the number of cycles has
 elapsed, the access pattern is just cut short, and the W12 (W = window
 nametable entry) access starts.

 This window access pattern is identical to the background one, except
 the window nametable is being accessed instead.

 Turning on the window incurs a 6 cycle penalty, so the total number of
 cycles taken is (173.5 + 6 + (xscroll % 7)).

 * * * * *

 OK, now things get slightly strange.  When xwindow = 0, some slightly
 different rules come into effect.

 When (xscroll % 7) = 0 to 6, things work a bit different.  Timing 
 looks like this:

 B01B (7 cycles) technically the last B is part of the B01s pattern.
 W01  (6 cycles) as above, the start of the window access pattern
 W01s (167.5 to 173.5 cycles) (167.5 + (scroll % 7))

 As before, the W01s pattern is repeated for the required number of 
 cycles.  When the count has expired, the access pattern is just cut off.

 This takes 180.5 to 186.5 cycles. 

 When (xscroll % 7) = 7, then the timing is slightly modified version
 of the above.  The access pattern is identical to when (xscroll % 7) = 6
 except an extra cycle is inserted in the first sprite window, causing
 the total amount to be 187.5 cycles.

 * * * * *

 When xwindow = 0a6h, then timing is identical to when the window is 
 disabled, i.e. 173.5 to 180.5 cycles.  The difference is the window
 nametable is used instead of the background nametable.  Rendering
 starts from the SECOND tile of each line, however.  The net effect of
 this is, the window register appears to be scrolled 8 pixels to the right
 (if xscroll = 0).

 Only the lower 3 bits of the xscroll register are used in this mode.
 It shifts the the window left 0-7 cycles.   Taking into account the first
 paragraph above about the nametable, the net effect is the window appears
 to be xscrolled 8 to 15 pixels left.

 Effective xscroll = (xscroll % 7) + 8

 The top scanline of the screen is ALWAYS reflecting the very first
 scanline of the window, when ywindow is less than or equal to 08fh.

 The second scanline of the screen will reflect the second scanline of
 the window, but ONLY when ywindow = 00h.  Any other ywindow value will
 result in the background showing for the first scanline.

 The other scanlines of the screen (lines 3-144) will show the window,
 *starting from the third window scanline* depending on the ywindow value.

 This is hard to describe, but the effect is simple:

 GB line:   (ywindow = 0)
 1   window 1
 2   window 2
 3   window 3
 4   window 4

 GB line:   (ywindow = 1)
 1   window 1
 2   background 2
 3   window 3
 4   window 4

 GB line:   (ywindow = 2)
 1   window 1
 2   background 2
 3   background 3
 4   window 3
 5   window 4
 6   window 5

 GB line:   (ywindow = 3)
 1   window 1
 2   background 2
 3   background 3
 4   background 4
 5   window 3
 6   window 4

 ----


 What addresses are read during rendering
 ----------------------------------------

 Referring back to the fetch patterns above, I will go through what
 addresses are read.

 For the degenerate case, the access pattern looks like this:

 B01 
 B01s (repeated 20-22x)

 Assuming that xscroll = 0, and yscroll = 0, and we have the background
 reading the 9800h nametable:

 B01  (reads 9800h)
 B01s (reads 9800h, 9801h, 9802h, 9803h...9814h)

 This is fairly simple: it just starts at the very upper left char of the
 nametable and starts reading.   It ends up reading 9800h TWICE.  The
 first access is just thrown away and is never used.  It's here, because
 it helps during windowing (I will describe later in the window section).

 The way the characters are read is performed something like this:

 At the start of the scanline:

 1) latch the current character address
 2) read a character from the address
 3) read another character from the SAME address
 4) increment address
 5) read another character
 6) repeat 4 and 5 20-22 times.

 In step 1, the address we latch is calculated like this:

 yscroll is the yscroll register on the GB CPU
 xscroll is the xscroll register on the GB CPU
 ycounter is the current scanline we are rendering (0-143)
 whichnt is bit 3 of the LCD control reg on the GB CPU

 ybase = (yscroll + ycounter)   // calculates the effective vis. scanline

 charaddr = (0x9800 | (whichnt << 10) | ((ybase & 0xf8) << 2) |
 		   ((xscroll & 0xf8) >> 3)

 Another way to represent this address:

 15                0
 -------------------
 1001 1NYY YYYX XXXX

 N = nametable #
 Y = upper 5 bits of ybase
 X = upper 5 bits of xscroll (which is then incremented between chars)


 In step 2, we read from charaddr,  and throw the result away
 In step 3, we read from charaddr again and use it for the first vis. char
 In step 4, *only the lower 5 bits* of charaddr is incremented 
 In step 5, we read the next character

 Then, we repeat it enough times to fill out the scanline.

 Once the nametable entry is fetched, we have to fetch the tile planes.

 Depending on the state of the "BG & Window Tile Data Select" register,
 which is LCD control bit 4, tile accesses are done one of two ways.


 ntbyte = the nametable byte we read from the above NT address.

 if (lcdcontrol[4]) tileaddr = (ntbyte << 4) | ((ysub & 0x7) << 1)
 else tileaddr = (0x1000 - (ntbyte << 4)) | ((ysub & 0x7) << 1)

 We will read the desired bytes for the tile data from tileaddr and
 tileaddr+1

 Notice that xscroll's lower 3 bits don't SEEM to play into any of the
 calculations above... this is because xscroll[2:0] does not affect
 which characters are fetched in any way.  Fine xscroll (lower 3 bits
 of xscroll) only adjust the timing during LCD writing (explained
 below).


 * * * * *

 So, this is all fine and good.. but what happens during windowing?
 It's not much different than the above.  

 During a typical VRAM access pattern with the window, it looks something
 like this:

 B01
 B01s (repeated N times)
 W01
 W01s (repeated M times)

 The background rendering sequence is identical to the background only
 sequence described previously.

 When the window accesses start, the address calculation is similar...

 First, a typical reading sequence:

 B01  (9800h)
 B01s (9800h, 9801h, 9802h, ...)
 W01  (9C00h)
 W01s (9C01h, 9C02h, ...)

 The first change is that the first W01 access is NOT thrown away.
 There is no duplicated read here as in the background read.

 The nametable read is calculated like so:

 windnt = the window nametable (LCD control bit 6)

 basew = (ycount - yscroll)   // calculates the effective window scanline

 charaddr = (0x9800 | (windnt << 10) | ((basew & 0xf8) << 2)

 Another way to represent this address:

 15                0
 -------------------
 1001 1NYY YYY0 0000

 N = nametable #
 Y = upper 5 bits of basew


 We simply read characters starting from the charaddr address, and 
 increment it each time we read a character until the scanline is
 finished.

 The tile plane address is calculated the same as it is calculated for
 the background reads.


 That wraps up the actual VRAM access patterns.

 ---

 LCD write timing
 ----------------

 Before I can describe how the LCD timing works, I have to first explain
 how the LCD itself works.

 The LCD is composed of a 2 bit wide by 159 bit deep shift register,
 where the input pixels are shifted.  Each rising edge of the display
 clock, data is shifted one stage down the register. 

 When the display latch signal is activated, this shift register's value
 is latched into the LCD column drivers.

 The shift register is only 159 bits- the input data is used as the 160th
 bit for latching into the LCD column drivers.

 * The first pixel shifted into the register appears on the first column
 * The last pixel shifted in appears on the second to last column
 * The input pixel data on the input lines appears on the first column

 Terrible ASCII:


 DCLK:    the display pixel clock
 data0/1: the 2 bit pixel data
 lat:     the latch signal.  when pulsed, latches the shift reg. data
 bias:    the LCD bias voltage (contrast wheel adjusts this)
 inv:     the LCD inverse signal (explained later)


                 +-----------------+
 LCD DCLK o-------|CLK              |
                 |                 |
                 |  159*2 bit S/R  |
 LCD data0 o---*--|D0               |
 LCD data1 o-*-+--|D1               |
            | |  |   pix 1 ->  159 |
            | |  +-----------------+
            | |    | ........... |
          +--------------------------+
          | pix 0     pix 1 ->  159  |
          |                          |
 LCD lat o-|latch    160*2 bit latch  |
          |                          |
          |      160*2 outputs       |
          +--------------------------+
            | .................... |
          +--------------------------+
          |                          |
          |    160 output drivers    |
 LCDbias o-|bias                      |
 LCD inv o-|invert                    |
          |                          |
          |        LCD outputs       |
          +--------------------------+
            ||||||||||||||||||||||||
       +---------------------------------+
       |      col 159   <-    col 0      |     
       |           LCD columns           |     
       |                                 |     
       |                                 |     
       |           LCD display           |     
       |                                 |     
       |                                 |     
       |                                 |     
       |                                 |     
       |                                 |     
       +---------------------------------+
       
       
 So now that the LCD column driving and latching has been explained,
 the display timing that follows should make a bit more sense.

 Because the LCD is clocked, unlike a CRT, this means that the hardware
 has the ability to stop clocking the LCD for awhile if it feels like it.
 The GB video hardware indeed does do this, and even uses it to advantage
 during scrolling, sprite fetching, and starting the window rendering.

 When windowing is disabled, the display clock always runs for the last
 159 cycles.  This is very interesting to me, because that means the
 video hardware is actively shifting pixels out to the display, but
 some of these pixels DO NOT HAVE A CORRESPONDING DCLK!  This is how
 fine X scroll is achieved- the first 0 to 7 pixels are just thrown away.
 They get shifted out, but since the display clock is not running, they
 do not get shifted in.  By delaying these cycles, the display data will
 shift left from 0 to 7 pixels.

 The windowing function works the same way- the pixel where the window
 is started will restart the rendering engine and thus allow single pixel
 precision on where the window starts on the LCD.

 During the first 6 cycles of the window fetch, the LCD clock is stalled.
 This lets the pipeline fill and then display clocking resumes after the
 6 cycle delay.

 That takes care of the timing of display data clocking.  

 Now for the interesting part about how data is read and shifted out the
 LCD data pins:

 When the nametable fetch starts, the LAST tile data read will be latched
 into two 8 bit shift registers, and then shift out the data pins one
 pixel per clock.  No matter what.

 So, referring to the access pattern again:

 B01   read the first tile
 B01s  latch the pixel data into the output shift registers
 B01s  latch the pixel data into the output shift registers
 B01s...

 Since each B01s access takes exactly 8 cycles, the output shift registers
 will be exactly refilled when they are empty, and continue the output
 data sending without interruption.

 Fine xscroll is effected by controlling the point in this process where
 the display clocking is started relative to the start of the rendering
 phase.  The data will always shift out the pixel data pins at the same
 point in the render cycle, but since the DCLK is started earlier or
 later, the point where the LCD starts latching data changes relative to
 the data.  This causes a 0-7 shift in the data on the LCD.

 The afore-mentioned output shift registers will blindly shift out their
 8 pixels of data without stopping, except when the LCD hardware is
 stalled by a sprite fetch (described later).  Thus, the timing of the
 VRAM reads determines the amount of fine xscroll on the background
 and on the window.

 ---
	Nitty Gritty Gameboy Cycle Timing
	---------------------------------


	A document about the down and dirty timing of the Gameboy's video hardware.

	Written by: Kevin Horton
	Version: 0.01 (preliminary)

	My findings here are based on the original DMG, Super Gameboy, and GB
	Pocket. All three appear to behave identically during testing, and the
	SGB was used for all the reverse engineering.

	An HP54645D mixed signal oscilloscope/logic analyzer was connected to the
	SGB using a 16 wire pod. A 20 pin ribbon cable with IDC plug on the end
	was soldered to various points on the SGB, and a pin header was inserted
	into the IDC plug so that the pod connectors could be plugged in to
	monitor the goings-on of the hardware.

	---


	I have discovered some interesting things about how the Gameboy fetches
	VRAM data in general. First, it will actually stop clocking the LCD and
	stall it if it needs to fetch something and is not ready to send the
	data out quite yet.

	Secondly, the window function is a restarting of the data fetching
	state machine, which is used to read the background tiles. The window
	is triggered N clocks after the start of rendering, where N is determined
	by the value in the xwindow register.

	So, without further delay...

	Scanline timing
	---------------

	During the discussion of scanline timing, I will be ignoring Y timing
	totally, since Y timing is unrelated to VRAM access patterns. The Y
	timing only affects WHICH KIND of VRAM access occurs, and does not
	affect it in any other way.

	There are a couple cases that will be discussed, from simplest to most
	complex.

	* * * * *

	The first case is what I call the degenerate case: xwindow is set to 0ffh
	which disables the window totally, and then xscroll is adjusted.

	There are only 8 different possible cases.

	These types of access take from 173.5 to 180.5 cycles. The reasoning for
	the half cycle will be described later.

	The access pattern looks like this:

	B01 - (6 cycles) fetch Background nametable byte, then 2 tile planes
	B01s - (167.5 + (xscroll % 7) cycles) fetch another tile and sprite
	window.

	Where:
	B = reading the background tile # (i.e. out of 1800h from the first
	nametable)
	0,1 = where the tile graphics are fetched. bitplanes 0 and 1.
	s = sprite window. the sprite hardware will insert reads here if needed.

	Each access to VRAM (B, 0, 1, s) takes 2 cycles to occur. A "cycle" is
	exactly 1 period of the main input clock to the gameboy CPU chip. This
	is nominally 4.19MHz approximately.

	The last four accesses (B01s) is repeated until the proper number of
	cycles has elapsed.

	In the xscroll = 0 case, it will run for 167.5 cycles. This has an
	interesting side effect- any tile access that is not complete just
	gets unceremoniously cut off. This means that there will be 20 complete
	tile accesses (B01s) and then 7.5 clocks worth of a 21st access,
	cutting off the last half cycle of the sprite window.

	In the xscroll = 2 case, it will run for 169.5 cycles. Similar to above,
	this will result in 21 complete B01s accesses, and then 1.5 cycles of the
	B fetch on a 22nd access.

	This pattern repeats until xscroll = 7 (taking a total of 180.5 cycles)
	until snapping back to 173.5 cycles when xscroll = 8.

	The total number of cycles taken is (173.5 + (xscroll % 7)).

	Now, you have been wondering what this extra 1/2 cycle business is about.
	Well, it has to do with how the display clock is generated. The display
	clock is generated via inverting the main input clock. I suspect it was
	done so that the video hardware can get the data to the LCD ready on the
	falling edge of the main clock. The display clocks data in on the RISING
	edge of the display clock, thus necessitating the inverted display clock
	relative to the main clock.

	That causes the vram access pattern to be extended 1/2 clock on the end
	to accommodate the inverted clock.

	* * * * *

	So, that takes care of the easy case. Now for what happens when the
	window is used.

	NOTE: When xwindow = 00h or xwindow = 0a6h, different things happen.
	I will explain them later on. For now, the following information
	holds when ((xwindow > 00h) && (xwindow < 0a6h)).

	Interestingly, adding the window does not change a whole lot- in fact,
	it simply restarts the whole fetch sequence over again, no matter where
	it was! The timing is generated like so:

	B01 (6 cycles) fetch background tile nametable+bitplanes
	B01_ (1 to 172 cycles) ((xscroll % 7) + xwindow + 1)
	W01 (6 cycles) fetch window tile nametable+bitplanes
	W01_ (1.5 to 166.5 cycles) (166.5 - xwindow)

	As can be seen, it's very similar to the first case. Only now,
	the number of B12_ accesses is controlled by the xscroll value
	and the xwindow value. As before, when the number of cycles has
	elapsed, the access pattern is just cut short, and the W12 (W = window
	nametable entry) access starts.

	This window access pattern is identical to the background one, except
	the window nametable is being accessed instead.

	Turning on the window incurs a 6 cycle penalty, so the total number of
	cycles taken is (173.5 + 6 + (xscroll % 7)).

	* * * * *

	OK, now things get slightly strange. When xwindow = 0, some slightly
	different rules come into effect.

	When (xscroll % 7) = 0 to 6, things work a bit different. Timing
	looks like this:

	B01B (7 cycles) technically the last B is part of the B01s pattern.
	W01 (6 cycles) as above, the start of the window access pattern
	W01s (167.5 to 173.5 cycles) (167.5 + (scroll % 7))

	As before, the W01s pattern is repeated for the required number of
	cycles. When the count has expired, the access pattern is just cut off.

	This takes 180.5 to 186.5 cycles.

	When (xscroll % 7) = 7, then the timing is slightly modified version
	of the above. The access pattern is identical to when (xscroll % 7) = 6
	except an extra cycle is inserted in the first sprite window, causing
	the total amount to be 187.5 cycles.

	* * * * *

	When xwindow = 0a6h, then timing is identical to when the window is
	disabled, i.e. 173.5 to 180.5 cycles. The difference is the window
	nametable is used instead of the background nametable. Rendering
	starts from the SECOND tile of each line, however. The net effect of
	this is, the window register appears to be scrolled 8 pixels to the right
	(if xscroll = 0).

	Only the lower 3 bits of the xscroll register are used in this mode.
	It shifts the the window left 0-7 cycles. Taking into account the first
	paragraph above about the nametable, the net effect is the window appears
	to be xscrolled 8 to 15 pixels left.

	Effective xscroll = (xscroll % 7) + 8

	The top scanline of the screen is ALWAYS reflecting the very first
	scanline of the window, when ywindow is less than or equal to 08fh.

	The second scanline of the screen will reflect the second scanline of
	the window, but ONLY when ywindow = 00h. Any other ywindow value will
	result in the background showing for the first scanline.

	The other scanlines of the screen (lines 3-144) will show the window,
	starting from the third window scanline depending on the ywindow value.

	This is hard to describe, but the effect is simple:

	GB line: (ywindow = 0)
	1 window 1
	2 window 2
	3 window 3
	4 window 4

	GB line: (ywindow = 1)
	1 window 1
	2 background 2
	3 window 3
	4 window 4

	GB line: (ywindow = 2)
	1 window 1
	2 background 2
	3 background 3
	4 window 3
	5 window 4
	6 window 5

	GB line: (ywindow = 3)
	1 window 1
	2 background 2
	3 background 3
	4 background 4
	5 window 3
	6 window 4

	----


	What addresses are read during rendering
	----------------------------------------

	Referring back to the fetch patterns above, I will go through what
	addresses are read.

	For the degenerate case, the access pattern looks like this:

	B01
	B01s (repeated 20-22x)

	Assuming that xscroll = 0, and yscroll = 0, and we have the background
	reading the 9800h nametable:

	B01 (reads 9800h)
	B01s (reads 9800h, 9801h, 9802h, 9803h...9814h)

	This is fairly simple: it just starts at the very upper left char of the
	nametable and starts reading. It ends up reading 9800h TWICE. The
	first access is just thrown away and is never used. It's here, because
	it helps during windowing (I will describe later in the window section).

	The way the characters are read is performed something like this:

	At the start of the scanline:

	1) latch the current character address
	2) read a character from the address
	3) read another character from the SAME address
	4) increment address
	5) read another character
	6) repeat 4 and 5 20-22 times.

	In step 1, the address we latch is calculated like this:

	yscroll is the yscroll register on the GB CPU
	xscroll is the xscroll register on the GB CPU
	ycounter is the current scanline we are rendering (0-143)
	whichnt is bit 3 of the LCD control reg on the GB CPU

	ybase = (yscroll + ycounter) // calculates the effective vis. scanline

	charaddr = (0x9800 \| (whichnt << 10) \| ((ybase & 0xf8) << 2) \|
	((xscroll & 0xf8) >> 3)

	Another way to represent this address:

	15 0
	-------------------
	1001 1NYY YYYX XXXX

	N = nametable #
	Y = upper 5 bits of ybase
	X = upper 5 bits of xscroll (which is then incremented between chars)


	In step 2, we read from charaddr, and throw the result away
	In step 3, we read from charaddr again and use it for the first vis. char
	In step 4, only the lower 5 bits of charaddr is incremented
	In step 5, we read the next character

	Then, we repeat it enough times to fill out the scanline.

	Once the nametable entry is fetched, we have to fetch the tile planes.

	Depending on the state of the "BG & Window Tile Data Select" register,
	which is LCD control bit 4, tile accesses are done one of two ways.


	ntbyte = the nametable byte we read from the above NT address.

	if (lcdcontrol[4]) tileaddr = (ntbyte << 4) \| ((ysub & 0x7) << 1)
	else tileaddr = (0x1000 - (ntbyte << 4)) \| ((ysub & 0x7) << 1)

	We will read the desired bytes for the tile data from tileaddr and
	tileaddr+1

	Notice that xscroll's lower 3 bits don't SEEM to play into any of the
	calculations above... this is because xscroll[2:0] does not affect
	which characters are fetched in any way. Fine xscroll (lower 3 bits
	of xscroll) only adjust the timing during LCD writing (explained
	below).


	* * * * *

	So, this is all fine and good.. but what happens during windowing?
	It's not much different than the above.

	During a typical VRAM access pattern with the window, it looks something
	like this:

	B01
	B01s (repeated N times)
	W01
	W01s (repeated M times)

	The background rendering sequence is identical to the background only
	sequence described previously.

	When the window accesses start, the address calculation is similar...

	First, a typical reading sequence:

	B01 (9800h)
	B01s (9800h, 9801h, 9802h, ...)
	W01 (9C00h)
	W01s (9C01h, 9C02h, ...)

	The first change is that the first W01 access is NOT thrown away.
	There is no duplicated read here as in the background read.

	The nametable read is calculated like so:

	windnt = the window nametable (LCD control bit 6)

	basew = (ycount - yscroll) // calculates the effective window scanline

	charaddr = (0x9800 \| (windnt << 10) \| ((basew & 0xf8) << 2)

	Another way to represent this address:

	15 0
	-------------------
	1001 1NYY YYY0 0000

	N = nametable #
	Y = upper 5 bits of basew


	We simply read characters starting from the charaddr address, and
	increment it each time we read a character until the scanline is
	finished.

	The tile plane address is calculated the same as it is calculated for
	the background reads.


	That wraps up the actual VRAM access patterns.

	---

	LCD write timing
	----------------

	Before I can describe how the LCD timing works, I have to first explain
	how the LCD itself works.

	The LCD is composed of a 2 bit wide by 159 bit deep shift register,
	where the input pixels are shifted. Each rising edge of the display
	clock, data is shifted one stage down the register.

	When the display latch signal is activated, this shift register's value
	is latched into the LCD column drivers.

	The shift register is only 159 bits- the input data is used as the 160th
	bit for latching into the LCD column drivers.

	* The first pixel shifted into the register appears on the first column
	* The last pixel shifted in appears on the second to last column
	* The input pixel data on the input lines appears on the first column

	Terrible ASCII:


	DCLK: the display pixel clock
	data0/1: the 2 bit pixel data
	lat: the latch signal. when pulsed, latches the shift reg. data
	bias: the LCD bias voltage (contrast wheel adjusts this)
	inv: the LCD inverse signal (explained later)


	+-----------------+
	LCD DCLK o-------\|CLK \|
	\| \|
	\| 159*2 bit S/R \|
	LCD data0 o---*--\|D0 \|
	LCD data1 o-*-+--\|D1 \|
	\| \| \| pix 1 -> 159 \|
	\| \| +-----------------+
	\| \| \| ........... \|
	+--------------------------+
	\| pix 0 pix 1 -> 159 \|
	\| \|
	LCD lat o-\|latch 160*2 bit latch \|
	\| \|
	\| 160*2 outputs \|
	+--------------------------+
	\| .................... \|
	+--------------------------+
	\| \|
	\| 160 output drivers \|
	LCDbias o-\|bias \|
	LCD inv o-\|invert \|
	\| \|
	\| LCD outputs \|
	+--------------------------+
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	+---------------------------------+
	\| col 159 <- col 0 \|
	\| LCD columns \|
	\| \|
	\| \|
	\| LCD display \|
	\| \|
	\| \|
	\| \|
	\| \|
	\| \|
	+---------------------------------+


	So now that the LCD column driving and latching has been explained,
	the display timing that follows should make a bit more sense.

	Because the LCD is clocked, unlike a CRT, this means that the hardware
	has the ability to stop clocking the LCD for awhile if it feels like it.
	The GB video hardware indeed does do this, and even uses it to advantage
	during scrolling, sprite fetching, and starting the window rendering.

	When windowing is disabled, the display clock always runs for the last
	159 cycles. This is very interesting to me, because that means the
	video hardware is actively shifting pixels out to the display, but
	some of these pixels DO NOT HAVE A CORRESPONDING DCLK! This is how
	fine X scroll is achieved- the first 0 to 7 pixels are just thrown away.
	They get shifted out, but since the display clock is not running, they
	do not get shifted in. By delaying these cycles, the display data will
	shift left from 0 to 7 pixels.

	The windowing function works the same way- the pixel where the window
	is started will restart the rendering engine and thus allow single pixel
	precision on where the window starts on the LCD.

	During the first 6 cycles of the window fetch, the LCD clock is stalled.
	This lets the pipeline fill and then display clocking resumes after the
	6 cycle delay.

	That takes care of the timing of display data clocking.

	Now for the interesting part about how data is read and shifted out the
	LCD data pins:

	When the nametable fetch starts, the LAST tile data read will be latched
	into two 8 bit shift registers, and then shift out the data pins one
	pixel per clock. No matter what.

	So, referring to the access pattern again:

	B01 read the first tile
	B01s latch the pixel data into the output shift registers
	B01s latch the pixel data into the output shift registers
	B01s...

	Since each B01s access takes exactly 8 cycles, the output shift registers
	will be exactly refilled when they are empty, and continue the output
	data sending without interruption.

	Fine xscroll is effected by controlling the point in this process where
	the display clocking is started relative to the start of the rendering
	phase. The data will always shift out the pixel data pins at the same
	point in the render cycle, but since the DCLK is started earlier or
	later, the point where the LCD starts latching data changes relative to
	the data. This causes a 0-7 shift in the data on the LCD.

	The afore-mentioned output shift registers will blindly shift out their
	8 pixels of data without stopping, except when the LCD hardware is
	stalled by a sprite fetch (described later). Thus, the timing of the
	VRAM reads determines the amount of fine xscroll on the background
	and on the window.

	---