Writes to addresses at high physical memory addresses, instead of going through the normal cache hierachy, use a special I/O bus. While this is useful for relatively low-speed peripherals, it has performance limitiations when used with coprocessors such as texture fetch units:
- It can only write a single, 32-bit word at a time. For vectorized compute code, this requires 32 instructions to copy a vector value into our out of it (getlane/write).
- It takes a rollback for every read or write, since the I/O bus is shared by all cores and transactions need to globally arbitrate
- There is no way for peripherals to make a thread wait for them to be ready, so threads must poll them, which wastes processor cycles and clogs up the bus
- The bus does not have a concept of thread IDs, which requires some form of arbitration at the software or coprocessor level.
The proposal is to create a new bus for use in high-speed peripherals. This would not replace the low speed bus, but would address different use cases and constraints.
The new bus would be specific to a single core and not shared globally by all cores. This would eliminate the overhead of arbitration among cores (although a peripheral could implement its own arbitrarion scheme). It would have the following interface:
interface coprocessor_bus_interface;
logic write_en;
logic read_en;
scalar_t address;
thread_idx_t thread_idx;
vector_t write_data;
vector_t read_data;
logic ack;
local_thread_bitmap_t wake_bitmap;
modport master(output write_en, read_en, address, thread_idx, write_data,
input read_data, ack, wake_bitmap);
modport slave(input write_en, read_en, address, thread_idx, write_data,
output read_data, ack, wake_bitmap);
endinterface
When a write or read is performed, the next cycle, the peripheral would assert or deassert the ack signal.
- If the ack signal is asserted, the peripheral was ready and the thread can continue executing without a rollback. If this was a read, the signal 'read_data' will contain data from the peripheral.
- If the ack signal is deasserted, the pipeline will suspend the thread. The peripheral can later wake it by asserting sigals in 'wake_bitmap', which contains one bit per thread.
Peripherals can add FIFOs for writes and reads, only blocking threads when the FIFOs are full or empty.
-
Functional Need to create a dummy peripheral in the testbench.
- Need a test that blocks on read/write and one that does not block
- Test thread resume
-
Performance