It turns out if you're just doing inference, Llama can be written very concisely. This implementation includes paged attention. Speculative decoding can also be added for another speed boost however it's quite verbose and was left out to keep the implementation cleaner.
Download the Llama files and place them in a directory ./Llama3.2-3B
(or whatever flavor of Llama you want).
Your directory structure should look like:
./Llama3.2-3B/consolidated.00.pth