gist:4969164

This is a prototype task scheduler, written in Rust, that is driven by an abstract event loop, along with the beginnings of an I/O interface for translating asynchronous I/O to synchronous I/O via scheduler context switches. It is part of issue #4419.

It is not ready to merge, but this is an opportunity to review and discuss.

While I am primarily interested in proving the integration of I/O with the scheduler, this is also written with a number of other goals in mind:

Replacing C++ code with Rust
Clean up stack management (#2044, #4479, #4480)
Supporting the work stealing algorithm (#3095)
Providing scheduling operations for different scenarios, some that avoid memory synchronization.
Simplifying the lifecycle and multithreading complexity of runtime types

What is implemented:

Context switching operations (scheduler->task, task->scheduler, task->task, blocking)
An opaque event loop and I/O abstraction using objects
Safer uv bindings that build on std::uv::ll
Very simple TCP reads and writes

Unimplemented:

Anything related to multithreading
Much I/O, including dealing with any cases where a task isn't available to respond to an I/O event
Platforms other than x864_64 and unix

Scheduler design

I am trying to use ownership to make the relationships between scheduler types clear, and as a result this scheduler is structured quite differently than the current one.

The core idea here is that tasks are owned, and code that owns a task is free to schedule it. During the lifetime of a task ownership transfers between schedulers and objects on which the task is blocked. The state of a task (blocked vs. running vs. dead, etc.) is encoded in its ownership.

For comparison, in the existing scheduler, the task is always owned by a scheduler, but is atomically reference counted so other entities (like pipes that a task is blocked on) occasionally hold pointers to a task. The resulting task lifecycle is quite complicated, and I think unnecessarily so.

I believe this model works with pipes very well, though I don't have a complete understanding of pipes yet.

You'll notice this code is still written as a very big test case, for ease of development.

The most important submodules:

rt::sched - The core of the scheduler, containing both Scheduler and Task
rt::uv - libuv bindings that build on uv::ll
rt::io - The runtime's abstract I/O interface, implemented for a specific event loop
rt::uvio - The implementation of io for libuv

In this iteration, Scheduler is very simple, mostly providing a few context switching primitive operations, and I like that, so I'll probably break it into multiple types, one as currently written, another dealing with scheduler policy and multithreading. In regards to multithreading, Schedulers are intended to be implemented as actors, mostly dealing with single-threaded state, then occasionally communicating with other schedulers through messages (the current implementation relies more on shared state, locks and signals).

rt::io is the runtime's internal, abstract I/O interface. It is used entirely as opaque ~objects and is intended to support yet other, user-facing I/O modules. The I/O interface here should be considered a proof-of-concept, as a real design will require a lot of consideration.

Measurements

I've only done one very simple measurement, comparing pure TCP read performance between node 0.6 and this scheduler. They indicate this code is about on par with node. I take that as a good sign, though you might expect us to beat node with their dynamic language overhead and the various extra abstractions in their libraries compared to this simple Rust code. perf indicates we spend about 50% time in the kernel, then the usual suspects: upcall_call_shim_on_c_stack, pthread_getspecific, malloc - things that can be tackled in increments.

Concerns

~objects don't work

I don't actually remember if I tested them or not, but AIUI they don't work correctly, so I've inserted some placeholder typedefs to try to make the code look like it's using objects when it is not. Hopefully it won't be too hard to do the conversion. Until ~objects works uv needs to live in core.

Allocation

There are a lot of small allocations here, particularly in the uv bindings, which rely heavily on owned closures, but also in the io interface that uses objects. Because it is much more pleasant to just use closures everywhere than figure out exactly how to thread data around without allocating, I think this tradeoff is good in the short term. The particularly bad allocations can be optimized as needed. The ~object allocations are considerably harder to eliminate.

What happens when I/O types are sent?

We talked about this in the meeting last week and it's a big problem. I think now that I/O types (those defined in rt::io and that form the basis for any I/O API's) must be sendable, the reason being that one-connection per task is going to be the right way to do networking, at least for the near future. The basic server will listen, accept then capture the connection in a new task.

So, assuming that I/O types are sendable, there is going to be a lot of complexity in making those I/O types ensure that the task they are currently associated with is running on the correct scheduler. Importantly, task pinning is not sufficient, since the I/O types are bound to a specific scheduler not task. Nor will it be sufficient to simply fail if an I/O type migrates to the wrong scheduler, because we have no way to prevent it from happening accidentally (that I can see). It's going to be ugly, and could involve polluting our nice single-threaded I/O paths with some concessions to memory synchronization.

Selecting

Related to the above point about one connection per task, we need to be able to make select work with various I/O requests, particularly for reading and listening. Not only that, they have to be able to select on both pipes and I/O types simultaneously (so that I can both listen for incoming connections as well as a signal to stop listening for incoming connections). I have no idea yet how to do this, nor do I even know how pipes implements this currently.

Adapting work stealing to non-strict parallel computations

The work stealing algorithm described in 'Scheduling Multithreaded Computations by Work Stealing' is for 'strict' computations (fork/join style), which is not what we have. We can do something add-hoc to add randomness but I'd rather have something known to work. I do think the work stealing approach makes a lot of sense for us, especially now where we have I/O callbacks that need to be scheduled with high-priority (so cold CPU-bound tasks get stolen to schedulers not doing I/O).

Fairness and I/O timeouts

Right now we have a model that does not enforce any sort of time accounting. I kind of like this for performance and simplicity, but it means we can mix CPU bound and I/O bound tasks on a single scheduler without ever yielding to do I/O. I suspect with this merging of I/O and the scheduler we are going to end up wanting some utility functions for setting up groups of schedulers specifically for I/O or specifically for processing.

The immediate goal, pending feedback, will be to get this patch into core. Beyond that there are a number of parallel development paths, primarily I/O design, multithreading and integration.

Begin working on user-facing I/O library (#4248)

This just barely scratches the surface of I/O. Designing an I/O library is itself a huge effort. I think I would like to approach this top down, figure out how we want a synchronous I/O library to be designed, then how to connect it to uv through rt::io.

When doing this design it's important to consider Rust's unique constraints, in particular the relationship of pipes and the scheduler to I/O. Pipes and I/O readers/writers share a lot in common and need to interoperate in various ways (particularly with select). We also need to consider this in the context of the existing core::io - what is or isn't working there.

I'm no expert on I/O libraries so I think I'd like to hash this out on the mailing lists, perhaps in conjunction with a survey of other languages' I/O.

Multithreading

Create a parallel deque (#4877)
Port pipes to this scheduler (#5018)
Encapsulate scheduling 'policy' in one place (SchedulerPolicy) - so we can experiment easily
Teach schedulers to communicate and steal work

Integration

Fix ~object and convert code to use it
Rewrite net::ip to not use uv (#4956)
Move uv to core or its own crate (#5019)
Context switching for remaining platforms (#5020)
start lang item (#3406) - so we can start running test cases on this scheduler
Bindings for win32 TLS - this only has pthread bindings
Stack growth and stack switching - this time stack logic goes into rt::stack and not into the task itself. Start by adapting the current scheme to the new scheduler, but consider potential upcoming changes to the FFI.
Logging (#5021) - Figure out how to make logging work in global, scheduler, and task context, in the old runtime or in this runtime. Optional complete redesign (#3309).

I want to hold off on adding linked failure because I think the current implementation is too complex still.

brson/gist:4969164