Thursday, May 1, 2025

Revamping my Linux kernel scheduler in Rust

Recap

Some time ago, I built a kernel scheduler in Rust, called scx_rustland. It runs entirely in user space, powered by sched_ext, a technology in the Linux kernel that lets you write CPU schedulers as BPF programs.

The original goal of my project was to create a proof-of-concept, a working example to demonstrate the potential of user-space scheduling. The ideas was to use sched_ext and BPF to capture scheduling events in the kernel and pass them to a Rust program in user space, which makes decisions and dispatches tasks back to the kernel.

Surprisingly, even that early prototype outperformed the default Linux scheduler, but only in a very specific scenario: testing the responsiveness of a videogame under heavy system load. (You can see this in action in this video, where I play Terraria at 60fps while compiling the kernel in the background.)

Since then, the project has evolved significantly. One key improvement was switching to BPF ringbuffers for the communication between BPF and user space. This allowed lockless, syscall-free message passing through shared memory between kernel and user space, eliminating bottlenecks and making the scheduler more viable beyond just demos or niche cases.

Eventually, part of this work grew into a framework: scx_rustland_core, a Rust crate for building fully functional Linux schedulers, designed to abstract away all the boilerplate and offering an easy scheduling development playground.

The original Rust scheduler (scx_rustland) was then completely rewritten on top of this new framework.

Tackling the overhead of user-space scheduling

One of the weaknesses of user-space scheduling is the overhead of the BPF-to-user roundtrip. For user responsiveness focused workloads, usually this isn’t a huge issue. But for more throughput-oriented workloads, the “bubbles” in the scheduling pipeline, caused by synchronizing BPF and the user-space scheduler, can be significant.

To address this, the scx_rustland_core scheduling pipeline has been re-architected and improved.

Now when a task releases a CPU, we check if there are other tasks that want to run and we always “append” the user-space scheduler itself at the end of the queue. In this way tasks can pile in and the scheduler can process them in “bursts” making its prioritization logic more effective. If there’s no pending action to be processed, the CPU can go idle, but right before entering the actual idle state, we opportunistically check for pending tasks again and immediately wakeup the user-space scheduler if needed.

This tighter scheduling loop reduced latency and maximized CPU utilization, as we can see in following trace, collected during a parallel kernel build and visualised using Perfetto:

In the trace, you can see how the older rustland version introduced ~60us gaps between tasks across all the CPUs. The new version virtually eliminates those bubbles (there are only rare ~5us gaps just before the user-space scheduler itself runs).

Another improvement was handling task wakeups directly in BPF: now, when a task wakes up and idle CPUs are available, the BPF code picks a nearby CPU (preferably the one it previously ran on) and dispatches it immediately, bypassing the user-space roundtrip entirely. This fast path improves both latency and throughput performance.

The runqueue design of scx_rustland_core has also been improved a bit, even if scx_rustland is still using a global runqueue, which works well for small-to-medium systems and allows perfect load balancing. However, on large systems with multiple cores, this can become a bottleneck due to contention (I’m planning to improve this in the future introducing per-NUMA or per-LLC runqueues and provide a proper API, so that other schedulers based on scx_rustland_core can use such topology-aware runqueues).

Benchmarking the Improvements

To evaluate the new improvements, I ran a subset of benchmarks from the Phoronix Test Suite, ranging from throughput-oriented workloads, such as kernel builds and LLM inference to more latency-oriented workloads, such as schbench, nginx and PostgreSQL.

Test system

Old vs new scx_rustland

Notable results: nginx: +77% in requests/sec, PostgreSQL: +26% in transactions/sec, schbench (99.9th latency) going from 9ms -> 3.4ms. These significant gains highlight the effect of a more efficient scheduling pipeline and better BPF/user-space synchronization.

scx_rustland vs EEVDF

Some notable results: schbench tail latency improved by a massive +75.5%; interestingly, scx_rustland outperformed EEVDF in nginx throughput by a notable +17.5% (likely thanks to its strong prioritization of short-lived, latency-sensitive tasks); PostgreSQL and FFmpeg latency metrics also saw small but consistent gains (this was expected due to the particular scheduling policy implemented by scx_rustland).

Conclusion

With the new design and optimizations in scx_rustland_core, user-space scheduling feels closer to being a viable option for many real-world use cases. In multiple benchmarks, scx_rustland even outperformed the in-kernel EEVDF scheduler. That’s quite impressive and definitely gives off some microkernel vibes.

But let’s be clear: moving the kernel scheduler to a user-space Rust program doesn’t magically make everything better. There’s always some overhead (even if it’s minimal now), and most of the performance gains come from the specific scheduling policy implemented by the scheduler.

Where user-space scheduling really shines is in specialization and integration. With sched_ext, we now have the flexibility to build task schedulers tailored to specific workloads and load/unload them at runtime. And with scx_rustland_core, we can also easily integrate other user-space components, libraries, and services directly into our scheduling decisions.

For example, an exciting experiment for future work could be integrating AI into the scheduler: letting a model predict optimal time slices, deadlines, or CPU affinity based on certain observed task behavior. That could unlock new ways to optimize systems in ways traditional schedulers can’t.

I think this is an interesting are to explore in the future.