In the first
part of this series we covered the basic implementation details of
scx_rustland
: a fully-functional Linux scheduler written in
Rust that runs in user space.
If having a Linux scheduler that run in user-space wasn’t enough, we can push the concept even further and consider the possibility to evolve this project into a generic framework that allows to implement any scheduling policy in user-space, using Rust.
The primary advantage of such a framework would lie in significantly lowering the bar of scheduling development. Using this framework, developers could just focus solely on crafting the scheduling policy, without delving into complex kernel internal details. This would make scheduling development and testing really accessible to a broader audience.
Moreover, as already mentioned in part 1, operating in user-space provides access to a plethora of tools, libraries, debuggers, etc., that really help to make the development environment much more “comfortable”.
Now, the question arises: how can we realize all of this?
Implementation
For those who have followed the previous post, you are already aware
that scx_rustland
is made of two main components: an eBPF
part, responsible for implementing the low-level interface to
sched-ext/eBPF, using libbpf-rs, and the Rust
code operating in user-space.
Between these two layers there is actually an additional layer: a
Rust module (bpf.rs
) that implements the low-level
communication between the Rust code and the eBPF code.
What if we could further abstract this module and relocate both the
eBPF code and bpf.rs
to a separate standalone crate?
By doing so, we could simply import this crate into our project and use the generic scheduling API to implement a fully functional Linux scheduler.
This is precisely what I’ve recently been focused on: developing a
new Rust crate named scx_rustland_core
, which is integrated
into the scx tools. Its purpose is to accomplish exactly this
abstraction.
A first version of this crate has been merged already in the scx repository.
API
The main challenge of this project is to figure out the best API to achieve both implicitly and efficiency, and this is probably going to be a long process (so, the API described below is subject to probable alterations in the near future).
The scx_rustland_core
crate provides a
BpfScheduler
struct that represents the “connector” to the
eBPF code.
BpfScheduler
provides the following public methods:
pub fn dequeue_task(&mut self) -> Result<Option<QueuedTask>, libbpf_rs::Error>
pub fn dispatch_task(&mut self, task: &DispatchedTask) -> Result<(), libbpf_rs::Error>
The former can be used to receive a task queued to the scheduler, the latter can be used to send a task to the dispatcher.
Between the functions dequeue_task()
and
dispatch_task()
, the scheduler can decide to store tasks
within internal data structures, determine their order of execution, on
which CPU run them, and for how long.
Enqueued tasks and dispatched tasks are represented as following:
pub struct QueuedTask {
pub pid: i32, // pid that uniquely identifies a task
pub cpu: i32, // CPU where the task is running (-1 = exiting)
pub cpumask_cnt: u64, // cpumask generation counter
pub sum_exec_runtime: u64, // Total cpu time
pub nvcsw: u64, // Voluntary context switches
pub weight: u64, // Task static priority
}
pub struct DispatchedTask {
pub pid: i32, // pid that uniquely identifies a task
pub cpu: i32, // target CPU selected by the scheduler
pub cpumask_cnt: u64, // cpumask generation counter
pub payload: u64, // task payload (used for debugging)
}
To assign a specific CPU to task the scheduler can change the
attribute cpu
within the DispatchedTask
struct. If the special value NO_CPU
is specified, the
dispatcher will execute the task on the first CPU available.
Moreover, to decide the amount of time that each task can run on the assigned CPU, a global time slice is used: there is a default global time slice and a global effective time slice, that can be adjusted dynamically by the scheduler using the following methods:
pub fn set_effective_slice_us(&mut self, slice_us: u64)
pub fn get_effective_slice_us(&mut self) -> u64
TODO: as a future improvement I’m planning to add also a local time
slice to the DispatchedTask
struct. This will give the
possibility to set a different time slice to each task, and override the
global effective time slice on a per-task basis.
Last, but not least, an additional method is provided to notify the eBPF component if the user-space scheduler has still some pending work to complete:
pub fn update_tasks(&mut self, nr_queued: Option<u64>, nr_scheduled: Option<u64>) {
nr_queued
is a counter that represents the amount of
queued tasks that still need to be processed by the user-space
scheduler, nr_scheduled
represents the amount of tasks that
have been currently queued into the scheduler and still need to be
dipsatched.
For example, it is possible to notify the eBPF dipatcher that the scheduler doesn’t have any pending work using this method as following:
.update_tasks(Some(0), Some(0));
scx_rustland refactoring
scx_rustland
has been rewritten on top of
scx_rustland_core
and the scheduler code is a
lot more compact:
$ git diff --stat origin/scx-user~9..origin/scx-user scheds/rust/scx_rustland/
...
9 files changed, 40 insertions(+), 1592 deletions(-)
This is purely a code refactoring, performance-wise
scx_rustland
can still achieve the same performance as
before
[ I can still play AAA games, such as Baldur’s Gate 3, CS2, etc., while recompiling the kernel in the background and achieve a higher fps than the default Linux scheduler. ]
And if the scheduler isn’t ideal for a particular workload we can simply switch to a different scx scheduler or move back to the default Linux scheduler, at run-time and with zero downtime.
Example
As a practical example, to demonstrate how to use
scx_rustland_core
, I’ve added a new Rust scheduler to the
scx schedulers, called scx_rlfifo
.
The scheduler is a plain FIFO scheduler (which may not be very thrilling), but its simplicity facilitates its use as a template for implementing more complex scheduling policies.
The entire code is compact enough that can fit in this blog post:
// Copyright (c) Andrea Righi <andrea.righi@canonical.com>
// This software may be used and distributed according to the terms of the
// GNU General Public License version 2.
mod bpf_skel;
pub use bpf_skel::*;
pub mod bpf_intf;
mod bpf;
use bpf::*;
use scx_utils::Topology;
use std::sync::atomic::AtomicBool;
use std::sync::atomic::Ordering;
use std::sync::Arc;
use std::time::{Duration, SystemTime};
use anyhow::Result;
struct Scheduler<'a> {
bpf: BpfScheduler<'a>,
}
impl<'a> Scheduler<'a> {
fn init() -> Result<Self> {
let topo = Topology::new().expect("Failed to build host topology");
let bpf = BpfScheduler::init(5000, topo.nr_cpus() as i32, false, false, false)?;
Ok(Self { bpf })
}
fn now() -> u64 {
SystemTime::now()
.duration_since(SystemTime::UNIX_EPOCH)
.unwrap()
.as_secs()
}
fn dispatch_tasks(&mut self) {
loop {
// Get queued taks and dispatch them in order (FIFO).
match self.bpf.dequeue_task() {
Ok(Some(task)) => {
// task.cpu < 0 is used to to notify an exiting task, in this
// case we can simply ignore the task.
if task.cpu >= 0 {
let _ = self.bpf.dispatch_task(&DispatchedTask {
pid: task.pid,
cpu: task.cpu,
cpumask_cnt: task.cpumask_cnt,
payload: 0,
});
}
// Give the task a chance to run and prevent overflowing the dispatch queue.
std::thread::yield_now();
}
Ok(None) => {
// Notify the BPF component that all tasks have been scheduled and dispatched.
self.bpf.update_tasks(Some(0), Some(0));
// All queued tasks have been dipatched, add a short sleep to reduce
// scheduler's CPU consuption.
std::thread::sleep(Duration::from_millis(1));
break;
}
Err(_) => {
break;
}
}
}
}
fn print_stats(&mut self) {
let nr_user_dispatches = *self.bpf.nr_user_dispatches_mut();
let nr_kernel_dispatches = *self.bpf.nr_kernel_dispatches_mut();
let nr_cancel_dispatches = *self.bpf.nr_cancel_dispatches_mut();
let nr_bounce_dispatches = *self.bpf.nr_bounce_dispatches_mut();
let nr_failed_dispatches = *self.bpf.nr_failed_dispatches_mut();
let nr_sched_congested = *self.bpf.nr_sched_congested_mut();
println!(
"user={} kernel={} cancel={} bounce={} fail={} cong={}",
nr_user_dispatches, nr_kernel_dispatches,
nr_cancel_dispatches, nr_bounce_dispatches,
nr_failed_dispatches, nr_sched_congested,
);
}
fn run(&mut self, shutdown: Arc<AtomicBool>) -> Result<()> {
let mut prev_ts = Self::now();
while !shutdown.load(Ordering::Relaxed) && !self.bpf.exited() {
self.dispatch_tasks();
let curr_ts = Self::now();
if curr_ts > prev_ts {
self.print_stats();
prev_ts = curr_ts;
}
}
self.bpf.shutdown_and_report()
}
}
fn main() -> Result<()> {
let mut sched = Scheduler::init()?;
let shutdown = Arc::new(AtomicBool::new(false));
let shutdown_clone = shutdown.clone();
ctrlc::set_handler(move || {
shutdown_clone.store(true, Ordering::Relaxed);
})?;
sched.run(shutdown)
}
Conclusion
Evolving scx_rustland
into a generic scheduling
framework in Rust can help to make scheduling development accessible to
a broader audience.
In particular, it has the potential to promote a stronger connection between academia and real-world kernel development, by helping researchers to experiment with and test new scheduling theories within a real kernel environment, but in a safe way.
Rust, eBPF, and all the debugging tools available in user-space can significantly mitigate the risks of bugs, compared with the traditional kernel development process. And even in presence of bugs (such as deadlock, or starvation), their impact is significantly less critical: the worst-case scenario may involve a brief freeze lasting for 5 seconds, subsequently, the sched-ext watchdog will intervene, restoring the default Linux scheduler.
In conclusion, sched-ext really seems to have the potential to bring a lot of new concepts and different approaches to Linux scheduling, making projects like this a tangible reality.
Hopefully, in a future not too far, we will see this feature
available in the upstream kernel, making technologies like
scx_rustland_core
really available to everyone.