Overview
The CPU scheduler can play a significant role to save energy in the system. Typically we talk about Energy Aware Scheduling (EAS) when scehduling decisions can impact on the energy consumed by the CPUs. EAS relies on an Energy Model (EM) of the CPUs to select the most energy efficient CPU for each task, with a minimal impact on throughput.
Effective energy-saving techniques can also be applied by maximizing the idle time of the CPUs. Modern CPUs’ power consumption is heavily influenced by how long they remain idle, sometimes yielding more significant energy savings than Dynamic Voltage and Frequency Scaling (DVFS) techniques (e.g., the cpufreq governor).
Forcing CPUs to stay idle
The scheduler can force specific CPUs to stay idle by not assigning tasks to them. However, it is essential to find a good balance between energy savings and performance. For example, a drastic solution could be scheduling all tasks on a single CPU while keeping others idle. This can save energy in emergency situations, such as when a battery-powered device is nearly out of power, but it can severely degrades the overall system performance.
Energy saving with
scx_rustland
scx_rustland uses a very simple, yet effective, strategy to save energy: when a CPU enters an idle state, it attempts to keep it idle if other CPUs are active, even if there are tasks queued for scheduling.
Since scx_rustland
uses a vruntime-based policy,
latency-sensitive tasks are likely placed at the top of the queue. Thus,
active CPUs can quickly dispatch these tasks, maintaining system
responsiveness. CPU-intensive tasks, on the other hand, will spend more
time in the scheduler queue, waiting for an active CPU.
This approach reduces overall system throughput by intentionally introducing bubbles in the scheduling, but it helps save power without compromising system responsiveness.
Of course the overall throughput in the system is strongly reduced (CPUs are explicitly under-utilized), but the “CPU throttling” is mostly affecting background CPU-intensive tasks.
For this reason, this strategy is disabled by default and it can be
enabled starting the scheduler with the --low-power
option.
Implementation
The implementation of this strategy is also very simple, technically just a one-liner.
All the logic is implemented in the
rustland_update_idle()
callback, that is executed when a
CPU changes its idle state:
/*
* A CPU is about to change its idle state.
*/
void BPF_STRUCT_OPS(rustland_update_idle, s32 cpu, bool idle)
{
/*
* Don't do anything if we exit from and idle state, a CPU owner will
* be assigned in .running().
*/
if (!idle)
return;
/*
* A CPU is now available, notify the user-space scheduler that tasks
* can be dispatched.
*/
if (usersched_has_pending_tasks()) {
set_usersched_needed();
/*
* Wake up the idle CPU, so that it can immediately accept
* dispatched tasks.
*/
if (!low_power || !nr_running)
scx_bpf_kick_cpu(cpu, 0);
}
}
In low-power mode the key part is:
...
if (usersched_has_pending_tasks()) {
...
/*
* Wake up the idle CPU, so that it can immediately accept
* dispatched tasks.
*/
if (!low_power || !nr_running)
scx_bpf_kick_cpu(cpu, 0);
}
...
The variable nr_running
keeps track of the active CPUs
and scx_bpf_kick_cpu()
is used, in this context, to
immediately wake up a CPU when it enters an idle state.
In general immediately waking up the CPU at this point would be
totally reasonable if there are still tasks that are waiting to be
scheduled (see the usersched_has_pending_tasks()
check a
few lines above).
However, in a “low power” scenario we can avoid to immediately wake
up the CPU, if other CPUs are active (nr_running != 0
), in
order to maximize the idle state effectiveness and save power at the
cost of throttling CPU-intensive tasks even more.
Result
The benefits of the low-power mode can be illustrated with the following test case:
- play a video game (Terraria) while recompiling the kernel
- measure game performance (fps) and core power consumption (W)
- compare the result of normal mode vs low-power mode
Results:
Game performance | Power consumption |
------------+-----------------+-------------------+
normal mode | 60 fps | 6W |
low-power mode | 60 fps | 3W |
As we can see from these results, the game performance were pretty much unaffected by the low-power mode, while the CPU consumption is cut in half.
Real-world tests showed around 20-30% increase in laptop battery life
using scx_rustland
in low-power mode with typical workloads
like reading emails, web browsing, listening to music, and compiling
code.
Conclusion
This experiment highlights the ease and effectiveness of using
sched_ext
and scx_rustland
for kernel
scheduling development.
The ability to quickly edit-compile-run kernel scheduling changes represents a significant improvement over traditional methods that require kernel recompilation and rebooting, with potential for catastrophic results in case of bugs.
The simplicity of the code change also demonstrates how easy it can
be to implement and test theories aimed at improving performance,
responsiveness, or energy savings. Moreover, operating in user-space
(remember that scx_rustland
performs 100% of the scheduling
decisions in user-space) can be particularly advantageous for debugging
and profiling.
Future development
While the described technique for energy saving is simple and effective, there is room for improvement.
For instance, incorporating topology awareness could make the low-power mode less aggressive in keeping CPUs idle. Avoiding idle states for CPUs in the same core as an active CPU, for example, could minimize unnecessary throttling of CPU-intensive tasks while maintaining, potentially, the same level of energy savings.
I noticed the claimed 30% power savings, but no information on task time required. For example, if you have a 30% power savings but a 50% increase in time taken to perform a task, you've increased the energy requirement of that task by 5% and can now do less on the same amount of battery—it just takes longer to do less. What does this look like in terms of energy consumption per unit work completed?
ReplyDeleteThe 20-30% is just a rough estimate of how long it takes for my battery to go from 80% to 20%. I tried to measure that over the week, trying to use my laptop regularly. The result was quite variable, obviously it depends on the particular workload that I'm running, and my kernel builds were consistently slower, but I also got more time to use the laptop for other activities (i.e., reading emails, navigate the web), that I tried to keep track / measure and I got the +20-30% extra time with scx_rustland running in low-power mode. Overall, I may have ended up using the same energy considering the extended time to complete the builds (I'll do better measurements, tracking the total build time as well), but even so, having a temporary power cap that gives you some extra time to reach a power supply and still be able to use your laptop pretty much at full speed can be useful.
DeleteHi Andrea, how to get the information in the first figure?
ReplyDeleteI'm using this command: `sudo turbostat --header_iterations 10 -S -s PkgWatt,CorWatt,GFXWatt,RAMWatt,CoreTmp,PkgTmp,AvgMHz,IRQ`
DeleteMmm... so, let's say the system is completely idle. At some point, one task is given to a CPU and is scheduled there (I don't really know, what's the purpose of rustland_update_idle() and when it's called, so I can't be sure of how/when this happens); from now on, nr_running will be > 0, and hence no other CPU will be woken up and given task to schedule?
ReplyDeleteThat seems a bit too much... what am I missing? :-) Maybe there are other ways for CPUs to be woken and get tasks to run?
@Dario yeah, that's a good question, thanks for asking. There are some pieces that I didn't mention in the post. If there are idle CPUs in the system tasks will be dispatched directly to them, ignoring the low-power logic. As soon as the system becomes overcommissioned (= more tasks than CPUs), the low-power logic kicks in. So, let's say we start a parallel kernel build while using the system normally (with other tasks running). The CPUs will be all filled with tasks, we trigger the low-power logic, CPUs will be slowly drained down to only 1 active CPU, then they will be filled again with tasks, and so on. This is somehow similar to a token-bucket throttling, where the amount of CPUs represents the budget of tokens. This is obviously bad in terms of performance, because we are not using the CPUs at their full capacity, but it forces them to spend more cycles in idle state, reducing power consumption. At the same time the scheduler massively prioritizes latency-sensitive tasks, that usually end up at the top of the scheduling queue, so there will be always 1 CPU available, at least, to dispatch them and the consequence is that we end up throttling mostly the CPU-intensive tasks, so the system is still quite responsive.
ReplyDelete