tag:blogger.com,1999:blog-43974096267109136102024-03-16T19:52:41.371+01:00arighi's blogAndrea Righi - Linux Kernel Engineer at Canonicalarighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.comBlogger56125tag:blogger.com,1999:blog-4397409626710913610.post-62329598613658272442024-03-02T19:46:00.003+01:002024-03-02T19:48:53.816+01:00Writing a scheduler for Linux in Rust that runs in user-space (part 2)<p>In the <a
href="https://arighi.blogspot.com/2024/02/writing-scheduler-for-linux-in-rust.html">first
part</a> of this series we covered the basic implementation details of
<code>scx_rustland</code>: a fully-functional Linux scheduler written in
Rust that runs in user space.</p>
<p>If having a Linux scheduler that run in user-space wasn’t enough, we
can push the concept even further and consider the possibility to evolve
this project into a <strong>generic framework</strong> that allows to
implement any scheduling policy in user-space, using Rust.</p>
<p>The primary advantage of such a framework would lie in significantly
lowering the bar of scheduling development. Using this framework,
developers could just focus solely on crafting the scheduling policy,
without delving into complex kernel internal details. This would make
scheduling development and testing really accessible to a broader
audience.</p>
<p>Moreover, as already mentioned in part 1, operating in user-space
provides access to a plethora of tools, libraries, debuggers, etc., that
really help to make the development environment much more
“comfortable”.</p>
<p>Now, the question arises: how can we realize all of this?</p>
<h2 id="implementation">Implementation</h2>
<p>For those who have followed the previous post, you are already aware
that <code>scx_rustland</code> is made of two main components: an eBPF
part, responsible for implementing the low-level interface to
sched-ext/eBPF, using <a
href="https://github.com/libbpf/libbpf-rs">libbpf-rs</a>, and the Rust
code operating in user-space.</p>
<p>Between these two layers there is actually an additional layer: a
Rust module (<code>bpf.rs</code>) that implements the low-level
communication between the Rust code and the eBPF code.</p>
<p>What if we could further abstract this module and relocate both the
eBPF code and <code>bpf.rs</code> to a separate standalone crate?</p>
<p>By doing so, we could simply import this crate into our project and
use the generic scheduling API to implement a fully functional Linux
scheduler.</p>
<p>This is precisely what I’ve recently been focused on: developing a
new Rust crate named <code>scx_rustland_core</code>, which is integrated
into the scx tools. Its purpose is to accomplish exactly this
abstraction.</p>
<p>A first version of this crate has been merged already in the <a
href="https://github.com/sched-ext/scx/pull/161">scx repository</a>.</p>
<h2 id="api">API</h2>
<p>The main challenge of this project is to figure out the best API to
achieve both implicitly and efficiency, and this is probably going to be
a long process (so, the API described below is subject to probable
alterations in the near future).</p>
<p>The <code>scx_rustland_core</code> crate provides a
<code>BpfScheduler</code> struct that represents the “connector” to the
eBPF code.</p>
<p><code>BpfScheduler</code> provides the following public methods:</p>
<pre><code> pub fn dequeue_task(&mut self) -> Result<Option<QueuedTask>, libbpf_rs::Error>
pub fn dispatch_task(&mut self, task: &DispatchedTask) -> Result<(), libbpf_rs::Error></code></pre>
<p>The former can be used to receive a task queued to the scheduler, the
latter can be used to send a task to the dispatcher.</p>
<p>Between the functions <code>dequeue_task()</code> and
<code>dispatch_task()</code>, the scheduler can decide to store tasks
within internal data structures, determine their order of execution, on
which CPU run them, and for how long.</p>
<p>Enqueued tasks and dispatched tasks are represented as following:</p>
<pre><code>pub struct QueuedTask {
pub pid: i32, // pid that uniquely identifies a task
pub cpu: i32, // CPU where the task is running (-1 = exiting)
pub cpumask_cnt: u64, // cpumask generation counter
pub sum_exec_runtime: u64, // Total cpu time
pub nvcsw: u64, // Voluntary context switches
pub weight: u64, // Task static priority
}
pub struct DispatchedTask {
pub pid: i32, // pid that uniquely identifies a task
pub cpu: i32, // target CPU selected by the scheduler
pub cpumask_cnt: u64, // cpumask generation counter
pub payload: u64, // task payload (used for debugging)
}</code></pre>
<p>To assign a specific CPU to task the scheduler can change the
attribute <code>cpu</code> within the <code>DispatchedTask</code>
struct. If the special value <code>NO_CPU</code> is specified, the
dispatcher will execute the task on the first CPU available.</p>
<p>Moreover, to decide the amount of time that each task can run on the
assigned CPU, a global time slice is used: there is a default global
time slice and a global effective time slice, that can be adjusted
dynamically by the scheduler using the following methods:</p>
<pre><code> pub fn set_effective_slice_us(&mut self, slice_us: u64)
pub fn get_effective_slice_us(&mut self) -> u64</code></pre>
<p>TODO: as a future improvement I’m planning to add also a local time
slice to the <code>DispatchedTask</code> struct. This will give the
possibility to set a different time slice to each task, and override the
global effective time slice on a per-task basis.</p>
<p>Last, but not least, an additional method is provided to notify the
eBPF component if the user-space scheduler has still some pending work
to complete:</p>
<pre><code> pub fn update_tasks(&mut self, nr_queued: Option<u64>, nr_scheduled: Option<u64>) {</code></pre>
<p><code>nr_queued</code> is a counter that represents the amount of
queued tasks that still need to be processed by the user-space
scheduler, <code>nr_scheduled</code> represents the amount of tasks that
have been currently queued into the scheduler and still need to be
dipsatched.</p>
<p>For example, it is possible to notify the eBPF dipatcher that the
scheduler doesn’t have any pending work using this method as
following:</p>
<pre><code> .update_tasks(Some(0), Some(0));</code></pre>
<h2 id="scx_rustland-refactoring">scx_rustland refactoring</h2>
<p><code>scx_rustland</code> has been rewritten on top of
<code>scx_rustland_core</code> and the scheduler code is <strong>a
lot</strong> more compact:</p>
<pre><code> $ git diff --stat origin/scx-user~9..origin/scx-user scheds/rust/scx_rustland/
...
9 files changed, 40 insertions(+), 1592 deletions(-)</code></pre>
<p>This is purely a code refactoring, performance-wise
<code>scx_rustland</code> can still achieve the same performance as
before</p>
<p>[ I can still play AAA games, such as Baldur’s Gate 3, CS2, etc.,
while recompiling the kernel in the background and achieve a higher fps
than the default Linux scheduler. ]</p>
<p>And if the scheduler isn’t ideal for a particular workload we can
simply switch to a different scx scheduler or move back to the default
Linux scheduler, at run-time and with zero downtime.</p>
<h2 id="example">Example</h2>
<p>As a practical example, to demonstrate how to use
<code>scx_rustland_core</code>, I’ve added a new Rust scheduler to the
scx schedulers, called <a
href="https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rlfifo"><code>scx_rlfifo</code></a>.</p>
<p>The scheduler is a plain FIFO scheduler (which may not be very
thrilling), but its simplicity facilitates its use as a template for
implementing more complex scheduling policies.</p>
<p>The entire code is compact enough that can fit in this blog post:</p>
<pre><code>// Copyright (c) Andrea Righi <andrea.righi@canonical.com>
// This software may be used and distributed according to the terms of the
// GNU General Public License version 2.
mod bpf_skel;
pub use bpf_skel::*;
pub mod bpf_intf;
mod bpf;
use bpf::*;
use scx_utils::Topology;
use std::sync::atomic::AtomicBool;
use std::sync::atomic::Ordering;
use std::sync::Arc;
use std::time::{Duration, SystemTime};
use anyhow::Result;
struct Scheduler<'a> {
bpf: BpfScheduler<'a>,
}
impl<'a> Scheduler<'a> {
fn init() -> Result<Self> {
let topo = Topology::new().expect("Failed to build host topology");
let bpf = BpfScheduler::init(5000, topo.nr_cpus() as i32, false, false, false)?;
Ok(Self { bpf })
}
fn now() -> u64 {
SystemTime::now()
.duration_since(SystemTime::UNIX_EPOCH)
.unwrap()
.as_secs()
}
fn dispatch_tasks(&mut self) {
loop {
// Get queued taks and dispatch them in order (FIFO).
match self.bpf.dequeue_task() {
Ok(Some(task)) => {
// task.cpu < 0 is used to to notify an exiting task, in this
// case we can simply ignore the task.
if task.cpu >= 0 {
let _ = self.bpf.dispatch_task(&DispatchedTask {
pid: task.pid,
cpu: task.cpu,
cpumask_cnt: task.cpumask_cnt,
payload: 0,
});
}
// Give the task a chance to run and prevent overflowing the dispatch queue.
std::thread::yield_now();
}
Ok(None) => {
// Notify the BPF component that all tasks have been scheduled and dispatched.
self.bpf.update_tasks(Some(0), Some(0));
// All queued tasks have been dipatched, add a short sleep to reduce
// scheduler's CPU consuption.
std::thread::sleep(Duration::from_millis(1));
break;
}
Err(_) => {
break;
}
}
}
}
fn print_stats(&mut self) {
let nr_user_dispatches = *self.bpf.nr_user_dispatches_mut();
let nr_kernel_dispatches = *self.bpf.nr_kernel_dispatches_mut();
let nr_cancel_dispatches = *self.bpf.nr_cancel_dispatches_mut();
let nr_bounce_dispatches = *self.bpf.nr_bounce_dispatches_mut();
let nr_failed_dispatches = *self.bpf.nr_failed_dispatches_mut();
let nr_sched_congested = *self.bpf.nr_sched_congested_mut();
println!(
"user={} kernel={} cancel={} bounce={} fail={} cong={}",
nr_user_dispatches, nr_kernel_dispatches,
nr_cancel_dispatches, nr_bounce_dispatches,
nr_failed_dispatches, nr_sched_congested,
);
}
fn run(&mut self, shutdown: Arc<AtomicBool>) -> Result<()> {
let mut prev_ts = Self::now();
while !shutdown.load(Ordering::Relaxed) && !self.bpf.exited() {
self.dispatch_tasks();
let curr_ts = Self::now();
if curr_ts > prev_ts {
self.print_stats();
prev_ts = curr_ts;
}
}
self.bpf.shutdown_and_report()
}
}
fn main() -> Result<()> {
let mut sched = Scheduler::init()?;
let shutdown = Arc::new(AtomicBool::new(false));
let shutdown_clone = shutdown.clone();
ctrlc::set_handler(move || {
shutdown_clone.store(true, Ordering::Relaxed);
})?;
sched.run(shutdown)
}</code></pre>
<h2 id="conclusion">Conclusion</h2>
<p>Evolving <code>scx_rustland</code> into a generic scheduling
framework in Rust can help to make scheduling development accessible to
a broader audience.</p>
<p>In particular, it has the potential to promote a stronger connection
between academia and real-world kernel development, by helping
researchers to experiment with and test new scheduling theories within a
real kernel environment, but in a safe way.</p>
<p>Rust, eBPF, and all the debugging tools available in user-space can
significantly mitigate the risks of bugs, compared with the traditional
kernel development process. And even in presence of bugs (such as
deadlock, or starvation), their impact is significantly less critical:
the worst-case scenario may involve a brief freeze lasting for 5
seconds, subsequently, the sched-ext watchdog will intervene, restoring
the default Linux scheduler.</p>
<p>In conclusion, sched-ext really seems to have the potential to bring
a lot of new concepts and different approaches to Linux scheduling,
making projects like this a tangible reality.</p>
<p>Hopefully, in a future not too far, we will see this feature
available in the upstream kernel, making technologies like
<code>scx_rustland_core</code> really available to everyone.</p>
arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-88501243665527996852024-02-19T13:39:00.000+01:002024-02-19T13:39:41.962+01:00Writing a scheduler for Linux in Rust that runs in user-space<h1 id="overview">Overview</h1>
<p>I’ve decided to start a series of blog posts to cover some details
about <code>scx_rustland</code>, my little Linux scheduler written in
Rust that runs in user-space.</p>
<p>This project started for fun over the Christmas break, mostly because
I wanted to learn more about sched-ext and I also needed some motivation
to keep practicing Rust (that I’m still learning).</p>
<p>In this series of articles I would like to focus at some
implementation details to better explains how this scheduler works.</p>
<p>A scheduler is kernel component that needs to:</p>
<ul>
<li><p>determine the order of executions of all the tasks that want to
run (<strong>ranking</strong>)</p></li>
<li><p>determine where each task needs to run (<strong>target
CPU</strong>)</p></li>
<li><p>determine for how long a task can run (<strong>time
slice</strong>)</p></li>
</ul>
<p>The main goal of this project is to prove that we can channel these
operations into a regular user-space program and still have a system
that can perform as good as using a in-kernel scheduler.</p>
<h1 id="pros-and-cons-of-a-user-space-scheduler">Pros and cons of a
user-space scheduler</h1>
<p>The most noticeable benefits of a user-space scheduler are the
following:</p>
<ul>
<li><p>Availability of a large pool of languages (e.g., Rust),
libraries, debugging and profiling tools.</p></li>
<li><p>Lower down the barrier of CPU scheduling experimentation:
implementing and testing a particular scheduling policy can be done
inside a regular user-space process, it doesn’t require rebooting into a
new kernel, in case of bugs the user-space scheduler will just crash and
sched-ext will transparently restore the default Linux
scheduler.</p></li>
</ul>
<p>Downside of a user-space scheduler:</p>
<ul>
<li><p>Overhead: even if the scheduler itself is not really a
CPU-intensive workload (unless the system is massively overloaded), the
communication between kernel and user-space is going to add some
overhead, so the goal is to try to reduce this overhead as much as
possible</p></li>
<li><p>Protection: the user-space task that implements the scheduling
policy needs some special protection, if it’s blocked (e.g., due to a
page fault), tasks can’t be scheduled. So, in order to avoid deadlocks
the user-space scheduler should never be blocked indefinitely, waiting
for some actions performed by another task (a page fault is a good
example of such “forbidden” behavior).</p></li>
</ul>
<h1 id="how-does-it-work">How does it work?</h1>
<p>First of all this scheduler has nothing to do with Rust-for-Linux
(the kernel subsystem that allows to write kernel modules in Rust).</p>
<p>In fact the Rust part of the scheduler runs 100% in user-space and
all the scheduling decisions are also done in user-space.</p>
<p>The connection with the kernel happens thanks to eBPF and sched-ext:
together they allow to channel all the scheduling events to a user-space
program, which then communicates the tasks to be executed back to the
kernel via eBPF as well.</p>
<h2 id="ebpf-component">eBPF component</h2>
<p>sched-ext is a Linux kernel feature (not upstream yet - hopefully
it’ll be upstream soon) that allows to implement a scheduler in
eBPF.</p>
<p>Basically all you have to do is to implement some callbacks and put
them in a struct <code>sched_ext_ops</code>.</p>
<p><code>scx_rustland</code> implements the following callbacks in the
<a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/bpf/main.bpf.c">eBPF
component</a>:</p>
<pre><code>/*
* Scheduling class declaration.
*/
SEC(".struct_ops.link")
struct sched_ext_ops rustland = {
.select_cpu = (void *)rustland_select_cpu,
.enqueue = (void *)rustland_enqueue,
.dispatch = (void *)rustland_dispatch,
.running = (void *)rustland_running,
.stopping = (void *)rustland_stopping,
.update_idle = (void *)rustland_update_idle,
.set_cpumask = (void *)rustland_set_cpumask,
.cpu_release = (void *)rustland_cpu_release,
.init_task = (void *)rustland_init_task,
.exit_task = (void *)rustland_exit_task,
.init = (void *)rustland_init,
.exit = (void *)rustland_exit,
.flags = SCX_OPS_ENQ_LAST | SCX_OPS_KEEP_BUILTIN_IDLE,
.timeout_ms = 5000,
.name = "rustland",
};</code></pre>
<p>The workflow is the following:</p>
<ul>
<li><code>.select_cpu()</code> implements the logic to assign a target
CPU to a task that wants to run, typically you have to decide if you
want to keep the task on the same CPU or if it needs to be migrated to a
different one (for example if the current CPU is busy); if we can find
an idle CPU at this stage there’s no reason to call the scheduler, the
task can be immediately dispatched here.</li>
</ul>
<pre><code>s32 BPF_STRUCT_OPS(rustland_select_cpu, struct task_struct *p, s32 prev_cpu,
u64 wake_flags)
{
bool is_idle = false;
s32 cpu;
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
if (is_idle) {
/*
* Using SCX_DSQ_LOCAL ensures that the task will be executed
* directly on the CPU returned by this function.
*/
dispatch_task(p, SCX_DSQ_LOCAL, 0, 0);
__sync_fetch_and_add(&nr_kernel_dispatches, 1);
}
return cpu;
}</code></pre>
<p>If we can’t find an idle CPU this step will just return the
previously used CPU, that can be used as a hint for the user-space
scheduler (keeping tasks on the same CPU has multiple benefits, such as
reusing hot caches and avoid any kind migration overhead). However this
decision is not the final one, the user-space scheduler can decide to
move the task to a different CPU if needed.</p>
<p><strong>NOTE: bypassing the user-space scheduler when we can find an
idle CPU can strongly improve the responsiveness of certain “low
latency” workloads, such as gaming for example.</strong></p>
<ul>
<li><p>Once a tentative CPU has been determined for the task we enter
the <code>.enqueue()</code> callback, usually here you may decide to
store the task in a queue, tree, or any other data structure to
determine the proper order of execution of the different tasks that
require to run.</p>
<p>In rustland the <code>.enqueue()</code> callback is used to store
tasks into a BPF map <code>BPF_MAP_TYPE_QUEUE</code>
<code>queued</code>, this represents the first connection with the
user-space counter-part. Items in this queue are managed in a
producer/consumer way: the BPF part is the producer, user-space is the
consumer.</p></li>
</ul>
<pre><code>void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags)
{
...
/*
* Add tasks to the @queued list, they will be processed by the
* user-space scheduler.
*
* If @queued list is full (user-space scheduler is congested) tasks
* will be dispatched directly from the kernel (re-using their
* previously used CPU in this case).
*/
get_task_info(&task, p, false);
dbg_msg("enqueue: pid=%d (%s)", p->pid, p->comm);
if (bpf_map_push_elem(&queued, &task, 0)) {
sched_congested(p);
dispatch_task(p, SHARED_DSQ, 0, enq_flags);
__sync_fetch_and_add(&nr_kernel_dispatches, 1);
return;
}
__sync_fetch_and_add(&nr_queued, 1);
}</code></pre>
<p>At this point it’s up to the user-space counter-part to determine the
proper order of execution of tasks and where they need to run, the
user-space has the option to maintain the CPU assignment determined by
the built-in idle selection logic or pick another CPU.</p>
<ul>
<li><p>Once the order of execution is determined tasks are then stored
to another <code>BPF_MAP_TYPE_QUEUE</code> <code>dispatched</code>,
again in a producer/consumer way, but this time the producer is the
user-space part and the consumer is the BPF part.</p></li>
<li><p>Then the workflow goes back to the BPF part. The dispatch path
operates using multiple per-CPU dispatch queues (DSQ) and a global
dispatch queue.</p>
<p>The per-CPU DSQs are used to dispatch tasks on specific CPUs, while
the global DSQ is used to dispatch tasks on the first CPU that becomes
available (usually when the user-space doesn’t specify any preference to
run the task on a particular CPU).</p>
<p>When a CPU becomes ready to dispatch tasks, the
<code>.dispatch()</code> callback is called, if there are tasks in the
<code>dispatched</code> queue they will be bounced to the target CPU’s
queue (DSQ), or to the global dispatch queue, based on the user-space
scheduler’s decision.</p>
<pre><code>void BPF_STRUCT_OPS(rustland_dispatch, s32 cpu, struct task_struct *prev)
{
/*
* Check if the user-space scheduler needs to run, and in that case try
* to dispatch it immediately.
*/
dispatch_user_scheduler();
/*
* Consume all tasks from the @dispatched list and immediately try to
* dispatch them on their target CPU selected by the user-space
* scheduler (at this point the proper ordering has been already
* determined by the scheduler).
*/
bpf_repeat(MAX_ENQUEUED_TASKS) {
struct task_struct *p;
struct dispatched_task_ctx task;
/*
* Pop first task from the dispatched queue, stop if dispatch
* queue is empty.
*/
if (bpf_map_pop_elem(&dispatched, &task))
break;
/* Ignore entry if the task doesn't exist anymore */
p = bpf_task_from_pid(task.pid);
if (!p)
continue;
/*
* Check whether the user-space scheduler assigned a different
* CPU to the task and migrate (if possible).
*
* If no CPU has been specified (task.cpu < 0), then dispatch
* the task to the shared DSQ and rely on the built-in idle CPU
* selection.
*/
dbg_msg("usersched: pid=%d cpu=%d cpumask_cnt=%llu payload=%llu",
task.pid, task.cpu, task.cpumask_cnt, task.payload);
if (task.cpu < 0)
dispatch_task(p, SHARED_DSQ, 0, 0);
else
dispatch_task(p, cpu_to_dsq(task.cpu), task.cpumask_cnt, 0);
bpf_task_release(p);
__sync_fetch_and_add(&nr_user_dispatches, 1);
}
/* Consume all tasks enqueued in the current CPU's DSQ first */
bpf_repeat(MAX_ENQUEUED_TASKS) {
if (!scx_bpf_consume(cpu_to_dsq(cpu)))
break;
}
/* Consume all tasks enqueued in the shared DSQ */
bpf_repeat(MAX_ENQUEUED_TASKS) {
if (!scx_bpf_consume(SHARED_DSQ))
break;
}
}</code></pre></li>
<li><p>The <code>.running()</code> and <code>.stopping()</code> callback
are called respectively when a task starts its execution on a CPU and it
releases the CPU; rustland uses this information to keep track of the
CPUs that are idle or busy, sharing this information to the user-space
counter-part (via the <code>cpu_mapcpu_map</code> BPF map
array).</p></li>
</ul>
<pre><code>/*
* Task @p starts on its selected CPU (update CPU ownership map).
*/
void BPF_STRUCT_OPS(rustland_running, struct task_struct *p)
{
s32 cpu = scx_bpf_task_cpu(p);
dbg_msg("start: pid=%d (%s) cpu=%ld", p->pid, p->comm, cpu);
/*
* Mark the CPU as busy by setting the pid as owner (ignoring the
* user-space scheduler).
*/
if (!is_usersched_task(p))
set_cpu_owner(cpu, p->pid);
}
/*
* Task @p stops running on its associated CPU (update CPU ownership map).
*/
void BPF_STRUCT_OPS(rustland_stopping, struct task_struct *p, bool runnable)
{
s32 cpu = scx_bpf_task_cpu(p);
dbg_msg("stop: pid=%d (%s) cpu=%ld", p->pid, p->comm, cpu);
/*
* Mark the CPU as idle by setting the owner to 0.
*/
if (!is_usersched_task(p)) {
set_cpu_owner(scx_bpf_task_cpu(p), 0);
/*
* Kick the user-space scheduler immediately when a task
* releases a CPU and speculate on the fact that most of the
* time there is another task ready to run.
*/
set_usersched_needed();
}
}</code></pre>
<ul>
<li>Both the <code>.stopping()</code> and <code>.update_idle()</code>
callbacks are used as checkpoints to wake-up the user-space scheduler
(being the scheduler a regular user-space task, it needs to implement a
logic to schedule itself).</li>
</ul>
<pre><code>void BPF_STRUCT_OPS(rustland_update_idle, s32 cpu, bool idle)
{
/*
* Don't do anything if we exit from and idle state, a CPU owner will
* be assigned in .running().
*/
if (!idle)
return;
/*
* A CPU is now available, notify the user-space scheduler that tasks
* can be dispatched.
*/
if (usersched_has_pending_tasks()) {
set_usersched_needed();
/*
* Wake up the idle CPU, so that it can immediately accept
* dispatched tasks.
*/
scx_bpf_kick_cpu(cpu, 0);
}
}
</code></pre>
<p>There is also a periodic heartbeat timer that kicks the user-space
scheduler to prevent triggering the sched-ext watchdog when the system
is almost idle (since in this condition we won’t hit any of the wake-up
point).</p>
<pre><code>static int usersched_timer_fn(void *map, int *key, struct bpf_timer *timer)
{
int err = 0;
/* Kick the scheduler */
set_usersched_needed();
/* Re-arm the timer */
err = bpf_timer_start(timer, NSEC_PER_SEC, 0);
if (err)
scx_bpf_error("Failed to arm stats timer");
return 0;
}</code></pre>
<ul>
<li>lastly the <code>.set_cpumask()</code> is used to detect when a task
changes its affinity; the scheduler will try to honor the affinity
looking at the cpumask (we check the validity of the cpumask using a
generation number, that is incremented every time the
<code>.set_cpumask()</code> callback is executed).</li>
</ul>
<pre><code>void BPF_STRUCT_OPS(rustland_set_cpumask, struct task_struct *p,
const struct cpumask *cpumask)
{
struct task_ctx *tctx;
tctx = lookup_task_ctx(p);
if (!tctx)
return;
tctx->cpumask_cnt++;
}</code></pre>
<h2 id="user-space-component-rust">User-space component (Rust)</h2>
<p>The user-space part is fully implemented in Rust as a regular
user-space program. The address space is shared with the eBPF part, so
some variables can be accessed and modified directly, while the
communication of tasks happen using the <code>bpf()</code> syscall,
accessing the <code>queued</code> and <code>dispatched</code> maps.</p>
<p><strong>NOTE: we could make this part more efficient by using eBPF
ring buffers, this would allow direct access to the maps without using a
syscall (there’s an ongoing work on this - patches are welcome if you
want to contribute).</strong></p>
<p>The user-space part is made of four components:</p>
<ul>
<li><p><a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/bpf.rs">eBPF
abstraction layer</a>: this part implements some Rust abstractions to
hide the internal eBPF details, so that the scheduler itself can be
implemented in a more abstracted and understandable way, focusing only
at the details of the implemented scheduling policy.</p></li>
<li><p>A custom memory allocator <a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/bpf/alloc.rs">RustLandAllocator</a>:
as mentioned in the pros and cons section, if the user-space scheduler
is blocked on a page fault, no other task can be scheduled, but we may
need to schedule some kernel threads to resolve the page fault, hence
the deadlock. To prevent this condition, the user-space scheduler locks
all the memory, via <code>mlockall()</code>, and it uses a custom memory
allocator that operates on a pre-allocated memory area. Quite tricky,
but this allows to prevent page faults in the user-space scheduler
task.</p></li>
<li><p>A <a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/topology.rs">CPU
topology abstraction</a>: simple library to detect the current system
CPU topology (this part will be improved in the future and moved to a
more generic place, so that other schedulers may benefit from
it).</p></li>
<li><p>The <a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/main.rs">scheduling
policy</a> itself, implemented in a totally abstracted way: the
scheduler uses a simple vruntime-based policy (similar to CFS) with a
little trick to detect interactive tasks and boost their priority a
little more (the trick is to look at the number of voluntary context
switches to determine if a task is interactive or not: a task that
releases the CPU without using its full assigned time slice is likely to
be interactive).</p>
<p>All tasks are stored in a <a
href="https://github.com/sched-ext/scx/blob/v0.1.7/scheds/rust/scx_rustland/src/main.rs#L184">BTreeSet</a>
ordered by their weighted vruntime and dispatched on the CPUs selected
by the sched-ext built-in idle selection logic (unless their assigned
CPU becomes busy and in this case the task will be dispatched on the
first CPU that becomes available).</p>
<p>For the time slice assigned to each task the scheduler uses a
variable time slice approach: it starts with a fixed time slice (20ms),
that is scaled down based on the amount of tasks waiting to be scheduled
(the more the system becomes overloaded, the shorter the assigned time
slice becomes; this can help to reduce the average wait time, making the
system more responsive when it is overloaded).</p></li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>That’s all for now, the goal of this post (probably the first one of
multiple series) is to give an idea how the scheduler works.</p>
<p>The scheduler is still under development, but some early results are
very <a
href="https://www.youtube.com/watch?v=oCfVbz9jvVQ">promising</a>.</p>
<p><strong>NOTE: keep in mind that in this video the scheduler was still
in an early stage, since then it has been improved a lot in terms of
stability, robustness and performance.</strong></p>
<p>In the next post I will cover more technical details, mentioning some
open issues and plans for future development and improvements.</p>
<p>I’m also planning to run more benchmarks with this scheduler (using
the <a href="https://www.phoronix-test-suite.com/">Phoronix test
suite</a>) and share some results, so stay tuned!</p>
<h1 id="references">References</h1>
<ul>
<li><p><a href="https://github.com/sched-ext/sched_ext">sched-ext
kernel</a></p></li>
<li><p><a href="https://github.com/sched-ext/scx">sched-ext schedulers
and tools</a></p></li>
<li><p><a
href="https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rustland">scx_rustland
source code</a></p></li>
</ul>
arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-6646825991122818842023-07-20T08:24:00.019+02:002023-12-19T09:17:34.652+01:00Implement your own kernel CPU scheduler in Ubuntu with sched-ext<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.3.1/styles/default.min.css" integrity="sha512-3xLMEigMNYLDJLAgaGlDSxpGykyb+nQnJBzbkQy2a0gyVKL2ZpNOPIj1rD8IPFaJbwAgId/atho1+LBpWu5DhA==" crossorigin="anonymous" referrerpolicy="no-referrer" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.3.1/highlight.min.js" integrity="sha512-Pbb8o120v5/hN/a6LjF4N4Lxou+xYZ0QcVF8J6TWhBbHmctQWd8O6xTDmHpE/91OjPzCk4JRoiJsexHYg4SotQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
<script>hljs.highlightAll();</script>
<h3>What is sched-ext?</h3>
sched-ext is a new scheduling class introduced in the Linux kernel that provides a mechanism to implement scheduling policies as BPF (Berkeley Packet Filter) programs <a href="#one">[1]</a>. Such programs can also be connected to user-space counterparts to defer scheduling decisions to regular user-space processes.
<br />
<br />
<h3>State of the art</h3>
The idea of "pluggable" schedulers is not new, it was initially proposed in 2004 <a href="#two">[2]</a>, but at that time it was strongly rejected, to prioritize the creation of a single generic scheduler (one to rule them all), that ended up being the “completely fair scheduler” (CFS).
However, with BPF and the sched-ext scheduling class, we now have the possibility to easily and quickly implement and test scheduling policies, making the “pluggable” approach an effective tool for easy experimentation.
<br />
<br />
<h3>What is the main benefit of sched-ext?</h3>
The ability to implement custom scheduling policies via BPF greatly lowers the difficulty of testing new scheduling ideas (much easier than changing CFS or replacing it with a different scheduler). With this feature researchers or developers can test their own scheduler in a safe way, without even needing to reboot the system.
<br />
<br />
<h3>How to use sched-ext in Ubuntu?</h3>
Unfortunately sched-ext is not yet available in the upstream Linux kernel, at the moment it is only available as a patch set in the Linux kernel mailing list <a href="#three">[3]</a> (and it is unlikely to be applied upstream in the near future, because there are still some concerns and potential issues that need to be addressed).
However, it is possible to use an experimental version of the Ubuntu linux-unstable kernel <a href="#four">[4]</a> <a href="#five">[5]</a> that includes the sched-ext patch set (keep in mind that this kernel is very experimental, do not use it in production!).
<br />
<br />
<h3>How to implement a custom scheduler?</h3>
The following example implements a “toy” CPU scheduler that passes all the scheduling “enqueue” events to a user-space task that processes them in a FIFO way and sends the “dispatch” events back to the kernel.
First of all let’s implement the BPF program:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
/* SPDX-License-Identifier: GPL-2.0 */
/*
* Copyright 2023 Canonical Ltd.
*/
#include "scx_common.bpf.h"
#include "scx_toy.h"
char _license[] SEC("license") = "GPL";
/*
* This contains the PID of the scheduler task itself (initialized in
* scx_toy.c).
*/
const volatile s32 usersched_pid;
/* Set when the user-space scheduler needs to run */
static bool usersched_needed;
/* Notify the user-space counterpart when the BPF program exits */
struct user_exit_info uei;
/* Enqueues statistics */
u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues;
/*
* BPF map to store enqueue events.
*
* The producer of this map is this BPF program, the consumer is the user-space
* scheduler task.
*/
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__uint(max_entries, MAX_TASKS);
__type(value, struct scx_toy_enqueued_task);
} enqueued SEC(".maps");
/*
* BPF map to store dispatch events.
*
* The producer of this map is the user-space scheduler task, the consumer is
* this BPF program.
*/
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__uint(max_entries, MAX_TASKS);
__type(value, s32);
} dispatched SEC(".maps");
/* Return true if the target task "p" is a kernel thread */
static inline bool is_kthread(const struct task_struct *p)
{
return !!(p->flags & PF_KTHREAD);
}
/* Return true if the target task "p" is the user-space scheduler task */
static bool is_usersched_task(const struct task_struct *p)
{
return p->pid == usersched_pid;
}
/*
* Dispatch user-space scheduler directly.
*/
static void dispatch_user_scheduler(void)
{
struct task_struct *p;
if (!usersched_needed)
return;
p = bpf_task_from_pid(usersched_pid);
if (!p)
return;
usersched_needed = false;
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
bpf_task_release(p);
}
void BPF_STRUCT_OPS(toy_enqueue, struct task_struct *p, u64 enq_flags)
{
struct scx_toy_enqueued_task task = {
.pid = p->pid,
};
/*
* User-space scheduler will be dispatched only when needed from
* toy_dispatch(), so we can skip it here.
*/
if (is_usersched_task(p))
return;
if (is_kthread(p)) {
/*
* We want to dispatch kernel threads and the scheduler task
* directly here for efficiency reasons, rather than passing
* the events to the user-space scheduler counterpart.
*/
__sync_fetch_and_add(&nr_kernel_enqueues, 1);
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
return;
}
if (bpf_map_push_elem(&enqueued, &task, 0)) {
/*
* We couldn't push the task to the "enqueued" map, dispatch
* the event here and register the failure in the failure
* counter.
*/
__sync_fetch_and_add(&nr_failed_enqueues, 1);
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
} else {
/*
* Enqueue event will be processed and task will be dispatched
* in user-space by the scheduler task.
*/
__sync_fetch_and_add(&nr_user_enqueues, 1);
}
}
void BPF_STRUCT_OPS(toy_dispatch, s32 cpu, struct task_struct *prev)
{
struct task_struct *p;
s32 pid;
dispatch_user_scheduler();
/*
* Get a dispatch event from user-space and dispatch the corresponding
* task.
*/
if (bpf_map_pop_elem(&dispatched, &pid))
return;
p = bpf_task_from_pid(pid);
if (!p)
return;
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
bpf_task_release(p);
}
s32 BPF_STRUCT_OPS(toy_init)
{
/* Apply the "toy" scheduling class for all the tasks in the system */
scx_bpf_switch_all();
return 0;
}
void BPF_STRUCT_OPS(toy_exit, struct scx_exit_info *ei)
{
/* Notify user-space counterpart that the BPF program terminated */
uei_record(&uei, ei);
}
SEC(".struct_ops.link")
struct sched_ext_ops toy_ops = {
.enqueue = (void *)toy_enqueue,
.dispatch = (void *)toy_dispatch,
.init = (void *)toy_init,
.exit = (void *)toy_exit,
.name = "toy",
};
</code>
</pre>
Then we can implement the user-space counterpart, that will intercept the “enqueue” events and will dispatch them back to the kernel:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
/* SPDX-License-Identifier: GPL-2.0 */
/*
* Copyright 2023 Canonical Ltd.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <assert.h>
#include <libgen.h>
#include <pthread.h>
#include <bpf/bpf.h>
#include <sys/mman.h>
#include <sys/queue.h>
#include <sys/syscall.h>
#include "user_exit_info.h"
#include "scx_toy.skel.h"
#include "scx_toy.h"
const char help_fmt[] =
"A toy sched_ext scheduler.\n"
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
"Usage: %s\n"
"\n"
" -h Display this help and exit\n";
static volatile int exit_req;
/*
* Descriptors used to communicate enqueue and dispatch event with the BPF
* program.
*/
static int enqueued_fd, dispatched_fd;
static struct scx_toy *skel;
static void sigint_handler(int dummy)
{
exit_req = 1;
}
/* Thread that periodically prints enqueue statistics */
static void *run_stats_printer(void *arg)
{
while (!exit_req) {
__u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues, total;
nr_failed_enqueues = skel->bss->nr_failed_enqueues;
nr_kernel_enqueues = skel->bss->nr_kernel_enqueues;
nr_user_enqueues = skel->bss->nr_user_enqueues;
total = nr_failed_enqueues + nr_kernel_enqueues + nr_user_enqueues;
printf("\e[1;1H\e[2J");
printf("o-----------------------o\n");
printf("| BPF SCHED ENQUEUES |\n");
printf("|-----------------------|\n");
printf("| kern: %10llu |\n", nr_kernel_enqueues);
printf("| user: %10llu |\n", nr_user_enqueues);
printf("| failed: %10llu |\n", nr_failed_enqueues);
printf("| -------------------- |\n");
printf("| total: %10llu |\n", total);
printf("o-----------------------o\n\n");
sleep(1);
}
return NULL;
}
static int spawn_stats_thread(void)
{
pthread_t stats_printer;
return pthread_create(&stats_printer, NULL, run_stats_printer, NULL);
}
/* Send a dispatch event to the BPF program */
static int dispatch_task(s32 pid)
{
int err;
err = bpf_map_update_elem(dispatched_fd, NULL, &pid, 0);
if (err) {
fprintf(stderr, "Failed to dispatch task %d\n", pid);
exit_req = 1;
}
return err;
}
/* Receive all the enqueue events from the BPF program */
static void drain_enqueued_map(void)
{
struct scx_toy_enqueued_task task;
while (!bpf_map_lookup_and_delete_elem(enqueued_fd, NULL, &task))
dispatch_task(task.pid);
}
/*
* Scheduler main loop: get enqueue events from the BPF program, process them
* (no-op) and send dispatch events to the BPF program.
*/
static void sched_main_loop(void)
{
while (!exit_req && !uei_exited(&skel->bss->uei)) {
drain_enqueued_map();
sched_yield();
}
}
int main(int argc, char **argv)
{
struct bpf_link *link;
u32 opt;
int err;
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
skel = scx_toy__open();
assert(skel);
skel->rodata->usersched_pid = getpid();
assert(skel->rodata->usersched_pid > 0);
while ((opt = getopt(argc, argv, "h")) != -1) {
switch (opt) {
default:
fprintf(stderr, help_fmt, basename(argv[0]));
return opt != 'h';
}
}
/*
* It's not always safe to allocate in a user space scheduler, as an
* enqueued task could hold a lock that we require in order to be able
* to allocate.
*/
err = mlockall(MCL_CURRENT | MCL_FUTURE);
if (err) {
fprintf(stderr, "Failed to prefault and lock address space: %s\n",
strerror(err));
return err;
}
assert(!scx_toy__load(skel));
/* Initialize file descriptors to communicate with the BPF program */
enqueued_fd = bpf_map__fd(skel->maps.enqueued);
dispatched_fd = bpf_map__fd(skel->maps.dispatched);
assert(enqueued_fd > 0);
assert(dispatched_fd > 0);
/* Start the thread to periodically print enqueue statistics */
err = spawn_stats_thread();
if (err) {
fprintf(stderr, "Failed to spawn stats thread: %s\n", strerror(err));
goto destroy_skel;
}
/* Register BPF program */
link = bpf_map__attach_struct_ops(skel->maps.toy_ops);
assert(link);
/* Call the scheduler main loop */
sched_main_loop();
/* Unregister the BPF program and exit */
bpf_link__destroy(link);
uei_print(&skel->bss->uei);
scx_toy__destroy(skel);
return 0;
destroy_skel:
scx_toy__destroy(skel);
exit_req = 1;
return err;
}
</code>
</pre>
To test the “toy” scheduler install the latest kernel from `ppa:arighi/sched-ext` with all the required build dependencies, following the steps documented at <a href="#five">[5]</a>).
Then you can load this “toy” scheduling class and replace the default CPU scheduler in Linux simply by running the following command:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
$ sudo ./scx_toy
</code>
</pre>
All the events that are affecting kernel threads or the scheduler task itself will be processed in kernel-space (for efficiency reasons), while the rest of the user-space tasks will be processed by the scheduler task (`scx_toy`).
The program will output some statistics of the “enqueue” and “dispatch” events done in user-space and kernel-space.
To unregister the “toy” scheduler and restore the default CFS scheduler we can simply press CTRL+c, that will also stop the scheduler task:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
o-----------------------o
| BPF SCHED ENQUEUES |
|-----------------------|
| kern: 19028 |
| user: 15875 |
| failed: 0 |
| -------------------- |
| total: 34903 |
o-----------------------o
^CEXIT: BPF scheduler unregistered
</code>
</pre>
<h3>Credits</h3>
The sched-ext patch set has been written by Tejun Heo, David Vernet, Josh Don and Barret Rhoden (with multiple contributions from the kernel community).
<br />
<br />
<h3>See also</h3>
<ol>
<li><a id="one" href="https://lwn.net/Articles/922405/">The extensible scheduler class</a></li>
<li><a id="two" href="https://lwn.net/Articles/109458/">Schedulers, pluggable and realtime</a></li>
<li><a id="three" href="https://lore.kernel.org/bpf/ZVPJTc5ZNEnnYmei@slm.duckdns.org/T/">[PATCHSET v5] sched: Implement BPF extensible scheduler class</a></li>
<li><a id="four" href="https://git.launchpad.net/~arighi/+git/linux">Latest sched-ext enabled Ubuntu kernel (git repository)</a></li>
<li><a id="five" href="https://launchpad.net/~arighi/+archive/ubuntu/sched-ext">Ubuntu sched-ext experimental (ppa)</a></li>
<li><a id="six" href="https://blogs.igalia.com/changwoo/sched-ext-a-bpf-extensible-scheduler-class-part-1/">sched_ext: a BPF-extensible scheduler class (Part 1)</a></li>
</ol>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-12449663159024613052019-08-11T15:45:00.000+02:002019-08-11T16:22:56.306+02:00Kernel debugging using QEMU/KVM, virtme and crash<h2>
<a id="user-content-introduction" class="anchor" href="#introduction" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Introduction</h2>
<p>Kernel development can be very time consuming, not only for the time required
to compile the kernel itself (especially with a beefy .config), but also the
tasks of deploying, testing and debugging represent a large portion of the
work.</p>
<p>In this article we will explore a slightly different approach that allows to
speed up kernel development, testing and debugging using QEMU/KVM.</p>
<h2>
<a id="user-content-virtualized-environment" class="anchor" href="#virtualized-environment" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Virtualized environment</h2>
<p>Some times we can speed up this process modularizing the portion of code that
we need to analize. However, this approach is not always doable (we may not
know yet the specific portion of code to debug or simply it's just not possible
to modularize it - i.e., core kernel components).</p>
<p>Moreover, working with a kernel module can potentially crash or seriously
compromise the system, causing loss of work and extra work to reboot and
restore the previous session.</p>
<p>Using a virtualized environment is definitely better from this point of view:
the testing environment can be easily re-deployed, restarted and easily resumed
in case of failures. Moreover, the development environment is never
compromised.</p>
<p>The first tool that we are going to use is virtme
(<a href="https://github.com/amluto/virtme">https://github.com/amluto/virtme</a>).</p>
<p>This tool is a wrapper on top of QEMU/KVM that allows to quickly spin up an
instance on a kernel build directory, creating a live sandbox of a just
compiled kernel that shares the $HOME directory (in read-only) with the host.</p>
<p>This gives the big advantage of being able to easily deploy additional files
into the testing instance, simply by copying them into a folder inside your
$HOME.</p>
<h2>
<a id="user-content-testing-the-kernel" class="anchor" href="#testing-the-kernel" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Testing the kernel</h2>
<p>We are going to create a helper script on top of virtme to quickly start a
kernel to test:</p>
<pre><code>righiandr@xps-13:~$ cat bin/kernel-test
#!/bin/bash
virtme-run --kdir . $* -a "nokaslr" --qemu-opts -m 1024 -smp 4 -s -qmp tcp:localhost:4444,server,nowait
righiandr@xps-13:~$
</code></pre>
<p>Assuming we have just compiled the kernel in $HOME/linux, we can quickly start it using:</p>
<pre><code>righiandr@xps-13:~/linux$ kernel-test
[ 0.000000] Linux version 5.3.0-rc3+ (righiandr@xps-13) (gcc version 9.1.0 (Ubuntu 9.1.0-9ubuntu2)) #43 SMP Sun Aug 11 08:46:51 CEST 2019
[ 0.000000] Command line: earlyprintk=serial,ttyS0,115200 console=ttyS0 psmouse.proto=exps "virtme_stty_con=rows 28 cols 104 iutf8" TERM=xterm-256color rootfstype=9p rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect ro nokaslr init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p /run/virtme/guesttools;/bin/mount -n -t 9p -o ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools /run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"
...
virtme-init: console is ttyS0
root@(none):/#
</code></pre>
<p>Now we can start running our tests using the new kernel ("CTRL+a x" to close
the session).</p>
<p>We can also assign a new block device to virtual instance, example (assign a
1GB raw disk):</p>
<pre><code>righiandr@xps-13:~/linux$ dd if=/dev/zero of=/tmp/disk.img bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.414057 s, 2.6 GB/s
righiandr@xps-13:~/linux$ kernel-test --disk "disk1=/tmp/disk.img"
[ 0.000000] Linux version 5.3.0-rc3+ (righiandr@xps-13) (gcc version 9.1.0 (Ubuntu 9.1.0-9ubuntu2)) #43 SMP Sun Aug 11 08:46:51 CEST 2019
...
virtme-init: console is ttyS0
root@(none):/# fdisk -l /dev/sda
Disk /dev/sda: 1 GiB, 1073741824 bytes, 2097152 sectors
Disk model: disk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@(none):/#
</code></pre>
<p>Having a separate disk dedicated to the instance is really useful to test I/O /
filesystem features.</p>
<h2>
<a id="user-content-debugging-kernel-modules" class="anchor" href="#debugging-kernel-modules" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Debugging kernel modules</h2>
<p>A downside of using "virtme --kdir" is that we're going to miss the kernel
module directory in the usual location /lib/modules/<code>uname -r</code>. For example, if
we try to load xfs (compiled as an external module), we get the following
error:</p>
<pre><code>root@(none):/# modprobe xfs
modprobe: ERROR: ../libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/5.3.0-rc3+/modules.dep.bin'
modprobe: FATAL: Module xfs not found in directory /lib/modules/5.3.0-rc3+
</code></pre>
<p>To resolve this limitation we can follow the following simple steps:</p>
<p>Install the kernel modules in a temporary directory in $HOME (for example
/home/righiandr/tmp/kmod):</p>
<pre><code>righiandr@xps-13:~/linux$ make modules_install INSTALL_MOD_PATH=~/tmp/kmod
</code></pre>
<p>Start the instance:</p>
<pre><code>righiandr@xps-13:~/linux$ kernel-test --disk "disk1=/tmp/disk.img"
...
virtme-init: console is ttyS0
root@(none):/#
</code></pre>
<p>Mount (bind) the temporary kernel modules directory to the standard module path
(run this inside the instance):</p>
<pre><code>root@(none):/# mount --bind /home/righiandr/tmp/kmod/lib/modules /lib/modules
</code></pre>
<p>At this point the kernel is able to load external modules as usual, example:</p>
<pre><code>root@(none):/# mount --bind /home/righiandr/tmp/kmod/lib/modules/ /lib/modules
root@(none):/# modprobe xfs
[ 17.257358] SGI XFS with security attributes, no debug enabled
root@(none):/#
</code></pre>
<h2>
<a id="user-content-kernel-debugging-using-virtme-and-crash" class="anchor" href="#kernel-debugging-using-virtme-and-crash" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Kernel debugging using virtme and crash</h2>
<p>Now let's see how we can track down a soft lockup bug (explicitly added) in
xfs. Let's assume we have explicitly introduced a soft lockup bug in xfs and we
have an easy way to reproduce the bug:</p>
<pre><code>root@(none):/# mkfs.xfs /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=65536 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
root@(none):/# mount /dev/sda /mnt
[ 293.523348] XFS (sda): Mounting V5 Filesystem
[ 293.529194] XFS (sda): Ending clean mount
root@(none):/# df /mnt/
[ 324.109386] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [df:328]
[ 324.110158] Modules linked in: xfs
[ 324.110522] irq event stamp: 64386
[ 324.110896] hardirqs last enabled at (64385): [<ffffffff810038fa>] trace_hardirqs_on_thunk+0x1a/0x20
[ 324.111843] hardirqs last disabled at (64386): [<ffffffff8100391a>] trace_hardirqs_off_thunk+0x1a/0x20
[ 324.112756] softirqs last enabled at (64384): [<ffffffff81e00338>] __do_softirq+0x338/0x435
[ 324.113592] softirqs last disabled at (64377): [<ffffffff810a03ae>] irq_exit+0xbe/0xd0
[ 324.114377] CPU: 0 PID: 328 Comm: df Not tainted 5.3.0-rc3+ #43
[ 324.114940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 324.115820] RIP: 0010:xfs_fs_statfs+0x87/0x90 [xfs]
[ 324.116290] Code: 38 e8 7d b5 52 e1 48 8d bb d0 01 00 00 e8 71 b5 52 e1 48 8d bb 38 02 00 00 e8 65 b5 52 e1 48 8d bb 20 01 00 00 e8 49 fa af e1 <eb> fe 0f 1f 80 00 00 00 00 0f 1f 44 00 00 85 f6 75 03 31 c0 c3 55
[ 324.118115] RSP: 0018:ffffc90000217df0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 324.118829] RAX: 0000000000000000 RBX: ffff88803ba4e000 RCX: 0000000000000000
[ 324.119506] RDX: ffff88803dc176e0 RSI: 8888888888888889 RDI: 0000000000000246
[ 324.120214] RBP: ffffc90000217df8 R08: 0000000000000000 R09: 0000000000000001
[ 324.120943] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803b3ec640
[ 324.121673] R13: ffffc90000217e90 R14: 0000000000000002 R15: 0000000000000000
[ 324.122409] FS: 00007fd7c6ccf580(0000) GS:ffff88803dc00000(0000) knlGS:0000000000000000
[ 324.123239] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 324.123832] CR2: 00007fd7c6bc20d0 CR3: 000000003a91a002 CR4: 0000000000360ef0
[ 324.124567] Call Trace:
[ 324.124841] statfs_by_dentry+0x73/0xa0
[ 324.125245] vfs_statfs+0x1b/0xc0
[ 324.125592] user_statfs+0x5b/0xb0
[ 324.125956] __do_sys_statfs+0x28/0x60
[ 324.126357] __x64_sys_statfs+0x16/0x20
[ 324.126766] do_syscall_64+0x65/0x1d0
[ 324.127153] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 324.127679] RIP: 0033:0x7fd7c6bee1fb
[ 324.128053] Code: c3 66 0f 1f 44 00 00 48 8b 05 91 8c 0d 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 8c 0d 00 f7 d8 64 89 01 48
[ 324.129961] RSP: 002b:00007ffe140f5ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[ 324.130739] RAX: ffffffffffffffda RBX: 00007ffe140f7f7f RCX: 00007fd7c6bee1fb
[ 324.131473] RDX: 00000000ffffffff RSI: 00007ffe140f5ef0 RDI: 00007ffe140f7f7f
[ 324.132206] RBP: 00007ffe140f5ef0 R08: 00007ffe140f6013 R09: 0000000000000032
[ 324.132940] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe140f5f90
[ 324.133672] R13: 0000000000000000 R14: 000056187bdb37c0 R15: 000056187bdb3780
</code></pre>
<p>At this point the system is unresponsive due to the soft lockup, but we got a
kernel oops at least that already tells a lot about the bug. However we are
unable to run any other command inside the instance.</p>
<p>In cases like this it can be pretty useful to generate a memory dump of the
instance and use a separate tool (i.e., crash) to analize the state of the
system after the crash.</p>
<p>In our environment we can use the QEMU Machine Protocol (QMP) to generate a
memory dump. If we pay attention to the options used in kernel-test to spin up
the QEMU/KVM instance (using virtme-run) we can see that we appended the
following: "-qmp tcp:localhost:4444,server,nowait". With these options QEMU/KVM
starts a QMP server listening on port 4444 that can accept QMP commands (see
also <a href="https://wiki.qemu.org/Documentation/QMP" rel="nofollow">https://wiki.qemu.org/Documentation/QMP</a>).</p>
<p>To easily generate a memory dump of our test instance we are going to create an
additional helper script called kernel-dump:</p>
<pre><code>righiandr@xps-13:~$ cat bin/kernel-dump
#!/usr/bin/expect -f
if { $argc < 1 } {
send_user "usage: kernel-dump vmcore.img\n"
exit
}
set out_file [lindex $argv 0]
spawn telnet localhost 4444
send "{ \"execute\": \"qmp_capabilities\" }\r"
expect "{\"return\": {}}"
send "{\"execute\":\"dump-guest-memory\",\"arguments\":{\"paging\":false,\"protocol\":\"file:$out_file\"}}\r"
expect "{\"return\": {}}"
</code></pre>
<p>Now we can simply use this helper script to generate the memory dump in
/tmp/vmcore.img:</p>
<pre><code>righiandr@xps-13:~/linux$ kernel-dump /tmp/vmcore.img
spawn telnet localhost 4444
{ "execute": "qmp_capabilities" }
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 4}, "package": "Debian 1:4.0+dfsg-0ubuntu5"}, "capabilities": ["oob"]}}
{"return": {}}
{"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}}
{"timestamp": {"seconds": 1565520296, "microseconds": 278432}, "event": "STOP"}
{"timestamp": {"seconds": 1565520296, "microseconds": 963140}, "event": "DUMP_COMPLETED", "data": {"result": {"total": 1074003968, "status": "completed", "completed": 1074003968}}}
{"timestamp": {"seconds": 1565520296, "microseconds": 963240}, "event": "RESUME"}
{"return": {}}
</code></pre>
<p>At this point we have a regular memory dump (similar to "virsh dump
--memory-only" if you are more familiar with libvirt instances) and we can use
"crash" to analize it:</p>
<pre><code>righiandr@xps-13:~/linux$ crash /tmp/vmcore.img vmlinux
...
This GDB was configured as "x86_64-unknown-linux-gnu"...
KERNEL: vmlinux
DUMPFILE: /tmp/vmcore.img
CPUS: 4
DATE: Sun Aug 11 12:44:55 2019
UPTIME: 00:05:10
LOAD AVERAGE: 1.55, 0.41, 0.14
TASKS: 86
NODENAME: (none)
RELEASE: 5.3.0-rc3+
VERSION: #43 SMP Sun Aug 11 08:46:51 CEST 2019
MACHINE: x86_64 (1992 Mhz)
MEMORY: 1 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: ffffffff8261b800 (1 of 4) [THREAD_INFO: ffffffff8261b800]
CPU: 0
STATE: TASK_RUNNING
WARNING: panic task not found
crash>
</code></pre>
<p>Now we are able to dig into the system even after the crash, for example we can
still look at the kernel log (running dmesg), list the processes (ps) and even
get a backtrace of a target PID:</p>
<pre><code>crash> dmesg | tail -36
[ 324.109386] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [df:328]
[ 324.110158] Modules linked in: xfs
[ 324.110522] irq event stamp: 64386
[ 324.110896] hardirqs last enabled at (64385): [<ffffffff810038fa>] trace_hardirqs_on_thunk+0x1a/0x20
[ 324.111843] hardirqs last disabled at (64386): [<ffffffff8100391a>] trace_hardirqs_off_thunk+0x1a/0x20
[ 324.112756] softirqs last enabled at (64384): [<ffffffff81e00338>] __do_softirq+0x338/0x435
[ 324.113592] softirqs last disabled at (64377): [<ffffffff810a03ae>] irq_exit+0xbe/0xd0
[ 324.114377] CPU: 0 PID: 328 Comm: df Not tainted 5.3.0-rc3+ #43
[ 324.114940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 324.115820] RIP: 0010:xfs_fs_statfs+0x87/0x90 [xfs]
[ 324.116290] Code: 38 e8 7d b5 52 e1 48 8d bb d0 01 00 00 e8 71 b5 52 e1 48 8d bb 38 02 00 00 e8 65 b5 52 e1 48 8d bb 20 01 00 00 e8 49 fa af e1 <eb> fe 0f 1f 80 00 00 00 00 0f 1f 44 00 00 85 f6 75 03 31 c0 c3 55
[ 324.118115] RSP: 0018:ffffc90000217df0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 324.118829] RAX: 0000000000000000 RBX: ffff88803ba4e000 RCX: 0000000000000000
[ 324.119506] RDX: ffff88803dc176e0 RSI: 8888888888888889 RDI: 0000000000000246
[ 324.120214] RBP: ffffc90000217df8 R08: 0000000000000000 R09: 0000000000000001
[ 324.120943] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803b3ec640
[ 324.121673] R13: ffffc90000217e90 R14: 0000000000000002 R15: 0000000000000000
[ 324.122409] FS: 00007fd7c6ccf580(0000) GS:ffff88803dc00000(0000) knlGS:0000000000000000
[ 324.123239] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 324.123832] CR2: 00007fd7c6bc20d0 CR3: 000000003a91a002 CR4: 0000000000360ef0
[ 324.124567] Call Trace:
[ 324.124841] statfs_by_dentry+0x73/0xa0
[ 324.125245] vfs_statfs+0x1b/0xc0
[ 324.125592] user_statfs+0x5b/0xb0
[ 324.125956] __do_sys_statfs+0x28/0x60
[ 324.126357] __x64_sys_statfs+0x16/0x20
[ 324.126766] do_syscall_64+0x65/0x1d0
[ 324.127153] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 324.127679] RIP: 0033:0x7fd7c6bee1fb
[ 324.128053] Code: c3 66 0f 1f 44 00 00 48 8b 05 91 8c 0d 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 8c 0d 00 f7 d8 64 89 01 48
[ 324.129961] RSP: 002b:00007ffe140f5ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[ 324.130739] RAX: ffffffffffffffda RBX: 00007ffe140f7f7f RCX: 00007fd7c6bee1fb
[ 324.131473] RDX: 00000000ffffffff RSI: 00007ffe140f5ef0 RDI: 00007ffe140f7f7f
[ 324.132206] RBP: 00007ffe140f5ef0 R08: 00007ffe140f6013 R09: 0000000000000032
[ 324.132940] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe140f5f90
[ 324.133672] R13: 0000000000000000 R14: 000056187bdb37c0 R15: 000056187bdb3780
crash>
crash> ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff8261b800 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff88803da48000 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff88803da4b2c0 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff88803da532c0 RU 0.0 0 0 [swapper/3]
1 0 0 ffff88803d9c8000 IN 0.2 4084 2028 virtme-init
2 0 3 ffff88803d9cb2c0 IN 0.0 0 0 [kthreadd]
3 2 0 ffff88803d9eb2c0 ID 0.0 0 0 [rcu_gp]
4 2 0 ffff88803d9e8000 ID 0.0 0 0 [rcu_par_gp]
5 2 0 ffff88803d9f0000 ID 0.0 0 0 [kworker/0:0]
6 2 0 ffff88803d9f32c0 ID 0.0 0 0 [kworker/0:0H]
8 2 0 ffff88803d9fb2c0 ID 0.0 0 0 [mm_percpu_wq]
9 2 0 ffff88803da00000 IN 0.0 0 0 [ksoftirqd/0]
10 2 2 ffff88803da032c0 ID 0.0 0 0 [rcu_sched]
11 2 0 ffff88803da08000 RU 0.0 0 0 [migration/0]
12 2 0 ffff88803da0b2c0 RU 0.0 0 0 [kworker/0:1]
13 2 0 ffff88803da50000 IN 0.0 0 0 [cpuhp/0]
14 2 1 ffff88803da632c0 IN 0.0 0 0 [cpuhp/1]
15 2 1 ffff88803da60000 IN 0.0 0 0 [migration/1]
16 2 1 ffff88803da68000 IN 0.0 0 0 [ksoftirqd/1]
18 2 1 ffff88803da732c0 ID 0.0 0 0 [kworker/1:0H]
19 2 2 ffff88803da70000 IN 0.0 0 0 [cpuhp/2]
20 2 2 ffff88803da80000 IN 0.0 0 0 [migration/2]
21 2 2 ffff88803da832c0 IN 0.0 0 0 [ksoftirqd/2]
22 2 2 ffff88803dad32c0 ID 0.0 0 0 [kworker/2:0]
23 2 2 ffff88803dad0000 ID 0.0 0 0 [kworker/2:0H]
24 2 3 ffff88803dad8000 IN 0.0 0 0 [cpuhp/3]
25 2 3 ffff88803dadb2c0 IN 0.0 0 0 [migration/3]
26 2 3 ffff88803dae32c0 IN 0.0 0 0 [ksoftirqd/3]
27 2 3 ffff88803dae0000 ID 0.0 0 0 [kworker/3:0]
28 2 3 ffff88803daf0000 ID 0.0 0 0 [kworker/3:0H]
29 2 1 ffff88803daf32c0 IN 0.0 0 0 [kdevtmpfs]
30 2 1 ffff88803dbab2c0 ID 0.0 0 0 [netns]
31 2 2 ffff88803dba8000 IN 0.0 0 0 [kauditd]
32 2 2 ffff88803d0732c0 IN 0.0 0 0 [khungtaskd]
33 2 2 ffff88803d070000 IN 0.0 0 0 [oom_reaper]
34 2 2 ffff88803d078000 ID 0.0 0 0 [writeback]
35 2 3 ffff88803d07b2c0 IN 0.0 0 0 [kcompactd0]
36 2 3 ffff88803d0c32c0 IN 0.0 0 0 [ksmd]
37 2 3 ffff88803d0c0000 IN 0.0 0 0 [khugepaged]
39 2 2 ffff88803d0cb2c0 ID 0.0 0 0 [kworker/u8:1]
43 2 3 ffff88803d138000 ID 0.0 0 0 [kworker/3:1]
48 2 1 ffff88803d150000 ID 0.0 0 0 [kworker/1:1]
131 2 1 ffff88803d370000 ID 0.0 0 0 [kintegrityd]
132 2 0 ffff88803d3932c0 ID 0.0 0 0 [kblockd]
133 2 2 ffff88803d390000 ID 0.0 0 0 [blkcg_punt_bio]
134 2 3 ffff88803d0c8000 ID 0.0 0 0 [tpm_dev_wq]
135 2 1 ffff88803d158000 ID 0.0 0 0 [ata_sff]
136 2 0 ffff88803d15b2c0 ID 0.0 0 0 [md]
137 2 2 ffff88803d128000 ID 0.0 0 0 [edac-poller]
138 2 3 ffff88803d12b2c0 ID 0.0 0 0 [devfreq_wq]
139 2 0 ffff88803d18b2c0 IN 0.0 0 0 [watchdogd]
140 2 2 ffff88803d188000 ID 0.0 0 0 [kworker/2:1]
143 2 1 ffff88803d17b2c0 IN 0.0 0 0 [kswapd0]
144 2 0 ffff88803d168000 ID 0.0 0 0 [kworker/u9:0]
145 2 1 ffff88803d16b2c0 IN 0.0 0 0 [ecryptfs-kthrea]
148 2 1 ffff88803d13b2c0 ID 0.0 0 0 [kthrotld]
149 2 2 ffff88803bcab2c0 ID 0.0 0 0 [acpi_thermal_pm]
150 2 3 ffff88803bca8000 IN 0.0 0 0 [scsi_eh_0]
151 2 2 ffff88803bf88000 ID 0.0 0 0 [scsi_tmf_0]
152 2 1 ffff88803bf8b2c0 ID 0.0 0 0 [nvme-wq]
153 2 3 ffff88803b8a0000 ID 0.0 0 0 [nvme-reset-wq]
154 2 3 ffff88803b8a32c0 ID 0.0 0 0 [nvme-delete-wq]
155 2 3 ffff88803b920000 IN 0.0 0 0 [scsi_eh_1]
156 2 3 ffff88803b9232c0 ID 0.0 0 0 [scsi_tmf_1]
157 2 3 ffff88803b9432c0 IN 0.0 0 0 [scsi_eh_2]
158 2 1 ffff88803b940000 ID 0.0 0 0 [scsi_tmf_2]
159 2 2 ffff88803b9932c0 ID 0.0 0 0 [kworker/u8:2]
161 2 0 ffff88803ba38000 ID 0.0 0 0 [ipv6_addrconf]
173 2 2 ffff88803ba532c0 ID 0.0 0 0 [kstrp]
192 2 2 ffff88803bb9b2c0 ID 0.0 0 0 [charger_manager]
193 2 2 ffff88803bb98000 ID 0.0 0 0 [kworker/2:1H]
213 1 1 ffff88803bb90000 IN 0.3 18312 2804 systemd-udevd
296 1 3 ffff88803bb4b2c0 IN 0.2 4216 2496 bash
299 2 0 ffff88803bb932c0 ID 0.0 0 0 [kworker/0:1H]
302 2 2 ffff88803b708000 ID 0.0 0 0 [xfsalloc]
303 2 0 ffff88803b70b2c0 ID 0.0 0 0 [xfs_mru_cache]
308 2 1 ffff88803bb732c0 ID 0.0 0 0 [kworker/1:2]
312 2 1 ffff88803d3332c0 ID 0.0 0 0 [kworker/1:1H]
321 2 1 ffff88803bb70000 ID 0.0 0 0 [xfs-buf/sda]
322 2 3 ffff88803d3732c0 ID 0.0 0 0 [xfs-conv/sda]
323 2 2 ffff88803d1532c0 ID 0.0 0 0 [xfs-cil/sda]
324 2 0 ffff88803b9db2c0 ID 0.0 0 0 [xfs-reclaim/sda]
325 2 3 ffff88803b9d8000 ID 0.0 0 0 [xfs-eofblocks/s]
326 2 0 ffff88803b9d0000 ID 0.0 0 0 [xfs-log/sda]
327 2 1 ffff88803b9d32c0 IN 0.0 0 0 [xfsaild/sda]
> 328 296 0 ffff88803d330000 RU 0.1 2556 1076 df
crash> bt 328
PID: 328 TASK: ffff88803d330000 CPU: 0 COMMAND: "df"
[exception RIP: xfs_fs_statfs+135]
RIP: ffffffffa008fd17 RSP: ffffc90000217df0 RFLAGS: 00000282
RAX: 0000000000000000 RBX: ffff88803ba4e000 RCX: 0000000000000000
RDX: ffff88803dc176e0 RSI: 8888888888888889 RDI: 0000000000000246
RBP: ffffc90000217df8 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803b3ec640
R13: ffffc90000217e90 R14: 0000000000000002 R15: 0000000000000000
CS: 0010 SS: 0018
#0 [ffffc90000217e00] statfs_by_dentry at ffffffff81352353
#1 [ffffc90000217e20] vfs_statfs at ffffffff8135265b
#2 [ffffc90000217e40] user_statfs at ffffffff8135275b
#3 [ffffc90000217e88] __do_sys_statfs at ffffffff813527d8
#4 [ffffc90000217f20] __x64_sys_statfs at ffffffff81352826
#5 [ffffc90000217f30] do_syscall_64 at ffffffff81004745
#6 [ffffc90000217f50] entry_SYSCALL_64_after_hwframe at ffffffff81c00091
RIP: 00007fd7c6bee1fb RSP: 00007ffe140f5ee8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007ffe140f7f7f RCX: 00007fd7c6bee1fb
RDX: 00000000ffffffff RSI: 00007ffe140f5ef0 RDI: 00007ffe140f7f7f
RBP: 00007ffe140f5ef0 R8: 00007ffe140f6013 R9: 0000000000000032
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe140f5f90
R13: 0000000000000000 R14: 000056187bdb37c0 R15: 000056187bdb3780
ORIG_RAX: 0000000000000089 CS: 0033 SS: 002b
crash>
</code></pre>
<p>Looking at all these details we can see that the lockup happened at
xfs_fs_statfs+135. We can use gdb inside crash to list the source code where
the lockup happened, unfortunately in our case it seems that the symbol cannot
be found:</p>
<pre><code>crash> gdb list *(xfs_fs_statfs+135)
No symbol "xfs_fs_statfs" in current context.
gdb: gdb request failed: list *(xfs_fs_statfs+135)
crash>
</code></pre>
<p>This is because xfs_fs_statfs is defined in an external module and crash
doesn't know where to find it. So we can help it to find the missing symbols
using the command "mod":</p>
<pre><code>crash> mod
MODULE NAME SIZE OBJECT FILE
ffffffffa00f5340 xfs 1327104 (not loaded) [CONFIG_KALLSYMS]
crash> mod -s xfs /home/righiandr/tmp/kmod/lib/modules/5.3.0-rc3+/kernel/fs/xfs/xfs.ko
MODULE NAME SIZE OBJECT FILE
ffffffffa00f5340 xfs 1327104 /home/righiandr/tmp/kmod/lib/modules/5.3.0-rc3+/kernel/fs/xfs/xfs.ko
crash>
</code></pre>
<p>Now we can use "gdb list" to look at the exact line of code that caused the
lockup:</p>
<pre><code>crash> gdb set listsize 3
crash> gdb list *(xfs_fs_statfs+135)
0xffffffffa008fd17 is in xfs_fs_statfs (fs/xfs/xfs_super.c:1102).
1101 spin_lock(&mp->m_sb_lock);
1102 while (1);
1103 statp->f_bsize = sbp->sb_blocksize;
crash>
</code></pre>
<p>There it is! In this case it's pretty obvious to notice the endless while(1)
loop after a spin_lock(), some times it can be way harder to spot a lockup, but
the virtualized environment in combination with crash can really help a lot to
speed up the tests and, often, it allows get additional information that would
have been impossible to see in a typical development/testing environment.</p>
<h2>
<a id="user-content-conclusion" class="anchor" href="#conclusion" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Conclusion</h2>
<p>In this article we have seen a possible workflow to test and debug the kernel
in a virtualized environment. This makes kernel development safer, faster and
easier, compared to the usual workflow of developing and testing a kernel on
the same development system.</p>
<p>This is all possible simply using QEMU/KVM, virtme and crash with a couple of
small helper scripts.</p>
<p>I hope anyone searching for kernel development tips find this useful. Let me
know your kernel development tips and tricks in the comments below!</p>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com1tag:blogger.com,1999:blog-4397409626710913610.post-29810382459014174092018-12-12T15:24:00.000+01:002018-12-12T15:24:57.190+01:00Linux: easy keylogger with eBPFIn this article I'd like to show you how we can use eBPF as a tool to learn the kernel.
<br/>
<br/>
<h3>Introduction</h3>
<br/>
<a href="https://lwn.net/Articles/740157/">eBPF</a> is a virtual machine implemented inside the kernel. It gives user-space programs the possibility to inject kernel code that runs in a safe environment (sandbox). The injected code has access to kernel structures like any regular kernel code, but it can't harm or break the system (an in-kernel verifier does a static analysis of the code, if the check doesn't pass the code is not accepted and you get an error, before running any code at all).
<br/>
<br/>
<a href="https://github.com/iovisor/bcc">BCC</a> (BPF Compiler Collection) is a front-end to eBPF, it provides a set of very useful tools on top of eBPF that allow to do amazing things (tracing, profiling, code inspection, etc.).
<br/>
<br/>
<h3>The problem</h3>
<br/>
Let's pretend we don't know anything about the kernel and we want to write a keylogger.
<br/>
<br/>
The very first thing that we need to do is to figure out how the kernel receives keys from the keyboard. Knowing a little bit how computers work, we may guess that keys are received as interrupts.
<br/>
<br/>
So, let's try to use a tool in BCC called <code>funccount.py</code>. This tool counts how many times one (or more) function(s), passed as argument, are called in the kernel, all at runtime. It accepts wildcards (similar to file globbing), so let's start to count all functions that are called <code>"*interrupt*"</code> (because I guess an IRQ handler of a keyboard interrupt should be called "something...interrupt...something else"):
<br/>
<br/>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
# ./funccout.py '*interrupt*'
</code>
</pre>
While funccount.py is running we press some random keys on the keyboard, then we stop it with CTRL+C and we get an output like the following:
<br/>
<br/>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
FUNC COUNT
wait_for_completion_interruptible 1
arch_show_interrupts 1
ww_mutex_lock_interruptible.part.10 14
ahci_handle_port_interrupt 26
ww_mutex_lock_interruptible 31
atkbd_interrupt 70
i915_mutex_lock_interruptible 71
ath9k_btcoex_handle_interrupt 80
ath9k_hw_kill_interrupts 84
ath9k_hw_resume_interrupts 85
__ath9k_hw_enable_interrupts 109
show_interrupts 489
psmouse_interrupt 633
i8042_interrupt 774
mutex_lock_interruptible_nested 875
serio_interrupt 906
note_interrupt 930
add_interrupt_randomness 1030
hrtimer_interrupt 2459
get_next_timer_interrupt 14771
__next_timer_interrupt 55993
generic_smp_call_function_single_interrupt 63830
</code>
</pre>
We've got many interrupt handlers here... but, among those, <code>atkbd_interrupt</code> looks really interesting for our purpose. It's likely the interrupt handler that we were looking for. Now, inspecting the kernel source code we find that the prototype of this function is the <a href="https://elixir.bootlin.com/linux/latest/source/drivers/input/keyboard/atkbd.c#L372">following</a>:
<br/>
<br/>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
static irqreturn_t atkbd_interrupt(struct serio *serio, unsigned char data, unsigned int flags)
</code>
</pre>
We got it! It looks like <code>data</code> is what we need to intercept all keys that are pressed on the keyboard.
<br/>
<br/>
To do so we use another tool provided by BCC: <code>trace.py</code>. This tool allows to trace any function in the kernel and inspect its argument:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
color: #000000; background-color: #eee;
font-size: 12px; border: 1px dashed #999999;
line-height: 14px; padding: 5px;
overflow: auto; width: 100%">
<code style="color:#000000;word-wrap:normal;">
# ./trace.py 'atkbd_interrupt(struct serio *serio, unsigned char data, unsigned int flags) "data=0x%x" data'
...
PID TID COMM FUNC -
0 0 swapper/3 atkbd_interrupt data=0x1e
0 0 swapper/3 atkbd_interrupt data=0x9e
0 0 swapper/3 atkbd_interrupt data=0x31
0 0 swapper/3 atkbd_interrupt data=0xb1
0 0 swapper/3 atkbd_interrupt data=0x20
0 0 swapper/3 atkbd_interrupt data=0xa0
0 0 swapper/3 atkbd_interrupt data=0x13
0 0 swapper/3 atkbd_interrupt data=0x93
0 0 swapper/3 atkbd_interrupt data=0x12
0 0 swapper/3 atkbd_interrupt data=0x92
0 0 swapper/3 atkbd_interrupt data=0x1e
0 0 swapper/3 atkbd_interrupt data=0x9e
...
</code>
</pre>
And here's our keylogger, just two commands, knowing a very little about kernel and programming in general.
<br/>
<br/>
<b>NOTE</b>: if you test this simple example on your box you may notice that any time you press a key on the keyboard you get 2 interrupts: 1 when the key is pressed and another when the key is released. The one when the key is released returns a value with the most significant bit set: this can be used to distinguish between a "<i>key pressed</i>" event and a "<i>key released</i>" event.
<br/>
<br/>
At this point the only thing that you need is to log these data somewhere and decode each number (scancode) into the corresponding key... and you have your keylogger.
<br/>
<br/>
Happy hacking!arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com2tag:blogger.com,1999:blog-4397409626710913610.post-65564611928690029642013-08-13T12:11:00.002+02:002013-08-13T19:57:28.096+02:00How to extract a single function from a ELF fileSimple bash script to disassemble a single function from a ELF file:
<pre>
#!/bin/bash
SECTION=$1
IN=$2
i=`nm -S --size-sort $IN | grep "\<$SECTION\>" | \
awk '{print toupper($1),toupper($2)}'`
echo "$i" | while read line; do
start=${line%% *}
size=${line##* }
end=`echo "obase=16; ibase=16; $start + $size" | bc -l`
objdump -d --section=.text \
--start-address="0x$start" \
--stop-address="0x$end" $IN
done
</pre>
We may also want to generate a "binary" dump of the function (i.e., to do a binary copy of the function to a separate file); in this case the script becomes the following:
<pre>
#!/bin/bash
SECTION=$1
IN=$2
i=`nm -S --size-sort $IN | grep "\<$SECTION\>" |
awk '{print toupper($1),toupper($2)}'`
echo "$i" | while read line; do
start=${line%% *}
size=${line##* }
end=`echo "obase=16; ibase=16; $start + $size" | bc -l`
objdump -d --section=.text \
--start-address="0x$start" \
--stop-address="0x$end" $IN | \
grep '[0-9a-f]:' | \
cut -f2 -d: | \
cut -f1-7 -d' ' | \
tr -s ' ' | \
tr '\t' ' ' | \
sed 's/ $//g' | \
sed 's/ /\\x/g' | \
paste -d '' -s | \
sed 's/^/"/' | \
sed 's/$/"/g' | \
sed 's:.*:echo -ne &:' | /bin/bash
done
</pre>
Enjoy!arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-7237151229670432842013-05-15T18:54:00.000+02:002013-05-15T20:28:08.375+02:00Linux PERF_EVENTS root exploit - CVE-2013-2094 (quick way to fix it)Recently a quite critical flaw has been found in the PERF_EVENT code in Linux.<br />
<br />
The problem is a failure at checking a 64-bits variable (cast to a 32-bits int) passed by user space, resulting to out-of-bounds access of an array in kernel space.<br />
<br />
Here is the exploit: <a href="http://packetstormsecurity.com/files/121616/semtex.c">http://packetstormsecurity.com/files/121616/semtex.c</a><br />
<br />
The bug has been already fixed mainstream: <a href="http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8176cced706b5e5d15887584150764894e94e02f">http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8176cced706b5e5d15887584150764894e94e02f</a><br />
<br />
But lots of distro kernels are affected by this bug, so they're vulnerable to the root exploit (like all the RH / CentOS 2.6.32-based kernels).<br />
<br />
A temporary solution, waiting for a new kernel to come out, can be to create a kernel module to "inject" the fix into the kernel at runtime, using the kprobe framework.<br />
<br />
Here is the source code of the module:<br />
<a href="https://www.develer.com/~arighi/linux/fix/CVE-2013-2094/perf-bug-fix.tar.gz">https://www.develer.com/~arighi/linux/fix/CVE-2013-2094/perf-bug-fix.tar.gz</a><br />
<br />
This module injects the fix by wrapping the buggy function perf_swevent_init() using the kprobe framework.<br />
<br />
This would allow to provide a fix really soon, and it works with any kernel affected by this bug.<br />
<div>
<br />
To compile it (on CentOS - instructions are almost identical in other distro):<br />
<br />
$ sudo yum install kernel-devel<br />
$ tar xvzf perf-bug-fix.tar.gz<br />
$ cd perf-bug-fix<br />
$ make<br />
<br />
How to load the module and apply the fix:<br />
<br />
$ sudo insmod wrapper.ko<br />
<br />
That's it. Now let's test the exploit.<br />
<br />
Without wrapper.ko loaded:<br />
<br />
cpuser1@testdomain1.com [/dev/shm]# id -u<br />
521<br />
cpuser1@testdomain1.com [/dev/shm]# ./a.out<br />
...<br />
-sh-4.1# id -u<br />
0<br />
<br />
With wrapper.ko:<br />
<br />
cpuser1@testdomain1.com [/dev/shm]# ./a.out<br />
a.out: test.c:51: sheep: Assertion `!close(fd)' failed.<br />
Aborted (core dumped)<br />
cpuser1@testdomain1.com [/dev/shm]# id -u<br />
521</div>
arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com4tag:blogger.com,1999:blog-4397409626710913610.post-13632255229655248232011-08-30T11:52:00.000+02:002011-08-30T11:52:56.341+02:00install busybox from source on Samsung GT-I9100If you want a complete set of unix tools in your phone, here are the steps to cross-compile and install busybox from source (<b>there's no need to root the phone and/or install everything using the busybox installer from the market</b>).<br />
<br />
Download the latest ARM gnueabi toolchain from <a href="http://www.codesourcery.com/sgpp/lite/arm/portal/release1803">codesourcery</a>.<br />
<br />
Get the busybox source code from the git repository:<br />
$ git clone git://busybox.net/busybox.git<br />
<br />
Copy my busybox config file (or use your own config if you prefer, this can be just a starting point):<br />
<br />
$ cd busybox<br />
$ wget -O .config <a href="http://www.develer.com/~arighi/android/busybox/config">http://www.develer.com/~arighi/android/busybox/config</a><br />
<br />
<i>[optional]</i> Change the busybox config if you want by running:<br />
$ make menuconfig<br />
<br />
Cross-compile busybox:<br />
$ make oldconfig && make<br />
<br />
At the end of the build the busybox binary should be available as a statically linked ELF for ARM:<br />
$ file busybox<br />
busybox: ELF 32-bit LSB executable, ARM, version 1 (SYSV), statically linked, for GNU/Linux 2.6.16, stripped<br />
<br />
Upload the busybox binary to the device (you don't need root access to your phone):<br />
$ adb push busybox /data/local/tmp/<br />
<br />
Now busybox it's ready to be used:<br />
$ adb shell /data/local/tmp/busybox CMD<br />
<br />
Example:<br />
$ adb shell /data/local/tmp/busybox lsusb<br />
Bus 001 Device 001: ID 1d6b:0002<br />
Bus 001 Device 002: ID 1519:0020<br />
<br />
=== NOTE: all the following steps are optional ===<br />
<br />
If you have enabled the adb root shell access to your phone (i.e., by rooting the phone or by installing my <a href="howtohttp://arighi.blogspot.com/2011/08/howto-custom-kernel-on-samsung-galaxy-s.html">custom kernel</a>), you can also install busybox in the /system partition and have all the commands available in the $PATH.<br />
<br />
Remount the /system partition in read-write mode on your phone:<br />
$ adb shell "mount -oremount,rw /dev/block/mmcblk0p9 /system<br />
<br />
Upload the busybox binary to the /system partition:<br />
$ adb push busybox /system/xbin/busybox<br />
<br />
Install busybox by using busybox itself:<br />
$ adb shell "chmod 755 /system/xbin/busybox"<br />
$ adb shell "/system/xbin/busybox --install -s /system/xbin"<br />
<br />
Remount /system in read-only:<br />
$ adb shell "mount -oremount,ro /dev/block/mmcblk0p9 /system"<br />
<br />
Now all the busybox applets are in your $PATH:<br />
<br />
$ adb shell lsusb<br />
Bus 001 Device 001: ID 1d6b:0002<br />
Bus 001 Device 002: ID 1519:0020arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-8171945867649072312011-08-30T00:33:00.001+02:002011-10-05T00:28:07.621+02:00HOWTO: custom kernel on Samsung Galaxy S II I9100Recently I've replaced my Bravo HTC Desire with a new Android phone: a <a href="http://www.samsung.com/global/microsite/galaxys2/html/">Samsung Galaxy S II I-9100</a>. I couldn't resist too much with the stock kernel, so finally I've found some spare time to cook a custom kernel starting from the original Samsung kernel source code <a href="https://opensource.samsung.com/reception/reception_main.do?method=reception_search&searchValue=I9100">GT-I9100_OpenSource_Update2</a>.<br />
<br />
In this post I report (for me to remember later and for those who are interested) all the steps that I did to setup the build environment, cross compile the custom kernel and flash it into the phone.<br />
<br />
<b>DISCLAIMER: </b><b>I take no responsibility for anything that may go wrong by you following these instructions. </b><b>Proceed at your own risk!</b><br />
<br />
=== Requirements ===<br />
<br />
- A Samsung Galaxy S II (not necessarily rooted! you'll get a root shell when you'll flash the new kernel)<br />
- The latest android <a href="http://developer.android.com/sdk/index.html">SDK</a><br />
- The arm-none-eabi cross-compile toolchain (you can get it from the <a href="https://sourcery.mentor.com/sgpp/lite/arm/portal/release1802">CodeSourcery website</a>)<br />
<br />
=== HOWTO ===<br />
<br />
Download and install the arm toolchain from the CodeSourcery website: be sure that arm-none-eabi-gcc is in your $PATH.<br />
<br />
Get the <a href="https://github.com/arighi/gt-i9100">autobuild script</a>:<br />
$ git clone git://github.com/arighi/gt-i9100.git<br />
<br />
Run the script:<br />
$ ./build-kernel.sh<br />
<br />
The script downloads the "<i>-arighi</i>" <a href="https://github.com/arighi/linux-gt-i9100">kernel source code</a>, a <a href="https://github.com/arighi/initramfs-gt-i9100">initramfs template</a> and builds a new kernel ready to be flashed into the device.<br />
<br />
At the end of the autobuild process the file <b>kernel-gt-i9100-arighi.tar</b> can be used to flash the new kernel to the phone using Odin (search on the web or in the xda-developers forum, there are tons of howtos/tutorials for this).<br />
<br />
=== Results ===<br />
<br />
The score with Quadrant benchmark is not bad at all, I got always > 4000, but remember that we're cheating during the IO test, due to the fsync-disable patch.<br />
<br />
Anyway overall result looks good enough.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-fkxDkFW-jpM/Tluroqy683I/AAAAAAAAHgM/dkjXZI2PE1k/s1600/device.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-fkxDkFW-jpM/Tluroqy683I/AAAAAAAAHgM/dkjXZI2PE1k/s320/device.png" width="208" /></a></div>
<br />
== Additional notes ===<br />
<br />
- All the custom *.ko files are included into the initramfs to avoid potential errors/problems with the original kernel modules (so it is always possible to flash back the original kernel later, all the old kernel modules are still there, untouched).<br />
<br />
- After you've flashed the -arighi kernel the first time you will also have root access to your device. The initramfs template enables adb root shell (ro.secure == 0), so an adb shell will immediately drop you to a root shell. This means you can also re-flash your device from Linux directly using the <a href="https://github.com/arighi/gt-i9100/blob/master/flash-kernel.sh">flash-kernel.sh</a> script.<br />
<br />
- For the complete list of all the patches applied to this kernel have a look at the git log <a href="https://github.com/arighi/linux-gt-i9100/commits/master">here</a>.<br />
<br />
- IMPORTANT: the <a href="https://github.com/arighi/linux-gt-i9100/commit/924375d3872db6c16894e9d2e7c3f1b408df0e45">fsync-disable patch</a> (enabled in the kernel by default) can increase performance and battery life, but it is dangerous, because it might eat your data!! It makes the software no longer crash safe, so if you start to randomly kill your apps you may lose some data<br />
<b><br /></b><br />
<b>[UPDATE: the fsync-disable patch is no more enabled by default in the kernel, to enable it just set CONFIG_FILE_SYNC_DISABLE=y in the kernel .config)]</b>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com14tag:blogger.com,1999:blog-4397409626710913610.post-9046512260574575632011-01-20T00:09:00.005+01:002011-01-20T00:21:15.680+01:00Android: automated per-uid task groupAndroid is a privilege-separated operating system, in which each application runs with a distinct system identity: the Linux user ID (uid).<br />
<br />
With this patch (<a href="http://www.develer.com/~arighi/android/cm-kernel/sched-automated-per-uid-task-groups.patch">sched-automated-per-uid-task-groups.patch</a>) the kernel automatically creates a distinct task group for each uid (when a process calls set_user()) and places all the tasks that belong to a single uid into the same task group. In this way each application can get a fair amount of the CPU bandwidth (guaranteed by the CFS scheduler), independently by the number of task/threads spawned.<br />
<br />
<div><div><div>The patch is against the CyanogenMod's 6.1.1 kernel (2.6.35.10) and I tested it successfully on my HTC Desire (Bravo).</div><div><br />
</div><div>I've used the following testcase:</div><div> - run 4 cpu hogs in background as user app_35 (com.android.email in my case):</div><div><br />
<div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> # su - app_35</span></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> $ for i in `seq 4`; do yes >/dev/null & done</span></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br />
</div></div> - run the <a href="http://www.aurorasoftworks.com/products/quadrant">Quadrant benchmark</a> in parallel and measure the result with and without the patch applied</div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br />
</span></div><div>Without the patch (output of top):</div><div> <span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6533 123 10070 R 202m 48.6 0 <b>20.0</b> com.aurorasoftworks.quadrant.ui.st</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6506 1 10035 R 1128 0.2 0 20.0 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6507 1 10035 R 1128 0.2 0 20.0 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6508 1 10035 R 1128 0.2 0 20.0 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6509 1 10035 R 1128 0.2 0 20.0 yes</span></div><div><br />
</div><div>Benchmark result: <b>676</b></div><div><br />
</div><div> <span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">uid 10035 (cpu hog) : 60.0 % cpu quota</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> uid 10070 (benchmark): 20.0 % cpu quota</span></div><div><br />
</div><div>With automated per-uid task group (output of top):</div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6784 123 10070 S 209m 51.4 0 <b>50.0</b> com.aurorasoftworks.quadrant.ui.st</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6852 1 10035 R 1128 0.2 0 12.5 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6853 1 10035 R 1128 0.2 0 12.5 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6854 1 10035 R 1128 0.2 0 12.5 yes</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> 6855 1 10035 R 1128 0.2 0 12.5 yes</span></div><div><br />
</div><div>Benchmark result: <b>816</b></div><div><br />
</div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> uid 10035 (cpu hog) : 50.0 % cpu quota</span></div><div><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> uid 10070 (benchmark): 50.0 % cpu quota</span></div><div><br />
</div><div>Total speedup: <b>~1.2</b> (the benchmark is about 20% faster in this case), because in the last case the benchmark gets ~50% of the CPU time and in the other case it gets only ~20%, despite the fact that there are 2 "pair" applications that should be correctly considered as equal from the user's perspective.</div><div><br />
</div></div></div><div><i>This patch is based on the patch "<a href="http://marc.info/?l=linux-kernel&m=128978361700898&w=2">sched: automated per tty task groups</a>" by Mike Galbraith.</i></div>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-46584220638851928692011-01-10T23:28:00.001+01:002011-01-10T23:30:35.961+01:00HOWTO: install a custom kernel on HTC Desire=== Disclaimer ===<br />
<br />
<b>I take no responsibility for anything that may go wrong by you following these instructions. Proceed at your own risk!</b><br />
<br />
I tested this howto with a Bravo HTC Desire, rooted with Unrevoked 3.22.<br />
<br />
=== Requirements ===<br />
<br />
- A rooted <a href="http://www.htc.com/www/product/desire/overview.html">HTC Desire (Bravo)</a><br />
<br />
- The android <a href="http://developer.android.com/sdk/index.html">SDK</a> + <a href="http://developer.android.com/sdk/ndk/index.html">NDK</a> (to get the cross-compile toolchain):<br />
<br />
- The latest <a href="http://developer.htc.com/">HTC Desire kernel</a> (choose "<i>HTC Desire - Froyo MR - 2.6.32 kernel source code</i>")<br />
<br />
- The <a href="https://github.com/koush/AnyKernel">koush's AnyKernel</a> template (to generate the update.zip at the end of the build process)<br />
<br />
=== HOWTO ===<br />
<br />
- Prepare the cross-compiler environment (replace /opt/android with the path where you have installed the Andorid NDK):<br />
<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ export PATH=/opt/android/android-ndk-r5/toolchains/arm-eabi-4.4.0/prebuilt/linux-x86/bin:$PATH</span><br />
<br />
At this point arm-eabi-gcc, as well as other binutils and compiler binaries, should be in your $PATH.<br />
<br />
- Untar the kernel<br />
<br />
- save the kernel config (if you want to restore the original kernel config):<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ adb pull /proc/config.gz</span><br />
<br />
- <i>[optional]</i> Apply the following patches to the kernel:<br />
- <a href="http://www.develer.com/~arighi/android/linux/0001-sync-disable-fsync-fdatasync-sync_file_range-syscall.patch">0001-sync-disable-fsync-fdatasync-sync_file_range-syscall</a><br />
- <a href="http://www.develer.com/~arighi/android/linux/0002-writeback-change-default-dirty-memory-settings.patch">0002-writeback-change-default-dirty-memory-settings</a><br />
- <a href="http://www.develer.com/~arighi/android/linux/0003-sched-replace-CFS-with-the-BFS-scheduler.patch">0003-sched-replace-CFS-with-the-BFS-scheduler</a><br />
<br />
- <i>[optional]</i> Take my kernel configuration:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ wget -O bravo-2.6.32-gd96f2c0/.config http://www.develer.com/~arighi/android/linux/config</span><br />
<br />
Or use the previously saved config.gz either.<br />
<br />
- Build the kernel:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ cd bravo-2.6.32-gd96f2c0/</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ make ARCH=arm CROSS_COMPILE=arm-eabi- oldconfig</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ make ARCH=arm CROSS_COMPILE=arm-eabi-</span><br />
<br />
Now you should find the fresh new kernel, ready to be flashed on your HTC Desire, in arch/arm/boot/zImage.<br />
<br />
- Apply the following patch to the koush's AnyKernel source (to fix a syntax error when trying to flash update.zip from ClockworkMod):<br />
- <a href="http://www.develer.com/~arighi/android/anykernel/0001-updater-script-specify-the-mount-options-for-the-sys.patch">0001-updater-script-specify-the-mount-options-for-the-sys</a><br />
<br />
- Replace the zImage in the AnyKernel template with your zImage:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ cp bravo-2.6.32-gd96f2c0/arch/arm/boot/zImage AnyKernel/kernel/zImage</span><br />
<br />
- Go back to the template directory (you will see three subdirectories: META-INF, kernel & system) and generate the update.zip:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ zip -r ../update.zip *</span><br />
<br />
- Connect your phone via a USB cable (be sure to turn on USB debugging on your phone: Settings -> Applications -> Development -> USB debugging)<br />
<br />
- Push update.zip and the wireless module to the SD card of your phone:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ adb push update.zip /sdcard/update.zip</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ adb push bravo-2.6.32-gd96f2c0/drivers/net/wireless/bcm4329_204/bcm4329.ko /sdcard/bcm4329.ko</span><br />
<br />
- Reboot your phone in ClockworkMod recovery (power-on while holding Volume down key and select RECOVERY)<br />
<br />
- In ClockworkMod select "apply sdcard:update.zip" (confirm: choose Yes)<br />
<br />
- Reboot your phone via "reboot system now"<br />
<br />
At this point your custom new kernel should boot.<br />
<br />
=== Fix the bcm4329 wireless module loading without S-OFF ===<br />
<br />
The bcm4329.ko module can't be properly overwritten in the /system partition without S-OFF the device and so give access in read-write to the /system partition. However, we can enforce the usage of our new module binding any other writable directory to /system/lib/modules (i.e., /data/local).<br />
<br />
Prerequisites:<br />
- the latest <a href="http://www.codesourcery.com/sgpp/lite/arm/portal/subscription?@template=lite">ARM toolchain</a> downloadable from the CodeSourcery site<br />
<br />
Download and install the ARM toolchain and be sure that arm-none-linux-gnueabi-gcc is in your $PATH.<br />
<br />
- get the latest version of busybox from git (or download a recent stable version if you prefer):<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ git clone git://busybox.net/busybox.git</span><br />
<br />
Howto:<br />
- use my busybox config file:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ cd busybox</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ wget -O .config http://www.develer.com/~arighi/android/busybox/config</span><br />
<br />
- cross-compile busybox:<br />
$ make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- oldconfig<br />
$ make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi-<br />
<br />
- copy the busybox binary into the /data/local directory on your phone:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ adb push busybox /data/local/busybox</span><br />
<br />
- bind the directory /system/lib/modules with /data/local:<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ adb shell</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">$ su</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"># cat /sdcard/bcm4329.ko > /data/local/bcm4329.ko</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"># /data/local/busybox mount --bind /data/local /system/lib/modules</span><br />
<br />
After this trick go on your phone check Settings -> Wireless & networks -> Wi-Fi. The wireless connection should start normally.<br />
<br />
=== Results ===<br />
<br />
Here is my score with this kernel using the Quadrant benchmark: <b>1370!</b><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRxd-dsJcCHfIlVXjYEDdJn_MexaB4KxP9eZbvTUsJ7ztxUxNa8seqNDwfWapTMT7TlfAdMztV6rZWZNuaSqlMK3b6e8vAEIjk-ii2xD44neoMZ4R3TY2v1p27s_8KFY8Ce_bTpXBnVA/s1600/kernel-custom.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRxd-dsJcCHfIlVXjYEDdJn_MexaB4KxP9eZbvTUsJ7ztxUxNa8seqNDwfWapTMT7TlfAdMztV6rZWZNuaSqlMK3b6e8vAEIjk-ii2xD44neoMZ4R3TY2v1p27s_8KFY8Ce_bTpXBnVA/s320/kernel-custom.png" width="212" /></a></div>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com6tag:blogger.com,1999:blog-4397409626710913610.post-5871730173799259172009-09-11T18:53:00.000+02:002009-09-11T18:53:20.661+02:00Linux kernel hacking: file notification system with kernel tracepointsAn example of how to use kernel tracepoints to create a simple real-time backup / file change notification system. The article (only in italian, sorry) is available at <a href="http://stacktrace.it/2009/09/linux-kernel-hacking-real-time-backup-con-i-kernel-tracepoints/">stacktrace.it</a>.arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-90621667066229542482009-06-26T14:30:00.005+02:002009-06-26T16:26:52.495+02:00mutt + gmail notifierI really enjoy the power of mutt, and I've to say that I've not too far from reaching the email Nirvana with it :). OK, it's not the email client for everybody, it's just for the people that prefer the keyboard to the mouse and love the command line interfaces.<br /><br />There's only one missing feature in mutt: a nice way to notifiy new emails. The problem with mutt is that I need to periodically switch to the mutt shell to check for new emails. And I don't even like the crappy notification balloons that cover the useful part of the desktop (e.g., thunderbird). A small tray icon could be a solution (and I did it this way for a while, patching mutt), but with the icon I don't see immediately the message that I receive.<br /><br />This led me to notice a large unused area in the gnome panel at the top (recently I moved from Fluxbox to Gnome, yeah! :) now I've a ultra-very-fast SSD I can also use a fancy desktop environment). So, why not to use the top panel to notify the subject of the last email I received in my mailbox? ta-da! the solution: a small python gnome applet that periodically fetches the last unread email from a generic IMAPS folder in gmail and prints the subject to the panel.<br /><br />Here's the code: <a href="http://download.systemimager.org/%7Earighi/gmail-check/">gmail-check</a>, very minimalist and designed just for my particular desktop environment, so it may not work in some cases...arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com1tag:blogger.com,1999:blog-4397409626710913610.post-91886450811064403642009-06-08T13:58:00.002+02:002009-06-08T14:16:26.413+02:00New SSD diskI just got one new SSD disk MTRON MOBI 3000 for my Dell Latitude D430 notebook. It's very small, only 32GB, but it definitely ROCKS!!! I can boot in about 12 seconds, without any deep tuning of the kernel and boot services, but the _responsiveness_ is the most relevant thing, apart the read/write 100MB/s throughput (that is not so important for a desktop system). The impressive part is the ~5500 iops (IO operations per second) obtained using a workload of 4KB random reads/writes!arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com2tag:blogger.com,1999:blog-4397409626710913610.post-17628907984292684812009-05-14T23:39:00.003+02:002009-05-14T23:50:46.732+02:00Linux kernel hacking: process containersA basic overview of the Linux cgroups. My article is available at <a href="http://stacktrace.it/2009/05/linux-kernel-hacking-contenitori-di-processi/">stacktrace.it</a> (in italian).arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-45033610813145031992009-05-02T16:00:00.004+02:002009-05-02T16:11:48.075+02:00battery life in bash promptI've just reconfigured my .bashrc to execute this bash script that allows to show the perentage of battery life at the beginning of the command prompt. Geeze, really nice! :) At this point I can turn off the guidance-power-manager applet and enjoy a faster boot.<pre><br />#!/bin/bash<br />GRAY="1;30"<br />CYAN="0;36"<br />LIGHT_CYAN="1;36"<br />LIGHT_BLUE="1;34"<br />YELLOW="1;33"<br />WHITE="0;1"<br />NO_COLOR="0"<br />LIGHT_RED="1;31"<br />LIGHT_GREEN="1;32"<br />BROWN="0;33"<br /><br />function battery_info()<br />{<br /> BATT_INFO=$(acpi -b | awk -F', ' '{print $2}')<br /> AC_INFO=$(acpi -aB | awk -F': ' '{print $2}')<br /><br /> if [ $AC_INFO = "off-line" ]; then<br /> BATT_PERC=${BATT_INFO:0:${#BATT_INFO}-1}<br /><br /> if [ $BATT_PERC -ge 75 ]; then<br /> COLOR=$LIGHT_GREEN<br /> elif [ $BATT_PERC -le 25 ]; then<br /> COLOR=$LIGHT_RED<br /> else<br /> COLOR=$YELLOW<br /> fi<br /> else<br /> COLOR=$NO_COLOR<br /> fi<br />}<br /><br />PROMPT_COMMAND=battery_info<br />PS1="\[\033[\$(echo -n \$COLOR)m\]\$(echo -n \$BATT_INFO)\<br />\[\033[${NO_COLOR}m\] \u@\h:\[\033[${WHITE}m\]\w\[\033[${NO_COLOR}m\]\$ "<br /></pre>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-78085756963538894082009-04-28T23:24:00.003+02:002009-04-28T23:37:54.799+02:00iozone: buffer overflow in ubuntu jauntyIn the latest Ubuntu Jaunty iozone immediately crashes with a nice *** buffer overflow detected *** message, that means it is practically unusable. Fortunately the cause of the bug is very simple: a wrong length used to copy a string by gethostname(). I posted a fix <a href="https://bugs.launchpad.net/ubuntu/+source/iozone3/+bug/320615">here</a>.arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-44620122402765046642009-04-16T17:17:00.002+02:002009-04-16T17:31:02.735+02:00cgroup: io-throttle controller (v13)A new version of my IO controller for Linux cgroups.<br /><br />LWN.net coverage at <a href="http://lwn.net/Articles/328484/">http://lwn.net/Articles/328484/</a>.arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-46213371641164525802009-03-10T16:22:00.020+01:002009-05-14T15:59:21.420+02:00Performance improvement of parallel applications with user-space spinlocksMutex locks provided by the operating system are expensive to create and acquire and a high contention can dramatically reduce or even eliminate the advantage of parallelism.<br /><br />GCC provides some atomic operations, directly mapped to the atomic instructions actually provided by the underlying hardware. These are the same instructions that are used to implement the traditional locking primitives (mutexes), but they can be used to implement user-space locking primitives and save a significant amount of overhead.<br /><br />However, this is not always the best solution: user-space spinlocks have their own drawbacks and overheads. Spinlocks are good when another CPU has the lock and it is likely to release it as soon as possible. In general, where there are more threads than physical CPUs, a spinlock simply wastes the CPU, until the OS decides to preempt it. Moreover, with spinlocks, the CPUs spins (really? :)) forever trying to acquire the lock. This leads to more power consumption and more heat to be dissipated, so this could be a really _bad_ solution for many embedded / ultra-portable devices (this is the same reason because asynchronous interrupt-driven handlers are better than polling for embedded devices). [ Actually, there are also some memory barrier and out-of-order execution troubles, but we will not discuss about this topic now... maybe I'll report some details in another post... ].<br /><br />Anyway, in the following example we will see a typical problem where the usage of user-space spinlocks can bring evident benefits for performance (however, there is still the power consumption issue, but we don't care about it in this case).<br /><br />Setting the USER_SPINLOCK option in the Makefile it is possible to choose between (0) the traditional pthread_mutex primitives and (1) custom user-space spinlock implementation.<br /><br />The problem is the classical <a href="http://en.wikipedia.org/wiki/Dining_philosophers_problem">dining philosopher problem</a>.<br /><pre class="code"><br />=== Makefile ===<br /><br />N_CPUS=$(shell getconf _NPROCESSORS_ONLN)<br />CACHELINE_SIZE=$(shell getconf LEVEL1_DCACHE_LINESIZE)<br /><br />USER_SPINLOCK=1<br /><br />TARGET=userspace-spinlock<br /><br />all:<br /> gcc -g -O3 -lpthread -o$(TARGET) \<br /> -DN_CPUS=$(N_CPUS) \<br /> -DCACHELINE_SIZE=$(CACHELINE_SIZE) \<br /> -DUSER_SPINLOCK=$(USER_SPINLOCK) \<br /> $(TARGET).c<br /><br />clean:<br /> rm -f $(TARGET)<br /><br />=== userspace-spinlock.c ===<br /><br />#define _GNU_SOURCE<br /><br />#include <stdio.h><br />#include <stdlib.h><br />#include <pthread.h><br />#include <sched.h><br />#include <errno.h><br />#include <time.h><br />#include <unistd.h><br />#include <sys/types.h><br />#include <sys/syscall.h><br /><br />static pthread_t threads[N_CPUS];<br />static int input[N_CPUS];<br /><br />#if USER_SPINLOCK<br />static int shared[N_CPUS * CACHELINE_SIZE];<br /><br />static inline void lock(int *l)<br />{<br /> while (__sync_lock_test_and_set(l, 1));<br />}<br /><br />static inline void unlock(int *l)<br />{<br /> *l = 0;<br /><br />}<br />#else /* USER_SPINLOCK */<br />static pthread_mutex_t shared[N_CPUS * CACHELINE_SIZE];<br /><br />static inline void lock(pthread_mutex_t *l)<br />{<br /> pthread_mutex_lock(l);<br />}<br /><br />static inline void unlock(pthread_mutex_t *l)<br />{<br /> pthread_mutex_unlock(l);<br />}<br />#endif /* USER_SPINLOCK */<br /><br />/*<br /> * From GETTID(2):<br /> *<br /> * Glibc does not provide a wrapper for this system call; call it using<br /> * syscall(2).<br /> *<br /> */<br />static inline pid_t gettid(void)<br />{<br /> return syscall(SYS_gettid);<br />}<br /><br />static void *thread(void *arg)<br />{<br /> int n = *(int *)arg;<br /> int first, second;<br /> cpu_set_t cmask;<br /> pid_t pid = gettid();<br /> int i = 1E7;<br /><br /> /* Set CPU affinity */<br /> CPU_ZERO(&cmask);<br /> CPU_SET(n % N_CPUS, &cmask);<br /> if (sched_setaffinity(pid, sizeof(cmask), &cmask) < 0) {<br /> fprintf(stderr,<br /> "could not set cpu affinity to core %d.", n % N_CPUS);<br /> exit(1);<br /> }<br /> /*<br /> * Acquire two locks and avoid deadlock; see also the dining<br /> * philosopher problem.<br /> */<br /> if (n % 2) {<br /> first = (n + 1) % N_CPUS;<br /> second = n;<br /> } else {<br /> first = n;<br /> second = (n + 1) % N_CPUS;<br /> }<br /> while (i) {<br /> lock(&shared[first * CACHELINE_SIZE]);<br /> lock(&shared[second * CACHELINE_SIZE]);<br /> i--;<br /> unlock(&shared[second * CACHELINE_SIZE]);<br /> unlock(&shared[first * CACHELINE_SIZE]);<br /><br /> }<br /> return NULL;<br />}<br /><br />int main(int argc, char **argv)<br />{<br /> struct sched_param sched;<br /> int i;<br /><br /> fprintf(stdout, "running on %d cpus\n", N_CPUS);<br /> sched.sched_priority = sched_get_priority_max(SCHED_RR);<br /> if (sched_setscheduler(0, SCHED_RR, &sched) == -1) {<br /> perror("error setting SCHED_RR");<br /> return -1;<br /> }<br /> for (i = 0; i < N_CPUS; i++) {<br /> input[i] = i;<br /> if (pthread_create(&threads[i], NULL, thread, &input[i]) < 0) {<br /> perror("pthread_create failed");<br /> exit(1);<br /> }<br /> }<br /> for (i = 0; i < N_CPUS; i++)<br /> pthread_join(threads[i], NULL);<br /><br /> return 0;<br />}<br /><br /></pre><br />And here some results of a run in my laptop with Intel(R) Core(TM)2 CPU U7600 @ 1.20GHz:<br /><pre><br />=== USER_SPINLOCK=0 (use pthread_mutex) ===<br />$ /usr/bin/time -v sudo ./userspace-spinlock <br />running on 2 cpus<br /> Command being timed: "sudo ./userspace-spinlock"<br /> User time (seconds): 17.01<br /> System time (seconds): 17.09<br /> Percent of CPU this job got: 190%<br /> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.91<br /> Average shared text size (kbytes): 0<br /> Average unshared data size (kbytes): 0<br /> Average stack size (kbytes): 0<br /> Average total size (kbytes): 0<br /> Maximum resident set size (kbytes): 0<br /> Average resident set size (kbytes): 0<br /> Major (requiring I/O) page faults: 0<br /> Minor (reclaiming a frame) page faults: 658<br /> Voluntary context switches: 8426<br /> Involuntary context switches: 31<br /> Swaps: 0<br /> File system inputs: 0<br /> File system outputs: 0<br /> Socket messages sent: 0<br /> Socket messages received: 0<br /> Signals delivered: 0<br /> Page size (bytes): 4096<br /> Exit status: 0<br /><br />=== USER_SPINLOCK=1 (use user-space spinlocks) ===<br />$ /usr/bin/time -v sudo ./userspace-spinlock <br />running on 2 cpus<br /> Command being timed: "sudo ./userspace-spinlock"<br /> User time (seconds): 13.12<br /> System time (seconds): 0.04<br /> Percent of CPU this job got: 191%<br /> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.89<br /> Average shared text size (kbytes): 0<br /> Average unshared data size (kbytes): 0<br /> Average stack size (kbytes): 0<br /> Average total size (kbytes): 0<br /> Maximum resident set size (kbytes): 0<br /> Average resident set size (kbytes): 0<br /> Major (requiring I/O) page faults: 0<br /> Minor (reclaiming a frame) page faults: 657<br /> Voluntary context switches: 3<br /> Involuntary context switches: 17<br /> Swaps: 0<br /> File system inputs: 0<br /> File system outputs: 0<br /> Socket messages sent: 0<br /> Socket messages received: 0<br /> Signals delivered: 0<br /> Page size (bytes): 4096<br /> Exit status: 0<br /></pre><br /><br />The elapsed time with traditional pthread_mutex locking is 17.91 sec; with user-space spinlocks the execution needs only 6.89 sec! This means a speed-up of ~2.6!<br /><br />As said above a disadvantage is the spinning of the CPUs that leads to a greater power consumption. This behaviour can be seen also looking at the voluntary context switches: 8426 in the pthread_mutex case and only 3 with user-space spinlocks (note: if you look at the code you can see that we're running real-time threads, this is the reason of this really low number of context switches). This also means that the whole system is really more reactive if the applications use pthread_mutex primitives, but if cases when reactiveness is not our goal (or we care about the reactiveness of few specific applications) we know that we can achieve a good speed-up of our parallel applications using user-space spinlocks.arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com2tag:blogger.com,1999:blog-4397409626710913610.post-81326900765247295882009-01-21T09:58:00.005+01:002009-01-21T10:30:37.483+01:00potential framebuffer deadlockIn the latest kernel (2.6.29-rc2) there's a potential deadlock condition in the frame buffer between fb_ioctl() and fb_mmap().<br /><br />The cause is that fb_mmap() is called under mm->mmap_sem (A) held, that also acquires fb_info->lock (B); fb_ioctl() takes fb_info->lock (B) and does copy_from/to_user() that might acquire mm->mmap_sem (A) if a page fault occurs.<br /><br />So we've a classic deadlock condition: a process holds the lock A and attempts to obtain the lock B, but B is already held by a second process that attemps to lock A.<br /><br />A possible fix is to prevent the deadlock condition, that means "push down" the mutex fb_info->lock into the fb_ioctl() implementation and avoid the occurrence of the page fault with fb_info->lock held (A). But this also requires to define two basic primitives i.e. lock/unlock_fb_info() and use them opportunely inside *all* the framebuffer drivers. For now I've tried to fix at least the main fb_ioctl() function (with the common ioctl ops) that is shared by all the drivers.arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-61547868633978452032009-01-21T09:33:00.002+01:002009-01-21T09:55:37.697+01:00systemimager-light: opening a new branch?I wonder if it's worth to open a new SystemImager branch without shipping the standard BOEL kernel anymore. Providing the packages for all the supported architectures requires a *huge* amount of time and it's not possible for me to build everything... :( (also because I don't have all the required architectures). So, probably the best way to proceed is to just remove BOEL and set UYOK the default. UYOK allows to create a boot&install package (kernel+initrd.img) for SystemImager using the kernel shipped in any distribution (preferably the kernel shipped with the distribution we want to install) and an initrd_template (shipped with the SystemImager packages). And with the UYOK-only version the time to build the packages is 10 times faster!<br /><br />I've just opened the new branch locally in my PC using a git repository and pushed the initial release to download.systemimager.org (to get it: git-clone git://download.systemimager.org/local/git/systemimager-light), but the server is down again! :( grrr.... it seems we need a newer and more powerful server...arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com1tag:blogger.com,1999:blog-4397409626710913610.post-75204542754248407382008-12-15T21:50:00.013+01:002008-12-16T10:24:37.510+01:00cache line bouncingUsually we don't realize how expensive is cacheline bouncing in parallel systems. Following is a simple example to evaluate the bouncing cost.<br /><br />A multi-threaded application uses shared data in some of its thread:<br /><blockquote><pre>struct shared_data_struct {<br /> unsigned int data1;<br /> unsigned int data2;<br />};<br /></pre></blockquote>Suppose data1 is used only by thread1 and data2 is used by thread2. A natural way to optimize it is to pack data together in order to reduce the size of the application, thus maximizing the amount of memory that fits into the cache.<br /><br />Unfortunately, this inevitably leads to poor performance: if both threads write to their assigned memory location the cache line must be always in exclusive state in the L1 data cache (L1D) of each core/processors and this generates a big cache coherency overhead.<br /><br />For example, in the Intel Core 2 processor the cacheline size is 64 bytes (this can be retrieved using the command `getconf LEVEL1_DCACHE_LINESIZE` from the shell), so in the example above both data1 and data2 share the same L1D cacheline, though apparently they're using different independent memory locations.<br /><br />We can measure the cost of the cacheline bounces using oprofile and a simple example. In cache-parallel.c (see the code below) we have the same `struct shared_data_struct` with an optional pad, depending on DISTINCT_CACHE_LINES symbol.<br /><br />Let's see what happens without the pad, commenting the #define DISTINCT_CACHE_LINES:<br /><blockquote><pre>Configure oprofile to account L1D cache misses:<br />$ sudo opcontrol --setup --event=L1D_PEND_MISS:500<br /><br />Start oprofile:<br />$ sudo opcontrol -s<br />Using 2.6+ OProfile kernel interface.<br />Reading module info.<br />Using log file /var/lib/oprofile/samples/oprofiled.log<br />Daemon started.<br />Profiler running.<br /><br />Run cache-parallel _without_ the pad in shared_data_struct:<br />$ time ./cache-parallel<br />...<br />real 0m29.274s<br />user 0m47.540s<br />sys 0m0.121s<br /><br />Stop oprofile:<br />$ sudo opcontrol -h<br />Stopping profiling.<br />Killing daemon.<br /><br />And see the results:<br />$ opannotate --source ./cache-parallel | grep data[12]++<br />1752 0.7326 : sd->data1++;<br />237090 99.1361 : sd->data2++;<br />^^^^^^<br />|<br />A lot of misses here!<br /></pre></blockquote>If we add the pad (defining DISTINCT_CACHE_LINES):<br /><blockquote><pre>$ time ./cache-parallel-with-pad<br />...<br />real 0m11.330s<br />user 0m20.686s<br />sys 0m0.037s<br /><br />$ opannotate --source ./cache-parallel-with-pad | grep data[12]++<br />49 43.7500 : sd->data1++;<br />24 21.4286 : sd->data2++;<br />^^^^^^<br />|<br />Cache misses are dramatically reduced now!<br /></pre></blockquote>A speedup of <span style="font-weight: bold;">29.274 / 11.330 = 2.583</span>, in other words the cacheline bouncing effect produced, in this case, a slowdown of <span style="font-weight: bold;">~260%</span>!!!<br /><br />Following the source code of the cache-parallel example:<br /><blockquote><pre>/*<br />* cache-parallel.c<br />*<br />* build with:<br />* gcc -DCACHELINE_SIZE=$(getconf LEVEL1_DCACHE_LINESIZE) -lpthread -Wall \<br />* -g -ocache-parallel cache-parallel.c<br />*/<br /><br />#define _GNU_SOURCE<br /><br />#include <stdio.h><br />#include <stdlib.h><br />#include <sched.h><br />#include <unistd.h><br />#include <pthread.h><br />#include <errno.h><br />#include <sys/types.h><br />#include <sys/syscall.h><br /><br />#define unlikely(expr) __builtin_expect(!!(expr), 0)<br />#define likely(expr) __builtin_expect(!!(expr), 1)<br /><br />#define __cacheline_aligned__ __attribute__((__aligned__(CACHELINE_SIZE)))<br /><br />#define LOOPS_MAX 2000000000<br />#define STACK_SIZE 4096<br /><br />/*<br />* From GETTID(2):<br />*<br />* Glibc does not provide a wrapper for this system call; call it using<br />* syscall(2).<br />*<br />*/<br />static inline pid_t gettid(void)<br />{<br />return syscall(SYS_gettid);<br />}<br /><br />/* XXX: comment this to see the effect of the cache line bouncing */<br />#define DISTINCT_CACHE_LINES<br />struct shared_data_struct {<br />unsigned int data1;<br />#ifdef DISTINCT_CACHE_LINES<br />unsigned char pad[CACHELINE_SIZE - sizeof(unsigned int)];<br />#endif<br />unsigned int data2;<br />};<br /><br />static struct shared_data_struct shared_data __cacheline_aligned__;<br /><br />static void dump_schedstats(pid_t pid, pid_t tid)<br />{<br />char buffer[256];<br />char filename[64];<br />FILE *f;<br /><br />snprintf(filename, sizeof(filename),<br />"/proc/%d/task/%d/status", pid, tid);<br />f = fopen(filename, "r");<br />if (unlikely(f == NULL)) {<br />perror("could not read scheduler statistics");<br />exit(1);<br />}<br />while (fgets(buffer, sizeof(buffer), f))<br />fprintf(stdout, "[%d:%d] %s", pid, tid, buffer);<br />fclose(f);<br />}<br /><br />static void *inc_first(void *arg)<br />{<br />struct shared_data_struct *sd = (struct shared_data_struct *)arg;<br />pid_t pid = getpid(), tid = gettid();<br />cpu_set_t cmask;<br />register long i;<br /><br />/* set affinity */<br />CPU_ZERO(&cmask);<br />CPU_SET(0, &cmask);<br />if (unlikely(sched_setaffinity(pid, sizeof(cmask), &cmask) < 0)) {<br />perror("could not set cpu affinity for the child.");<br />exit(1);<br />}<br />/* periodically increment first member of shared struct */<br />for (i = 0; i < LOOPS_MAX; i++)<br />sd->data1++;<br />dump_schedstats(pid, tid);<br /><br />return NULL;<br />}<br /><br />static void *inc_second(void *arg)<br />{<br />struct shared_data_struct *sd = (struct shared_data_struct *)arg;<br />pid_t pid = getpid(), tid = gettid();<br />cpu_set_t cmask;<br />register long i;<br /><br />/* set affinity */<br />CPU_ZERO(&cmask);<br />CPU_SET(1, &cmask);<br />if (unlikely(sched_setaffinity(0, sizeof(cmask), &cmask) < 0)) {<br />perror("could not set cpu affinity for current process.");<br />exit(1);<br />}<br />/* periodically increment second member of shared struct */<br />for (i = 0; i < LOOPS_MAX; i++)<br />sd->data2++;<br />dump_schedstats(pid, tid);<br /><br />return NULL;<br />}<br /><br />int main(int argc, char **argv)<br />{<br />void *child_stack;<br />pthread_t child_thr;<br /><br />/* allocate memory for other process to execute in */<br />if (unlikely((child_stack = malloc(STACK_SIZE)) == NULL)) {<br />perror("cannot allocate stack for child");<br />exit(1);<br />}<br /><br />/* create the child */<br />if (unlikely(pthread_create(&child_thr, NULL,<br />&inc_second, &shared_data) < 0)) {<br />perror("pthread_create failed");<br />exit(1);<br />}<br />inc_first((void *)&shared_data);<br />pthread_join(child_thr, NULL);<br /><br />return 0;<br />}<br /></pre></blockquote>arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com1tag:blogger.com,1999:blog-4397409626710913610.post-5510357969411157572008-10-27T14:45:00.002+01:002008-10-27T16:00:00.076+01:00SystemImager @ LinuxDay 2008 in FerraraI presented a quick overview of SystemImager, how it works and typical use cases, at the LinuxDay 2008 in Ferrara. The slides are available <a href="http://download.systemimager.org/%7Earighi/doc/SystemImager-LinuxDay-2008-Ferrara.pdf">here</a>. There are also some nice <a href="http://linuxday.ferrara.linux.it/2008/album">pictures</a> of the event.<br /><br />After my talk Andrea Arcangeli presented an interesting talk about recent core kernel features (expecially mmu_notifier), that makes <a href="http://kvm.qumranet.com/kvmwiki">KVM</a> to reliable swap guest mapped pages (without a mmu_notifier pages mapped in secondary MMU are pinned and cannot be swapped), perform balooning and save host memory with KSM (map different virtual address to common pages shared between different VMs).arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-64952582137063914492008-10-15T22:16:00.002+02:002008-10-15T22:17:53.559+02:00a new websiteI've finished to write my new <a href="http://www.dii.unisi.it/%7Erighi/">homepage</a> hosted at dii.unisi.it... a plain and essential website (that is always the best choice IMHO) fully created with <a href="http://www.vim.org/">vim</a>. It has been a long time since I wrote a website from scratch, and vim is still my preferred HTML editor! :)arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com0tag:blogger.com,1999:blog-4397409626710913610.post-10730583082049377172008-10-08T10:35:00.004+02:002009-09-03T16:50:41.186+02:00fine-grained dirty_ratio and dirty_background_ratioA process that writes something to a file generates dirty pages in the page cache. Dirty pages must be kept in sync with their backing store (the file defined on the block device).<br /><br />In the Linux kernel the frequency to writeback dirty pages is controlled by two parameters: vm.dirty_ratio and vm.dirty_background_ratio. Both are expressed a percentage of dirtyable memory, that is the free memory + reclaimable memory (active and inactive pages in the LRU list).<br /><br />The first parameter controls when a process will itself start writing out dirty data, the second controls when the kernel thread [pdflush] must be woken up and it will start writing out dirty data globally on behalf of the processes (dirty_background_ratio is always less than dirty_ratio; if dirty_background_ratio >= dirty_ratio the kernel automatically set it to dirty_ratio / 2).<br /><br />Unfortunately, both percentages are int and the kernel doesn't even allow to set them below 5%. This means that in large memory machine those limits are too coarse. On a machine that has 1GB of dirtyable memory the kernel will start to writeback dirty pages in chunks of 50MB (!!!) minimum (with dirty_ratio = 5).<br /><br />Even if it could be fine for batch or server machines, this behaviour could be unpleasant for desktop or latency-sensitive environments, when the large writeback can be perceived as a lack of responsiveness in the whole system.<br /><br />IMHO we really need an interface to define fine-grained limits (to writeback small amount of data, often) and the best solution for this without breaking the compatibility with the old interface seems to introduce a new interface to define <a href="http://en.wikipedia.org/wiki/Per_cent_mille">pcm</a> (milli-percent) values.<br /><br />At least this would resolve the problem for today machines... until 1TB memory servers will become popular...arighihttp://www.blogger.com/profile/15223521151492879497noreply@blogger.com2