I presented a quick overview of SystemImager, how it works and typical use cases, at the LinuxDay 2008 in Ferrara. The slides are available here. There are also some nice pictures of the event.
After my talk Andrea Arcangeli presented an interesting talk about recent core kernel features (expecially mmu_notifier), that makes KVM to reliable swap guest mapped pages (without a mmu_notifier pages mapped in secondary MMU are pinned and cannot be swapped), perform balooning and save host memory with KSM (map different virtual address to common pages shared between different VMs).
Monday, October 27, 2008
Wednesday, October 15, 2008
a new website
Wednesday, October 8, 2008
fine-grained dirty_ratio and dirty_background_ratio
A process that writes something to a file generates dirty pages in the page cache. Dirty pages must be kept in sync with their backing store (the file defined on the block device).
In the Linux kernel the frequency to writeback dirty pages is controlled by two parameters: vm.dirty_ratio and vm.dirty_background_ratio. Both are expressed a percentage of dirtyable memory, that is the free memory + reclaimable memory (active and inactive pages in the LRU list).
The first parameter controls when a process will itself start writing out dirty data, the second controls when the kernel thread [pdflush] must be woken up and it will start writing out dirty data globally on behalf of the processes (dirty_background_ratio is always less than dirty_ratio; if dirty_background_ratio >= dirty_ratio the kernel automatically set it to dirty_ratio / 2).
Unfortunately, both percentages are int and the kernel doesn't even allow to set them below 5%. This means that in large memory machine those limits are too coarse. On a machine that has 1GB of dirtyable memory the kernel will start to writeback dirty pages in chunks of 50MB (!!!) minimum (with dirty_ratio = 5).
Even if it could be fine for batch or server machines, this behaviour could be unpleasant for desktop or latency-sensitive environments, when the large writeback can be perceived as a lack of responsiveness in the whole system.
IMHO we really need an interface to define fine-grained limits (to writeback small amount of data, often) and the best solution for this without breaking the compatibility with the old interface seems to introduce a new interface to define pcm (milli-percent) values.
At least this would resolve the problem for today machines... until 1TB memory servers will become popular...
In the Linux kernel the frequency to writeback dirty pages is controlled by two parameters: vm.dirty_ratio and vm.dirty_background_ratio. Both are expressed a percentage of dirtyable memory, that is the free memory + reclaimable memory (active and inactive pages in the LRU list).
The first parameter controls when a process will itself start writing out dirty data, the second controls when the kernel thread [pdflush] must be woken up and it will start writing out dirty data globally on behalf of the processes (dirty_background_ratio is always less than dirty_ratio; if dirty_background_ratio >= dirty_ratio the kernel automatically set it to dirty_ratio / 2).
Unfortunately, both percentages are int and the kernel doesn't even allow to set them below 5%. This means that in large memory machine those limits are too coarse. On a machine that has 1GB of dirtyable memory the kernel will start to writeback dirty pages in chunks of 50MB (!!!) minimum (with dirty_ratio = 5).
Even if it could be fine for batch or server machines, this behaviour could be unpleasant for desktop or latency-sensitive environments, when the large writeback can be perceived as a lack of responsiveness in the whole system.
IMHO we really need an interface to define fine-grained limits (to writeback small amount of data, often) and the best solution for this without breaking the compatibility with the old interface seems to introduce a new interface to define pcm (milli-percent) values.
At least this would resolve the problem for today machines... until 1TB memory servers will become popular...
Tuesday, October 7, 2008
SystemImager @ CINECA
An article I wrote (only in italian.. sorry) for CINECA news about SystemImager and the advantages of the BitTorrent transport, documenting the installation of the whole BCX cluster (1290 nodes).
Saturday, October 4, 2008
cgroup I/O bandwidth controller results
Some experimental results of the tests I ran on my box with the latest version of my cgroup-io-throttle patchset against the -mm kernel (2.6.27-rc5-mm1).
The goal of this test is to demonstrate the effectiveness of applying a throttling controller to enhance the IO performance predictability in a shared system.
The graphic highlights the bursty behaviour of the IO rate that we have with a plain CFQ IO scheduling (red line), and the smoother and contained behaviour using cgroup-io-throttle. We can also see the differences between the leaky bucket (green line) and token bucket (blue line) policies: first one is even more smooth and gives a better guarantee to respect the IO limit (hard limit), the second one allows a small irregularity degree (soft limit), but is better in terms of efficiency, expecially at high IO rates.
See my previous post to have an overview of the advantages this controler could provide.
The goal of this test is to demonstrate the effectiveness of applying a throttling controller to enhance the IO performance predictability in a shared system.
The graphic highlights the bursty behaviour of the IO rate that we have with a plain CFQ IO scheduling (red line), and the smoother and contained behaviour using cgroup-io-throttle. We can also see the differences between the leaky bucket (green line) and token bucket (blue line) policies: first one is even more smooth and gives a better guarantee to respect the IO limit (hard limit), the second one allows a small irregularity degree (soft limit), but is better in terms of efficiency, expecially at high IO rates.
See my previous post to have an overview of the advantages this controler could provide.