arighi's blog: May 2007

Wednesday, May 16, 2007

libnosync: are *sync really necessary?

I was asking myself why user applications should care about the synchronization of their buffers. I suppose it's a task dedicated to the operating system, that actually knows what is better for the system. Looking at the manpage of FSYNC(2) we can see that:


NAME
       fsync,  fdatasync  -  synchronize  a file's complete in-core state with
       that on disk

[snip]

DESCRIPTION
       fsync copies all in-core parts of a file to disk, and waits  until  the
       device  reports  that all parts are on stable storage.  It also updates
       metadata stat information. It does  not  necessarily  ensure  that  the
       entry  in the directory containing the file has also reached disk.  For
       that an explicit fsync on the file descriptor of the directory is  also
       needed.

[snip]

OK, but... why? I wrote a simple glibc wrapper (see below) in order to have "fake" fsync() and fdatasync() - not for the simple sync(), so you can continue to run the famous `sync; sync; sync`, if you're paranoid enough ;-) - and I was impressed by the heavy use of them by the user applications... and the speed-up if you disable them.

In fact, if you have a journaled filesystem (hey! otherwise I think you should really consider to move to a journaled filesystem!) all the flushes of metadata causes a lot of writes in the journal (for example in ext3 a single fsync() causes the write of *everything*) and it's a lot of I/O for your PC. And this is a disadvantage also in term of power consumption.

So, where is the trick?! After some thoughts I realized that the main reson should be to *be* really sure that the internal metadata of the applications (like a DBMS for example), built on-top-of the filesystem, have been correctly written to the backing store area. Everything that implements its own concept of "journal" should use the *sync() functions. Otherwise if a crash occurs just in the middle of an "important" write, well... at the resume the metadata of your filesystem will be ok, but the metadata of the application (mapped into the filesystem data) could result corrupted. So, in order to have a robust desktop it's surely better to have those syscalls enabled.

OK, but is this really important for *all* your applications??? for example I don't think it's important for amarok... for example try to run a simple `strace -qfe trace=fdatasync,fsync amarok`. In my system I can see 36 syscalls of *sync!!! and this is too much... BTW I've nothing against amarok, it's a great application & my favourite music player :-)

Following the *sync() lib wrapper. Use this (always without any warranty) if you want to run your non-critical application faster. [IDEA] It would be interesting to run your apps with the wrapper and execute a `sync; sync; sync` just before the screensaver... :-)


/*
 *  libnosync
 *
 *  Copyright (C) 2007 Andrea Righi 
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 * Compile:
 *     gcc -fPIC -Wall -O2 -g -shared -W1,-soname,libnosync.so.0 \
 *     -o libnosync.so.0.1 -lc -ldl
 *
 * Use:
 *     export LD_PRELOAD=`pwd`/libnosync.so.0.1
 *
 * Remove:
 *     unset LD_PRELOAD
 */

#define _GNU_SOURCE
#define __USE_GNU

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifdef DEBUG
#define DPRINTF(format, args...) fprintf(stderr, "debug: " format, ##args)
#else
#define DPRINTF(format, args...)
#endif

int fdatasync(int) __attribute__ ((weak, alias("wrap_fsync")));
int fsync(int) __attribute__ ((weak, alias("wrap_fsync")));

int wrap_fsync(int fd)
{
        DPRINTF("called fsync/fdatasync on fd = %d\n", fd);
        return 0;
}

Sunday, May 13, 2007

Thunderbird + Google Calendar

I realized that I really need a calendar integrated in my email client (Thunderbird), and unfortunately the *great* vim + some shell script in cron are not enough... :-)

It's a couple of weeks that I'm using lightning with a cool extension called Provider for Google Calendar, that allows a bi-directional access (r/w) to google calendar directly from Thunderbird. I've also enabled the SMS notification (free) and honestly I have to admin that it's simply great! Now I can read my events, tasks, TODOs, etc. everywhere using a web browser or by my email client and receive alarms and notifications in my phone.

Saturday, May 12, 2007

LinuxTag 2007 in Berlin

I'll be at Linuxtag 2007 in Berlin. On thursday (31/05/2007) I'll present a paper about the integration of the BitTorrent protocol in SystemImager, to quickly deploy operating systems in large installations, like HPC clusters, big render farms or complex grid-computing environments.

Friday, May 11, 2007

Linux VM: per-user overcommit policy

I wrote a simple patch that allows to define per-UID virtual memory overcommit handling.

Configuration is stored in a hash list in kernel space reachable through /proc/overcommit_uid (surely there're better ways to do it, i.e. via configfs).

Since most of the time we've readers, the concurrent read/write accesses of the hash list are synchronized using the RCU (Read Copy Update) mutual exclusion.

Hash elements are defined using a triple:

uid:overcommit_memory:overcommit_ratio

The overcommit_* values have the same semantic of their respective sysctl variables. If a user is not present in the hash, the default system policy will be used (defined by /proc/sys/vm/overcommit_memory and /proc/sys/vm/overcommit_ratio).

Example:

- admin can allocate full memory + swap:

root@host # echo 0:2:100 > /proc/overcommit_uid

- processes belonging to sshd (uid=100) and ntp (uid=102) users can be quite critical, so use the same policy of the admin:

root@host # echo 100:2:100 > /proc/overcommit_uid
root@host # echo 102:2:100 > /proc/overcommit_uid

- Others can allocate up to the swap + 60% of the available RAM:

root@host # echo 2 > /proc/sys/vm/overcommit_memory && echo 60 > /proc/sys/vm/overcommit_ratio

The result in the example above is that the memory is never overcommitted (due to the value 2 in overcommit_memory) and the 40% of the RAM is used as spare memory, reserved for root processes and critical services only. Normal users can use only the 60% of the RAM. So, in conclusions, non-privileged users never hog the machine.

You can play with per-user overcommit parameters to implement your own VM allocation rules.

This is only a very simple approach to user resource management. If you want a more flexible, complete and powerful approach look at the containers work, a very interesting project actively developed in Linux.