Thursday, January 17, 2008

Linux: I/O throttling (again)

I've improved my per-task I/O throttling patch to support per-uid/gid I/O throttling. As reported in the patch description:

Allow to limit the I/O bandwidth for specific uid(s) or gid(s) imposing
additional delays on those processes that exceed the limits defined in a
configfs tree.

A typical use of this patch could be on a shared Linux system under heavy load condition due to I/O-intensive applications. In this scenario it's possible to assign a different amount of available I/O bandwidth for each group or user (read also fair sharing, I/O shaping, etc.): for example 5MB/s to group A (students), 20MB/s to group B (professors), unlimited MB/s for user C (sysadmin), etc.

But a vastely more interesting approach would be to implement a control group (cgroup) based I/O throttling... and I've just started to work on this! ;-)

Sunday, January 13, 2008

Linux: per-task I/O throttling

I've posted a patch on the LKML that allows to limit the I/O bandwidth per-task via /proc filesystem. Writing a value > 0 in /proc/PID/io_throttle allows to set the upper bound limit of the I/O bandwidth (in 512-bytes sector per second) usable by the process PID.

The patch itself it's not really useful, the same result can be obtained by ionice and a good I/O scheduler (like CFQ), but my patch it's a very simple proof-of-concept that it's possible to implement a kind of UID/GID (or even process-container) based policy of I/O bandwidth shaping (like network bandwidth shaping).

Anyway, just right now I'm running my new 2.6.24-rc7-io-throttle kernel and using the following script to throttle the I/O consumption of the backup, that now can run in backgrund with a very small impact in my other applications. ;-)

WARNING: obviously this script requires my kernel patch to work...

$ cat ~/bin/iothrottle
#!/bin/sh

[ $# -lt 2 ] && echo "usage: $0 RATE CMD" && exit 1

rate=$1
shift
$* &
trap "kill -9 $!" SIGINT SIGTERM
[ -e /proc/$!/io_throttle ] && echo $rate >/proc/$!/io_throttle
wait %1

Saturday, January 5, 2008

PyGFS: implementing a distributed filesystem in python

In this post I try to explain how to implement a secure and robust distributed filesystem in user-space with python.

The advantages of user-space are many: no kernel modification, no OS crashes due to buggy code, debugging is easy, etc. Moreover, for the development point of view, in user-space it's possible to exploit all the nice features provided by the user-space libraries! It means that with few lines of code we can provide a lot of interesting features.

So, let's see some potential requirements for our filesystem:
  • the filesystem must support a complete set of standard POSIX APIs,
  • as a distributed filesystem it must provide data accessibility to remote hosts,
  • it must be reliable to hardware or network failures,
  • it must be secure (it must provide authentication, authorization and encryption mechanisms to provide secure access over insecure networks).
Even if the requirements seem to fit on a long-term project, it's possible to satisfy all of them with few lines of code. Let's see how.

The user-space accessibility is provided by FUSE, that allows to implement a full POSIX filesystem without any kernel changes (it provides all the required kernel APIs to register a filesystem without any kernel-space code). FUSE also allows to provide a secure method for non privileged users to mount their own filesystem.

A distributed filesystem also need a mechanism for communications (how to send data to the remote hosts). An interesting project that could help us for this is Pyro. Pyro allows to skip the development for a new networking communication protocol, since it provides an elegant and easy-to-use object oriented form of RPC. It also optionally supports x509 certificate encryption, that perfectly covers our security requirement.

At this point the real filesystem implementation is quite easy, we can use a simple client-server approach like NFS.

The client wraps all the POSIX syscalls in the filesystem defined by the FUSE interface and calls the equivalent OS routines on the remote server (using Pyro RPC); the server executes the OS procedues over the back-end filesystem and pass to the client the same result returned by the OS syscall (executed on the server filesystem).

Moreover, to provide reliability feautures it's possible to exploit the robust exception handling statements in python. In this way we can detect all the communication failures and call an opportune event handler to re-issue the operations when the server become reachable again. We can also increase the reliability using a client-side and a server-side file handles; in this way each file handle at the client-side can mapped to a different file handle at the server side. If the server goes down the mapping between the two file handles is simply re-initialized and this allows to transparently continue the operations on the clients as the server was never stopped.

So, I tried to implement a real example of this filesystem and I've called it PyGFS (it should be something like: python grid file system... in perspective I'd like to improve it with multiple servers to mirror or unify more filesystems in different hosts, just like a real grid-filesystem...). The source code is available to all who are interested on it... If you even have ideas to add new features let me know... ;-)

SystemImager @ LinuxDay 2007 in Bologna

The slides of my talk "SystemImager and BitTorrent: a peer-to-peer approach to large scale OS deployment" presented at LinuxDay 2007 in Bologna are available here (the site is in italian bug the slides are still in english).

They're pretty the same slides on SystemImager website, with few small changes.