Sunday, March 25, 2007

disk I/O per-process accounting

A common problem in Linux is how to find the most I/O intensive process when there is an intense disk activity of the system. In some cases you may want to kill the crazy process that caused this condition.

A lot of tools in Linux are able to deliver generic stats for your system: top, sar, dstat, iostat, vmstat, ... but unfortunately none of them is capable to show the particular disk activity done by each process.

The following kernel patch enables the userspace tools to access per-process I/O statistics (WARNING: I tested it only with 2.6.18.3 vanilla!!!):

--- include/linux/sched.h.orig 2007-03-25 21:42:50.000000000 +0200
+++ include/linux/sched.h 2007-03-25 21:42:56.000000000 +0200
@@ -990,6 +990,12 @@
struct rcu_head rcu;

/*
+ * disk I/O accounting informations
+ */
+ unsigned long long acct_disk_read;
+ unsigned long long acct_disk_write;
+
+ /*
* cache last used pipe for splice
*/
struct pipe_inode_info *splice_pipe;
--- block/ll_rw_blk.c.orig 2007-03-25 18:05:51.000000000 +0200
+++ block/ll_rw_blk.c 2007-03-25 18:12:51.000000000 +0200
@@ -2586,6 +2586,12 @@
disk_round_stats(rq->rq_disk);
rq->rq_disk->in_flight++;
}
+
+ if (rw == READ) {
+ current->acct_disk_read += nr_sectors;
+ } else {
+ current->acct_disk_write += nr_sectors;
+ }
}

/*
--- fs/proc/array.c.orig 2007-03-25 18:13:07.000000000 +0200
+++ fs/proc/array.c 2007-03-25 18:15:00.000000000 +0200
@@ -412,7 +412,7 @@

res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu %llu %llu\n",
task->pid,
tcomm,
state,
@@ -457,7 +457,9 @@
task_cpu(task),
task->rt_priority,
task->policy,
- (unsigned long long)delayacct_blkio_ticks(task));
+ (unsigned long long)delayacct_blkio_ticks(task),
+ task->acct_disk_read,
+ task->acct_disk_write);
if(mm)
mmput(mm);
return res;

The patch adds at the end of the process status array (see /usr/src/linux/fs/proc/array.c) two entries:
  1. the I/O read activity of the process
  2. the I/O write activity of the process
You can access to them via the proc filesystem, the process array is in /proc/[pid]/stat (see `man 5 proc`).

For example the following command shows the "top 10" list of the most I/O intensive processes of my system:

$ cat /proc/[0-9]*/stat | awk '{print $2 ":" $43 + $44}' | sort -rn -t : -k 2 | head
(pdflush):275240
(reiserfs/0):179064
(thunderbird-bin):74376
(cupsd):18904
(firefox-bin):15640
(Xorg):13632
(netstat):13512
(gaim):9096
(kswapd0):6032
(syslog-ng):4568

As expected at the first place there's the pdflush (the worker_thread that writes back filesystem data), followed by the reiserfs/0 worker_thread... but obviously you can't kill them! they're kernel thread... so in my case the most active I/O intensive userspace process is thunderbird! ;-)

You can also write your custom top-like userspace tools to monitor the I/O rate of each process, or a program to see if your processes are doing more reads or writes, etc...

Wednesday, March 7, 2007

A quite old, but very nice howto about basic kernel bug hunting techniques:

Tuesday, March 6, 2007

Excellent explanation about SVN merging. SVN merge is a good way to merge code changes between different directories within a repository and this is done often when it's necessary to apply the same fixes in different branches.

Monday, March 5, 2007

weekend at Marmoraia


Yesterday visit to the beautiful church of Marmoraia, a romanic church of XI century in the hills of the Montagnola Senese, with a wonderful walk in the nearby woods of chestnut trees...

Saturday, March 3, 2007

multi-threaded "cat-like" command for web pages...

Following a very useful script to dump HTML of one or more web links, given as arguments, to standard output. If more than one link is passed it spawns a thread for each link (synchronizing the dump on stdout in mutual exclusion). The threaded approach reduce the average waiting time for the reply of the connections and it's strongly improve performances when we need to download a lot of pages at the same time (i.e. I usually use this script to dump and grep the linux kernel changelogs directly from web...).

BTW: python is great! ;-)

#!/usr/bin/env python

import sys, urllib, urllib
from threading import Thread, Lock

class webThread(Thread):
def __init__(self, url):
self.url = url
Thread.__init__(self)

def run(self):
remotefile = urllib.urlopen(self.url)
data = remotefile.read()
remotefile.close()

stdout_mutex.acquire()
print "=== %s ===" % self.url
print data
stdout_mutex.release()

if __name__ == '__main__':
if len(sys.argv) <>" % sys.argv[0]
sys.exit(1)
else:
threads = []
stdout_mutex = Lock()
sys.stdout.flush()
for url in sys.argv[1:]:
t = webThread(url)
t.start()
threads.append(t)
for t in threads:
t.join()

Thursday, March 1, 2007

interactive support for si_psh

Today I improved my favorite distributed shell si_psh (that is part of SystemImager) adding the interactive support.

I need it for the BCX cluster when I have to run multiple commands on the same subset of nodes. Without the interactive support I used to edit the command line of the previous command in the shell, but with long commands this is not practice enough...

I discovered also that ssh (maybe via glibc, dunno...) knows when it has the stdin opened on a terminal or not and if it doesn't find a valid terminal it's not possible to catch the stderr inside a perl wrapper script! The problem is that when si_psh runs interactively stdin is not opened in the terminal, but it's opened inside the perl script, to get the user commands (like a typical shell). And so I wasn't able to get the stderr of the spawned ssh sessions...

It's possible to workaround spawning the ssh process via exec. For example in this case @out contains both stdout and stderr:

my @out = `exec 2>&1 ssh ...`;

And this doesn't work (@out doesn't contain the stderr):

my @out = `ssh ... 2>&1`;