Wednesday, February 28, 2007

memory leak bug with IA32 emulation on x86_64

It seems that the kernel of some recent distributions (like RHEL4 or SLES9 for example) are affected by a memory leak bug in the committed memory: the virtual memory allocated by the userland applications and requested by *malloc(). This occurs only with 64-bit processors (like x86_64, in my case) when you run IA32 applications. If you start to run a lot of IA32 applications you can see the value of Committed_AS in /proc/meminfo to grow forever... it occurs only in the kernels of some distributions, not with recents vanilla.

But... is it a critical bug? it depends... virtual memory is not physical memory, applications can always request a memory region, but if they don't use it the physical memory is never allocated. The point is: should the kernel give virtual memory to the processes also if they are requesting more than the physical memory? In case of yes the system is overcommitting the memory.

Linux supports 3 overcommit handling policies (see /usr/src/linux/Documentation/vm/overcommit-accounting):
  • "guess" policy
  • always overcommit
  • never overcommit
By default Linux uses the "guess" policy: the kernel uses a heuristic to decide if a memory request can be committed or not. This heuristic does not depend on the value of the committed memory, but it depends essentially on the free physical memory. Also with always overcommit the counter of the committed memory is not important (except for accounting informations). But with never overcommit policy the value of the committed memory defines the result of the memory requests, (because the memory can't be overcommitted) so in this case it is functionally important. For more implementation details see __vm_enough_memory() in /usr/src/linux/mm/mmap.c.

The memory leak bug occurs because during exec(), setup_arg_pages() calls vm_enough_memory() for a vma without the VM_ACCOUNT flag set. When the process exits, exit_mmap() only calls vm_unacct_memory() if the vma has the VM_ACCOUNT flag set... hey! but so we're really leaking memory here...

The fix in this case is very simple:
--- include/asm-x86_64/page.h.orig 2007-02-27 17:31:05.000000000 +0100
+++ include/asm-x86_64/page.h 2007-02-27 17:24:26.000000000 +0100
@@ -134,7 +134,7 @@
#define __VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
#define __VM_STACK_FLAGS (VM_GROWSDOWN | VM_READ | VM_WRITE | VM_EXEC | \
- VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | VM_ACCOUNT)

#define VM_DATA_DEFAULT_FLAGS \
(test_thread_flag(TIF_IA32) ? vm_data_default_flags32 : \

Unfortunately there's another problem... in arch/x86_64/ia32/ia32_binfmt.c, security_vm_enough_memory() tend to forget to vm_unacct_memory() when a
failure occurs (this problem is more rare, but it can occur). For this problem the patch is the following:
--- arch/x86_64/ia32/ia32_binfmt.c.orig 2007-02-27 17:26:47.000000000 +0100
+++ arch/x86_64/ia32/ia32_binfmt.c 2007-02-27 17:27:01.000000000 +0100
@@ -347,11 +347,6 @@
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
- kmem_cache_free(vm_area_cachep, mpnt);
- return -ENOMEM;
- }
-
memset(mpnt, 0, sizeof(*mpnt));

down_write(&mm->mmap_sem);
--- fs/exec.c.orig 2007-02-27 17:27:39.000000000 +0100
+++ fs/exec.c 2007-02-27 17:28:08.000000000 +0100
@@ -413,11 +413,6 @@
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) {
- kmem_cache_free(vm_area_cachep, mpnt);
- return -ENOMEM;
- }
-
memset(mpnt, 0, sizeof(*mpnt));

down_write(&mm->mmap_sem);
--- mm/mmap.c.orig 2007-02-27 17:27:50.000000000 +0100
+++ mm/mmap.c 2007-02-27 17:28:58.000000000 +0100
@@ -2024,6 +2024,9 @@
__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
if (__vma && __vma->vm_start <>vm_end)
return -ENOMEM;
+ if ((vma->vm_flags & VM_ACCOUNT) &&
+ security_vm_enough_memory(vma_pages(vma)))
+ return -ENOMEM;
vma_link(mm, vma, prev, rb_link, rb_parent);
return 0;
}

If I apply the 2 patches above I can resolve my problems. For critical servers I use the "never overcommit" policy, because I hate when the OOM-killer always decide to terminate the most important application... :-) If the memory is never overcommitted OOM-killer is disabled and the applications can quit in a more graceful way; it's better to get a NULL from a *malloc()than get a SIGKILL from the kernel... :-)

No comments: