But... is it a critical bug? it depends... virtual memory is not physical memory, applications can always request a memory region, but if they don't use it the physical memory is never allocated. The point is: should the kernel give virtual memory to the processes also if they are requesting more than the physical memory? In case of yes the system is overcommitting the memory.
Linux supports 3 overcommit handling policies (see /usr/src/linux/Documentation/vm/overcommit-accounting):
- "guess" policy
- always overcommit
- never overcommit
The memory leak bug occurs because during exec(), setup_arg_pages() calls vm_enough_memory() for a vma without the VM_ACCOUNT flag set. When the process exits, exit_mmap() only calls vm_unacct_memory() if the vma has the VM_ACCOUNT flag set... hey! but so we're really leaking memory here...
The fix in this case is very simple:
--- include/asm-x86_64/page.h.orig 2007-02-27 17:31:05.000000000 +0100
+++ include/asm-x86_64/page.h 2007-02-27 17:24:26.000000000 +0100
@@ -134,7 +134,7 @@
#define __VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
#define __VM_STACK_FLAGS (VM_GROWSDOWN | VM_READ | VM_WRITE | VM_EXEC | \
- VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | VM_ACCOUNT)
#define VM_DATA_DEFAULT_FLAGS \
(test_thread_flag(TIF_IA32) ? vm_data_default_flags32 : \
Unfortunately there's another problem... in arch/x86_64/ia32/ia32_binfmt.c, security_vm_enough_memory() tend to forget to vm_unacct_memory() when a
failure occurs (this problem is more rare, but it can occur). For this problem the patch is the following:
--- arch/x86_64/ia32/ia32_binfmt.c.orig 2007-02-27 17:26:47.000000000 +0100
+++ arch/x86_64/ia32/ia32_binfmt.c 2007-02-27 17:27:01.000000000 +0100
@@ -347,11 +347,6 @@
if (!mpnt)
return -ENOMEM;
- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
- kmem_cache_free(vm_area_cachep, mpnt);
- return -ENOMEM;
- }
-
memset(mpnt, 0, sizeof(*mpnt));
down_write(&mm->mmap_sem);
--- fs/exec.c.orig 2007-02-27 17:27:39.000000000 +0100
+++ fs/exec.c 2007-02-27 17:28:08.000000000 +0100
@@ -413,11 +413,6 @@
if (!mpnt)
return -ENOMEM;
- if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) {
- kmem_cache_free(vm_area_cachep, mpnt);
- return -ENOMEM;
- }
-
memset(mpnt, 0, sizeof(*mpnt));
down_write(&mm->mmap_sem);
--- mm/mmap.c.orig 2007-02-27 17:27:50.000000000 +0100
+++ mm/mmap.c 2007-02-27 17:28:58.000000000 +0100
@@ -2024,6 +2024,9 @@
__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
if (__vma && __vma->vm_start <>vm_end)
return -ENOMEM;
+ if ((vma->vm_flags & VM_ACCOUNT) &&
+ security_vm_enough_memory(vma_pages(vma)))
+ return -ENOMEM;
vma_link(mm, vma, prev, rb_link, rb_parent);
return 0;
}
If I apply the 2 patches above I can resolve my problems. For critical servers I use the "never overcommit" policy, because I hate when the OOM-killer always decide to terminate the most important application... :-) If the memory is never overcommitted OOM-killer is disabled and the applications can quit in a more graceful way; it's better to get a NULL from a *malloc()than get a SIGKILL from the kernel... :-)