When it comes to frustrating parts of the Linux kernel, OOM killer takes the cake. If it finds that applications are using too much memory on the server, it will kill process abruptly to free up memory for the system to use. I spent much of this week wrestling with a server that was in the clutches of OOM killer.
There are a few processes on the server that keep it fairly busy. Two of the processes are vital to the server's operation - if they are stopped, lots of work is required to get them running properly again. I found that a certain java process was being killed by OOM killer regularly, and another perl process was being killed occasionally.
Naturally, my disdain for java made me think that the java process was the source of the issue. The process was configured to use a small amount of RAM, so it was ruled out. The other perl process used even less memory, so it was ruled out as well. When I checked the sysstat data with sar, I found that the server was only using about 2-3GB out of 4GB of physical memory at the time when OOM killer was started. At this point, I was utterly perplexed.
I polled some folks around the office and gathered some ideas. After putting some ideas together, I found that the server was actually caching too much data in the ext3_inode_cache and dentry_cache. These caches hold recently accessed files and directories on the server, and they're purged as the files and directories become stale. Since the operations on the server read and write large amounts of data locally and via NFS, I knew these caches had to be gigantic. If you want to check your own caches, you can use the slabtop command. For those who like things more difficult, you can also cat the contents of /proc/slabinfo and grep for the caches that are important to you.
An immense amount of Googling revealed very little, but I discovered a dirty hack to fix the issue (don't run this yet):
echo 1 > /proc/sys/vm/drop_caches # free pagecache
[OR]
echo 2 > /proc/sys/vm/drop_caches # free dentries and inodes
[OR]
echo 3 > /proc/sys/vm/drop_caches # free pagecache, dentries and inodes
sync # forces the dump to be destructiveThere are huge consequences to dumping these caches and running sync. If you are writing data at the time you run these commands, you'll actually be dumping the data out of the filesystem cache before it reaches the disk, which could lead to very bad things.
While discussing the issue with a coworker, he found a different method for correcting the issue that was much safer. You can echo values into /proc/sys/vm/vfs_cache_pressure to tell the kernel what priority it should take when clearing out the inode/dentry caches. LinuxInsight explains the range of values well:
At the default value of vfs_cache_pressure = 100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.
In short, values less than 100 won't reduce the caches very much as all. Values over 100 will signal to the kernel that you want to clear out the caches at a higher priority. I found that no matter what value you use, the kernel clears the caches at a slow rate. I've been using a value of 10000 on the server I talked about earlier in the article, and it has kept the caches down to a reasonable level.

Thank you for this post, it was very very helpful.
I'm glad it was helpful! I've found some other optimizations as well, so I'll be adding those soon.
This should lead to another great post regarding the way oom works. As well I normally find nfs_inode_cache another good one to find. It is not to always perform a sync ; drop cache because you think this may be the cause. You can find out if it is due to this by cating /proc/slab or running slabtop.
Now how do you keep oom_killer at bay with a 2.6.X kernel and not kill your services that is essential to the server if your sure that said service is not the problem you can set oom_adj in /proc//oom_adj this will raise / lower the priory of it's chance of being killed and you can then slowly build a way to control oom_killer. After this is done you can then check the score of your application which is in /proc//oom_score this should give you the current score and you can check your applications score at that time and find which may be killed first.
Kernel Documentation :
Documentation/filesystem/proc.txt
"
2.12 /proc//oom_adj - Adjust the oom-killer score
------------------------------------------------------
This file can be used to adjust the score used to select which processes
should be killed in an out-of-memory situation. Giving it a high score will
increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.
2.13 /proc//oom_score - Display current oom-killer score
-------------------------------------------------------------
------------------------------------------------------------------------------
This file can be used to check the current score used by the oom-killer is for
any given . Use it together with /proc//oom_adj to tune which
process should be killed in an out-of-memory situation.
"
Isn't it safe to do
sync; and then,
echo 3 > /proc/sys/vm/drop_caches
instead of the other way around?
From: Andrew Morton
That page says "If you are writing data at the time you run these
commands, you'll actually be dumping the data out of the filesystem
cache before it reaches the disk, which could lead to very bad things".
That had better not be true! That would be a bad bug. drop_caches
only drops stuff which has been written back.
http://lkml.indiana.edu/hypermail/linux/kernel/1005.1/00693.html
Thank you for this article. I was banging my head for long time with VMs web servers caching 60%+ of the available memory and becoming slow as a turtle!
I am referencing your article in my resent "research", setting up Cent OS 5 Minimalist
> That page says "If you are writing data at the time you run these
commands, you'll actually be dumping the data out of the filesystem
cache before it reaches the disk, which could lead to very bad things".
Is decidedly NOT TRUE. Andrew Morton himself (!!!) took the trouble to comment as much. Would you please fix your statement?
@linux_user:
Doing a synch first changes nothing, because other processes can (and will) be dirtying page cache in the mean time