Earlier this year, I started a series of posts to encourage systems administrators to refine their troubleshooting abilities. This is the second post in that series.
Almost every system administrator has found themselves in a situation where they're confronted with a server which has a problem. However, if you're not the primary administrator for the server, you may not always know what has changed recently or you may not be aware of changes in the server's environment. In these situations, if the fix isn't obvious, try going through these steps:
Localize the problem to a specific daemon or service
In the case of a problem where a website isn't loading properly, is it a problem with the web server itself? Could something other than the actual web server daemon be having an issue?
As an example, consider a ruby on rails application which runs through apache's mod_proxy_balancer and queries data from MySQL. If any of those individual puzzle pieces were not functioning correctly, you'd get a different result. A downed MySQL instance could make the application throw errors or appear to be unresponsive. If the mongrel cluster had failed, apache might be returning internal server errors. Your browser might return a connection refused if apache was down. These are all relatively easy to determine.
What if you are unable to determine which daemon is causing the problem?
If it's broken, break it a little more
Let's say that you've reviewed the process list and all of the appropriate daemons appear to be running. However, the website is still not loading properly. What do you do? Bring down a service and try again. Did something change? Did a new error appear? If not, bring that daemon back up and try taking down one of the other ones.
I've also had some good results by making small adjustments in the web server's configuration file. If you have a virtual host that isn't returning the correct data, try commenting it out temporarily. For rewrite rules, try removing them temporarily or strip them down to a more basic form. Test again, and then begin adding lines back incrementally. As much as a single period or quotation mark can derail a perfectly good set of rewrite rules.
In short - try to think outside the box when you're troubleshooting a difficult issue on an unfamiliar system. Always remember to back up your configurations before making changes and ensure your daemons will start properly if you bring them down.

Great post:
However, if you’re not the primary administrator for the server, you may not always know what has changed recently or you may not be aware of changes in the server’s environment.
Monitoring
I've found that good monitoring and a few system notes can go a long way in this situation. For example, monitoring slave replication, NFS mounts or other inter-connected services can often give you a good point to start.
False Correlation
Be cautious against false correlation of events.
For example, after a kernel update, a client complained his web based forms were no longer functioning. He was sure that the reboot had caused the problem. We checked what may have happened and found no system updates to impact the service.
We then dug into the ftp logs to find that his offshore dev team had uploaded files just prior to our scheduled reboot. They introduced code changes that caused the problem.
Don't fall into a trap of false cause-effect scenarios. Coincidence can explain a large number of events, especially when changes are not well documented by all parties.
Hey Jeff,
Those are great points! Good change management will definitely help future troubleshooting efforts by an administrator that is unfamiliar with the configuration.
I've worked at two companies so far where systems administrators deployed the code written by the developers and this really helps to keep administrators in the loop. Also, it requires that the developers write code well enough that it can be deployed by a person who is not familiar with the contents of the application.
I've been both sides of the fence as a sysadmin, and I much prefer it when sysadmin is the deployer. Sure it can be a pain in the arse, especially if you work in a large environment with lots of small apps you might not be familiar with, but the benefits far outweigh the hassles. At employer-2 we were 24x7 on call, each doing one week in rotation. Whoever was the sysadmin on call for that week would be the one that deployed the application, that way any errors from the app deployment would be handled out of hours by the member of staff most familiar.
My Golden rule for odd errors on a server I'm unfamiliar with:
Check DNS.
It's mind boggling just how often DNS is the problem