Happy New Year! I certainly hope it's a great one for you, your family, and your business. As the new year begins, I figured it would be a good time to sit down and answer a question that I hear very often:
How do I become a better systems administrator?
The best way to become a better systems administrator is to fully understand the theory of what's happening in your server's environment.
What do I mean by that? Learn why things aren't happening as you expected and think about all of the factors that could possibly be involved. Instead of thinking purely about cause and effect, you'll find it much easier and rewarding to consider everything inside and outside your environment before you make any changes.
This still may be a little difficult to fully understand, so he's an example. Let's say you're handling an issue where a customer can't reach a website hosted on their server. When you ask them for more details, they might give you the dreaded reply: "It's not coming up." Start by making a mental list of the problems that are easiest to check:
- Is the web server daemon running?
- If a database server is being used, is it running and accessible?
- Is there a software/hardware firewall blocking port 80?
- Is a script stuck on the server tying up resources?
- Could there be a DNS resolution problem?
- Is the server up?
- Did a switch fail?
- Is the server's hard disk out of space?
- Can the customer reach other websites like Google or Yahoo?
- If SELinux is involved, have the appropriate contexts been set?
- Could the site be a target of a denial of service attack?
- Has the server reached its connection tracking limit?
Of course, this is a relatively short list, but these are all easy to check. If you're thinking about cause and effect, you might only consider the web server daemon and some basic network issues. By considering all of the other factors that may be related, you've ensured that all of the basics are covered before you consider more complex problems.
Most systems administrators have taken an error message and tossed it in en masse into Google before. Occasionally, no results will appear for the search. If you find yourself in this situation, try to understand the individual parts of the error message. Work outward from what you know already. You should know which daemon said it, and you may have an idea of what the application was doing when the error occurred. Take time to consider what the daemon is trying to tell you within the context of what it was doing at the time.
One of the easiest ways to force yourself to be immersed into this way of thinking is to host applications for non-technical people. You'll find that many customers want things done differently, and they're all at different levels of technical aptitude. Some may find it a frustrating experience at first, but you'll think yourself later. It will force you to consider all aspects of how a server operates since you might not always know what's happening within a customer's application.
As always, if you find yourself stumbling, remember to ask your peers and colleagues. Even if they haven't seen the particular issue, they will probably be able to guide you closer to the solution you seek.

Major,
This post was to the point an excellent write up, I think it showcases how just knowing a lot about computers is not enough. It shows how being an Systems Administrator is more than knowing how to manipulate command line, I think it's a way of thinking, a way of problem solving.
I am fortunate enough to work with someone as experienced with you, even though it'd indirectly, just knowing your in the same building and able to answer any questions or offer advice is a privilege.
Looking forward to 2010, and the things it's going to bring RackSpace, The Cloud, and all of us personally.
-Kyle
It's long since reached a stage where the very first thing I think of is DNS. It's stunning just how many disparate problems are caused by it. Given it takes all of 30 seconds to run "dig" it's one of the first things I do when a problem gets escalated to my team.
It's so easy to get fixated on one cause (and I'm just as guilty of it), but a lot of the time I'm able to resolve customer problems that have been escalated to my department simply because I habitually drop back to first principles. It's not that I distrust the tech support teams (we've got some of the most skilled 'basic' support technicians I've ever had the pleasure of working with), but just that I don't like to take anything as certain unless I've proven it myself.
One thing I'd add: if you find an answer, document it. If you ask a question in an internet forum, then go back and answer your own question. If you learn something that isn't apparently on the internet anywhere, find someplace to document your answer.
One of the problems of Google is that everyone and their dogs are asking questions. The problem is, most of these questions never get answered, so for some esoteric stuff you end up paging through all these people asking for help with no response. So by following through on any questions you ask, you increase the signal-to-noise ratio (or maybe the help-to-helplessness ratio) of the internet at large.
If you don't have your own 'blog or wiki, a simple place to add questions (or even ask them) is serverfault.com or superuser.com. It is simple to sign up, ask your question, and then answer it directly. What's better usually is to just ask your question and wait to see if the rest of the site's user community can come up with an answer. Yes, it's a numerical competition, but it is one where quality answers are a nice side-effect of that competition.
Kyle - Thanks for the kind words!
Every System Admin must also have this bookmarked for his family: http://xkcd.com/627/
David - You definitely hit the nail on the head there. By not sharing the knowledge - intentionally or unintentionally - you slow down the community as a whole. End users, administrators and developers all get left out of the loop.
Well said!
Understanding the ecosystem of an IT service is key to assuring that problems are resolved quickly. I stress this approach to my staff. I find if you gain a good understanding of how an IT service is integrated from its components, such as firewalls, networks, file servers, databases, web servers, you can then design effective monitoring tools that allow you to pinpoint issues rather than chasing phantoms.
thank you wery much
Thanks, Very nice site.