Start networking and exchanging professional insights

Register now or log in to join your professional community.

Follow

What are some system administration best practices?

user-image
Question added by Vaiyapuri Gopalakrishnan , Manager - After Sales , M/s Saud Bahwan Automotive llc
Date Posted: 2016/08/25
Mahmoud Zaher Tarakji
by Mahmoud Zaher Tarakji , مدير , أوال جاليري

............i don,t know ......................

Maher Sadwan
by Maher Sadwan

Rule 1 - Be Good Citizens

• Visibility

– Ticketing system

– Updates must propagate outside your group

• Know your metrics

– User perception (quick response)

– CTO perception

– Partner perception

 

 

Rule 2 - Monitor your Systems

• Status

• Establish baselines

• Watch trends

• Use the right tools for the jobs

• Use the right tools for your team

• Start in a known state

Routers

• Link traffic

• Capacity

• CPU

• Memory

• Environmental

• ACL hits, BGP routes...

Networks

• Reachability

– Ping

– Traceroute

– Routing loops

• Latency

– Directly affects end-user perception

Systems

• Disk

• CPU

• Memory

• Environment

• Services

 

Rule 3 – Perform Disaster Recovery Planning

• Things break. All the time.

• Quis custodiet ipsos custodes?

– If your monitoring system breaks, who will notice?

Who will care?

• Timestamps are essential for correlation

– NTP is your friend

Networks

• Redundant paths

• Dynamic routing

– minimal human intervention

• Spares (GBICs and cables, too…)

• Know your S.L.A.

Systems

• Load-balancing

– DNS round-robin

– F5/Cisco Director/Resonate Global Dispatch

• Redundancy of service

– MX backups

– Leaf nodes should cache

Backups

• Remember “no single points of failure”?

– This goes for backups, too!

• Media fails

• Media devices fail

• Networks fail

• Try restoring on a different system…

 

Rule 4 - It’s not done until it’s documented

 

 

Rule 5 – Establish Procedures

• Consistency

• Reproducibility

• ISO 9001 is all about procedures

• Helps to implement Rule 4

• Peer review

 

Rule 6 - Defence in Depth

• Software updates

– OS and applications

All software is buggy. Get over it.

• Firewall

– Can give false sense of security

– Misconfigured? Worse than no firewall.

• Monitor your network, too (IDS, honeypot)

• Internal more likely than external

 

 

Rule 7 - It’s not done until it’s tested

• Software installation is a risk

– Yes, patches too!

• Test systems

– Must the software updates be applied right now?

• Automate your testing, if possible

 

Rule 8 - Learn from Others

• Don’t re-invent the wheel

– Save yourself time

– Save yourself money

• Mailing lists

– SAGE and local groups

– NANOG

• Conferences

Other sources

• Vendors

– Sometimes they hire smart people

• FAQs

• Search engines

• White papers

• Books

• Articles

Mustaf Shabeer
by Mustaf Shabeer , Accountant , Albaats Trading Pvt. Ltd.

Yes, you know what the swap usage is today because there’s a problem with the disks thrashing and it’s causing the server to go slow. But your users are complaining to management that it’s an ongoing issue and now management is asking you for data. What, you haven’t been documenting this, so it’s now your word against Sales and Marketing? Guess who wins that argument by default? You’re responsible for the system, so they will make this your problem. So get/build/buy a system to monitor, measure, and record that data so you can build pretty power-point slides for finance next time you need to ask for hardware upgrades, or to prove that the issues are caused by bad software rather than your perfectly functioning servers. Even if you are just running a single server for an employer, a client, or even yourself, it’s good data to have for some unforeseen reason someday.

A shortlist of things to start monitoring/recording/charting/graphing:

  • Load average
  • Memory usage
  • Disk I/O (transactions per second)
  • Network throughput (in Mbits/sec)
  • Network throughput per virtual host/site
  • Transfer (in GB/month)
  • Transfer per virtual host
  • Disk storage (monthly in GB) and also daily rolling average if files are uploaded and deleted regularly)
  • Average response time of test URI under your control (in milliseconds)
  • Average response time of a PHP (or Ruby/Python/etc.) page under your control that does not change. Testing real web pages gives you a consistent baseline that you can use to narrow the problem to the server, the OS, or the web code itself.
  • SSH logins per day/month by user and IP address
  • Anything you feel is necessary, or will get questions on later

Once you have consistent information, you’ll start seeing patterns and can look for things out of the ordinary. It’s also good for correlating data to behaviors when you’re troubleshooting issues and aren’t sure where to start.

sameer abdul wahab alfaddagh
by sameer abdul wahab alfaddagh , عضو هيئة تدريس , جامعة دلمون

For successful system administration, you need more than just the required technical skills. Below is a list of five slightly non-technical abilities that should be developed in order to become the best system admin ever.

Monitor, measure, and record.1

. Develop project management habits.2

. Develop a system for day-to-day work.3

. Develop communications skills (sales, presentation, etc).4

 

. Start preparing for “what if” scenarios.5

Ahmed Mohamed Ayesh Sarkhi
by Ahmed Mohamed Ayesh Sarkhi , Shared Services Supervisor , Saudi Musheera Co. Ltd.

USers & Admin Login

               .

Manzoor Alam
by Manzoor Alam , Director , 7th Sky Travel & Tourism Services (Pvt.) Limited

Beside the technical knowledge and solutions, he must have following qualities to serve his/her clients/users:

1-Ready to help

2-Friendly behaviour

3-Accomodative

And must keep himself/herself upto date technically.

Wail Zayid
by Wail Zayid , Facilities Supervisor , Shade Corporation

 

1. Monitor, measure, and record. Yes, you know what the swap usage is today because there’s a problem with the disks thrashing and it’s causing the server to go slow. But your users are complaining to management that it’s an ongoing issue and now management is asking you for data. What, you haven’t been documenting this, so it’s now your word against Sales and Marketing? Guess who wins that argument by default? You’re responsible for the system, so they will make this your problem. So get/build/buy a system to monitor, measure, and record that data so you can build pretty power-point slides for finance next time you need to ask for hardware upgrades, or to prove that the issues are caused by bad software rather than your perfectly functioning servers. Even if you are just running a single server for an employer, a client, or even yourself, it’s good data to have for some unforeseen reason someday.

 

A shortlist of things to start monitoring/recording/charting/graphing:

 

  • Load average
  • Memory usage
  • Disk I/O (transactions per second)
  • Network throughput (in Mbits/sec)
  • Network throughput per virtual host/site
  • Transfer (in GB/month)
  • Transfer per virtual host
  • Disk storage (monthly in GB) and also daily rolling average if files are uploaded and deleted regularly)
  • Average response time of test URI under your control (in milliseconds)
  • Average response time of a PHP (or Ruby/Python/etc.) page under your control that does not change. Testing real web pages gives you a consistent baseline that you can use to narrow the problem to the server, the OS, or the web code itself.
  • SSH logins per day/month by user and IP address
  • Anything you feel is necessary, or will get questions on later

 

Once you have consistent information, you’ll start seeing patterns and can look for things out of the ordinary. It’s also good for correlating data to behaviors when you’re troubleshooting issues and aren’t sure where to start.

 

2. Develop project management habits.  Even for small, one-person projects. Write up a small scope of work, write requirements, get sign-off from stakeholders on their expectations, plan a schedule, and record your activities. Write up a postmortem document at the end. Even if it’s just for yourself. It doesn’t have to be fancy, and it certainly doesn’t have to be formal PMBoK activities. It may seem bureaucratic managing all that paper and it may seem like you’re spending more time on paperwork than sysadminning, but it helps keep you organized when your boss hands you random high-priority assignment that strays you from your task. It’s also handy when you build a new system and users complain that it doesn’t do what they wanted it to do. See? You got their sign-off on the requirements document right there…

 

Even if it’s just for yourself, one day you’ll ask yourself, “now why on earth did I install Acme::Phlegethothon this server? Oh yeah, it was for that weird commune who needs it for their application code…”

 

3. Develop a system for day-to-day work. Again, this may seem bureaucratic, but if you spend your days just “doing stuff” without a To-Do list, you may find it difficult to explain to your boss next week exactly what you’ve been doing with your time. I’ve become a fan of Kanban boards lately because it’s a visual device that your boss (or anyone who assigns you work) can interact with. Let’s say I’ve got three items I plan to work on today that should fill up my 8 hours. “Oh, you need me work on this other item instead? Yes sir! Here is what I planned to work on today. Which one should I deprioritize in favor of this one? Oh, so it’s more important than this one, but not as important as these two? That’s fine, I can requeue that lower priority one and get to it later.” This helps set expectations. I know of one graphic designer who used it to coordinate her work between three competing project managers. If one asked her to prioritize something, she’d show him her board and send him to the other project managers to negotiate the conflict and coordinate their deadlines. Even if no one else looks at your board but you, it helps to keep you organized.

 

4. Develop communications skills (sales, presentation, etc). It took me a while to really understand why this is important. Yes, today you just want to sit in a server room, keep things running, and look at Lolcats. But tomorrow, you may have other people assisting (or working for) you. You need to be able to communicate expectations. You need to propose and advocate your ideas (great ideas never stand on their own merit unless and until they are properly communicated), to your peers or to management. Maybe you need to convince someone that they need to upgrade the web server. Maybe you need to explain your new server proposal that will fix all their problems. Maybe you need to convince the developer that his code is really causing those memory leaks, but you need to present it in a non-accusatory manner. I’m personally a big fan of Toastmasters for this, as it’s the cheapest and most effective way to improve your ability to communicate.

 

5. Start preparing for “what if” scenarios.  Your servers will crash. Your servers will be hax0r3d. Your backups will be corrupted. So start figuring out how to react when that happens. One of the unhappiest days of my life was when my personal server was r00t3d. I did all the right things, but the attackers were more dedicated to getting in than I was in keeping them out. How do you remove a rootkit after it’s discovered? I didn’t know then, because I never asked the question (remember? I thought I did all the right things to prevent it in the first place). You can bet I certainly know now! What happens when the server drops off the network because of a power outage, and now it’s saying “kernel not found”? What happens when your client or internal user asks for you to restore a backup, and the backup is corrupted? You may not get all the answers to these until you actually experience them first-hand, but it’s better to start asking the questions now and not when you have angry people yelling at you. Also, once you start asking the questions, you can start setting up “self-training” scenarios to test it. Set up a test box and remove the kernel. See if you can get it back to operational. Try and get someone to install a rootkit on it, or at least do a bunch of stuff that you have to troubleshoot and fix. By asking these questions now, you’ll be in a much better position to deal with them later.

 

Vaiyapuri Gopalakrishnan
by Vaiyapuri Gopalakrishnan , Manager - After Sales , M/s Saud Bahwan Automotive llc

1. Create a user account for yourself and use that to login, use sudo to elevate your privileges to root, only if you need them. Read up on both sudo and ssh capabilities, especially ssh if you are managing multiple servers.2. Run as few services as possible, disable anything you do not use or need.Every service is a security risk and a resource consumer. Make sure you know and understand how the ones that you do run are configured and in which configuration files.3. Monitor critical behaviour in a graphing tool, so you can look at values over time. Looking at top/free/iostat/netstat etc is useful (see below) but does not replace or offer the same kind of insight as being able to look at the value of the same parameters over time. Graph anything that is important to you, like diskspace, cpu usage, network usage etc. etc. in as much detail as you can afford.4. Learn debugging / troubleshooting and analysis tools for real-time insight.This involves getting to know top, iostat/dstat, sar, netstat, nmap, tcpdump etc. etc. which are all analysis tools for a running system to figure out what is happening in case of a crisis/problem. Learning should take place before the actual crisis/problem situation. Once you need them it is too late to learn ( or at least very painful).5. Store logfiles and look in them, preferably store them off-host on a logserver as well. The latter will give you valuable information on why server went down or became unreachable. Or worse when you have a breach of security and need to make sure that the logs are unmodified.6. Install an intrusion detection tool and a firewall. Use a non-permissive ruleset aka deny everything and allow selectively what you know you have configured/need.7. Make sure that you clean out history on logout and make sure you logout as well. Commandline history is a goldmine for passwords. Avoid storing passwords in files in clear text.8. Know your hardware, there is an immense difference between various hardware components and their capabilities.And the most important practice on Unix / Linux is and always has been:Automate any non-trivial process or task!New tools like puppet and chef are really nice, but this is a long standing (best) practice on Unix IMHO. Before cfg management tools good sysadmins would write scripts to automate any non-trivial task because it works as both documentation of the task and provides repeatability. It is also far less error-prone than doing the task manually in nearly all cases. And faster.The usual stuff about software engineering applies just as well to scripts as it does to "real code". Learn how to program in your favorite scripting language instead of making scripts that are glorified collection of commandline invocations. And yes, there is even a testing framework for bash and you can actually program in bash.

 

More Questions Like This