Remember a few simple paradigms
1) The risk profile of a network or fabric is greater than the aggregate of the risk profiles for each of its endpoint/client connected nodes or services.
2) Never underestimate physical *and* logical separation. Ask yourself what happens if the mgmt control plane goes down or gets stuck in 'flipmode'?
3) Protect your management and control plane above all else, try not to have them in-path with the data plane. IT is change management, if you can't manage your resources, you may as well not have them.
4) Where are your policy enforcement points which facilitate auditability and visibility? AAA is a must!
5) Always use subnets and NETBLOCKs to separate traffic when you can. [e.g. use good address management] QOS on subnets is easier than QOS on discrete flows.
6) Darkness is not good. Instrument and gather telemetry from your network. Inbound poll and outbound trap at a minimum. Baselining and trending helps.
7) Always look at logs, sessions and empirical data rather than listening to conjecture and hearsay.
8) Abstraction layers are a good thing such that logical resources and physical resources can move without affecting one another. Loose coupling not tight coupling is the order of the day.
9) Always use loopbacks or virtual interfaces to manage devices where possible. [see 8]
10) In-path tests are the only things that represent what a client or endpoint sees. Up isn't always up, sometimes it's down.
Note: This is evolving, please leave comments on adds, moves, and changes... including priorities!
4 Laws of Troubleshooting:
1) Get, define, refine PROBLEM STATEMENT and the 5 WHY's.
2) Always go back to basics and first principles.
3) Look for commonalities and deltas.
4) Document an end-to-end code/firmware matrix for your problem.
Hugo's take on things (Not that I specifically disagree, but I do have a slightly variying point of view to the previously released laws)
1. Lack of visibility does not constitute lack of activity. While being unable to manage a device constitutes a significant risk, it does not constitute an outage.
2. We spend a great deal of time building highly available data paths in networks. They constitute one of the most reliable ways to get around the network. It is a valid consideration for the carriage of management traffic.
3. In a redundant, highly available network, a down device does not constitute a disaster, in fact, it doesn't even constitute an outage. Delaying its recovery constitutes a risk, not a problem.
4. The weakest part of your management is your people and processes, think less technically and more simply. Sometimes an analogue phone is the best solution.
5. Focus your efforts on the areas you have problems. Management like to see rapid improvement, don't focus on what causes you 1 issue a month to the detriment of something causing you 10.
6. Before you ring for escalation support, type "show log". Or look at the appropriate logs on the device or host.
7. History is important. Nothing changes radically overnight, if you can see what has happened before, you will know better whether you are looking at a one off event or a re-occurring issue. Many other pointers come from history and trending information.
8. No matter how big a nuffer they are, the day to day or other incident staff may well have seen something important that they can tell you. Try to establish the information behind their assumptions.
9. Best practice is merely something that worked for others. Sometimes our differences necessitate divergence. The best German engineering software in the world is of little value to someone who only speaks English. The best network management software in the world adds little value if it does not gather call history and quality information on your VoIP network. Best practice is a great starting point, but usually not where you should end up.
10. Keep it simple. Networks have a way of complicating themselves, your efforts should be towards keeping it simple and reliable.