Wednesday, November 04, 2009

Troubleshooting-101

Social:


* TRUST BUT VERIFY. Information Technology is supposed to be rational. Humans are not rational. Verify both.
* You may hear people talking but don't listen to them, they will pollute your mind. Ask to see EVERYTHING for yourself.
* Only have
** device/infrastructure administrators
** one infrastructure architect
** and one application architect on a call at any time. (Anything else wastes money and mindshare!)
* Don't let anyone try a scattergun or consensus approach. In fact, don't allow any additional functions/capabilities on the call that are not technical or 100% required. More often than not the Project Manager is not required once the call starts
** Talk is most likely conjecture if it starts with "my understanding is", "I believe", "assume", "presume", etc.
* Explain you have to capture and share for audit purposes. Then capture and share.
* Always go back to first principles, including proving it's plugged in and switched on.
* Always ask to see the data/empirical evidence.
* Always get fresh data from the administrators, not stale logs.
* Never assume the admin knows how to use their tools.

Technical:


* Identify your application behaviour, if no one knows, end the call. AppFlowNow
* Ask for logs, if none, turn them on sparingly
* Separate your platform and application stacks
** the application stack is totally different from the platform/network stack
** the platform/network stack is totally different from the application stack

Warning:
All code contains bugs and every file can have configuration errors.
Humans write code, humans are fallible, code is fallible.


Testing:
* A failed application test proves absolutely nothing.
* Only a raw network test proves a data path exists.
* application stacks use many modules and functions to create messages
* application stacks may be their own protocols or use existing protocols
* application stacks can call on TCP/IP stacks on the host operating system or platform and uses device drivers to construct IP packets (or in the case of FC FC_frames etc)
* network stacks have many tunable parameters depending upon the platform


Build a matrix and diagram and use it! Make stuff or source stuff!

* build a flow diagram to contextualise relationships
* collaborate on the matrix/diagram centrally
* allow ICMP echo_request and echo_reply ICMPNow on all project flows

Identify your flows:
* end-to-end
* point-to-point
* point-to-multipoint
* mesh


Verify your endpoints and codebase(s)

* clients/servers
* Does arp complete?
* default gateway
* interface IP AND Subnet Mask
* client route table(s)
* operating system and patch levels
* device driver versions
* check the buglists for your versions, sometimes it's not a new bug nor unique (sometimes it is!)

Idenfity all your interim infrastructure nodes:
* local switch (layer 2, MAC/CAM table)
* default gateway (layer 3, FW/Router/LB)
* transit nodes (FW/Switch-Router/LB/Optimiser/IPS)
* operating system and patch levels

Verify the policies and configuration on all nodes:
* in-path
* pick one example flow and dissect it step by step
* check routes and routing on all devices
* go hop by hop

No comments: