Sunday, February 21, 2016

On Blame

I am sick of this 'blameless' culture approach in technology. Blame is attribution and often misused.

I think I have a different take indeed. I think blame has traditionally been misused in many orgs to incorrectly scapegoat individuals or minorities (for political or egoic purposes), *however blame has utility otherwise attribution can not exist*. Today we find ourselves part of a 'politically correct' mainstream fearful of reprisals around allocation of blame in complex scenarios.  Indeed blame can be toxic when used non-skillfully, and can also be used to persecute those who are least able to defend themselves or initiate learning.

For Root Cause Analysis there actually has to be blame of a thing, process, person, agent, or group (or mix thereof). In our societal conflict avoidance culture, we tend to want to fix only the system or process, not always realising that humans are a huge part of, and make up a large part of the system. For humans to learn, they must not only know that they were wrong, or made a mistake, but feel it deeply to trigger deep learning. This is a physiological response that must happen.

There can be no responsibility, accountability or learning without said attribution. The trick is not to extrinsically 'blame' or 'shame' individuals. Blame must be intrinsically attributable to the thing, accountable team, or group responsible (if indeed that is the true RCA), otherwise there is no organisational or individual learning. In the case of no blame, the rest of the organisation has to evolve around these fixable/preventable failures via a form of avoidance or process overhead/tax.

Imagine a startup that couldn't fail fast and learn because the RCA is actually some of the people hired. Sure you reset the training/hiring etc. (or fire them) but you also need to target the individuals for betterment if you want to keep them. If the RCA is that an individual or teams need more training, then this must be identified and dealt with at a management tier right?

The challenge is not to explicitly blame/shame any *individuals* (whose teams intrinsically know who were to blame for certain events) but for managers/leaders or groups to fall on their sword and accept attribution/blame for their team's actions IMHO.

This topic of 'blameless' culture has a groundswell which I fundamentally disagree with (mostly in the avoidance and transparency angles) as a form of avoidance of conflict.

This is indeed a nuanced approach but attribution and accountability form the backbone of progress. I keep coming back to this seminal book (see the summary section)

Perhaps every Post Mortem should end with the question, "do our teams need more training"?

Appendix. A
Why Organizations Don't Learn

Saturday, December 12, 2015

Email Productivity Hacks

Might have just found the answer to work email anxiety when HQ is in another timezone i.e. "Delayed Messages": NOW running only on Mon-Fri at GMT 8am, 2pm and 4pm

Amended code from Musabi link above:

Note: I also use Zapier email parser to respond to meeting invites with some guidelines for meetings (just make sure your Google Mail filter excludes invitations with the words 'accepted' or 'updated' too).

Thursday, January 15, 2015

A note on netsec

Know your network and assets.
Gain situational awareness.
Quantify Value at Risk.
Risk is a factor of dependency.
Map transitive trust.
Zone assets and services.
Partition failure domains.
Assume compromise.
Fail well.
Maintain ability to replay traffic to high value assets.
Drill incident response.
Minimise abstraction layers.
Advocate loose coupling.

Monday, November 10, 2014

WebSummit 2015 Re-Imagined: A More Evenly Distributed Future

WebSummit is trapped between a rock and a hard place yet it need not be so! The very thing that makes WebSummit special, its 'secret sauce' if you will, is that of the host country and its zeitgeist (and more specifically that of Dublin itself). One of the primary ingredients of this sauce is the Irish welcome, openness, and indeed the intimacy that occurs in and around the edges of the conference.

Like any good conference, festival, or gathering, it's as much about the serendipity engine of coming together in large groups which then facilitates unexpected and novel interactions. More often than not the event's official content and schedule plays second fiddle to the more intimate clusters of conversation before, during, and after sessions. WebSummit, like any human get together, is about the people first and foremost, people whose interconnection is supported by the transport fabrics of the venue and host city. People come because of the promise of connection; connection to other people, to ideas, and to methods that facilitate their learning. Today, people expect to connect both digitally and physically, each a proxy to and serving the other.

Thus, event WiFi is one of today's crucial and ubiquitous service fabrics at technology events, and unfortunately it was indeed sub-standard and woefully underprovisioned at WebSummit. Notwithstanding the underlying politics, event Wifi is a three dimensional fabric that helps to distribute information to attendees, to connect them to the outside world during the proceedings (including connecting the outside world in) whilst catalysing human connections. The WebSummit WiFi did not seem to follow certain best practice patterns for high density deployments (which are documented freely and openly on the web) but more on this later, including some key points and recipes for anyone else thinking about high density WiFi at 'webscale' events! First, let's look at how one might potentially extricate WebSummit from the RDS (Royal Dublin Society) conference and exhibition centre without damaging the brand and buzz around the event itself.

WebSummit needs more leverage in this 'Mexican standoff' of sorts. It's trapped in the only event campus large enough to host its *current* numbers by an incumbent who have demonstrated they just don't get it. The irony is WebSummit can neither write the network technology requirements itself yet (and/or bake them in to contractual service level agreements for a range of reasons) nor is it permitted to take advantage of entities who actually know how to provide this type of elastic wired and wireless network due to the RDS's current stance. The RDS is unable to 'fail fast' and can only 'fail big' as there is no incentive nor room to rapidly iterate when your deployment cycle is once a year, involves actual physical hardware, and especially when you have a monopoly. WebSummit is unfortunately paying the RDS yearly to learn a little more about high density WiFi design and operation yet the RDS is still falling short and thus damaging both WebSummit and Ireland's national brand. The lack of quality and stability in this utility service is damaging the attendees experience, damaging WebSummit's intrinsic and global marketing channels, and also damaging the country's reputation by re-enforcing negative Irish stereotypes rather than the positive ones which attracted many of the people in the first place. I could go on here about how the Web itself uses encapsulation and abstraction models and how web startups only learn about 'web scale' (and thus the underlying OSI layers and network patterns) as they mature and gain traction, but I'd like to get back to the venue choice for a moment first...

The only leverage WebSummit has is to actually and fundamentally rethink using the RDS and find or create a local alternative for the event and 'festival' campus (so the RDS understands that moving location is not just a veiled threat but that a WebSummit straight flush beats an RDS full house!). Ireland as a host country and city has many constraints indeed but let's use them to get creative, to innovate, and to bootstrap the basics for a moment. WebSummit can *not* go abroad as it would lose its special powers and become just another technology/startup conference, i.e. bland and over-commercialised. If it left the capital city, Dublin and indeed Ireland would lose so much more than just revenue, it would be an admission of national failure and incompetence. External parties would lose confidence in Ireland's startup scene, in the existing technology base, and most impactful, in the potential and capability of the Irish to play at a global level whilst still at home.

WebSummit needs world class conference and trade show facilities within a stone's throw of the city centre's pubs, restaurants, hotels, transport infrastructure, and preferably all within walking distance of Grafton St. It needs all this with a nexus capable of hosting ~20,000 people at a keynote, but does it really? Intimacy is not a scale free network and it is scarcity that helps to determine perceived value. Let's suppose for a minute that WebSummit explicitly states that Dublin's nucleus is it's true 24/7 campus. Albeit there is no rival to the RDS in terms of 'one (giant) throat to choke', perhaps a new campus could be imagined as an intertwined web of smaller more intimate locations (just like the Night Summit itself!).

Consider if you will for a moment the docklands with a bit of vision?

Have a think about the above with a kind of a SXSW feel? Sure, it would take a master stroke of organisation and liaison with a range of parties but the Convention Centre Dublin has a 2,000 seater auditorium as does the Bord Gáis Energy Theatre, and the 3 Arena has a 14,500 seat capacity (combined with taking over the Odeon Cinema and anything else they could get their hands on nearby!). Just a thought to bootstrap your thinking! I'm sure many Irish people would give you 20 reasons how something like this could fail all without any constructive criticism, ideas, or alternatives but.... what if...?

So, on to the WiFi... and known good patterns. Well, here is what WebSummit could do or have another entity do...

High Level Design / Basic Requirements

Client Requirements:
- One to three devices per attendee (all manner of smartphones and laptops)
- 2.4 GHz and 5GHz support
- Minimum RSSI -67dBm / SNR 25dB in coverage areas
- Minimum 5/5Mbps throughput to maximum 20/20mbps
- Application traffic types primarily miscellaneous web browsing
- HD video conferencing and voice should be available and prioritised
- Sub 5ms response from default gateway
- Sub 5ms response from cached DNS entries
- Multicast and local client to client connectivity not supported except in smaller spaces
- Limited wired connections of 100/100Mbps for all speakers and those wanting to do live demos.

WiFi Related:
- MCA(Multi-Channel Architecture) which rules out Meru!
- Distributed WiFi micro-cell architecture (rules out Xirrus!)
- Overhead directional 'patch' antennas
- 2.4GHz 'event-legacy' and 5GHz 'event' ESSIDs (throughout main hall)
- ESSID's anchored to major spaces (named accordingly vs. full site roaming)
- Limited layer 2 roaming
20MHz only channel widths for maximum spectrum re-use and clean air
- 802.11g/n only (802.11ac in some locations but not required!)
- Basic (i.e. mandatory) 18Mbps data rates and above only
- Predictive modelling/full survey but mandatory post-validation survey
- SNR to 25dB in all expected coverage areas
- Full WIPS+ Spectrum Analyzer capable and dedicated radios/APs.
- 802.11k (and/or proprietary load balancing in mini-radio clusters in super dense client areas)
- Careful use of RX-SOP (if available ;)

Wired Backbone and Event Services:
- Minimum dual active/active 10Gbps ISP transit links via disparate vendors/metro rings (with ability for vendors/exhibitors to terminate their own feeds)
- Minimum 40Gbps+ capable edge routing/firewalling
- 10Gbps dual redundant access edge uplinks to distribution
- Full 20-40Gbps or more primary campus backbone/infrastructure to the CORE
- N+1 redundant architecture throughout as far as the access/edge and APs
- Well architected L3 domains (to partition and minimise L2 failure domains)
- Routers should route and firewalls should firewall, thus DNS and DHCP should be provided for via dedicated servers or appliances 
- Software caches and/or major CDN edges onsite
- Local redundant / Anycast DNS resolvers and/or caches (i.e. not performed on routers/FWs)
- Dedicated physical links and paths (where possible) for exhibitors/vendors and/ or workshops or labs.
- L7 Application Visibility / DPI (Deep Packet Inspection) and associated shaping/throttling or queuing to a scavenger class for known bandwidth hogs
- Optional per-client SRC IP bi-directional rate limiting

NetOps/SecOps and Customer Service:
- A full NoC (Network Operations Centre) that also engaged attendees via locally hosted status pages and other social media channels i.e. Twitter etc..
- All digital signage giving informative and constant network info/updates
- Constant and distributed monitoring via humans/sensors/APs to adjust for a growing noise floor and to track down any 'evil twin' APs or strong rogues
- Full Network Management, Capacity Management, Alerting etc.
- Technically qualified roaming volunteers to assist attendees get connected including at event hubs and booths

Note: This is just a mixed flavour of some high and low level critical design elements (of course more explicit requirements should be created and customised with respect to WebSummit's specific functional and non-functional requirements including proper design documentation etc.). But there is no escape however from doing simple maths with respect to the number of supported CAM table and ARP entries per infrastructure device and factoring things like the TCP set up per second and concurrent NAT sessions at layer 3 boundaries... also.. know your clients i.e. do some capacity planning in advance + wouldn't it be lovely to use all Public IPs for clients at events ;)

Disclaimer: I was not at #WebSummit but was watching live from Berlin whilst talking to some people who were (and am now back home in Dublin for a short stint!).

If anyone wants to leave constructive comments, spots errors/omissions, or would like to follow up please do so or ping me on twitter @irldexter and in case anyone is wondering my background is here @podomere

Sunday, September 14, 2014

OSX Wifi

A quick and dirty script to put in your crontab to see what the hell is going on! MacBook Airs act funny with power management and their SNR. Also, with the RSSI and noise floor you can just subtract the noise to get the SNR. At the end of the day though it's the SINR that counts + proper tools are required to diagnose non-802.11 interference, CCI, ACI, throughput and retries....

## Then put the below in your crontab for every minute (with your own path of course)…
* * * * * /Users/useraccount/wifitest/ > /dev/null 2>&1

Friday, April 04, 2014

One liners

So I've always been a fan of the unix philosophy with a passion for trying to do most things in one line of/with Bash rather than a full script or program. I am by no means adept but battle through with sed, awk, paste, cut, tr, sort etc... So when one can potentially use curl on a RESTful service, and stay on the command line rather than logging in to a web app, I'll give it a go.

We use Saasu for our back-end invoicing and reconciliation and they provide access to their API via a secret key (tied to configurable users). I decided I wanted to see what we were owed and owing via the command line so with the help of some other simple tools I came up with the below. It's still a work in progress and I'm open to all the help and any suggestions I can get ;)

First you may need to ensure you have 'xmlstarlet', 'dialog', and 'cURL' installed on your *nix system via ports, apt, or otherwise. Ensure they are happily found in your $PATH. Then replace the below with your Saasu secret/access key 'XXXXXXXXXXXXXXXX' (preferably from a read only user account) and the file ID 'YYYYY' of your desired Saasu account. You can find out how to enable the web services API from Saasu here.

Note: I actually ended up putting the below one liners in respective files and calling them (but you can alias it just as easily... I just didn't want to have to keep sourcing it while testing/editing)... and bingo you have a command (you can call whatever you like).

Owed (the following is one single line):
dialog --title "Company Accounts Receivable" --msgbox "`echo -e "\r\n\r\n" && curl -s ""  | xmlstarlet sel -t -m //invoiceListItem -o "Invoice #" -v invoiceNumber -o " is/was due by " -v dueDate -o " for " -v amountOwed -o " " -v ccy -o " by " -v contactOrganisationName -n | sort -n -k2 && echo -e "\r\n\r\n"`" 60 100 ; clear

Owing (the following is one single line):
dialog --title "Company Accounts Payable" --msgbox "`echo -e "\r\n\r\n" && curl -s ""  | xmlstarlet sel -t -m //invoiceListItem -o "Invoice #" -v invoiceNumber -o " is/was due by " -v dueDate -o " for " -v amountOwed -o " " -v ccy -o " to " -v contactOrganisationName -n | sort -n -k2 && echo -e "\r\n\r\n"`" 60 100 ; clear

So now I just type 'owed' or 'owing' on the command line to get:

I know there's a lot more that could be done, tweaked, improved, and extended... so let me know what you're thinking via @irldexter on twitter if you want to get in touch!

Thursday, March 27, 2014

Changing Masks

So a customer reckons a quick subnet mask change on their router will increase their available host range on their management network.. just like that... :(

Problem: Why increasing a subnet mask (even when you can keep the same gateway) breaks things if you don't update all the associated nodes masks (and additional assets that reference that new increased subnet)... can you help with, add to, or validate the list of issues below?

Example / Details :
  • Router is a Cisco 5548 running NX-OS 5.0(3)N1(1c) 
  • Original network: 10.4.66/24 Router gateway: Router mask updated: to /23 i.e. (The 10.4.67/24 is free for use).
  • Updated network: 10.4.66/23 Router gateway remains: 
  • Existing endpoints not updated and remain with original /24 mask. 
  • Only new endpoints in the higher portion of the habitable range and the router now have /23 mask
  • VLAN ID stays the same. 
  • Network consists mainly of management servers and infrastructure devices management interfaces. 
  a) If a host A with its old /24 mask tries to talk to a new host B in the higher portion of the router updated /23 (i.e. 10.0.67.x ) it can not do so directly via ARP. It assumes the other host is on a remote subnet after checking its own host A mask. The A host (thinking it's still on a /24) then sends all traffic to the default gateway rather than via local means to host B. The router has to then process the frame/packet, do a lookup, and forward to host B essentially doubly handling a frame and packet rather than a conversation which could have remained fully on the local switched fabric.
b) Any hosts with static routes configured for the initial /24 will follow their default gateways to reach the new /23 higher portion without having their static routes updated. This may affect multihomed hosts with multiple egress interfaces that require the non-default gateway to communicate to remote management networks for example.
c) The new network will have to be confirmed as being advertised in all required infrastructure routers routing tables, VRFs, and associated statics. (Equally this may affect VPN concentrators, layer 3 switches, or any devices that perform either dynamic or static routing).
d) Any NAT rules will have to be updated to allow for the new /23 and associated pool sizes and mappings.
e) Any firewall objects/network objects will have to be updated to reflect the new network size.
f) For any hosts that use the IP broadcast address to communicate (as opposed to the layer 2 all hosts broadcast address of ff:ff:ff:ff:ff:ff ), the /24 broadcast address is whereas the /23 broadcast address is… (albeit will always reach either) thus all endpoints/hosts should be updated.
g) Methods that use proxy ARP or (potentially gratuitous ARP) from the /24 range may fail to update the router and/or hosts if the IP is not deemed to be from the correct subnet.
h) any infrastructure/router ACL(Access Control Lists) that reference a /24 mask will now have to be updated to reflect the /23 mask or connectivity/reachability may suffer.
i) any infrastructure/router prefix lists, policy maps, or traffic engineering that references these subnets or utilises the ACL's above may fail without being correctly updated.
j) if one was to update required endpoints/servers with the updated /23 mask many devices may cache the old mask and/or require a networking restart or route flush before performing.

Twitter'ish musings...

    Come join me on Twitter