'Risk in a Box': IP Bubblewrap A technology risk framework and data visualisation methodology for an IP organisation based upon empirical data, end to end flow / type, control / data planes, connectedness, trust levels, *known* vulnerabilities and threats. This framework hopes to assist individuals and organisations quantify risks to an IP infrastructure and services by enumerating current traffic and threats, rather than a 'perception' of risk and threat in relation to the unknown, which incidentally, is infinite and unquantifiable with the elapsing of time.
The author hopes to help foster, more so, an enumeration and enablement of the positive or 'valid' data/services in extreme or unforeseen circumstances, rather than attempting to enumerate the unknown or 'invalid' negative conditions. However, enumerating the *currently* known negative will help to quantify and frame technology risk overall. Due to the challenges of complexity, a very generic breakdown will be applied which will hopefully be improved upon and refined in the future. The focus is on the actual flows, relationships, dependencies and interfaces to IP services, rather than on specific valuation of data at rest.
Aside: From the title the term 'IP' may be used to represent either, 'Internet Protocol' or 'Information Protection', but not 'Intellectual Property', and is used in the 'Internet Protocol' terminology throughout this paper.
Technology risk is viewed as a subset of overall business risk and may be attributed different weightings based upon an organisation or individuals perception of their reliance upon certain technology areas.
Problem Statement
==================
The complexity and velocity of information technology in IP enabled organisations is such that irrespective of dimensioning and design, the majority of organisations face similar challenges relating to information based asset classification and information protection; thanks mainly to the underlying intricacy of protocols, node relationships, operating systems, applications etc. and the management, operation, integrity and stability of all of the above.
Unfortunately from the macro-organisational view it can be hard to prioritise information flows and services , as different groups or individuals may not posses enough knowledge about the transports and technologies that enable their distinct applications or business processes. This in turn leads to a sub-optimal allocation of resources and / or a skewed view of the organisations IP enabled world. It is very easy to see the end result of a product or new application, but very hard to see the intrinsic use of the IP network, operating systems and dependencies upon which, enable the actual operation and availability of said application or product.
From the micro-view, a single entry in a logfile, a certain vulnerability, attack or potential loss in data integrity may cause catastrophic consequences to the ability of an organisation to operate effectively or in some cases even at all. Some circumstances may result in either systematic degradation of service over time and others in immediately quantifiable revenue loss (albeit generally not with any form of guaranteed accuracy).
There exists no common or easily applied methodology or framework to help contextualise the connectedness, dependencies, threats and risks to an organisation which finds itself heavily dependent upon IP based services.
Overview
========
Certainly different individuals ( be they IT management, IT executive or not.. ) may attribute different values to information assets, however common or shared infrastructure, protocols, services and/or data have more far-reaching consequences to an organisations function and ongoing stable operations than they may initially be privy to. It is with this understanding that a baseline or sliding window must be established across an organisation to facilitate macro classification of services, data and nodes. It should be noted this is only one way to interpret and view an IP enabled organisation.
Nodes
=====
a) transit / infrastructure node
( Routers, firewalls, load balancers, VPN concentrators, GGSN, MMSC, IDP, or anything that facilitates a flow between endpoints. Traffic may be generated by a 'transit node' for the purposes of management, reporting etc however these flows should be attributed to (c) below... this node type includes nodes that store related configuration and management data e.g Network Management Systems / MOMs / OSS )
IT workers laptops and desktops are defined as type (a) nodes as they are viewed as supporting and adding value to the infrastructure and services, as are servers running services such as Active Directory, Native LDAP, DNS etc...
b) endpoint / business node
( Customers, clients, servers or any standard end to end connectivity / flow that terminates or generates explicitly business or revenue related traffic. These nodes should not be related to the operation, management and reporting functionalities of the underlying infrastructure or network... and do include nodes that store related billing, company financial, customer, ordering / provisioning data etc )
Services
========
c) infrastructure control and protocols ( business infrastructure and process support )
( BGP/OSPF/EIGRP/MPLS suite/RIP/DNS/NTP/SYSLOG/LDAP/Radius/SSH(SFTP/SCP)*/TELNET/TFTP/FTP/SNMP/RPC/ICMP/SIP/H323... )
d) data / payload and protocols. ( business product / service / customer / transaction information related )
( SMTP, HTTP, HTTPS, FTP, Radius, SSH(SFTP/SCP)*, NFS, CIFS, SQL based, Mediation / Billing based, CRM based, ERP based, Financial / HR based )
With the above in mind, a node of type (a) may facilitate services or data of type (c) and / or (d) but should only generate (c) itself. A node (b) should only facilitate services and data of type (d), though may generate its own infrastructure traffic (c) such as SNMP traps, Syslog, authentication traffic and standard name resolution protocols etc.
More granularity will be introduced at a later date for protocol types assigned to service types (c) and (d) which can be weighted to influence metrics. Tunnelled protocols may be viewed as separate flows during and after encapsulation, and would be attributed similar values unless transiting different 'trust' basis. ( See 'Trust' section. )
* Note: the author would like to express the desire to be able to tag traffic with a 16 character word rather than just DSCP / MPLS labels and hopes that with the increase in 'Service Orientated Architectures' that protocols such as the ones used by Tibco may be 'peered' in to by IPFIX based flow reporting for example. An application that could prefix or mark itself in the actual payload in clear text with a 'subject' which may be matched via something like FPM ( Flexible Packet Matching ) and treated uniquely by the network would be believed to be beneficial.
Control (Infrastructure) plane / Payload (Data) plane
=====================================================
As the 'Risk in a Box' framework views the nodes and services breakdown, the question is 'where do they fit?'. From a business context it is here that one may view the IP enabled organisation in two operational planes.
1. The "Payload" plane containing (b) - endpoint / business nodes and (d) data / payload services
2. The "Control" plane containing (a) transit / infrastructure nodes and (c) infrastructure services
Trust
=====
The concept of 'trust' is to be applied to different IP segments and / or hosts based upon their overall IP reachability, posture, user / systems access and can be somewhat qualitative though needn't be.
There exists three trust domains that should be assigned by those knowledgeable of the organisations overall IT architecture:
1. Trusted ( e.g. internal only services and networks on which only trusted employees or systems operate upon, Intranet etc. )
2. Semi-Trusted
3. Un-Trusted ( e.g. Internet etc )
Many may disagree here and say that some if not all segments and hosts should be treated as 'un-trusted', however this is not the reality we find ourselves occupying and with 'defense in depth' strategies and layered security models, certain factors including cost, resources, technology and expertees dictate trade-offs such that hosts or networks are viewed in these categories or treated as such.
Connectedness and Importance (CI)
=================================
Value (CI) 0-1
Inherent in the degree of connectedness is indeed an intrinsic measure of value, but residing moreso in the context and frequency of transiting or terminating flows. This quantification applies more accurately to controlled or trusted organisational segments as only valid business traffic should exist on the aforementioned 'Control plane' and 'Payload plane'.
Entropy may be increased for nodes facing 'non-trusted' segments like the Internet, or Extranet paths that do not invoke rate-limiting or QOS when required.
Example argument: something like a 'flash crowd' to some piece of static content on a web server may not actually increase the value of an access/border gateway router or firewall that carries no real service of type (d) business product / service. It would however highlight a delta in traffic and a possible DOS ( Denial of Service ) condition to any other services that utilised that shared connectivity path such as external DNS resolution. Theoretically at any time the maximum number of flows either to or transiting a device is constrained by a combination of factors such as total IP host reachability, upstream bandwidth and resources in servicing such IP flows from a CPU, memory and hardware perspective. This may temporarily raise the importance metric enough to warrant attention or to highlight the need for certain corrective measures.
From the 'Control plane' it is possible to extract flow information regarding all endpoint IP enabled devices, their unique and common relationships, flow types and frequencies. This facilitation of flows highlights the importance of the 'Control plane' and the degree of connectedness is most easily, accurately and economically drawn from 'flow' enabled nodes of type (a). Flows may be garnered from nodes of type (b) with host agents such as Argus etc but can be platform specific and do not scale as easily. In future, off-box, auditable host flow records may be warranted / recommended. For more information about IPFIX or flows please view RFC 3917.
For the moment the concept of the number of flows transiting a node of type (a) shall give it increased value over an assigned base value. The additional value of the type (a) node shall be calculated by virtue of the number and frequency of type (c) and (d) flows with weightings being attributed accordingly. ( as they undoubtedly will differ per organisation and associated usage of non-standard or arbitrary high ports. This will be discussed in detail at a later stage and also depends upon positioning and trust values. )
Nodes of type (b), be they server or client may also be attributed a base value and assigned additional value by virtue of the number and frequency of type (c) and (d) flows they entertain. Naturally a server should host more sessions than a client, be they client to server authentication, server to server traffic, or server to database etc. Should a client side device be experiencing high volumes of *valid* traffic then it may highlight the actual importance of the function of that client machine be they user or automated sessions. This may also help to highlight devices that should be deemed as servers and treated as such or some other anomalous or non-acceptable/invalid use or traffic. There will always be exceptions to this rule.
Aside: an 'End System Multicast' or legitimate 'Peer to Peer' application may break this concept though multicast should be a distinct address range and the legitimate Peer to Peer traffic may be re-classified in to a less weighted flow type. Auto-discovery and port scanning nodes should be known in advance and should be a special case of type (a) nodes, anything else would suggest 'invalid / negative' traffic and should a workstation peak in terms of flow frequency, it would be deemed grounds for investigation.
k = total classifications of C flows decided upon in terms of priority where k is a whole number and x=1 is the most important flow or flow group.
s = total classifications of D flows decided upon in terms of priority where s is a whole number and x=1 is the most important flow or flow group.
Table 1 ( Partial) :
---------------------------------------
C Flows weighting( Classification/Priority x = {1..k} )
c(x) = k / ( x² + k )
D Flows weighting ( Classification/Priority x = {1..s} )
d(x) = s / ( x² + s )
----------------------------------------
z(c1) = number of C1 priority flows per time period 't' in seconds where z is a number between 0 and ? ( hmmm, problems with upper bounds.. )
z(d1) = number of D1 priority flows per time period 't' in seconds where z is a number between 0 and ? ( hmmm, problems with upper bounds.. )
It is recommended that 't' is set low initially ~ 1 week.
Payload plane node: Connectedness / Importance (CI) = not sure of my maths here yet but some sort of integral over a set!
Control plane node: Connectedness / Importance (CI ) = not sure of my maths here yet but some sort of integral over a set!
Vulnerabilties and Threats (VT)
===============================
Value (VT) 0-1
Without re-inventing the wheel the http://www.first.org/ CVSS ( Common Vulnerability Scoring System ) shall be used as a metric to help enumerate known vulnerabilities in the organisation. The vulnerability / threat concept shall be that of a 0-1 value and may have other properties based upon the 'Trust' level as viewed by the organisation.
Actually relating these metrics to the organisation will require a 'Vulnerability Assessment' of sorts, which may be in the form of an automated tool or manual process. It is hoped that calcuations may be done automatically in the future based upon some form of 'risk' or correlation engine that can take feeds from CVSS enabled vulnerability scanners. It is recommended that a vulnerability scanner should be given ubiquitous access to segments either locally or such that if future ACL of FW changes occur all known/existing vulnerabilities are still capable of being enumerated. It should be noted that vulnerability scanning contains risks of its own regarding stability and availability. Such scanners would be treated as type (a) nodes.
IDS/IDP/IPS may also feed in to these figures as confirmation and escalation of threat levels.
A known vulnerability or multiple vulnerabilities for a node generates a (V) value, that is multiplied should the trust profile and/or known flows, posture or IDS confirm the threat (T).
Data Visualisation 'IP Bubblewrap'
==================================
The concept is that of a 3D / isometric cube which plots two distinct planes made up of equi-sized bubbles ( a 'bubblewrap' malleable plane each if you will... ) Bubbles may be considered as individual nodes, but more generally will be groups of nodes with similar connectedness / flows / posture / or IP prefixes. Bubbles shall attempt to cling together ( e.g. have some form of stickiness ) to provide a single viewable plane, but when queried directly will represent exact figures.
The Control plane [type (a) nodes/groups] starts as one horizontal plane at the base of the cube , and the Payload plane [type (b) nodes/groups] is another horizontal plane starting at the middle, thus the cube is sub-divided horizontally in to a Control space and Payload space. All values vary between 0 and 1 and as such the actual planes that may be inhabited form two short, squat vertical cylinders.
The four sides of the cube represent 'Trust' levels e.g. 2x Trusted(opposite), 1x Semi-Trusted and 1x Un-Trusted - which allow for ( the majority of which should be ) 'trusted' hosts, but may be skewed towards the 'non-trusted' or 'semi-trusted' to be graphed along a diagonal border of two zones for visual purposes.
Distance from the cubes sides to the centrepoint horizontally are considered measures of connectedness and importance (CI). The closer to the center a bubble is, defines more connected and more important, whereas towards the edge defines less connected and less important. A node may have thousands of flows per second which may not be of any major importance to the business, or a host may have few flows of major importance to the business.
Risk is calculated as the inverse of the distance between the node and the apex of the cube/cyclinder and is a value between 1 - 100.
Basically the closer to the centre of the cube and the higher, the greater the risk.
Multiple views may be taken and filtered upon, including a node or 'bubble' in the Payload plane that has a relationships or reliance with nodes on the Control plane. This can be easily addressed via flows and / or SNMP with normal topological data, but would correlate risk very quickly and give good visual interpretations thereof. Also most topological data uses graphs to visualise, but as this is a risk map two 'bubbles' [type (b) nodes or groups] may sit next to each other, even touch but not have direct connectivity. They may only speak down to the Control plane and back up to the Payload plane ( unless similar segments with different risk ratings ).
Risk
====
Risk = 1 ÷ √ ( ( 1 - VT )² + ( CI )² )