Distributed Network Management Laboratory

Prof. Kaliappa Ravindran

Location: NAC.7/115

The main research direction is on distributed management and control of complex networked systems. Specific areas of research focus are adaptive fault-tolerance & QoS, autonomic network resource management, declarative networks, QoS auditing in clouds, and cyber-physical network systems. A common thread in these research activities pertains to the design and management of trustable computer network systems. Specific research activities are listed below.

  1. Network Certification
  2. QoS Auditing in Cloud-based Distributed Services
  3. Software Cybernetics for Networked Embedded Systems
  4. Test-bed Facilities

1. Network certification

When designing dependable network systems, it is necessary to certify that a networked system behaves in the way it is supposed to. A certification process involves verifying how good S behaves relative to its stated objectives, and assigning a score therein for S. For instance, S may obtain a numerical score of 0.9 (on a scale of [0,1]) in the face of currently prevalent environment conditions, but may obtain a score of 0.8 under more hostile conditions. In an example of end-to-end connectivity service over a data network, how stable is the data transfer rate sustained in the presence of packet loss fluctuations may be of interest to a video application. The candidate network system S as a whole is subject to stress tests, whereupon an external management entity reasons about the ability of S to fight through the stressful conditions incident on S. An analogy to the network certification problem is the assignment of course grade for a student: the grade basically tells how good the student is relative to the declared expectations (the exams is a means to assess the student performance under a particular criteria). We develop policy and rule-based network management techniques to assess the para-functional attributes of the behavior of network system S: such as system resilience, stability, and responsiveness. The certification mechanisms are based on machine intelligence tools (such as PO-MDP) and probabilistic reasoning methods. These mechanisms however do not require domain-specific knowledge.

Network certification is beneficial in two ways. First, it offers the means to measure the goodness of a complex network system in domain-specific metric spaces and compare it with other competing systems. For instance, a military commander deploying different network systems in a theater of operation can use the numerical scores to reason about the overall effectiveness of the combined system under various external conditions. Second, the certification enables an autonomic controller to improve upon the workflow processes and algorithms embodied in the system to deal with the environment conditions in a better way. The domain-independent nature of the certification methods developed in this research make them employable in diverse application domains: which lowers the software development costs of complex network systems. We are currently studying the certification methods in the domains of content distribution networks (CDN), replicated web services, and adaptive video transport.


2. QoS auditing in cloud-based distributed services

Given cloud-based realization of a distributed system, QoS auditing enables risk analysis and accounting of SLA violations under various security threats and resource depletions faced by the system. The problem of QoS failures and security infringements arises due to the third-party control of cloud resources and components that are used in realizing the application-oriented services. The less-than-100% trust between the various sub-systems is a major issue that necessitates a probabilistic analysis of the application behavior relative to the SLA negotiated with the service provider. In this light, QoS auditing allows reasoning about how good the SLA is complied by the provider in the face of hostile environment conditions.

SLA refers to the contractual obligations between a cloud service provider and a service consumer. The SLA can document the promised QoS from a service provider and the para-functional requirements of service delivery to the client. In cloud setting, the SLA evaluation involves the following entities:

  1. The direct parties involved in a QoS specification and enforcement: the cloud service provider and the client;
  2. The third parties involved in QoS assessment and verification: the QoS monitor, audit provider, and certifier.

These parties are provided with a service description which specifies the QoS that will be guaranteed to the client applications under the agreement. Domain-specific information is included to map the specification onto application-level objectives. The latter, which define the service-level indicators, correspond to the promised QoS values from the service provider: such as the response time, availability, and overhead. The SLA also includes a prescription of the penalty in case that the service provider under-performs or is unable to provide service at the promised level. Violation of different guaranteed service objectives may lead to different penalties. An SLA thus provides the needed transparency of operations between cloud-based service providers and consumers.

As a case study of QoS auditing methods, we work on the measurement of available bandwidth estimation on an end-to-end path set up from a client device (e.g., smart-phone) to a cloud data center over a series of routers. This estimate is then reasoned about in light of the SLA negotiated with the cloud provider at the time of path set up. In the future, we plan to employ OPenFlow switches and PlanetLab nodes as part of  our experimental platform.

3. Software Cybernetics for networked embedded systems

Future network systems are expected to have various levels of adaptation capabilities: at parametric level, service-level, and application-level. These capabilities are often realized in multiple system layers, with the control logic needed for a specific capability residing in the application agents interfacing with the underlying network system services. For instance, the formation of a first-responder vehicular network for underwater rescue/repair mission may be based on the processing, communication, and sensing capabilities of various nodes as managed by the software agents running in these nodes. We employ a software cybernetics approach, where an intelligent physical system module (IPW) embodies the core adaptation functionality to respond to the changing environment conditions and user inputs. The IPW exhibits an intelligent behavior over a limited operating region of the system --- as in the earlier example of first-responder team. The IPW is augmented by a management-oriented computational module (ICW) housing supervisory feedback loops to deal with the changing external environment conditions (e.g., adding more nodes to the rescue team when the terrain conditions become severe). Our autonomic management of various hierarchical control loops comes under the ambit of Cyber-Physical Systems (CPS). The ICW patches the IPW with suitable control parameters and rules/procedures when the system operating conditions change. Our focus in this research is on the software engineering-aspects of designing networked embedded systems, and the construction of IPW-ICW modules in specific application domains.

Our modular decomposition of a networked embedded system into IPW and ICW has many advantages: lowering the overall software complexity, simplifying system verification, and supporting easier evolution of system features. Other application domains of our research are in automotive control systems and vehicular networks. Existing complex network applications: such as bandwidth-adaptive video transport and latency-adaptive CDN server deployment, can also be structured in terms of our IPW-ICW approach. The attendant advantages arise from software and systems engineering angles: such online model reconfigurations and parameter/algorithm switching (i.e., system morphing).

Facilities description
Distributed Network Management Test-bed
NAC.7/115, Dept. of Computer Science, CUNY - City College

Kaliappa Ravindran and his students pursue research in service-level management of distributed networks, system-level support for information assurance, distributed collaboration techniques, cyber-physical & embedded systems, and formal validation of assured distributed software systems. His research lab maintains a distributed network management test-bed that enables the multi-pronged research in the above areas. The test-bed is built using many network equipments:

  • 4#s of CISCO routers
  • 2#s of AT&T switches
  • 1# of Fore ATM switch
  • 1# of Spirent traffic analyzer
  • 8#s of Pocket PCs with IEEE 802.11 (Sharp-Zaurus, HP-iPAQ, Dell-Axim)
  • 7#s of IBM ThinkPad laptops
  • 3#s of SUN-Blade computers
  • 6#s of SUN-Ultrasparc computers
  • 8#s of low-end and high-end Windows and Linux PCs
  • 802.11 wireless network cards on laptops and PCs
  • 4#s of T1/T4 line cards
  • Network Management Software:  HPOpenView
  • Simulation software: OPNET, NS-2, etc
  • Languages: JAVA, C++, MATLAB

The various computers are configured as a multi-hop network (maximum diameter is 6 hops), with a hybrid of low and high speed links. The test-bed allows simulating network attacks and resource outages in the study of traffic engineering techniques, distributed implementation of fault-tolerance algorithms, and robustness control under fuzzy network measurements. Traffic analyzer is used to inject controlled amounts of cross-traffic in network paths and simulate denial-of-service attacks on network links. HPOpenView software allows studying "managed QoS assurance" from networked systems (such as clouds). HPOpenView agents implement domain-specific QoS monitoring at faster time-scales (e.g., latency monitoring in a CDN), with recovery from application-level QoS failures occurring at slower time-scales.

The above test-bed is currently being augmented with Android and Windows Smartphones to provide wireless communication capability and device mobility. With a subscription to the PlanetLab, we plan to extend the test-bed capability to accommodate mobile clouds. This allows us to pursue the research activities on mobile cloud SLA and auditing. The augmented test-bed can then be employed in the study of system-level dependability measures for certification and management purposes.

Research support from external agencies

Air Force Research Lab, Naval Research Lab, General Motors.