Cover page

Table of Contents

IEEE Press

Title page

Copyright page

Figures

Tables and Equations

Tables

Equations

1: Introduction

1.1 Approach

1.2 Target Audience

1.3 Organization

Acknowledgments

I: Context

2: Application Service Quality

2.1 Simple Application Model

2.2 Service Boundaries

2.3 Key Quality and Performance Indicators

2.4 Key Application Characteristics

2.5 Application Service Quality Metrics

2.6 Technical Service versus Support Service

2.7 Security Considerations

3: Cloud Model

3.1 Roles in Cloud Computing

3.2 Cloud Service Models

3.3 Cloud Essential Characteristics

3.4 Simplified Cloud Architecture

3.5 Elasticity Measurements

3.6 Regions and Zones

3.7 Cloud Awareness

4: Virtualized Infrastructure Impairments

4.1 Service Latency, Virtualization, and the Cloud

4.2 VM Failure

4.3 Nondelivery of Configured VM Capacity

4.4 Delivery of Degraded VM Capacity

4.5 Tail Latency

4.6 Clock Event Jitter

4.7 Clock Drift

4.8 Failed or Slow Allocation and Startup of VM Instance

4.9 Outlook for Virtualized Infrastructure Impairments

II: Analysis

5: Application Redundancy and Cloud Computing

5.1 Failures, Availability, and Simplex Architectures

5.2 Improving Software Repair Times via Virtualization

5.3 Improving Infrastructure Repair Times via Virtualization

5.4 Redundancy and Recoverability

5.5 Sequential Redundancy and Concurrent Redundancy

5.6 Application Service Impact of Virtualization Impairments

5.7 Data Redundancy

5.8 Discussion

6: Load Distribution and Balancing

6.1 Load Distribution Mechanisms

6.2 Load Distribution Strategies

6.3 Proxy Load Balancers

6.4 Nonproxy Load Distribution

6.5 Hierarchy of Load Distribution

6.6 Cloud-Based Load Balancing Challenges

6.7 The Role of Load Balancing in Support of Redundancy

6.8 Load Balancing and Availability Zones

6.9 Workload Service Measurements

6.10 Operational Considerations

6.11 Load Balancing and Application Service Quality

7: Failure Containment

7.1 Failure Containment

7.2 Points of Failure

7.3 Extreme Solution Coresidency

7.4 Multitenancy and Solution Containers

8: Capacity Management

8.1 Workload Variations

8.2 Traditional Capacity Management

8.3 Traditional Overload Control

8.4 Capacity Management and Virtualization

8.5 Capacity Management in Cloud

8.6 Storage Elasticity Considerations

8.7 Elasticity and Overload

8.8 Operational Considerations

8.9 Workload Whipsaw

8.10 General Elasticity Risks

8.11 Elasticity Failure Scenarios

9: Release Management

9.1 Terminology

9.2 Traditional Software Upgrade Strategies

9.3 Cloud-Enabled Software Upgrade Strategies

9.4 Data Management

9.5 Role of Service Orchestration in Software Upgrade

9.6 Conclusion

10: End-to-End Considerations

10.1 End-to-End Service Context

10.2 Three-Layer End-to-End Service Model

10.3 Distributed and Centralized Cloud Data Centers

10.4 Multitiered Solution Architectures

10.5 Disaster Recovery and Geographic Redundancy

III: Recommendations

11: Accountabilities for Service Quality

11.1 Traditional Accountability

11.2 The Cloud Service Delivery Path

11.3 Cloud Accountability

11.4 Accountability Case Studies

11.5 Service Quality Gap Model

11.6 Service Level Agreements

12: Service Availability Measurement

12.1 Parsimonious Service Measurements

12.2 Traditional Service Availability Measurement

12.3 Evolving Service Availability Measurements

12.4 Evolving Hardware Reliability Measurement

12.5 Evolving Elasticity Service Availability Measurements

12.6 Evolving Release Management Service Availability Measurement

12.7 Service Measurement Outlook

13: Application Service Quality Requirements

13.1 Service Availability Requirements

13.2 Service Latency Requirements

13.3 Service Reliability Requirements

13.4 Service Accessibility Requirements

13.5 Service Retainability Requirements

13.6 Service Throughput Requirements

13.7 Timestamp Accuracy Requirements

13.8 Elasticity Requirements

13.9 Release Management Requirements

13.10 Disaster Recovery Requirements

14: Virtualized Infrastructure Measurement and Management

14.1 Business Context for Infrastructure Service Quality Measurements

14.2 Cloud Consumer Measurement Options

14.3 Impairment Measurement Strategies

14.4 Managing Virtualized Infrastructure Impairments

15: Analysis of Cloud-Based Applications

15.1 Reliability Block Diagrams and Side-by-Side Analysis

15.2 IaaS Impairment Effects Analysis

15.3 PaaS Failure Effects Analysis

15.4 Workload Distribution Analysis

15.5 Anti-Affinity Analysis

15.6 Elasticity Analysis

15.7 Release Management Impact Effects Analysis

15.8 Recovery Point Objective Analysis

15.9 Recovery Time Objective Analysis

16: Testing Considerations

16.1 Context for Testing

16.2 Test Strategy

16.3 Simulating Infrastructure Impairments

16.4 Test Planning

17: Connecting the Dots

17.1 The Application Service Quality Challenge

17.2 Redundancy and Robustness

17.3 Design for Scalability

17.4 Design for Extensibility

17.5 Design for Failure

17.6 Planning Considerations

17.7 Evolving Traditional Applications

17.8 Concluding Remarks

Abbreviations

References

About the Authors

Index

ffirs02-fig-5001

Title page

Figures

Figure 1.1. Sample Cloud-Based Application.
Figure 2.0. Organization of Part I: Context.
Figure 2.1. Simple Cloud-Based Application.
Figure 2.2. Simple Virtual Machine Service Model.
Figure 2.3. Application Service Boundaries.
Figure 2.4. KQIs and KPIs.
Figure 2.5. Application Consumer and Resource Facing Service Indicators.
Figure 2.6. Application Robustness.
Figure 2.7. Sample Application Robustness Scenario.
Figure 2.8. Interactivity Timeline.
Figure 2.9. Service Latency.
Figure 2.10. Small Sample Service Latency Distribution.
Figure 2.11. Sample Typical Latency Variation by Workload Density.
Figure 2.12. Sample Tail Latency Variation by Workload Density.
Figure 2.13. Understanding Complimentary Cumulative Distribution Plots.
Figure 2.14. Service Latency Optimization Options.
Figure 3.1. Cloud Roles for Simple Application.
Figure 3.2. Elastic Growth Strategies.
Figure 3.3. Simple Model of Cloud Infrastructure.
Figure 3.4. Abstract Virtual Machine Server.
Figure 3.5. Provisioning Interval (TGrow).
Figure 3.6. Release Interval TShrink.
Figure 3.7. VM Scale In and Scale Out.
Figure 3.8. Horizontal Elasticity.
Figure 3.9. Scale Up and Scale Down of a VM Instance.
Figure 3.10. Idealized (Linear) Capacity Agility.
Figure 3.11. Slew Rate of Square Wave Amplification.
Figure 3.12. Elastic Growth Slew Rate and Linearity.
Figure 3.13. Regions and Availability Zones.
Figure 4.1. Virtualized Infrastructure Impairments Experienced by Cloud-Based Applications.
Figure 4.2. Transaction Latency for Riak Benchmark.
Figure 4.3. VM Failure Impairment Example.
Figure 4.4. Simplified Nondelivery of VM Capacity Model.
Figure 4.5. Characterizing Virtual Machine Nondelivery.
Figure 4.6. Nondelivery Impairment Example.
Figure 4.7. Simple Virtual Machine Degraded Delivery Model.
Figure 4.8. Degraded Resource Capacity Model.
Figure 4.9. Degraded Delivery Impairment Example.
Figure 4.10. CCDF for Riak Read Benchmark for Three Different Hosting Configurations.
Figure 4.11. Tail Latency Impairment Example.
Figure 4.12. Sample CCDF for Virtualized Clock Event Jitter.
Figure 4.13. Clock Event Jitter Impairment Example.
Figure 4.14. Clock Drift Impairment Example.
Figure 5.1. Simplex Distributed System.
Figure 5.2. Simplex Service Availability.
Figure 5.3. Sensitivity of Service Availability to MTRS (Log Scale).
Figure 5.4. Traditional versus Virtualized Software Repair Times.
Figure 5.5. Traditional Hardware Repair versus Virtualized Infrastructure Restoration Times.
Figure 5.6. Simplified VM Repair Logic.
Figure 5.7. Sample Automated Virtual Machine Repair-as-a-Service Logic.
Figure 5.8. Simple Redundancy Model.
Figure 5.9. Simplified High Availability Strategy.
Figure 5.10. Failure in a Traditional (Sequential) Redundant Architecture.
Figure 5.11. Sequential Redundancy Model.
Figure 5.12. Sequential Redundant Architecture Timeline with No Failures.
Figure 5.13. Sample Redundant Architecture Timeline with Implicit Failure.
Figure 5.14. Sample Redundant Architecture Timeline with Explicit Failure.
Figure 5.15. Recovery Times for Traditional Redundancy Architectures.
Figure 5.16. Concurrent Redundancy Processing Model.
Figure 5.17. Client Controlled Redundant Compute Strategy.
Figure 5.18. Client Controlled Redundant Operations.
Figure 5.19. Concurrent Redundancy Timeline with Fast but Erroneous Return.
Figure 5.20. Hybrid Concurrent with Slow Response.
Figure 5.21. Application Service Impact for Very Brief Nondelivery Events.
Figure 5.22. Application Service Impact for Brief Nondelivery Events.
Figure 5.23. Nondelivery Impact to Redundant Compute Architectures.
Figure 5.24. Nondelivery Impact to Hybrid Concurrent Architectures.
Figure 6.1. Proxy Load Balancer.
Figure 6.2. Proxy Load Balancing.
Figure 6.3. Load Balancing between Regions and Availability Zones.
Figure 7.1. Reliability Block Diagram of Simplex Sample System (with SPOF).
Figure 7.2. Reliability Block Diagram of Redundant Sample System (without SPOF).
Figure 7.3. No SPOF Distribution of Component Instances across Virtual Servers.
Figure 7.4. Example of No Single Point of Failure with Distributed Component Instances.
Figure 7.5. Example of Single Point of Failure with Poorly Distributed Component Instances.
Figure 7.6. Simplified VM Server Control.
Figure 8.1. Sample Daily Workload Variation (Logarithmic Scale).
Figure 8.2. Traditional Maintenance Window.
Figure 8.3. Traditional Congestion Control.
Figure 8.4. Simplified Elastic Growth of Cloud-Based Applications.
Figure 8.5. Simplified Elastic Degrowth of Cloud-Based Applications.
Figure 8.6. Sample of Erratic Workload Variation (Linear Scale).
Figure 8.7. Typical Elasticity Orchestration Process.
Figure 8.8. Example of Workload Whipsaw.
Figure 8.9. Elastic Growth Failure Scenarios.
Figure 9.1. Traditional Offline Software Upgrade.
Figure 9.2. Traditional Online Software Upgrade.
Figure 9.3. Type I, “Block Party” Upgrade Strategy.
Figure 9.4. Application Elastic Growth and Type I, “Block Party” Upgrade.
Figure 9.5. Type II, “One Driver per Bus” Upgrade Strategy.
Figure 10.1. Simple End-to-End Application Service Context.
Figure 10.2. Service Boundaries in End-to-End Application Service Context.
Figure 10.3. Measurement Points 0–4 for Simple End-to-End Context.
Figure 10.4. End-to-End Measurement Points for Simple Replicated Solution Context.
Figure 10.5. Service Probes across User Service Delivery Path.
Figure 10.6. Three Layer Factorization of Sample End to End Solution.
Figure 10.7. Estimating Service Impairments across the Three-Layer Model.
Figure 10.8. Decomposing a Service Impairment.
Figure 10.9. Centralized Cloud Data Center Scenario.
Figure 10.10.  Distributed Cloud Data Center Scenario.
Figure 10.11. Sample Multitier Solution Architecture.
Figure 10.12. Disaster Recovery Time and Point Objectives.
Figure 10.13. Service Impairment Model of Georedundancy.
Figure 11.1. Traditional Three-Way Accountability Split: Suppliers, Customers, External.
Figure 11.2. Example Cloud Service Delivery Chain.
Figure 11.3. Service Boundaries across Cloud Delivery Chain.
Figure 11.4. Functional Responsibilities for Applications Deployed on IaaS.
Figure 11.5. Sample Application.
Figure 11.6. Service Outage Accountability of Sample Application.
Figure 11.7. Application Elasticity Configuration.
Figure 11.8. Service Gap Model.
Figure 11.9. Service Quality Zone of Tolerance.
Figure 11.10. Application's Resource Facing Service Boundary.
Figure 11.11. Application's Customer Facing Service Boundary.
Figure 12.1. Traditional Service Operation Timeline.
Figure 12.2. Sample Application Deployment on Cloud.
Figure 12.3. “Network Element” Boundary for Sample Application.
Figure 12.4. Logical Measurement Point for Application's Service Availability.
Figure 12.5. Reliability Block Diagram of Sample Application (Traditional Deployment).
Figure 12.6. Evolving Sample Application to Cloud.
Figure 12.7. Reliability Block Diagram of Sample Application on Cloud.
Figure 12.8. Side-by-Side Reliability Block Diagrams.
Figure 12.9. Accountability of Sample Cloud Based Application.
Figure 12.10. Connectivity-as-a-Service as a Nanoscale VPN.
Figure 12.11. Sample Application with Database-as-a-Service.
Figure 12.12. Accountability of Sample Application with Database-as-a-Service.
Figure 12.13. Sample Application with Outboard RAID Storage Array.
Figure 12.14. Sample Application with Storage-as-a-Service.
Figure 12.15. Accountability of Sample Application with Storage-as-a-Service.
Figure 12.16. Virtual Machine Failure Lifecycle.
Figure 12.17. Elastic Capacity Growth Timeline.
Figure 12.18. Outage Normalization for Type I “Block Party” Release Management.
Figure 12.19. Outage Normalization for Type II “One Driver per Bus” Release Management.
Figure 13.1. Maximum Acceptable Service Disruption.
Figure 14.1. Infrastructure impairments and application impairments.
Figure 14.2. Loopback and Service Latency.
Figure 14.3. Simplified Measurement Architecture.
Figure 15.1. Sample Side-by-Side Reliability Block Diagrams.
Figure 15.2. Worst-Case Recovery Point Scenario.
Figure 15.3. Best-Case Recovery Point Scenario.
Figure 16.1. Measuring Service Disruption Latency.
Figure 16.2. Service Disruption Latency for Implicit Failure.
Figure 16.3. Sample Endurance Test Case for Cloud-Based Application.
Figure 17.1. Virtualized Infrastructure Impairments Experienced by Cloud-Based Applications.
Figure 17.2. Application Robustness Challenge.
Figure 17.3. Sequential (Traditional) Redundancy.
Figure 17.4. Concurrent Redundancy.
Figure 17.5. Hybrid Concurrent with Slow Response.
Figure 17.6. Type I, “Block Party” Upgrade Strategy.
Figure 17.7. Sample Phased Evolution of a Traditional Application.

Tables and Equations

Tables

TABLE 2.1. Mean Opinion Scores [P.800]                          
TABLE 13.1.  Service Availability and Downtime Ratings

Equations

Equation 2.1. Availability Formula
Equation 5.1. Simplex Availability
Equation 5.2. Traditional Availability
Equation 10.1. Estimating General End-to-End Service Impairments
Equation 10.2. Estimating End-to-End Service Downtime
Equation 10.3. Estimating End-to-End Service Availability
Equation 10.4.  Estimating End-to-End Typical Service Latency
Equation 10.5. Estimating End-to-End Service Defect Rate
Equation 10.6. Estimating End-to-End Service Accessibility
Equation 10.7. Estimating End to End Service Retainability (as DPM)
Equation 13.1. DPM via Operations Attempted and Operations Successful
Equation 13.2. DPM via Operations Attempted and Operations Failed
Equation 13.3. DPM via Operations Successful and Operations Failed
Equation 14.1. Computing VM FITs
Equation 14.2. Converting FITs to MTBF

1

Introduction

Customers expect that applications and services deployed on cloud computing infrastructure will deliver comparable service quality, reliability, availability, and latency as when deployed on traditional, native hardware configurations. Cloud computing infrastructure introduces a new family of service impairment risks based on the virtualized compute, memory, storage, and networking resources that an Infrastructure-as-a-Service (IaaS) provider delivers to hosted application instances. As a result, application developers and cloud consumers must mitigate these impairments to assure that application service delivered to end users is not unacceptably impacted. This book methodically analyzes the impacts of cloud infrastructure impairments on application service delivered to end users, as well as the opportunities for improvement afforded by cloud. The book also recommends architectures, policies, and other techniques to maximize the likelihood of delivering comparable or better service to end users when applications are deployed to cloud.

1.1 Approach

Cloud-based application software executes within a set of virtual machine instances, and each individual virtual machine instance relies on virtualized compute, memory, storage, and networking service delivered by the underlying cloud infrastructure. As shown in Figure 1.1, the application presents customer facing service toward end users across the dotted service boundary, and consumes virtualized resources offered by the Infrastructure-as-a-Service provider across the dashed resource facing service boundary. The application's service quality experienced by the end users is primarily a function of the application's architecture and software quality, as well as the service quality of the virtualized infrastructure offered by the IaaS across the resource facing service boundary, and the access and wide area networking that connects the end user to the application instance. This book considers both the new impairments and opportunities of virtualized resources offered to applications deployed on cloud and how user service quality experienced by end users can be maximized. By ignoring service impairments of the end user's device, and access and wide area network, one can narrowly consider how application service quality differs when a particular application is hosted on cloud infrastructure compared with when it is natively deployed on traditional hardware.

Figure 1.1. Sample Cloud-Based Application.

c1-fig-0001

The key technical difference for application software between native deployment and cloud deployment is that native deployments offer the application's (guest) operating system direct access to the physical compute, memory, storage, and network resources, while cloud deployment inserts a layer of hypervisor or virtual machine management software between the guest operating system and the physical hardware. This layer of hypervisor or virtual machine management software enables sophisticated resource sharing, technical features, and operational policies. However, the hypervisor or virtual machine management layer does not deliver perfect hardware emulation to the guest operating system and application software, and these imperfections can adversely impact application service delivered to end users. While Figure 1.1 illustrates application deployment to a single data center, real world applications are often deployed to multiple data centers to improve user service quality by shortening transport latency to end users, to support business continuity and disaster recovery, and for other business reasons. Application service quality for deployment across multiple data centers is also considered in this book.

This book considers how application architectures, configurations, validation, and operational policies should evolve so that the acceptable application service quality can be delivered to end users even when application software is deployed on cloud infrastructure. This book approaches application service quality from the end users perspective while considering standards and recommendations from NIST, TM Forum, QuEST Forum, ODCA, ISO, ITIL, and so on.

1.2 Target Audience

This book provides application architects, developers, and testers with guidance on architecting and engineering applications that meet their customers' and end users' service reliability, availability, quality, and latency expectations. Product managers, program managers, and project managers will also gain deeper insights into the service quality risks and mitigations that must be addressed to assure that an application deployed onto cloud infrastructure consistently meets or exceeds customers' expectations for user service quality.

1.3 Organization

The work is organized into three parts: context, analysis, and recommendations. Part I: Context frames the context of service quality of cloud-based applications via the following:

Part II: Analysis methodically considers how application service defined in Chapter 2, “Application Service Quality,” is impacted by the infrastructure impairments enumerated in Chapter 4, “Virtualized Infrastructure Impairments,” across the following topics:

Part III: Recommendations covers the following:

As many readers are likely to study sections based on the technical needs of their business and their professional interest rather than strictly following this work's running order, cross-references are included throughout the work so readers can, say, dive into detailed Part II analysis sections, and follow cross-references back into Part I for basic definitions and follow references forward to Part III for recommendations. A detailed index is included to help readers quickly locate material.

Acknowledgments

The authors acknowledge the consistent support of Dan Johnson, Annie Lequesne, Sam Samuel, and Lawrence Cowsar that enabled us to complete this work. Expert technical feedback was provided by Mark Clougherty, Roger Maitland, Rich Sohn, John Haller, Dan Eustace, Geeta Chauhan, Karsten Oberle, Kristof Boeynaems, Tony Imperato, and Chuck Salisbury. Data and practical insights were shared by Karen Woest, Srujal Shah, Pete Fales, and many others. Bob Brownlie offered keen insights into service measurements and accountabilities. Expert review and insight on release management for virtualized applications was provided by Bruce Collier. The work benefited greatly from insightful review feedback from Mark Cameron. Iraj Saniee, Katherine Guo, Indra Widjaja, Davide Cherubini, and Karsten Oberle offered keen and substantial insights. The authors gratefully acknowledge the external reviewers who took time to provide through review and thoughtful feedback that materially improved this book: Tim Coote, Steve Woodward, Herbert Ristock, Kim Tracy, and Xuemei Zhang.

The authors welcome feedback on this book; readers may e-mail us at Eric.Bauer@alcatel-lucent.com and Randee.Adams@alcatel-lucent.com.

I

Context

Figure 2.0 frames the context of this book: cloud-based applications rely on virtualized compute, memory, storage, and networking resources to provide information services to end users via access and wide area networks. The application's primary quality focus is on the user service delivered across the application's customer facing service boundary (dotted line in Figure 2.0).

Figure 2.0. Organization of Part I: Context.

p1-fig-0001

2

Application Service Quality

This section considers the service offered by applications to end users and the metrics used to characterize the quality of that service. A handful of common service quality metrics that characterize application service quality are detailed. These user service key quality indicators (KQIs) are considered in depth in Part II: Analysis.

2.1 Simple Application Model

Figure 2.1 illustrates a simple cloud-based application with a pool of frontend components distributing work across a pool of backend components. The suite of frontend and backend components is managed by a pair of control components that provide management visibility and control for the entire application instance. Each of the application's components, along with their supporting guest operating systems, execute in distinct virtual machine instances served by the cloud service provider. The Distributed Management Task Force (DMTF) defines virtual machine as:

the complete environment that supports the execution of guest software. A virtual machine is a full encapsulation of the virtual hardware, virtual disks, and the metadata associated with it. Virtual machines allow multiplexing of the underlying physical machine through a software layer called a hypervisor. [DSP0243]

Figure 2.1. Simple Cloud-Based Application.

c2-fig-0001

For simplicity, this simple model ignores systems that directly support the application, such as security appliances that protect the application from external attack, domain name servers, and so on.

Figure 2.2 shows a single application component deployed in a virtual machine on cloud infrastructure. The application software and its underlying operating system—referred to as a guest OS—run within a virtual machine instance that emulates a dedicated physical server. The cloud service provider's infrastructure delivers the following resource services to the application's guest OS instance:

Figure 2.2. Simple Virtual Machine Service Model.

c2-fig-0002

2.2 Service Boundaries

It is useful to define boundaries that demark applications and service offerings to better understand the dependencies, interactions, roles, and responsibilities of each element in overall user service delivery. This work will focus on the two high-level application service boundaries shown in Figure 2.3:

Figure 2.3. Application Service Boundaries.

c2-fig-0003

Note that customer facing service and resource facing service boundaries are relative to a particular entity in the service delivery chain. Figure 2.3, and this book, consider these concepts from the perspective of a cloud-based application, but these same service boundary notions can be applied to an element of the cloud Infrastructure-as-a-Service or technology component offered as “as-a-Service” like Database-as-a-Service.

2.3 Key Quality and Performance Indicators

Qualities such as latency and reliability of service delivered across a service boundary can be quantitatively measured. Technically useful service measurements are generally referred to as key performance indicators (KPIs). As shown in Figure 2.4, a subset of KPIs across the customer facing service boundary characterize key aspects of the customer's experience and perception of quality, and these are often referred to as key quality indicators (KQIs) [TMF_TR197]. Enterprises routinely track and manage these KQIs to assure that customers are delighted. Well-run enterprises will often tie staff bonus payments to achieving quantitative KQI targets to better align the financial interests of enterprise staff to the business need of delivering excellent service to customers.

Figure 2.4. KQIs and KPIs.

c2-fig-0004

In the context of applications, KQIs often cover high-level business considerations, including service qualities that impact user satisfaction and churn, such as:

Different applications with different business models will define KPIs somewhat differently and will select different KQIs from their suite of application KPIs.

A primary resource facing service risk experienced by cloud-based applications is the quality of virtualized compute, memory, storage, and networking delivered by the cloud service provider to application components executing in virtual machine (VM) instances. Chapter 4, “Virtualized Infrastructure Impairments,” considers the following:

Figure 2.5 overlays common customer facing service KQIs with typical resource facing service KPIs on the simple application of Section 2.1.

Figure 2.5. Application Consumer and Resource Facing Service Indicators.

c2-fig-0005

As shown in Figure 2.6, the robustness of an application's architecture charac­terizes how effectively the application can maintain quality across the application's customer facing service boundary despite impairments experienced across the resource facing service boundary and failures within the application itself.

Figure 2.6. Application Robustness.

c2-fig-0006

Figure 2.7 illustrates a concrete robustness example: if the cloud infrastructure stalls a VM that is hosting one of the application backend instances for hundreds of milliseconds (see Section 4.3, “Nondelivery of Configured VM Capacity”), then is the application's customer facing service impacted? Do some or all user operations take hundreds of milliseconds longer to complete, or do some (or all) operations fail due to timeout expiration? A robust application will mask the customer facing service impact of this service impairment so end users do not experience unacceptable service quality.

Figure 2.7. Sample Application Robustness Scenario.

c2-fig-0007

2.4 Key Application Characteristics

Customer facing service quality expectations are fundamentally driven by application characteristics, such as:

These characteristics influence both the quantitative targets for application's service quality (e.g., critical applications have higher service availability expectations) and specifics of those service quality measurements (e.g., maximum tolerable service downtime influences the minimum chargeable outage downtime threshold).

2.4.1 Service Criticality

Readers will recognize that different information services entail different levels of criticality to users and the enterprise. While these ratings will vary somewhat based on organizational needs and customer expectations, the criticality classification definitions from the U.S. Federal Aviation Administration's National Airspace System's reliability handbook are fairly typical:

There is also a “Safety Critical” category, with service availability rating of seven 9s for life-threatening risks and services where “loss would present an unacceptable safety hazard during the transition to reduced capacity operations” [FAA-HDBK-006A]. Few commercial enterprises offer services or applications that are safety critical, so seven 9's expectations are rare.

The higher the service criticality the more the enterprise is willing to invest in architectures, policies, and procedures to assure that acceptable service quality is continuously available to users.

2.4.2 Application Interactivity

As shown in Figure 2.8, there are three broad classifications of application service interactivity:

Figure 2.8. Interactivity Timeline.

c2-fig-0008

2.4.3 Tolerance to Network Traffic Impairments

Data networks are subject to three fundamental types of service impairments:

[RFC4594] characterizes tolerance to packet loss, delay, and jitter for common classes of applications.