1: Introduction
1.1 Approach
1.2 Target Audience
1.3 Organization
I: Context
2: Application Service Quality
2.1 Simple Application Model
2.2 Service Boundaries
2.3 Key Quality and Performance Indicators
2.4 Key Application Characteristics
2.5 Application Service Quality Metrics
2.6 Technical Service versus Support Service
2.7 Security Considerations
3: Cloud Model
3.1 Roles in Cloud Computing
3.2 Cloud Service Models
3.3 Cloud Essential Characteristics
3.4 Simplified Cloud Architecture
3.5 Elasticity Measurements
3.6 Regions and Zones
3.7 Cloud Awareness
4: Virtualized Infrastructure Impairments
4.1 Service Latency, Virtualization, and the Cloud
4.2 VM Failure
4.3 Nondelivery of Configured VM Capacity
4.4 Delivery of Degraded VM Capacity
4.5 Tail Latency
4.6 Clock Event Jitter
4.7 Clock Drift
4.8 Failed or Slow Allocation and Startup of VM Instance
4.9 Outlook for Virtualized Infrastructure Impairments
II: Analysis
5: Application Redundancy and Cloud Computing
5.1 Failures, Availability, and Simplex Architectures
5.2 Improving Software Repair Times via Virtualization
5.3 Improving Infrastructure Repair Times via Virtualization
5.4 Redundancy and Recoverability
5.5 Sequential Redundancy and Concurrent Redundancy
5.6 Application Service Impact of Virtualization Impairments
5.7 Data Redundancy
5.8 Discussion
6: Load Distribution and Balancing
6.1 Load Distribution Mechanisms
6.2 Load Distribution Strategies
6.3 Proxy Load Balancers
6.4 Nonproxy Load Distribution
6.5 Hierarchy of Load Distribution
6.6 Cloud-Based Load Balancing Challenges
6.7 The Role of Load Balancing in Support of Redundancy
6.8 Load Balancing and Availability Zones
6.9 Workload Service Measurements
6.10 Operational Considerations
6.11 Load Balancing and Application Service Quality
7: Failure Containment
7.1 Failure Containment
7.2 Points of Failure
7.3 Extreme Solution Coresidency
7.4 Multitenancy and Solution Containers
8: Capacity Management
8.1 Workload Variations
8.2 Traditional Capacity Management
8.3 Traditional Overload Control
8.4 Capacity Management and Virtualization
8.5 Capacity Management in Cloud
8.6 Storage Elasticity Considerations
8.7 Elasticity and Overload
8.8 Operational Considerations
8.9 Workload Whipsaw
8.10 General Elasticity Risks
8.11 Elasticity Failure Scenarios
9: Release Management
9.1 Terminology
9.2 Traditional Software Upgrade Strategies
9.3 Cloud-Enabled Software Upgrade Strategies
9.4 Data Management
9.5 Role of Service Orchestration in Software Upgrade
9.6 Conclusion
10: End-to-End Considerations
10.1 End-to-End Service Context
10.2 Three-Layer End-to-End Service Model
10.3 Distributed and Centralized Cloud Data Centers
10.4 Multitiered Solution Architectures
10.5 Disaster Recovery and Geographic Redundancy
III: Recommendations
11: Accountabilities for Service Quality
11.1 Traditional Accountability
11.2 The Cloud Service Delivery Path
11.3 Cloud Accountability
11.4 Accountability Case Studies
11.5 Service Quality Gap Model
11.6 Service Level Agreements
12: Service Availability Measurement
12.1 Parsimonious Service Measurements
12.2 Traditional Service Availability Measurement
12.3 Evolving Service Availability Measurements
12.4 Evolving Hardware Reliability Measurement
12.5 Evolving Elasticity Service Availability Measurements
12.6 Evolving Release Management Service Availability Measurement
12.7 Service Measurement Outlook
13: Application Service Quality Requirements
13.1 Service Availability Requirements
13.2 Service Latency Requirements
13.3 Service Reliability Requirements
13.4 Service Accessibility Requirements
13.5 Service Retainability Requirements
13.6 Service Throughput Requirements
13.7 Timestamp Accuracy Requirements
13.8 Elasticity Requirements
13.9 Release Management Requirements
13.10 Disaster Recovery Requirements
14: Virtualized Infrastructure Measurement and Management
14.1 Business Context for Infrastructure Service Quality Measurements
14.2 Cloud Consumer Measurement Options
14.3 Impairment Measurement Strategies
14.4 Managing Virtualized Infrastructure Impairments
15: Analysis of Cloud-Based Applications
15.1 Reliability Block Diagrams and Side-by-Side Analysis
15.2 IaaS Impairment Effects Analysis
15.3 PaaS Failure Effects Analysis
15.4 Workload Distribution Analysis
15.5 Anti-Affinity Analysis
15.6 Elasticity Analysis
15.7 Release Management Impact Effects Analysis
15.8 Recovery Point Objective Analysis
15.9 Recovery Time Objective Analysis
16: Testing Considerations
16.1 Context for Testing
16.2 Test Strategy
16.3 Simulating Infrastructure Impairments
16.4 Test Planning
17: Connecting the Dots
17.1 The Application Service Quality Challenge
17.2 Redundancy and Robustness
17.3 Design for Scalability
17.4 Design for Extensibility
17.5 Design for Failure
17.6 Planning Considerations
17.7 Evolving Traditional Applications
17.8 Concluding Remarks
About the Authors
Customers expect that applications and services deployed on cloud computing infrastructure will deliver comparable service quality, reliability, availability, and latency as when deployed on traditional, native hardware configurations. Cloud computing infrastructure introduces a new family of service impairment risks based on the virtualized compute, memory, storage, and networking resources that an Infrastructure-as-a-Service (IaaS) provider delivers to hosted application instances. As a result, application developers and cloud consumers must mitigate these impairments to assure that application service delivered to end users is not unacceptably impacted. This book methodically analyzes the impacts of cloud infrastructure impairments on application service delivered to end users, as well as the opportunities for improvement afforded by cloud. The book also recommends architectures, policies, and other techniques to maximize the likelihood of delivering comparable or better service to end users when applications are deployed to cloud.
Cloud-based application software executes within a set of virtual machine instances, and each individual virtual machine instance relies on virtualized compute, memory, storage, and networking service delivered by the underlying cloud infrastructure. As shown in Figure 1.1, the application presents customer facing service toward end users across the dotted service boundary, and consumes virtualized resources offered by the Infrastructure-as-a-Service provider across the dashed resource facing service boundary. The application's service quality experienced by the end users is primarily a function of the application's architecture and software quality, as well as the service quality of the virtualized infrastructure offered by the IaaS across the resource facing service boundary, and the access and wide area networking that connects the end user to the application instance. This book considers both the new impairments and opportunities of virtualized resources offered to applications deployed on cloud and how user service quality experienced by end users can be maximized. By ignoring service impairments of the end user's device, and access and wide area network, one can narrowly consider how application service quality differs when a particular application is hosted on cloud infrastructure compared with when it is natively deployed on traditional hardware.
Figure 1.1. Sample Cloud-Based Application.
The key technical difference for application software between native deployment and cloud deployment is that native deployments offer the application's (guest) operating system direct access to the physical compute, memory, storage, and network resources, while cloud deployment inserts a layer of hypervisor or virtual machine management software between the guest operating system and the physical hardware. This layer of hypervisor or virtual machine management software enables sophisticated resource sharing, technical features, and operational policies. However, the hypervisor or virtual machine management layer does not deliver perfect hardware emulation to the guest operating system and application software, and these imperfections can adversely impact application service delivered to end users. While Figure 1.1 illustrates application deployment to a single data center, real world applications are often deployed to multiple data centers to improve user service quality by shortening transport latency to end users, to support business continuity and disaster recovery, and for other business reasons. Application service quality for deployment across multiple data centers is also considered in this book.
This book considers how application architectures, configurations, validation, and operational policies should evolve so that the acceptable application service quality can be delivered to end users even when application software is deployed on cloud infrastructure. This book approaches application service quality from the end users perspective while considering standards and recommendations from NIST, TM Forum, QuEST Forum, ODCA, ISO, ITIL, and so on.
This book provides application architects, developers, and testers with guidance on architecting and engineering applications that meet their customers' and end users' service reliability, availability, quality, and latency expectations. Product managers, program managers, and project managers will also gain deeper insights into the service quality risks and mitigations that must be addressed to assure that an application deployed onto cloud infrastructure consistently meets or exceeds customers' expectations for user service quality.
The work is organized into three parts: context, analysis, and recommendations. Part I: Context frames the context of service quality of cloud-based applications via the following:
Part II: Analysis methodically considers how application service defined in Chapter 2, “Application Service Quality,” is impacted by the infrastructure impairments enumerated in Chapter 4, “Virtualized Infrastructure Impairments,” across the following topics:
Part III: Recommendations covers the following:
As many readers are likely to study sections based on the technical needs of their business and their professional interest rather than strictly following this work's running order, cross-references are included throughout the work so readers can, say, dive into detailed Part II analysis sections, and follow cross-references back into Part I for basic definitions and follow references forward to Part III for recommendations. A detailed index is included to help readers quickly locate material.
The authors acknowledge the consistent support of Dan Johnson, Annie Lequesne, Sam Samuel, and Lawrence Cowsar that enabled us to complete this work. Expert technical feedback was provided by Mark Clougherty, Roger Maitland, Rich Sohn, John Haller, Dan Eustace, Geeta Chauhan, Karsten Oberle, Kristof Boeynaems, Tony Imperato, and Chuck Salisbury. Data and practical insights were shared by Karen Woest, Srujal Shah, Pete Fales, and many others. Bob Brownlie offered keen insights into service measurements and accountabilities. Expert review and insight on release management for virtualized applications was provided by Bruce Collier. The work benefited greatly from insightful review feedback from Mark Cameron. Iraj Saniee, Katherine Guo, Indra Widjaja, Davide Cherubini, and Karsten Oberle offered keen and substantial insights. The authors gratefully acknowledge the external reviewers who took time to provide through review and thoughtful feedback that materially improved this book: Tim Coote, Steve Woodward, Herbert Ristock, Kim Tracy, and Xuemei Zhang.
The authors welcome feedback on this book; readers may e-mail us at and
Figure 2.0 frames the context of this book: cloud-based applications rely on virtualized compute, memory, storage, and networking resources to provide information services to end users via access and wide area networks. The application's primary quality focus is on the user service delivered across the application's customer facing service boundary (dotted line in Figure 2.0).
Figure 2.0. Organization of Part I: Context.
Application Service Quality
This section considers the service offered by applications to end users and the metrics used to characterize the quality of that service. A handful of common service quality metrics that characterize application service quality are detailed. These user service key quality indicators (KQIs) are considered in depth in Part II: Analysis.
Figure 2.1 illustrates a simple cloud-based application with a pool of frontend components distributing work across a pool of backend components. The suite of frontend and backend components is managed by a pair of control components that provide management visibility and control for the entire application instance. Each of the application's components, along with their supporting guest operating systems, execute in distinct virtual machine instances served by the cloud service provider. The Distributed Management Task Force (DMTF) defines virtual machine as:
the complete environment that supports the execution of guest software. A virtual machine is a full encapsulation of the virtual hardware, virtual disks, and the metadata associated with it. Virtual machines allow multiplexing of the underlying physical machine through a software layer called a hypervisor. [DSP0243]
Figure 2.1. Simple Cloud-Based Application.
For simplicity, this simple model ignores systems that directly support the application, such as security appliances that protect the application from external attack, domain name servers, and so on.
Figure 2.2 shows a single application component deployed in a virtual machine on cloud infrastructure. The application software and its underlying operating system—referred to as a guest OS—run within a virtual machine instance that emulates a dedicated physical server. The cloud service provider's infrastructure delivers the following resource services to the application's guest OS instance:
Figure 2.2. Simple Virtual Machine Service Model.
It is useful to define boundaries that demark applications and service offerings to better understand the dependencies, interactions, roles, and responsibilities of each element in overall user service delivery. This work will focus on the two high-level application service boundaries shown in Figure 2.3:
Figure 2.3. Application Service Boundaries.
Note that customer facing service and resource facing service boundaries are relative to a particular entity in the service delivery chain. Figure 2.3, and this book, consider these concepts from the perspective of a cloud-based application, but these same service boundary notions can be applied to an element of the cloud Infrastructure-as-a-Service or technology component offered as “as-a-Service” like Database-as-a-Service.
Qualities such as latency and reliability of service delivered across a service boundary can be quantitatively measured. Technically useful service measurements are generally referred to as key performance indicators (KPIs). As shown in Figure 2.4, a subset of KPIs across the customer facing service boundary characterize key aspects of the customer's experience and perception of quality, and these are often referred to as key quality indicators (KQIs) [TMF_TR197]. Enterprises routinely track and manage these KQIs to assure that customers are delighted. Well-run enterprises will often tie staff bonus payments to achieving quantitative KQI targets to better align the financial interests of enterprise staff to the business need of delivering excellent service to customers.
Figure 2.4. KQIs and KPIs.
In the context of applications, KQIs often cover high-level business considerations, including service qualities that impact user satisfaction and churn, such as:
Different applications with different business models will define KPIs somewhat differently and will select different KQIs from their suite of application KPIs.
A primary resource facing service risk experienced by cloud-based applications is the quality of virtualized compute, memory, storage, and networking delivered by the cloud service provider to application components executing in virtual machine (VM) instances. Chapter 4, “Virtualized Infrastructure Impairments,” considers the following:
Figure 2.5 overlays common customer facing service KQIs with typical resource facing service KPIs on the simple application of Section 2.1.
Figure 2.5. Application Consumer and Resource Facing Service Indicators.
As shown in Figure 2.6, the robustness of an application's architecture characterizes how effectively the application can maintain quality across the application's customer facing service boundary despite impairments experienced across the resource facing service boundary and failures within the application itself.
Figure 2.6. Application Robustness.
Figure 2.7 illustrates a concrete robustness example: if the cloud infrastructure stalls a VM that is hosting one of the application backend instances for hundreds of milliseconds (see Section 4.3, “Nondelivery of Configured VM Capacity”), then is the application's customer facing service impacted? Do some or all user operations take hundreds of milliseconds longer to complete, or do some (or all) operations fail due to timeout expiration? A robust application will mask the customer facing service impact of this service impairment so end users do not experience unacceptable service quality.
Figure 2.7. Sample Application Robustness Scenario.
Customer facing service quality expectations are fundamentally driven by application characteristics, such as:
These characteristics influence both the quantitative targets for application's service quality (e.g., critical applications have higher service availability expectations) and specifics of those service quality measurements (e.g., maximum tolerable service downtime influences the minimum chargeable outage downtime threshold).
Readers will recognize that different information services entail different levels of criticality to users and the enterprise. While these ratings will vary somewhat based on organizational needs and customer expectations, the criticality classification definitions from the U.S. Federal Aviation Administration's National Airspace System's reliability handbook are fairly typical:
There is also a “Safety Critical” category, with service availability rating of seven 9s for life-threatening risks and services where “loss would present an unacceptable safety hazard during the transition to reduced capacity operations” [FAA-HDBK-006A]. Few commercial enterprises offer services or applications that are safety critical, so seven 9's expectations are rare.
The higher the service criticality the more the enterprise is willing to invest in architectures, policies, and procedures to assure that acceptable service quality is continuously available to users.
As shown in Figure 2.8, there are three broad classifications of application service interactivity:
Figure 2.8. Interactivity Timeline.
Data networks are subject to three fundamental types of service impairments:
[RFC4594] characterizes tolerance to packet loss, delay, and jitter for common classes of applications.