Performance, Availability and Power Analysis for IaaS Cloud Kishor Trivedi [email protected] duke. edu www. ee. duke. edu/~kst Dept. of ECE, Duke University, Durham, NC 27708 Universita Napoli September 23, 2011 1 Duke University Research Triangle Park (RTP) 2 Duke UNC-CH NC state USA North Carolina 2 Trivedi’s Research Triangle Theory Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-Markov models Fluid stochastic models Performability & Markov reward models Software aging and rejuvenation Attack countermeasure trees Software Packages

Books: Blue, Red, White Applications Reliability/availability/performance Avionics, Space, Power systems, Transportation systems, Automobile systems Computer systems (hardware/software) Telco systems Computer Networks Virtualized Data center Cloud computing HARP (NASA), SAVE (IBM), IRAP (Boeing) SHARPE, SPNP, SREPT 3 Talk outline Overview of Reliability and Availability Quantification Overview of Cloud Computing Performance Quantification for IaaS Cloud (PRDC 2010) Availability Quantification for IaaS Cloud (DSN 2011) Power Quantification for IaaS Cloud (DSN workshop 2011) Future Research

Copyright © 2011 by K. S. Trivedi 4 An Overview of Reliability and Availability Quantification Methods Software + hardware in operation Dynamic as opposed to static behavior Copyright © 2011 by K. S. Trivedi 5 Reliability and Availability Quantification Measurement-Based More Accurate Expensive due to configurations Model-Based Combined approach where measurements are made at the subsystem level and models are built to derive system-level measures Copyright © 2011 by K. S. Trivedi 6 many parameters and Not always possible during system design. Reliability and Availability Evaluation Methods

Measurement-based Quantitative Evaluation Discrete-event simulation Model-based Hybrid Analytic Models Numerical solution of analytic models not as well utilized; Unnecessarily excessive use of simulation Closed-form solution Numerical solution via a tool Copyright © 2011 by K. S. Trivedi 7 Analytic Modeling Taxonomy Measurement-based Quantitative Dependability Evaluation Model-based Discrete-event simulation Hybrid Analytic Models Non-state-space models Analytic models State-space models Hierarchical composition Fixed point iterative models Copyright © 2011 by K. S. Trivedi 8 Non-state space models

Modeling using reliability block diagrams (RBDs), reliability graphs (relgraphs) and fault trees (FTs) are easy to use and efficient to solve for system reliability, system availability and system mean time to failure (MTTF) Product-form queuing networks for performance analysis 9 Example: Reliability Analysis of Boeing 787 Current Return Network Modeled as a Reliability Graph (Relgraph) 10 Reliability Analysis of Boeing 787 (cont’d) This real problem has too many minpaths Non-state space models also face largeness problem Number of paths from source to target 11 Reliability Analysis of Boeing 787 (cont’d)

Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation developed and incorporated in SHARPE SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) 12 Non-State-Space Models Failure/Repair Dependencies are often present; RBDs, relgraphs, FTREEs cannot easily handle these (e. g. , shared repair, warm/cold spares, imperfect coverage, non-zero switching time, travel time of repair person, reliability with repair). Product-form does not often hold when modeling real-life aspects such as simultaneous resource possession, priorities, retries, etc. 3 State-space models : Markov chains To model complex interactions between components, use models such as Markov chains or more generally state space models. Many examples of dependencies among system components have been observed in practice and captured by continuous-time Markov chains (CTMCs) Extension to Markov reward models makes computation of measures of interest relatively easy. Copyright © 2011 by K. S. Trivedi 14 Markov Availability model of WebSphere AP Server Failure detection By WLM By Node Agent Manual detection Recovery Node Agent Auto process restart Manual recovery

Process restart Node reboot Repair Application server and proxy server (with escalated levels of recovery) • Delay and imperfect coverage in each step of recovery modeled 15 15 Analytic Modeling Taxonomy Non-state-space models Analytic models State-space (Markov) models 16 Should I Use Markov Models? + Model Fault-Tolerance and Recovery/Repair + Model Dependencies + Model Contention for Resources and concurrency + Generalize to Markov Reward Models for Degradable systems + Can relax exponential assumption + Performance, Availability and Performability Modeling Possible – Large State Space

Copyright © 2011 by K. S. Trivedi 17 State Space Explosion State space explosion can be avoided by using hierarchical model composition. Use state-space models for those parts of a system that require them, and use non-state-space models for the more “well-behaved” parts of the system. Copyright © 2011 by K. S. Trivedi 18 Analytic Modeling Taxonomy Analytic models Non-state-sapce models Efficiency, simplicity State-space models Dependency capture Hierarchical composition To avoid largeness We are composing sub-model solutions together

Copyright © 2011 by K. S. Trivedi 19 Example: Architecture of SIP on IBM WebSphere AS: WebSphere Appl. Server (WAS) Replication domain 1 2 3 4 5 6 Nodes A, D A, E B, F B, D C, E C, F 20 Hierarchical composition System Failure system App servers k of 12 proxy AS1 AS 5 AS 6 PX 1 PX 2 AS 1 AS 2 AS 3 AS 4 1A BS A CM 1 2A BS A CM 1 3B BS B CM 1 4B BS B CM 1 5C BS C CM 1 6C BS C CM 1 P1 BS G CM 1 P2 BS H CM 2 AS 7 AS 8 AS 9 AS 10 AS 11 AS 12 1A BSA CM1 1D BS D CM 2 4D BS D CM 2 2E BS E CM 2 5E BS E CM 2 3F BS F CM 2 6F BS F CM 2

This model was responsible for the actual sale of the system by IBM to their Telco customer 21 Fixed-Point Iteration Input parameters of sub-models can be functions of outputs of other models If the import graph is not acyclic then we solve using fixed-point iteration Non-state-space models Efficiency, simplicity Analytic models State-space models Dependency capture Hierarchical composition To avoid largeness Fixed-Point Iteration To deal with interdependent submodels Copyright © 2011 by K. S. Trivedi 22 An Overview of Cloud Computing

Copyright © 2011 by K. S. Trivedi 23 NIST definition of cloud computing Definition by National Institute of Standards and Technology (NIST): “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e. g. , networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. ” Source: P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, October 7, 2009 24 Key characteristics

On-demand self-service: Provisioning of computing capabilities without human intervention Resource pooling: Shared physical and virtualized environment Rapid elasticity: Through standardization and automation, quick scaling Metered Service: Pay-as-you-go model of computing Many of these characteristics are borrowed from Cloud’s predecessors! Source: P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, October 7, 2009 25 Evolution of cloud computing Time line of evolution Around 2005-06 Around 2000 Early 90s Early 80s Grid computing Cluster computing Utility computing Cloud computing Source: http://seekingalpha. com/article/167764-tipping-point-gartner-annoints-cloud-computing-top-strategic-technology 26 Cloud Service models Infrastructure-as-a-Service (IaaS) Cloud: Examples: Amazon EC2, IBM Smart Business Development and Test Cloud Platform-as-a-Service (PaaS) Cloud: Examples: Microsoft Windows Azure, Google AppEngine Software-as-a-Service (SaaS) Cloud: Examples: Gmail, Google Docs Copyright © 2011 by K. S. Trivedi 27 Deployment models Private Cloud: – Cloud infrastructure solely for an organization – Managed by the organization or third party – May exist on premise or off-premise

Public Cloud: – Cloud infrastructure available for use for general users – Owned by an organization providing cloud services Hybrid Cloud: – Composition of two or more clouds (private or public) Copyright © 2011 by K. S. Trivedi 28 Key Challenges Three critical metrics for a cloud: – Service (un)availability – Performance (response time) unpredictability – Power consumption Large number of parameters can affect performance, availability and power – Workload parameters – Failure/recovery characteristics – Types of physical infrastructure – Characteristics of virtualization infrastructures – Large scale; thousands of servers

Performance, availability & power quantification are difficult! Copyright © 2011 by K. S. Trivedi 29 Our goals in the IBM Cloud project Develop a comprehensive analytic modeling approach High fidelity Scalable and tractable Apply these models to cloud capacity planning Copyright © 2011 by K. S. Trivedi 30 Our approach and motivations behind it Difficulty with measurement-based approach: expensive experimentation for each workload and system configuration

Monolithic analytic model will suffer largeness and hence is not scalable Our approach: overall system model consists of a set of sub-models sub-model solutions composed via an interacting Markov chain approach scalable and tractable low cost of model solution while covering large parameter space 31 Duke/IBM project on cloud computing Joint work with Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina) Vijay Naik, Murthy Devarakonda and Daniel Dias (IBM T. J. Watson Research Center) 32 Performance Quantification for IaaS Cloud [paper in Proc.

IEEE PRDC 2010] Copyright © 2011 by K. S. Trivedi 33 System model Current Assumptions [will be relaxed soon] Homogenous requests All physical machines (PMs) are identical. To minimize power consumption, PMs divided into three pools: Hot pool– fast provisioning but high power usage Warm pool—slower provisioning but lower power usage Cold pool – slowest provisioning but lowest power usage 34 Life-cycle of a job inside a IaaS cloud Provisioning response delay Arrival Queuing Provisioning Decision Resource Provisioning Decision Engine Instantiation VM deployment Actual Service Out

Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity Provisioning and servicing steps: (i) resource provisioning decision, (ii) VM provisioning and (iii) run-time execution 35 Resource provisioning decision engine (RPDE) Provisioning response delay Arrival Queuing Provisioning Decision Resource Provisioning Decision Engine Instantiation VM deployment Actual Service Out Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity Copyright © 2011 by K. S. Trivedi 36

Resource provisioning decision engine (RPDE) Flow-chart: Copyright © 2011 by K. S. Trivedi 37 CTMC model for RPDE i,s i = number of jobs in queue, s = pool (hot, warm or cold) ? 0,0 0,h ? 1,h ? ? h Ph … ? N-1,h ? h Ph ? h (1 ? Ph ) ? w Pw 0,w ? h Ph ? h Ph ? h (1 ? Ph ) ? w Pw ? h (1 ? Ph ) ? P ? c (1 ? Pc ) c c ? w Pw ? w Pw … ? ? ? N-1,w 1,w ? c (1 ? Pc ) ? c Pc ? w (1 ? Pw ) ? c (1 ? Pc ) ? c Pc ? w (1 ? Pw ) ? c (1 ? Pc ) ? w (1 ? Pw ) ? c Pc ? 0,c 1,c ? … ? N-1,c 38 RPDE model: parameters & measures Input Parameters: ? –arrival rate: data collected from cloud 1 / ? ,1 / ? w ,1 / ? c – mean search delays for resource provisioning decision engine: from searching algorithms or measurements Ph , Pw , Pc – probability of being able to provision: computed from VM provisioning model N – maximum # jobs in RPDE: from system/server specification Output Measures: Job rejection probability due to buffer full (Pblock) Job rejection probability due to insufficient capacity (Pdrop) Mean decision delay for an accepted job (E[Tdecision]) Mean queuing delay for an accepted job (E[Tq_dec]) Copyright © 2011 by K. S. Trivedi 39

VM provisioning Provisioning response delay Arrival Queuing Provisioning Decision Resource Provisioning Decision Engine Instantiation VM deployment Actual Service Out Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity Copyright © 2011 by K. S. Trivedi 40 VM provisioning model Hot PM Hot PM pool Resource Provisioning Decision Engine Accepted jobs Running VMs Idle resources on hot machine Idle resources on warm machine Idle resources on cold machine Service out Warm pool Cold pool 41 VM provisioning model for each hot PM ,0,0 ?h µ 0,1,0 ?h … ?h Lh,1,0 ?h 0,0,1 µ ?h 2µ … … µ ? h ?h (Lh-1),1,1 µ ?h 2µ (m ? 1)µ Lh,1,1 Lh is the buffer size and m is max. # VMs that can run simultaneously on a PM ?h … ?h 2µ … … (m ? 1)µ ?h (m ? 1)µ ?h 0,0,(m-1) ?h 0,1,(m-1) ?h mµ ?h (Lh-1),1,(m-1) … mµ i,j,k ?h 0,0,m ?h 1,0,m ?h ? h … ?h (m ? 1)µ Lh,1,(m1) ?h mµ ?h ?h ?h Lh,0,m i = number of jobs in the queue, j = number of VMs being provisioned, k = number of VMs running Copyright © 2011 by K. S. Trivedi 42 VM provisioning model (for each hot PM) Input Parameters: ?h = ? (1 ? Pblock ) nh 1 / ? can be measured experimentally 1 / µ obtained from the lower level run-time model Pblock obtained from the resource provisioning decision model Hot pool model is the set of nh independent hot PM models Output Measure: n ( h) ( h) Ph = prob. that a job is accepted in the hot pool = 1 ? (? ? ( Lh ,1,i ) + ? ( Lh , 0,m ) ) h m ? 1 i =0 where, ( ? ? ( L , 1 , i ) + ? ( L , 0 , m ) ) is the steady state probability that a PM can not i= 0 accept job for provisioning – from the solution of the Markov model of a hot PM on the previous slide Copyright © 2011 by K. S. Trivedi (h) h m ? (h) h 43 VM provisioning model for each warm PM 0,0,0 ?w 0,1*,0 ?w … ?w ? w ? w Lw,1*,0 ?w µ ?w 0,1**, 0 ?w Lw,1,0 0,1,0 ?w 0,0,1 … ? w ?h ? w µ ?w ?h … µ Lw, 1**,0 ?w (Lw-1),1,1 ?h 2µ µ Lw,1,1 0,1,1 ?w 2µ ?h … ?h 2µ … … (m ? 1)µ (m ? 1)µ ?h 0,0,(m-1) ?w 0,1,(m-1) ?w mµ … (m ? 1)µ ?h (Lw-1),1,(m1) (m ? 1)µ mµ ?h 0,0,m ?h ?w ?w … ?h ?w mµ ?w Lw,1,(m-1) ?h Lw,0,m 44 ?w 1,0,m VM provisioning model for each cold PM 0,0,0 ?c 0,1*,0 ?c … ?c Lc,1*, 0 ?c µ 0,1,0 ?c 0,1**, 0 ?c ?c … ?c Lc,1,0 ?c ?h µ µ ?c ? c (Lc-1),1,1 ?h Lc , 1**,0 µ ?h 2µ 0,0,1 ?c 0,1,1 ?c … ? c 2µ (m? 1)µ ?h …

Lc,1, 1 2µ … … ?h (m ? 1)µ ?h 0,0,(m-1) ?c 0,1,(m-1) ?c mµ … (m ? 1)µ ?h (Lc-1),1,(m-1) (m ? 1)µ mµ ?h 0,0, m ?h 1,0, m ?c ? c ?c … ?h ?c mµ ? c Lc,1,(m-1) ?h Lc,0,m 45 VM provisioning model: Summary Warm/cold PM model is similar to hot PM, except: (i) Effective job arrival rate (ii) For first job, warm/cold PM requires additional start-up time (iii) Mean provisioning delay for a VM for the first job is longer Outputs of hot, warm and cold pool models: Probabilities (Ph , Pw , Pc )that at least one PM in hot/warm/cold pool can accept a job 46 Import graph for performance models ob rejection probability and mean response delay Pblock RPDE model Pblock Ph Pw Hot pool model Pblock Pc Ph Ph Warm pool model Pw Cold pool model VM provisioning models 47 Fixed-point iteration To solve hot, warm and cold PM models, we need Pblock from resource provisioning decision model To solve provisioning decision model, we need Ph , Pw, Pc and cold pool model respectively from hot, warm This leads to a cyclic dependency among the resource provisioning decision model and VM provisioning models (hot, warm, cold) We resolve this dependency via fixed-point iteration Observe, our fixed-point variable is

Pblock equation is of the form: Pblock = f ( Pblock ) and corresponding fixed-point 48 Performance measures comparison with monolithic model 1 PM per pool and 1 VM per PM Jobs/hr Mean RPDE queue length Rejection probability ISP monolithic 9. 2321e-07 4. 3364e-05 2. 4225e-04 6. 4377e-04 1. 2655e-03 2. 1179e-03 3. 2091e-01 4. 5462e-03 ISP 9. 8899e-06 4. 2334e-02 2. 3496e-01 3. 9860e-01 5. 1069e-01 5. 8915e-01 6. 4648e-01 6. 8999e-01 Monolithic 1. 1221e-03 8. 0500e-02 2. 6587e-01 4. 1493e-01 5. 1969e-01 5. 9449e-01 6. 4985e-01 6. 223e-01 1 5 10 15 20 25 30 35 9. 0332e-07 4. 1622e-05 2. 3731e-04 6. 3539e-04 1. 2526e-03 2. 0990e-03 3. 1826e-03 4. 5106e-03 The error is between e-03 and e-07 for all the results. The number of states in monolithic model is 912 while in ISP model it is 21 49 Availability Quantification for IaaS Cloud [paper in Proc. IEEE/IFIP DSN 2011] Copyright © 2011 by K. S. Trivedi 50 Assumptions We consider the net effect of different failures and repairs of PMs MTTF of each hot PM is 1 / ? h and that of each warm PM is 1 / ? w with 1 / ? h