Cost-Effective Resource Configurations for Executing Data-Intensive Workloads in Public Clouds
Optimization , Data-Intensive Workloads , Resource Provisioning , tabu search , Cost Model , Performance Model , Inexpensive Deployment
The rate of data growth in many domains is straining our ability to manage and analyze it. Consequently, we see the emergence of computing systems that attempt to efficiently process data-intensive applications or I/O bound applications with large data. Cloud computing offers “infinite” resources on demand, and on a pay-as-you-go basis. As a result, it has gained interest for large-scale data processing. Given this supposedly infinite resource set, we need a provisioning process to determine appropriate resources for data processing or workload execution. We observe that the prevalent data processing architectures do not usually employ provisioning techniques available in a public cloud, and existing provisioning techniques have largely ignored data-intensive applications in public clouds. In this thesis, we take a step towards bridging the gap between existing data processing approaches and the provisioning techniques available in a public cloud, such that the monetary cost of executing data-intensive workloads is minimized. We formulate the problem of provisioning and include constructs to exploit a cloud’s elasticity to include any number of resources to host a multi-tenant database system prior to execution. The provisioning is modeled as a search problem, and we use standard search heuristics to solve it. We propose a novel framework for resource provisioning in a cloud environment. Our framework allows pluggable cost and performance models. We instantiate the framework by developing various search algorithms, cost and performance models to support the search for an effective resource configuration. We consider data-intensive workloads that consist of transactional, analytical or mixed workloads for evaluation, and access multiple database tenants. The workloads are based on standard TPC benchmarks. In addition, the user preferences on response time or throughput are expressed as constraints. Our propositions and their results are validated in a real public cloud, namely the Amazon cloud. The evaluation supports our claim that the framework is an effective tool for provisioning database workloads in a public cloud with minimal dollar cost.