Tuning Spark Performance
Abstract
In the big data era, big data frameworks play a vital role in storing and processing large amounts of data, providing significant improvements in performance and availability. Spark is one of the most popular big data frameworks, providing high scalability and fault-tolerance with its unique in-memory engine. To hide the complex settings from users, Spark has approximately 200 configurable parameters in the execution engine. Default values assigned to the parameters provide initial ease of use. However, the default values are not the best setting for all workloads. In this work, we propose a general tuning algorithm named QST, Queen’s Spark Tuning, to help users with tuning Spark and to improve overall performance. First of all, we study Spark performance for a variety of workloads and identify 9 tunable parameters among more than 200 parameters that have significant impact on performance. Then, we propose QST, a general greedy iterative tuning algorithm for our set of 9 key parameters. By classifying Spark workloads as memory-intensive, shuffle-intensive or all-intensive, QST configures the parameters for each type of workload. We perform an experimental evaluation of QST using benchmark workloads and industry workloads. In our experiments, using QST significantly improves Spark performance. Overall, using QST yields an average speedup of 65% for our benchmark evaluation workloads and 57% for our industry evaluation workloads.
URI for this record
http://hdl.handle.net/1974/24439Request an alternative format
If you require this document in an alternate, accessible format, please contact the Queen's Adaptive Technology CentreThe following license files are associated with this item: