MEWSE - Multi Engine Workflow Submission and Execution on Apache YARN
Abstract
In this era of BigData, designing a workflow to gain insights from the vast amount of data has become more complex. There are several different frameworks which individually process the batch and streaming data but coordinating the jobs between the engines in the workflow creates a performance penalty and other performance issues. Current workflow systems typically run only on one engine and do not offer the versatility required for today’s workflows. The process of submitting the jobs on different engines manually is not only time consuming, but also requires the expertise of working on these engines. In this thesis, we have overcome the above mentioned issues by proposing a MEWSE - Multi Engine Workflow Submission and Execution on Apache YARN. It should also have design with plug and play functionalities to allow the inclusion of new engines. MEWSE has been tested on Amazon EC2 with a sample workflow which requires the following engines, Hadoop, Mahout, java and some scripts to process the data.