(Image by author)

Introduction

In this day and age, storage is cheap whereas compute is expensive. Hence, traditional map-reduce clusters that are kept “on” perpetually will rack up enormous costs especially if the map-reduce process is triggered sporadically. This article explores the automation of a big data processing pipeline while maintaining low cost and enabling alerts. This is achieved using various AWS services like AWS Elastic MapReduce (EMR), AWS StepFunctions, AWS EventBridge, AWS Lambda.

For example, if a data processing job takes an hour to run and is triggered once every day on a cluster that is always on, then ~96% of the compute…

Jinam Shah

Sr. Software Engineer at Cactus Communications who is passionate about AWS and Data science. | jinamshah.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store