Benefits of Using AWS EMR

Turbot supports many different HPC and Big Data cluster environments (e.g. AWS EMR and CFNCluster); however, there are many different applications that can be used for varying use cases. Many customers have been benefiting from AWS EMR as it is a powerful Hadoop PaaS from AWS which allows you to run your scripts/jobs quickly and efficiently without having to manage infrastructure components.

We are often asked why a customer would use AWS EMR vs running their own Hadoop cluster. Below are a few examples of various benefits other customers have realized using EMR:

  • EMR is Hadoop as a managed service
    • Maintenance of systems is handled by AWS
    • No Hadoop versioning and upgrade complexity
    • EMR automatically replaces failed server/nodes
    • Saves operational management of tens, hundreds, thousands of instances being managed as seperated instances
    • AWS EMR does not require the user to manage capacity
    • AWS EMR is pay for what you use, there is no hardware to depreciate – always on latest and greatest through AWS
  • EMR decouples storage from compute
    • EMR allows for multiple clusters using the same filesystem without having to go through HDFS
    • S3 can be used as the persistent storage, S3 has eleven 9’s of durability and four 9’s availability
    • Cluster can be turned off and data can still be uploaded/downloaded from S3
  • EMR is resilient, real-time scalable, and cost-effective
    • EMR can resize up/down to save costs or increase performance
    • Able to use spot instances for large/inexpensive/time insensitive computational jobs
    • EMR clusters can be persistent/long running but scaled out according to demand
    • Different use cases can utilize different clusters w/o affecting each other
    • Different clusters can run different versions of software depending on business software
  • Support
    • AWS Enterprise support covers EMR
    • EMR does not require central team to manage one cluster, clusters can be federated and operated by application teams
  • Integrations
    • Many Hadoop ecosystem tools supported OOTB: Hive, Pig, HBase, Hue, Mahout, Presto, Spark, Oozie, Sqoop, Tez, Zeppelin, Zookeeper
    • EMR integrates with many visualization tools such as Tableau, Microstrategy, Datameer, etc.
    • EMR integrates with most AWS services: S3, RDS, DynamoDB, Redshift, Data Pipeline, Kinesis, etc.
  • Cloud Big Data Best Practices:
    • Industry best practices is to keep transient HPC environments vs. persistent, AWS EMR allows you to quickly spin and down instances while keeping storage persistent with other highly available AWS services.
    • Persistent Hadoop/HDFS models raise costs tenfold as it can be difficult to maximize compute on a persistent cluster.
    • When managing a persistent cluster, additional security requirements such as access, vulnerability scanning, virus scanning, OS controls, etc. become more important to manage as the cluster will be running for weeks to months on your network thus requiring additional operational overhead.
    • EMR coupled with the other services mentioned above are highly scalable and durable – to replicate that layer of availability with our own EC2 cluster, would require significant investment and management to manage state across the ecosystem + uptime and performance.
  • Turbot AWS EMR Automations:
    • Turbot can be enabled to automatically harden AWS EMR per CIS Level Benchmarks, manage users from AD, manage patching, and manage various environment variables.
    • Turbot automations accelerate application teams to quickly use AWS EMR within a controlled framework while quickly being able to use their current Linux credentials.
Was this article helpful?
0 out of 0 found this helpful