Apache Spark is a flexible and general engine for large-scale data processing, enabling you to be productive, due to:
- supporting batch, real-time, stream, machine learning, graph workloads within one framework, also relevant from an architectural POV.
- doing in-memory processing whenever possible, resulting in fast execution for mid to large-scale data.
- offering higher level of abstraction compared to Java MapReduce API with a choice of languages for developers (currently: Scala, Python, Java).
From 10,000 feet the full Apache Spark stack looks as follows:
… and here's the breakdown:
- Data platform (HDFS, HBase, Cassandra, S3)
- Execution environment
- Spark core engine
- Spark ecosystem
- Examples via Apache
- Running Spark jobs on EMR
- Why Spark Is the Next Top Compute Model
- tutorial: Apache Spark – a Fast Big Data Analytics Engine
- tutorial: Why Apache Spark is a Crossover Hit for Data Scientists
- tutorial: Run Apache Spark on Apache Mesos
- tutorial: Getting Started Running Apache Spark on Apache Mesos
- hands-on material/future stuff: Apache Spark on YouTube
- Spark: Open Source Superstar Rewrites Future of Big Data by Cade Metz, 2013
- How companies are using Spark by Ben Lorica, 2013
- In-Stream Big Data Processing by Ilya Katsov, 2013
- Spark Is Too Big To Fail, by Thomas W. Dinsmore 2015