Getting started

Apache Spark is a flexible and general engine for large-scale data processing, enabling you to be productive, due to:

  • supporting batch, real-time, stream, machine learning, graph workloads within one framework, also relevant from an architectural POV.
  • doing in-memory processing whenever possible, resulting in fast execution for mid to large-scale data.
  • offering higher level of abstraction compared to Java MapReduce API with a choice of languages for developers (currently: Scala, Python, Java).

From 10,000 feet the full Apache Spark stack looks as follows:

Apache Spark stack

… and here's the breakdown:

  1. Data platform (HDFS, HBase, Cassandra, S3)
  2. Execution environment
  3. Spark core engine
  4. Spark ecosystem

Get Spark

You can get Spark directly from Apache as well as enterprise support for MapR and CDH via Databricks; for HDP a tech preview is planned.

Learn Spark

Books

Spark in Action: Spark in Action book cover

Spark GraphX in Action: Spark GraphX in Action book cover

Learning Spark: Learning Spark book cover

Fast Data Processing with Spark: Fast Data Processing with Spark book cover

Further reading

Proudly published with Ghost

Latest posts