Getting Started¶
# di jupyter/pyspark-notebook bash
# spark-shell
# Not sure why I started looking into this one instead of the one above?
di p7hb/docker-spark
Concepts¶
Operations
Sparks core operations are split into transformations (lazy) and actions (eager).
Spark Context
Like the orchestrator. Responsible for
Task creation
Scheduling
Data location (send tasks to the data to reduce data movement)
Fault tolerance
It is recommended to have just one context per app (but it is possible to have multiple contexts)