Spark 3.x course examples and exercises.
Detailed write-up coming in a later update.
Repo: SparkProgrammingInScala
Spark 3.x course examples and exercises.
Detailed write-up coming in a later update.
Spark models data as immutable, partitioned collections; transformations build a lineage graph instead of mutating state, which is what makes recomputation and fault recovery possible.
Transformations are recorded, not run, until an action fires. The scheduler then builds a DAG of stages and pipelines narrow dependencies to minimize shuffles.
The SQL optimizer rewrites query plans (predicate pushdown, column pruning) and Tungsten manages off-heap memory + codegen for CPU-efficient execution.
A driver coordinates executors across nodes; partition count and shuffle behavior are the main levers for performance — the course's recurring theme.