Spark Programming in Scala — Kaustubh Mani Tripathi

Spark 3.x course examples and exercises.

Detailed write-up coming in a later update.

← All projects

How it works — components

RDD / DataFrame API

Spark models data as immutable, partitioned collections; transformations build a lineage graph instead of mutating state, which is what makes recomputation and fault recovery possible.

Lazy evaluation & DAG

Transformations are recorded, not run, until an action fires. The scheduler then builds a DAG of stages and pipelines narrow dependencies to minimize shuffles.

Catalyst & Tungsten

The SQL optimizer rewrites query plans (predicate pushdown, column pruning) and Tungsten manages off-heap memory + codegen for CPU-efficient execution.

Cluster execution

A driver coordinates executors across nodes; partition count and shuffle behavior are the main levers for performance — the course's recurring theme.