RDD
- In Memory
- Immutable
- Dependency
- Narrow
- Wide
- Properties
- Partitions
- blocks, splits
- Parent dependencies
- Function (File, Pair, Shuffled, …)
- Partitioner (Opt.)
- PreferredLocation (Opt.)
- Partitions
- Dependency
- Lazy Evaluation (Execution Plan)
- Transformation
- Action
- Memory Management
- LRU Cache - per partition
Spark Application
- Jobs (per Action)
- Stages
- Set of tasks
- Stage boundary (Input, Shuffle)
- Wide dependency
- Sequencial
- Tasks
- Parallel
- 1 Task / Partition
- N Tasks / Executor (Slots)
- Stages
- DAG Scheduler
- Stage, Task 분할
- Task Scheduler
- Cluster Manager를 통해 Task를 알맞은 Worker에 있는 Executor에 할당
Driver and Executors
- Driver: Spark Application의 main
- Executor: 실제 데이터 처리 작업을 수행
- 1 JVM
- 1 core
- Workers (Nodes)
- 1대의 Worker에서 N개의 executor 실행 (Spark Cluster에서 동시에 여러 Application 실행)