cyberneticlibrary

Build PySpark data transformation pipelines

spark-architectskillsetup L31
BygraveRyan/gcp-flights-analytics
What it does

Design scalable Spark data pipelines with partitioning and optimization

Best for

Building ETL jobs that process terabytes without driver memory issues.

Inputs
  • · Data source
  • · Transformation spec
Outputs
  • · Spark job code
  • · Optimization recommendations
Requires
  • · Apache Spark
  • · PySpark
Preconditions
  • · Cluster configured
  • · Data schema known
Failure modes
  • · Shuffle explosion
  • · OOM on join
Trust signals
  • · Partition key selection explained
  • · Shuffle barrier minimized