Build PySpark data transformation pipelines
spark-architectskillsetup L3★1
BygraveRyan/gcp-flights-analytics ↗What it does
Design scalable Spark data pipelines with partitioning and optimization
Best for
Building ETL jobs that process terabytes without driver memory issues.
Inputs
- · Data source
- · Transformation spec
Outputs
- · Spark job code
- · Optimization recommendations
Requires
- · Apache Spark
- · PySpark
Preconditions
- · Cluster configured
- · Data schema known
Failure modes
- · Shuffle explosion
- · OOM on join
Trust signals
- · Partition key selection explained
- · Shuffle barrier minimized