
Python data analysis at scale
Jonathan Rioux
简介
PySpark in Action is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time you’re done, you’ll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.
what's inside
Packaging your PySpark code
Managing your data as it scales across multiple machines
Re-writing Pandas, R, and SAS jobs in PySpark
Troubleshooting common data pipeline problems
Creating reliable long-running jobs