A Brief Introduction to PySpark by Ben Weber

PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a great language to learn in order to create more scalable analyses and pipelines. The goal of this post is to show how to get up and running with PySpark and to perform common tasks. NHL Game Data We’ll use Databricks for a Spark environment, and the NHL dataset from Kaggle as a data source for analysis. This post shows how to read and write data into Spark dataframes, create transformations and aggregations of these frames, visualize results, and perform linear regression. I’ll also show how to mix regular Python code with PySpark in a scalable way, using Pandas UDFs. To keep things simple, we’ll focus on batch processing and avoid some of....

Read the rest of this story with a free account.

Already have an account? Sign in

February 15, 2019

1 Comment

Newest

Oldest Most Voted

Inline Feedbacks

View all comments

A Brief Introduction to PySpark by Ben Weber

https://hakin9.org/wp-content/uploads/2023/06/hakin9_logo.svg

Download Free eBook