A Brief Introduction to PySpark by Ben Weber


PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a great language to learn in order to create more scalable analyses and pipelines. The goal of this post is to show how to get up and running with PySpark and to perform common tasks. NHL Game Data We’ll use Databricks for a Spark environment, and the NHL dataset from Kaggle as a data source for analysis. This post shows how to read and write data into Spark dataframes, create transformations and aggregations of these frames, visualize results, and perform linear regression. I’ll also show how to mix regular Python code with PySpark in a scalable way, using Pandas UDFs. To keep things simple, we’ll focus on batch processing and avoid some of....

February 15, 2019
Notify of
1 Comment
Oldest Most Voted
Inline Feedbacks
View all comments
© HAKIN9 MEDIA SP. Z O.O. SP. K. 2023
What certifications or qualifications do you hold?
Max. file size: 150 MB.
What level of experience should the ideal candidate have?
What certifications or qualifications are preferred?

Download Free eBook

Step 1 of 4


We’re committed to your privacy. Hakin9 uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our Privacy Policy.