How to Get Other Columns When Using Spark Dataframe Groupby
PySpark Groupby : We will see in this tutorial how to aggregate data with the Groupby function present in Spark.
Introduction
PySpark's groupBy() function is used to aggregate identical data from a dataframe and then combine with aggregation functions.
There are a multitude of aggregation functions that can be combined with a group by :
- count(): It returns the number of rows for each of the groups from group by.
- sum() : It returns the total number of values of each group.
- max() – Returns the maximum number of values for each group.
- min() – Returns the minimum value of values for each group.
- mean(): Returns the average of the values for each group.
- avg() – Returns the average of the values for each group.
- agg() – Allows you to calculate several aggregates at once (which can be very useful).
- pivot(): The pivot() function is used to rotate the data of a DataFrame column into several columns (which is used to transform rows into columns).
In this article we will review the different ways to make a group by on a Pyspark dataframe using the different aggregation functions.
If you have to install spark, I invite you to have a look at one of my previous articles which explains the installation step by step:
Creating Dataframe
To illustrate the various examples of aggregation functions, we will first create a Dataframe Pyspark :
from pyspark.sql import SparkSession from pyspark.sql import functions as f spark = SparkSession.builder.appName('pyspark - example join').getOrCreate() sc = spark.sparkContext datavengers = [ ("Carol","Data Scientist","USA",70000,5), ("Peter","Data Scientist","USA",90000,7), ("Clark","Data Scientist","UK",111000,10), ("Jean","Data Scientist","UK",220000,30), ("Bruce","Data Engineer","UK",80000,4), ("Thanos","Data Engineer","USA",115000,13), ("Scott","Data Engineer","UK",180000,15), ("T'challa","CEO","USA",300000,20), ("Xavier","Marketing","USA",100000,11), ("Wade","Marketing","UK",60000,2) ] schema = ["Name","Job","Country","salary","seniority"] df = spark.createDataFrame(data=datavengers, schema = schema) df.printSchema()
root |-- Name: string (nullable = true) |-- Job: string (nullable = true) |-- Country: string (nullable = true) |-- salary: long (nullable = true) |-- seniority: long (nullable = true)
Thanks to the printschema (), we can see that our dataframe contains 5 columns:
- Name: Name of the employee
- Job: The work the person does
- Country: The country where the employee works
- salary: annual salary
- seniority: seniority in the company's workforce
To display what the dataframe contains, you can use the show() function .
+--------+--------------+-------+------+---------+ |Name |Job |Country|salary|seniority| +--------+--------------+-------+------+---------+ |Carol |Data Scientist|USA |70000 |5 | |Peter |Data Scientist|USA |90000 |7 | |Clark |Data Scientist|UK |111000|10 | |Jean |Data Scientist|UK |220000|30 | |Bruce |Data Engineer |UK |80000 |4 | |Thanos |Data Engineer |USA |115000|13 | |Scott |Data Engineer |UK |180000|15 | |T'challa|CEO |USA |300000|20 | |Xavier |Marketing |USA |100000|11 | |Wade |Marketing |UK |60000 |2 | +--------+--------------+-------+------+---------+
PySpark groupBy and aggregation functions on DataFrame columns
We will use the groupby() function on the "Job" column of our previously created dataframe and test the different aggregations.
Pyspark groupBy using count() function
To count the number of employees per job type, you can proceed like this:
# count() function df.groupBy("Job").count().show(truncate=False)
+--------------+-----+ |Job |count| +--------------+-----+ |CEO |1 | |Data Scientist|4 | |Marketing |2 | |Data Engineer |3 | +--------------+-----+
Pyspark groupBy using sum() function
Here is the syntax for summing salaries by job type :
# sum() function df.groupBy("Job").sum("salary").show(truncate=False)
+--------------+-----------+ |Job |sum(salary)| +--------------+-----------+ |CEO |300000 | |Data Scientist|491000 | |Marketing |160000 | |Data Engineer |375000 | +--------------+-----------+
Pyspark groupBy using max() function
To retrieve the maximum salary for each type of job :
# sum() function df.groupBy("Job").max("salary").show(truncate=False)
+--------------+-----------+ |Job |max(salary)| +--------------+-----------+ |CEO |300000 | |Data Scientist|220000 | |Marketing |100000 | |Data Engineer |180000 | +--------------+-----------+
Pyspark groupBy using min() function
To retrieve the minimum salary for each type of job :
# sum() function df.groupBy("Job").min("salary").show(truncate=False)
+--------------+-----------+ |Job |min(salary)| +--------------+-----------+ |CEO |300000 | |Data Scientist|70000 | |Marketing |60000 | |Data Engineer |80000 | +--------------+-----------+
Pyspark groupBy using avg() function
To calculate the average salary for each type of job :
# sum() function df.groupBy("Job").avg("salary").show(truncate=False)
+--------------+-----------+ |Job |avg(salary)| +--------------+-----------+ |CEO |300000.0 | |Data Scientist|122750.0 | |Marketing |80000.0 | |Data Engineer |125000.0 | +--------------+-----------+
PySpark groupBy and aggregation functions on DataFrame multiple columns
For some calculations, you will need to aggregate your data on several columns of your dataframe. The example below shows you how to aggregate on more than one column:
# groupBy on multiple columns df.groupBy("Job","Country") \ .avg("salary","seniority") \ .show(truncate=False)
+--------------+-------+-----------+--------------+ |Job |Country|avg(salary)|avg(seniority)| +--------------+-------+-----------+--------------+ |Marketing |UK |60000.0 |2.0 | |Data Engineer |UK |130000.0 |9.5 | |Data Scientist|UK |165500.0 |20.0 | |Marketing |USA |100000.0 |11.0 | |Data Scientist|USA |80000.0 |6.0 | |CEO |USA |300000.0 |20.0 | |Data Engineer |USA |115000.0 |13.0 | +--------------+-------+-----------+--------------+
As you can see, we have filled in two columns in the groupBy function (Job and Country).
Execute several aggregation types simultaneously
Thanks to the agg() function, we can calculate several aggregates directly in a groupBy :
# agg() function df.groupBy("Job") \ .agg(f.sum("salary").alias("sum_salary"), \ f.avg("salary").alias("avg_salary"), \ f.min("salary").alias("min_salary"), \ f.max("salary").alias("max_salary"), \ f.mean("salary").alias("mean_salary") \ ) \ .show(truncate=False)
+--------------+----------+----------+----------+----------+-----------+ |Job |sum_salary|avg_salary|min_salary|max_salary|mean_salary| +--------------+----------+----------+----------+----------+-----------+ |CEO |300000 |300000.0 |300000 |300000 |300000.0 | |Data Scientist|491000 |122750.0 |70000 |220000 |122750.0 | |Marketing |160000 |80000.0 |60000 |100000 |80000.0 | |Data Engineer |375000 |125000.0 |80000 |180000 |125000.0 | +--------------+----------+----------+----------+----------+-----------+
Conclusion
In this tutorial we have learned how to use the groupBy() function and the aggregation functions on a Pyspark Dataframe. I hope this has helped you to understand how these aggregations work. Feel free to leave me a comment, I would be happy to answer it 🙂
Back to the python section
I'm a data scientist. Passionate about new technologies and programming I created this website mainly for people who want to learn more about data science and programming :)
View all of ayed_amira's posts.
How to Get Other Columns When Using Spark Dataframe Groupby
Source: https://amiradata.com/pyspark-groupby-aggregate-data-in-pyspark/
0 Response to "How to Get Other Columns When Using Spark Dataframe Groupby"
Post a Comment