In this tutorial you will learn more about different operations with Pandas. You will see different techniques on how to handle missing data, and combine different datasets. Then finally, about groupby functions and computing function like aggregate, apply, filter and transform.
This tutorial is part two in our three-part series on the fundamentals of Pandas:
- Part #1: A simple walk-through with Pandas for Data Science, Part 1 (last week’s tutorial)
- Part #2: A simple walk-through with Pandas for Data Science, Part 2 (this week’s tutorial)
- Part #3: How to Import existing files with Pandas (next week’s tutorial)
Operations with Pandas
Since Pandas was built on NumPy, any of NumPy’s universal functions will work on Pandas Series and DataFrame objects. These universal functions include basic arithmetic operations (division, subtraction, addition, etc.) and more complex operations like (exponential and logarithmic function, etc.). To know more about these operations, visit this tutorial I made on NumPy.
Let’s define a DataFrame on which to illustrate:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import pandas as pd import numpy as np number_list_1 = np.arange(5) number_list_2 = np.arange(5,10) print(number_list_1) print(number_list_2) # ------ output ------- # [0 1 2 3 4] # [5 6 7 8 9] data = pd.DataFrame({'col 1' : number_list_1, 'col 2' : number_list_2}) print(data.values) # ------ output ------- # [[0 5] # [1 6] # [2 7] # [3 8] # [4 9]] |
Once you perform the calculation, you will notice that the index and column arrangement is maintained, similar to operations between two-dimensional Numpy arrays.
Note: This also applies to Series objects and one-dimensional NumPy arrays.
1 2 3 4 5 6 7 8 9 10 | data['col 3'] = data['col 1'] + data['col 2'] print(data) # --------- output ----------- # col 1 col 2 col 3 # 0 0 5 5 # 1 1 6 7 # 2 2 7 9 # 3 3 8 11 # 4 4 9 13 |
Handling Missing Data
Having missing values within your data is a total pain in the neck. Especially when performing any numerical operations.
Let me show you a simple illustration of having missing values within your data and performing numerical operations:
1 2 3 4 5 | data = np.array([200, np.nan, 210, 215, None, 220, 225]) print(data) # --------- output ----------- # [200 nan 210 215 None 220 225] |
If you try to sum this array you will get a TypeError on your output command line. The same will happen if you are looking to find either the maximum or the minimum value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | print(data.sum()) # --------- output ----------- # TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' print(data.min()) # --------- output ----------- # TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' print(data.max()) # --------- output ----------- # nan |
To check and manage these NaN and None values within your data, Pandas provides you with isnull() and notnull() predefined methods.
- isnull() – Returns True if a value is Null and False otherwise
- notnull() – Returns False for NaN values and True otherwise.
1 2 3 4 5 6 7 8 9 10 11 12 | data = pd.Series(data) print(data.isnull()) # --------- output ----------- # a False # b True # c False # d False # e True # f False # g False # dtype: bool |
You could also perform boolean filtering using these methods like the example below:
1 2 3 4 5 6 7 8 9 10 | print(data[data.notnull()]) # --------- output ----------- # 0 200 # 2 210 # 3 215 # 5 220 # 6 225 # dtype: bool |
Luckily for us, Pandas provides multiple options on how you may freely handle these missing values by replacing these values.
1) Remove rows that contain missing data – using the .dropna() method.
1 2 3 4 5 6 7 8 9 | print(data.dropna()) # --------- output ----------- # 0 200 # 2 210 # 3 215 # 5 220 # 6 225 # dtype: bool |
2) Mark missing data and replacing them with other values – fillna, mean, sum, median, forward fill, or backward fill.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | data.fillna(data.mean()) # ------ output ------- # 0 200.0 # 1 205.0 # 2 210.0 # 3 215.0 # 4 214.0 # 5 220.0 # 6 225.0 # dtype: float64 data.fillna(method='ffill') # ------ output ------- # 0 200.0 # 1 200.0 # 2 210.0 # 3 215.0 # 4 215.0 # 5 220.0 # 6 225.0 # dtype: float64 data.fillna(method='bfill') # ------ output ------- # 0 200.0 # 1 210.0 # 2 210.0 # 3 215.0 # 4 220.0 # 5 220.0 # 6 225.0 # dtype: float64 |
Another issue that nan values may pose that we won’t explore in this tutorial is fitting your data into any machine learning model. Doing such a task will certainly raise common issues like “ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’), which is not good.
I should also inform you that we can use other machine learning methods from scikit-learn to handle these missing values. Like [‘median’, ‘mean’, ‘constant value’, or ‘most frequent value’]
Combining Datasets
Recall from the previous tutorial what we mentioned about array concatenation of NumPy arrays. Well performing concatenation of Series and DataFrame objects is almost indistinguishable. When you are working with data from different sources and combining them leads to exciting insights because it provides the whole picture.
pd.Concat
Let’s recap on how we concatenate two NumPy arrays:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | a_list = [[2, 4, 6]] b_list = [[1, 3, 5]] print(np.concatenate([a_list, b_list], axis=0)) # ------ output ------- # array([[2, 4, 6], # [1, 3, 5]]) print(np.concatenate([a_list, b_list], axis=1)) # ------ output ------- # [[2 4 6 1 3 5]] |
Performing row-wise or column-wise concatenation using Pandas is very similar. Here you can see the difference is that we are using the short form pd.concat instead of np.concatenate from NumPy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | country = pd.Series(data=['Nigeria', 'Serbia', 'China', 'USA']) population = pd.Series(data=[195900000, 6964000, 1393000000, 328200000]) print(pd.concat([country, population], axis=0)) # ------ output ------- # 0 Nigeria # 1 Serbia # 2 China # 3 USA # 0 195900000 # 1 6964000 # 2 1393000000 # 3 328200000 # dtype: object pd.concat([country, population], axis=1) # ------ output ------- # 0 1 # 0 Nigeria 195900000 # 1 Serbia 6964000 # 2 China 1393000000 # 3 USA 328200000 |
pd.Merge
The way the merge predefined function works is quite different from using concat. Using the concat function from Pandas, we are given the freedom to specify the axis on how we choose to concat two Pandas Series together. However, with merge, it’s different. We can’t specify the axis here.
The only way these Series or DataFrames are combined is either based on their shared column(s) or index(es).
Let’s see an example of how we can perform this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | df1 = pd.DataFrame({ 'country' : ['Nigeria', 'Serbia', 'China', 'USA'], 'continent' : ['Africa', 'Europe', 'Asia', 'North America']}) df2 = pd.DataFrame({ 'country' : ['Nigeria', 'Serbia', 'China', 'USA'], 'language' : ['English', 'Serbian', 'Mandarin', 'English']}) df3 = pd.merge(df1, df2) print(df3) # ------------- output -------------- # country continent language # 0 Nigeria Africa English # 1 Serbia Europe Serbian # 2 China Asia Mandarin # 3 USA North America English |
Difference between pd.Merge and pd.Concat
You might conclude that if both of the predefined functions are achieving the same goal that they work the same way. I used to think likewise. However, the procedure in which this combination happens is different.
More specifically, .concat():
- Combines two or more DataFrames either horizontally or vertically.
- Provides you with the join parameter, which you can specify how you want to perform the combination—either using an inner or an outer join.
And .merge():
- Doesn’t give you the option to combine two DataFrames by specifying the axis
- Combines DataFrames based on the column(s) or index(es) they both share.
- Provides you with the “how” parameter whose default options are inner, outer, left, and right.
To learn more, be sure to check out the official website of the Pandas blog here.
Grouping Function
The concept behind grouping functions tends to be a challenge for people who are beginning to explore the library Pandas.
To understand what the groupby or grouping function does, it essentially splits the data into different groups depending on your choice of variable or column. Let’s create a simple example showing car brands, the models, and the cars’ prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | data = pd.DataFrame({ 'cars' : ['BMW', 'BMW', 'Ferrari', 'Ferrari', 'Lamborghini', 'Lamborghini'], 'model' : ['M5', 'i8', '488 GTB', '488 Pista Spider', 'Huracan', 'Urus'], 'price' : [102700, 147500, 262647, 350000, 203674, 203995] }) print(data) # ------------- output -------------- # cars model price # 0 BMW M5 102700 # 1 BMW i8 147500 # 2 Ferrari 488 GTB 262647 # 3 Ferrari 488 Pista Spider 350000 # 4 Lamborghini Huracan 203674 # 5 Lamborghini Urus 203995 |
Now let’s use the data.groupby(‘cars’) expression to split our current DataFrame by the brands of different cars.
1 2 3 4 | print(data.groupby('cars')) # ------------- output -------------- # <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb249814470> |
Once you execute the statement, the groupby() function returns a GroupBy object, which describes how the original dataset’s rows have split. To access and see which unique groups are computed and the corresponding values, which are the indexes belonging to each group, we can use the .groups attributes, and it returns a Python dictionary.
1 2 3 4 | print(data.groupby('cars').groups) # ------------- output -------------- # {'BMW': [0, 1], 'Ferrari': [2, 3], 'Lamborghini': [4, 5]} |
Coupled with these grouping functions, we can apply some computing functions, usually aggregate, transform, or filter functions within these individual groups. Let’s talk about these operations.
Aggregate
These aggregate functions, which I have already mentioned, when applied to GroupBy objects, help procure a statistical summary for each group. These operations include:
- .min() and .max(): Helps find the minimum and maximum value.
- .count(): Provides a total count of the data in that group.
- .sum(): Provides the total sum of the data in the group.
- .mean() and .median(): Helps find the mean and median value.
Then, as an output, the combined step that happens at the end merges these operations’ results into an output array.
Let’s experiment by grouping our dataset based on the column cars and applying the mean function to calculate these cars’ average price.
1 2 3 4 5 6 7 8 | print(data.groupby('cars').mean()) # ------------- output -------------- # # cars price # BMW 125100.0 # Ferrari 306323.5 # Lamborghini 203834.5 |
You can also specify which column you want to apply aggregate functions to.
1 2 3 4 5 6 7 8 9 | # apply groupby on one column print(data.groupby('cars')['price'].mean()) # ------------- output -------------- # cars # BMW 125100.0 # Ferrari 306323.5 # Lamborghini 203834.5 # Name: price, dtype: float64 |
We can also apply multiple statistical summaries to our data:
1 2 3 4 5 6 7 8 9 | print(data.groupby('cars').aggregate(['min', np.median, np.mean, max])) # ---------------- output ------------------- # price # min median mean max # cars # BMW 102700 125100.0 125100.0 147500 # Ferrari 262647 306323.5 306323.5 350000 # Lamborghini 203674 203834.5 203834.5 203995 |
Filter
The filter operations give us the freedom to drop any sample based on the group properties. For example, let’s say we want only the cars whose price is strictly above $250,000.
1 2 3 4 5 6 7 8 9 | def filter_by_price(x): return x['price'].mean() > 250000 print(data.groupby('cars').filter(filter_by_price)) # ------------- output -------------- # cars model price # 2 Ferrari 488 GTB 262647 # 3 Ferrari 488 Pista Spider 350000 |
As you can examine the result, group BMW and Lamborghini don’t have a mean greater than $250,000.
Transform
Now you understand how the aggregate functions work. The transformation can return a transformed version of the data as you can see from the result, with the output staying the same as the input. An example will be subtracting the price of each of the cars by $20,000.
1 2 3 4 5 6 7 8 9 10 | print(data.groupby('cars').transform(lambda x: x - 20000)) # ------------- output -------------- # price # 0 82700 # 1 127500 # 2 242647 # 3 330000 # 4 183674 # 5 183995 |
Apply
I appreciate the .apply() method because it lets you pass in a function to the group results. Before calling the function and writing any control statements, it’s essential to know that your function should take as an input a DataFrame, and will return either a Pandas object (Series, DataFrame) or a scalar value.
Let’s see how we can normalize the price column based on the sum of all the different cars’ prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def norm_by_price(x): x['price'] /= x['price'].sum() return x print(data.groupby('cars').apply(norm_by_price)) # ------------- output -------------- # cars model price # 0 BMW M5 0.410472 # 1 BMW i8 0.589528 # 2 Ferrari 488 GTB 0.428709 # 3 Ferrari 488 Pista Spider 0.571291 # 4 Lamborghini Huracan 0.499606 # 5 Lamborghini Urus 0.500394 |
You should click on the “Click to Tweet Button” below to share on twitter.
Check out the post on a simple walk-through with Pandas. Share on XConclusion
In this post, you discovered the fundamental understanding behind the Pandas Library.
You learned how to perform different operations with Pandas (addition, subtracting, etc.), how to handle missing data and combining different datasets using pandas.concat and pandas.merge. Then finally about the groupby function and some computing functions like aggregate, apply, filter and transform.
In the next tutorial, you will how to to import existing files using Pandas.
Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.
Further Reading
We have listed some useful resources below if you thirst for more reading.
Articles
- A Simple Walk-through with Pandas for Data Science – Part 1
- A Simple Walk-through with Python for Data Science
- How to import existing files with Pandas
- A Simple Walk-through with NumPy for Data Science
- Pandas Docs: Merge, join, concatenate and compare
- Why is Python the most popular language for Data Science
- What You Don’t Know About Machine Learning Could Hurt You
Books
To be notified when this next blog post goes live, be sure to enter your email address in the form !