A Simple Walk-through with Pandas for Data Science - Part 2

In this tutorial you will learn more about different operations with Pandas. You will see different techniques on how to handle missing data, and combine different datasets. Then finally, about groupby functions and computing function like aggregate, apply, filter and transform.

This tutorial is part two in our three-part series on the fundamentals of Pandas:

Part #1: A simple walk-through with Pandas for Data Science, Part 1 (last week’s tutorial)
Part #2: A simple walk-through with Pandas for Data Science, Part 2 (this week’s tutorial)
Part #3: How to Import existing files with Pandas (next week’s tutorial)

Operations with Pandas

Since Pandas was built on NumPy, any of NumPy’s universal functions will work on Pandas Series and DataFrame objects. These universal functions include basic arithmetic operations (division, subtraction, addition, etc.) and more complex operations like (exponential and logarithmic function, etc.). To know more about these operations, visit this tutorial I made on NumPy.

Let’s define a DataFrame on which to illustrate:

→ Click here to download the code

import pandas as pd

import numpy as np

number_list_1 = np.arange(5)

number_list_2 = np.arange(5,10)

print(number_list_1)

print(number_list_2)

# ------ output -------

# [0 1 2 3 4]

# [5 6 7 8 9]

data = pd.DataFrame({'col 1' : number_list_1,

'col 2' : number_list_2})

print(data.values)

# ------ output -------

# [[0 5]

# [1 6]

# [2 7]

# [3 8]

# [4 9]]

Once you perform the calculation, you will notice that the index and column arrangement is maintained, similar to operations between two-dimensional Numpy arrays.

Note: This also applies to Series objects and one-dimensional NumPy arrays.

data['col 3'] = data['col 1'] + data['col 2']

print(data)

# --------- output -----------

# col 1 col 2 col 3

# 0 0 5 5

# 1 1 6 7

# 2 2 7 9

# 3 3 8 11

# 4 4 9 13

Handling Missing Data

Having missing values within your data is a total pain in the neck. Especially when performing any numerical operations.

Let me show you a simple illustration of having missing values within your data and performing numerical operations:

data = np.array([200, np.nan, 210, 215, None, 220, 225])

print(data)

# --------- output -----------

# [200 nan 210 215 None 220 225]

If you try to sum this array you will get a TypeError on your output command line. The same will happen if you are looking to find either the maximum or the minimum value.

print(data.sum())

# --------- output -----------

# TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

print(data.min())

# --------- output -----------

# TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

print(data.max())

# --------- output -----------

# nan

To check and manage these NaN and None values within your data, Pandas provides you with isnull() and notnull() predefined methods.

isnull() – Returns True if a value is Null and False otherwise
notnull() – Returns False for NaN values and True otherwise.

data = pd.Series(data)

print(data.isnull())

# --------- output -----------

# a False

# b True

# c False

# d False

# e True

# f False

# g False

# dtype: bool

You could also perform boolean filtering using these methods like the example below:

print(data[data.notnull()])

# --------- output -----------

# 0 200

# 2 210

# 3 215

# 5 220

# 6 225

# dtype: bool

Luckily for us, Pandas provides multiple options on how you may freely handle these missing values by replacing these values.

1) Remove rows that contain missing data – using the .dropna() method.

print(data.dropna())

# --------- output -----------

# 0 200

# 2 210

# 3 215

# 5 220

# 6 225

# dtype: bool

2) Mark missing data and replacing them with other values – fillna, mean, sum, median, forward fill, or backward fill.

data.fillna(data.mean())

# ------ output -------

# 0 200.0

# 1 205.0

# 2 210.0

# 3 215.0

# 4 214.0

# 5 220.0

# 6 225.0

# dtype: float64

data.fillna(method='ffill')

# ------ output -------

# 0 200.0

# 1 200.0

# 2 210.0

# 3 215.0

# 4 215.0

# 5 220.0

# 6 225.0

# dtype: float64

data.fillna(method='bfill')

# ------ output -------

# 0 200.0

# 1 210.0

# 2 210.0

# 3 215.0

# 4 220.0

# 5 220.0

# 6 225.0

# dtype: float64

Another issue that nan values may pose that we won’t explore in this tutorial is fitting your data into any machine learning model. Doing such a task will certainly raise common issues like “ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’), which is not good.

I should also inform you that we can use other machine learning methods from scikit-learn to handle these missing values. Like [‘median’, ‘mean’, ‘constant value’, or ‘most frequent value’]

Combining Datasets

Recall from the previous tutorial what we mentioned about array concatenation of NumPy arrays. Well performing concatenation of Series and DataFrame objects is almost indistinguishable. When you are working with data from different sources and combining them leads to exciting insights because it provides the whole picture.

pd.Concat

Let’s recap on how we concatenate two NumPy arrays:

a_list = [[2, 4, 6]]

b_list = [[1, 3, 5]]

print(np.concatenate([a_list, b_list], axis=0))

# ------ output -------

# array([[2, 4, 6],

# [1, 3, 5]])

print(np.concatenate([a_list, b_list], axis=1))

# ------ output -------

# [[2 4 6 1 3 5]]

Performing row-wise or column-wise concatenation using Pandas is very similar. Here you can see the difference is that we are using the short form pd.concat instead of np.concatenate from NumPy.

country = pd.Series(data=['Nigeria', 'Serbia', 'China', 'USA'])

population = pd.Series(data=[195900000, 6964000, 1393000000, 328200000])

print(pd.concat([country, population], axis=0))

# ------ output -------

# 0 Nigeria

# 1 Serbia

# 2 China

# 3 USA

# 0 195900000

# 1 6964000

# 2 1393000000

# 3 328200000

# dtype: object

pd.concat([country, population], axis=1)

# ------ output -------

# 0 1

# 0 Nigeria 195900000

# 1 Serbia 6964000

# 2 China 1393000000

# 3 USA 328200000

pd.Merge

The way the merge predefined function works is quite different from using concat. Using the concat function from Pandas, we are given the freedom to specify the axis on how we choose to concat two Pandas Series together. However, with merge, it’s different. We can’t specify the axis here.
The only way these Series or DataFrames are combined is either based on their shared column(s) or index(es).
Let’s see an example of how we can perform this:

df1 = pd.DataFrame({

'country' : ['Nigeria', 'Serbia', 'China', 'USA'],

'continent' : ['Africa', 'Europe', 'Asia', 'North America']})

df2 = pd.DataFrame({

'country' : ['Nigeria', 'Serbia', 'China', 'USA'],

'language' : ['English', 'Serbian', 'Mandarin', 'English']})

df3 = pd.merge(df1, df2)

print(df3)

# ------------- output --------------

# country continent language

# 0 Nigeria Africa English

# 1 Serbia Europe Serbian

# 2 China Asia Mandarin

# 3 USA North America English

Difference between pd.Merge and pd.Concat

You might conclude that if both of the predefined functions are achieving the same goal that they work the same way. I used to think likewise. However, the procedure in which this combination happens is different.

More specifically, .concat():

Combines two or more DataFrames either horizontally or vertically.
Provides you with the join parameter, which you can specify how you want to perform the combination—either using an inner or an outer join.

And .merge():

Doesn’t give you the option to combine two DataFrames by specifying the axis
Combines DataFrames based on the column(s) or index(es) they both share.
Provides you with the “how” parameter whose default options are inner, outer, left, and right.

To learn more, be sure to check out the official website of the Pandas blog here.

Grouping Function

The concept behind grouping functions tends to be a challenge for people who are beginning to explore the library Pandas.

To understand what the groupby or grouping function does, it essentially splits the data into different groups depending on your choice of variable or column. Let’s create a simple example showing car brands, the models, and the cars’ prices.

data = pd.DataFrame({

'cars' : ['BMW', 'BMW', 'Ferrari', 'Ferrari', 'Lamborghini', 'Lamborghini'],

'model' : ['M5', 'i8', '488 GTB', '488 Pista Spider', 'Huracan', 'Urus'],

'price' : [102700, 147500, 262647, 350000, 203674, 203995]

})

print(data)

# ------------- output --------------

# cars model price

# 0 BMW M5 102700

# 1 BMW i8 147500

# 2 Ferrari 488 GTB 262647

# 3 Ferrari 488 Pista Spider 350000

# 4 Lamborghini Huracan 203674

# 5 Lamborghini Urus 203995

Now let’s use the data.groupby(‘cars’) expression to split our current DataFrame by the brands of different cars.

print(data.groupby('cars'))

# ------------- output --------------

# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb249814470>

Once you execute the statement, the groupby() function returns a GroupBy object, which describes how the original dataset’s rows have split. To access and see which unique groups are computed and the corresponding values, which are the indexes belonging to each group, we can use the .groups attributes, and it returns a Python dictionary.

print(data.groupby('cars').groups)

# ------------- output --------------

# {'BMW': [0, 1], 'Ferrari': [2, 3], 'Lamborghini': [4, 5]}

Coupled with these grouping functions, we can apply some computing functions, usually aggregate, transform, or filter functions within these individual groups. Let’s talk about these operations.

Aggregate

These aggregate functions, which I have already mentioned, when applied to GroupBy objects, help procure a statistical summary for each group. These operations include:

.min() and .max(): Helps find the minimum and maximum value.
.count(): Provides a total count of the data in that group.
.sum(): Provides the total sum of the data in the group.
.mean() and .median(): Helps find the mean and median value.

Then, as an output, the combined step that happens at the end merges these operations’ results into an output array.

Let’s experiment by grouping our dataset based on the column cars and applying the mean function to calculate these cars’ average price.

print(data.groupby('cars').mean())

# ------------- output --------------

# cars price

# BMW 125100.0

# Ferrari 306323.5

# Lamborghini 203834.5

You can also specify which column you want to apply aggregate functions to.

# apply groupby on one column

print(data.groupby('cars')['price'].mean())

# ------------- output --------------

# cars

# BMW 125100.0

# Ferrari 306323.5

# Lamborghini 203834.5

# Name: price, dtype: float64

We can also apply multiple statistical summaries to our data:

print(data.groupby('cars').aggregate(['min', np.median, np.mean, max]))

# ---------------- output -------------------

# price

# min median mean max

# cars

# BMW 102700 125100.0 125100.0 147500

# Ferrari 262647 306323.5 306323.5 350000

# Lamborghini 203674 203834.5 203834.5 203995

Filter

The filter operations give us the freedom to drop any sample based on the group properties. For example, let’s say we want only the cars whose price is strictly above $250,000.

def filter_by_price(x):

return x['price'].mean() > 250000

print(data.groupby('cars').filter(filter_by_price))

# ------------- output --------------

# cars model price

# 2 Ferrari 488 GTB 262647

# 3 Ferrari 488 Pista Spider 350000

As you can examine the result, group BMW and Lamborghini don’t have a mean greater than $250,000.

Transform

Now you understand how the aggregate functions work. The transformation can return a transformed version of the data as you can see from the result, with the output staying the same as the input. An example will be subtracting the price of each of the cars by $20,000.

print(data.groupby('cars').transform(lambda x: x - 20000))

# ------------- output --------------

# price

# 0 82700

# 1 127500

# 2 242647

# 3 330000

# 4 183674

# 5 183995

Apply

I appreciate the .apply() method because it lets you pass in a function to the group results. Before calling the function and writing any control statements, it’s essential to know that your function should take as an input a DataFrame, and will return either a Pandas object (Series, DataFrame) or a scalar value.

Let’s see how we can normalize the price column based on the sum of all the different cars’ prices.

def norm_by_price(x):

x['price'] /= x['price'].sum()

return x

print(data.groupby('cars').apply(norm_by_price))

# ------------- output --------------

# cars model price

# 0 BMW M5 0.410472

# 1 BMW i8 0.589528

# 2 Ferrari 488 GTB 0.428709

# 3 Ferrari 488 Pista Spider 0.571291

# 4 Lamborghini Huracan 0.499606

# 5 Lamborghini Urus 0.500394

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on a simple walk-through with Pandas. Share on X

Conclusion

In this post, you discovered the fundamental understanding behind the Pandas Library.

You learned how to perform different operations with Pandas (addition, subtracting, etc.), how to handle missing data and combining different datasets using pandas.concat and pandas.merge. Then finally about the groupby function and some computing functions like aggregate, apply, filter and transform.

In the next tutorial, you will how to to import existing files using Pandas.

Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.

→ Click here to download the code

Articles

Books

To be notified when this next blog post goes live, be sure to enter your email address in the form !

A Simple Walk-through with Pandas for Data Science – Part 2

Write A Comment Cancel Reply

A Simple Walk-through with Pandas for Data Science – Part 2

Operations with Pandas

Handling Missing Data

Combining Datasets

pd.Concat

pd.Merge

Difference between pd.Merge and pd.Concat

Grouping Function

Aggregate

Filter

Transform

Apply

Conclusion

Further Reading

Articles

Books

A Simple Walk-through with Pandas for Data Science – Part 1

How to import existing files with Pandas

Related Posts

How to Flip an Image using Python and OpenCV

Adding a web interface to our NFT Search Engine in 3 steps with Flask

Building an NFT Search Engine in 3 steps using Python and OpenCV

Write A Comment Cancel Reply