Data Science Machine Learning Python

A Simple Walk-through with Pandas for Data Science – Part 2

Pinterest LinkedIn Tumblr

In this tutorial you will learn more about different operations with Pandas. You will see different techniques on how to handle missing data, and combine different datasets. Then finally, about groupby functions and computing function like aggregate, apply, filter and transform.

This tutorial is part two in our three-part series on the fundamentals of Pandas:

  1. Part #1: A simple walk-through with Pandas for Data Science, Part 1 (last week’s tutorial)
  2. Part #2: A simple walk-through with Pandas for Data Science, Part 2 (this week’s tutorial)
  3. Part #3: How to Import existing files with Pandas (next week’s tutorial)

Operations with Pandas

Since Pandas was built on NumPy, any of NumPy’s universal functions will work on Pandas Series and DataFrame objects. These universal functions include basic arithmetic operations (division, subtraction, addition, etc.) and more complex operations like (exponential and logarithmic function, etc.). To know more about these operations, visit this tutorial I made on NumPy.

Let’s define a DataFrame on which to illustrate how we can perform one of these operations:

Once you perform the calculation, you will notice that the index and column arrangement is maintained, similar to operations between two-dimensional Numpy arrays.

Note: This also applies to Series objects and one-dimensional NumPy arrays.

Handling Missing Data

Having missing values within your data is a total pain in the neck. Especially when performing any numerical operations.

Let me show you a simple illustration of having missing values within your data and performing numerical operations:

If you try to sum this array you will get a TypeError on your output command line. The same will happen if you are looking to find either the maximum or the minimum value.

To check and manage these NaN and None values within your data, Pandas provides you with isnull() and notnull() predefined methods.

  • isnull() – Returns True if a value is Null and False otherwise
  • notnull() – Returns False for NaN values and True otherwise.

You could also perform boolean filtering using these methods like the example below:

Luckily for us, Pandas provides multiple options on how you may freely handle these missing values by replacing these values.

1) Remove rows that contain missing data – using the .dropna() method.

2) Mark missing data and replacing them with other values – fillna, mean, sum, median, forward fill, or backward fill.

Another issue that nan values may pose that we won’t explore in this tutorial is fitting your data into any machine learning model. Doing such a task will certainly raise common issues like “ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’), which is not good.

I should also inform you that we can use other machine learning methods from scikit-learn to handle these missing values. Like [‘median’, ‘mean’, ‘constant value’, or ‘most frequent value’]

Combining Datasets

Recall from the previous tutorial what we mentioned about array concatenation of NumPy arrays. Well performing concatenation of Series and DataFrame objects is almost indistinguishable. When you are working with data from different sources and combining them leads to exciting insights because it provides the whole picture.

pd.Concat

Let’s recap on how we concatenate two NumPy arrays:

Performing row-wise or column-wise concatenation using Pandas is very similar. Here you can see the difference is that we are using the short form pd.concat instead of np.concatenate from NumPy.

pd.Merge

The way the merge predefined function works is quite different from using concat. Using the concat function from Pandas, we are given the freedom to specify the axis on how we choose to concat two Pandas Series together. However, with merge, it’s different. We can’t specify the axis here.
The only way these Series or DataFrames are combined is either based on their shared column(s) or index(es).
Let’s see an example of how we can perform this:

Difference between pd.Merge and pd.Concat

You might conclude that if both of the predefined functions are achieving the same goal that they work the same way. I used to think likewise. However, the procedure in which this combination happens is different.

More specifically, .concat():

  • Combines two or more DataFrames either horizontally or vertically.
  • Provides you with the join parameter, which you can specify how you want to perform the combination—either using an inner or an outer join.

And .merge():

  • Doesn’t give you the option to combine two DataFrames by specifying the axis
  • Combines DataFrames based on the column(s) or index(es) they both share.
  • Provides you with the “how” parameter whose default options are inner, outer, left, and right.

To learn more, be sure to check out the official website of the Pandas blog here.

Grouping Function

The concept behind grouping functions tends to be a challenge for people who are beginning to explore the library Pandas.

To understand what the groupby or grouping function does, it essentially splits the data into different groups depending on your choice of variable or column. Let’s create a simple example showing car brands, the models, and the cars’ prices.

Now let’s use the data.groupby(‘cars’) expression to split our current DataFrame by the brands of different cars.

Once you execute the statement, the groupby() function returns a GroupBy object, which describes how the original dataset’s rows have split. To access and see which unique groups are computed and the corresponding values, which are the indexes belonging to each group, we can use the .groups attributes, and it returns a Python dictionary.

Coupled with these grouping functions, we can apply some computing functions, usually aggregate, transform, or filter functions within these individual groups. Let’s talk about these operations.

Aggregate

These aggregate functions, which I have already mentioned, when applied to GroupBy objects, help procure a statistical summary for each group. These operations include:

  • .min() and .max(): Helps find the minimum and maximum value.
  • .count(): Provides a total count of the data in that group.
  • .sum():  Provides the total sum of the data in the group.
  • .mean() and .median(): Helps find the mean and median value.

Then, as an output, the combined step that happens at the end merges these operations’ results into an output array.

Let’s experiment by grouping our dataset based on the column cars and applying the mean function to calculate these cars’ average price.

You can also specify which column you want to apply aggregate functions to.

We can also apply multiple statistical summaries to our data:

Filter

The filter operations give us the freedom to drop any sample based on the group properties. For example, let’s say we want only the cars whose price is strictly above $250,000.

As you can examine the result, group BMW and Lamborghini don’t have a mean greater than $250,000.

Transform

Now you understand how the aggregate functions work. The transformation can return a transformed version of the data as you can see from the result,  with the output staying the same as the input. An example will be subtracting the price of each of the cars by $20,000.

Apply

I appreciate the .apply() method because it lets you pass in a function to the group results. Before calling the function and writing any control statements, it’s essential to know that your function should take as an input a DataFrame, and will return either a Pandas object (Series, DataFrame) or a scalar value.

Let’s see how we can normalize the price column based on the sum of all the different cars’ prices.

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on a simple walk-through with Pandas. Click To Tweet

Conclusion

In this post, you discovered the fundamental understanding behind the Pandas Library.

You learned how to perform different operations with Pandas (addition, subtracting, etc.), how to handle missing data and combining different datasets using pandas.concat and pandas.merge. Then finally about the groupby function and some computing functions like aggregate, apply, filter and transform.

In the next tutorial, you will how to to import existing files using Pandas.

Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.

To get all access to the source code used in all tutorials, leave your email address in any page’s subscription forms.

Further Reading

We have listed some useful resources below if you thirst for more reading.

Articles

Books

To be notified when this next blog post goes live, be sure to enter your email address in the form !

Write A Comment