Data Science Pandas Python

A Simple Walk-through with Pandas for Data Science – Part 1

Pinterest LinkedIn Tumblr

In this post, we will talk about the essential concepts in the library Pandas. You will see It’s one of the most important libraries used for data processing, efficient storage, and manipulation of densely typed arrays in Python. We will also see that it is a continuation of the NumPy tutorial. Without further ado, let’s start!.

This tutorial is part one in our three-part series on the fundamentals of Pandas:

  1. Part #1: A simple walk-through with Pandas for Data Science, Part 1 (today’s post)
  2. Part #2: A simple walk-through with Pandas for Data Science, Part 2  (next weeks tutorial)
  3. Part #3: How to import existing files with Pandas (blog post two weeks from now)

Prerequisites

Before starting with this tutorial, I strongly recommend you get familiar with Python programming terminology and NumPy. To understand this tutorial, you need to get your foundations appropriately built.

What is Pandas?

If you have carefully followed the tutorial on Numpy, you will have learned that NumPy lies at the core of a rich ecosystem of data science libraries. What this means is that most of the data science libraries utilize the power of NumPy. Let’s have a careful look at the image below:

The nature of numpy

After a well-detailed study of the photo, you will spot that the Pandas package is built on top of NumPy.

There is something you should keep in mind about Numpy. It will provide you with the basic features when you have well-organized data. However, there are several drawbacks are when more functionality is required. Such as:

  • Linking the rows and columns of labels to data (values)
  • Working with heterogeneous types of data
  • Missing data, and
  • When performing powerful operations such as grouping data to analyze less structured data in different forms.

Let’s see how we can install this package, then explore some of the basic functionalities which are the basics of Pandas.

How to Install and Use Pandas?

If you have Anaconda installed on your computer already. In that case, you may skip this process since Pandas comes along with Anaconda, which includes Data Science packages suitable for Linux, Windows, and macOS.

If you don’t have the library already installed or you’re not using Anaconda, I strongly recommend installing it to avoid missing dependencies. Or you may choose to ignore me by entering these commands on your command line:

Once you have followed the tutorial or went ahead to install the Anaconda Stack, you should have Pandas installed. Once installed, you can import the library and check the current version with the following commands:

Note the pd added is a standard convention to renaming libraries which we will do each time we import Pandas.

Auto-Completion

Before starting this tutorial, it’s good to know about using the tab-completion feature; this way, you can quickly explore all the contents of a package. For example, we can show all the contents of the pandas namespace by typing:

This auto-completion feature also applies to other libraries, including NumPy.

Creating Pandas Objects

There are two (2) common types of data structures in Pandas:

  • Series – one-dimensional arrays
  • DataFrame – a two-dimensional array

And one more called the Index.

Let’s understand the underlying concept of how we may work with these two libraries. We may begin by opening a script pandas_tutorial.py or pandas_tutorial.ipynb (A Jupyter notebook).

Start with importing both libraries (NumPy and Pandas) into our working script.

pandas.Series()

pandas.Series can be thought of as a one-dimensional array with indexes. If you want to create one, there are multiple ways you can do so by using either:

  1. Python Lists
  2. Python Dictionaries
  3. NumPy arrays
  4. Scalar value or Constant

Python Lists

First, I will show you how you can do this with Python lists.

As you can see, inside the parentheses of the Series object created, it contains both a segment of values and indexes. Luckily Pandas provides the capability which lets us access the values and index attributes.

You can think of the Index as an array-like object of type pd.Index, which I will further explain below. For now, knowing that they exist is enough.

The values are similar to what we learned about NumPy.

Using NumPy arrays, data can be accessed by the corresponding Index via square-bracket notation:

So far, we have seen how data can be accessed by a Numpy implicitly defined integer index. However, this can be done by an explicitly defined Index associated with the values created above. For example, we can use the strings as an index:

Dictionaries

The second way to create a Pandas Series is by using a dictionary.

Let’s start by creating a Python dictionary. If you don’t know what they are, visit this tutorial here :

In order to create a Series object out of a Python dictionary, we can do it this way:

NumPy Array

We can also create one where specifying the Index is optional as follows:

Scalar Value or Constant

Then, if you have a scenario where the data needs to be a scalar or a constant, we can fill up to an index specified.

pandas.DataFrame()

Let’s say we want to represent different top tech companies. Assuming you are familiar with Python dictionaries we will have keys and values.

To represent a company like Google, we have the key as ‘name’ and value  as ‘Google’ and another key ‘year’ with the value as 1998. This way we have represented data for a single tech company. However, the downside of writing it this way is representing data for more tech companies like Facebook and thousands more becomes complicated. Writing it this way isn’t efficient enough.

The way I will suggest you think about this is to represent all the values as a list. Let’s see how this will look like below:

We can also access the index labels from the pandas.DataFrame Object created like the Series object as well as the column attribute, which is an Index object holding the column labels:

To access the values of a single column, we can write as follow:

You can also think about pandas.DataFrames as two-dimensional NumPy arrays, where both the rows and columns have a generalized index for accessing the data.

pandas.Index()

You might have noticed up until now that both the pandas.DataFrame and pandas.Series contain an explicit index that quickly allows us to get values or modify values within the existing data. According to the documentation provided, these indexes are nothing more than an Immutable n-dimensional array which stores the axis labels for all Pandas objects.

Let’s run an experiment and construct an Index from a list of Integers:

This Index object also supports many operations like NumPy arrays.
Take Python Indexing notation as an example to retrieve values or slices:

These Immutable objects also have many attributes such as:


They are called immutable objects for one reason, and that’s because you can’t modify them.

This operation will certainly raise a TypeError: “Index does not support mutable operations.”

Data Indexing and Selection

If you have read the previous tutorial related to NumPy arrays, you will have learned how you can access, set, and modify values within NumPy arrays created. If you’ve learned that, there’s no need to stress out. There is only a few features you need to know.

When you perform either data indexing or selection on a Pandas Object (Series or DataFrame), it’s very similar to the pattern used in NumPy arrays.

Let’s learn about performing such an operation on a Pandas Series object, then move on to working with a pandas.DataFrame object.

Data Indexing and selection in Series

Let’s create a pandas.Series and perform some basic selection and indexing techniques as we would with NumPy arrays:

We can even decide to modify this pandas.Series object by adding a unique key and assigning a new index value to the new key.

One difference when performing slicing, which we talked about in the tutorial related to NumPy, is that there are two different ways slicing behaves.

  • The Explicit index style ( continent [ ‘Nigeria’ , ‘China’ ] ):  The last Index is included in the slice.
  • The Implicit index style ( continent [ 0 : 2 ] ): The last Index isn’t included in the slice.

You may confirm from the output below. Remember, we start counting from 0 instead of 1.

Indexer: loc and iloc

Since these indexing types often cause confusion, special indexer attributes were created to resolve this issue.

The first one is the .loc attribute, which refers to the explicit Index with which we may get rows or columns with particular labels from the Index:

The second one is the .iloc attribute, which refers to the implicit Index with which we can get rows or columns at a particular position in the Index (Note: it only takes integers):

Data Indexing and selection in DataFrames

Moving on to more complex structured arrays (DataFrames), you can think of it as a two-dimensional structured array.

Let’s create a pandas.DataFrame with multiple pandas.Series objects such as continents, language, and population, and specify the Index as a country located within that region.

To get all the values within a row, we can pass a single index to an array:

and giving an available index to the pandas.DataFrame, we can get all the languages within the column.

As mentioned earlier, we can use these pandas indexers: loc and iloc. The only underlying difference is we need to specify the index and column labels to access a value, sample, or multiple rows within our DataFrame:

Similarly, direct masking operations are also interpreted row-wise rather than column-wise: we can perform direct masking operations by selecting only the rows with a population greater than ten million.

To break down how this works, first, we can use boolean arrays as masks to select a particular subset of the data, which returns either true or false based on the condition. Now to choose the values from the array, we can index based on the boolean array. That’s why we get the result above.

Using these indexers, we can combine both masking (just explained) and fancy indexing to return only a specific column of interest.

Then we can choose to either modify or set values into our DataFrame, the same way you have seen with NumPy.

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on a simple walk-through with Pandas. Click To Tweet

Conclusion

In this post, you discovered the fundamentals behind the Pandas Library. You learned how to create Pandas Objects, and perform data indexing and selection on both Pandas Objects (Series and DataFrame).

In the next tutorial we will learn how to perform different operations with Pandas,  how to handle missing data, combining different datasets, and finally about groupby functions along with computing functions like aggregate, apply, filter and transform.

Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.

Further Reading

We have listed some useful resources below if you thirst for more reading.

Articles

Books

To be notified when this next blog post goes live, be sure to enter your email address in the form !

2 Comments

Write A Comment