A Simple Walk-through with Pandas for Data Science - Part 1

In this post, we will talk about the essential concepts in the library Pandas. You will see It’s one of the most important libraries used for data processing, efficient storage, and manipulation of densely typed arrays in Python. We will also see that it is a continuation of the NumPy tutorial. Without further ado, let’s start!.

This tutorial is part one in our three-part series on the fundamentals of Pandas:

Part #1: A simple walk-through with Pandas for Data Science, Part 1 (today’s post)
Part #2: A simple walk-through with Pandas for Data Science, Part 2 (next weeks tutorial)
Part #3: How to import existing files with Pandas (blog post two weeks from now)

Prerequisites

Before starting with this tutorial, I strongly recommend you get familiar with Python programming terminology and NumPy. To understand this tutorial, you need to get your foundations appropriately built.

What is Pandas?

If you have carefully followed the tutorial on Numpy, you will have learned that NumPy lies at the core of a rich ecosystem of data science libraries. What this means is that most of the data science libraries utilize the power of NumPy. Let’s have a careful look at the image below:

After a well-detailed study of the photo, you will spot that the Pandas package is built on top of NumPy.

There is something you should keep in mind about Numpy. It will provide you with the basic features when you have well-organized data. However, there are several drawbacks are when more functionality is required. Such as:

Linking the rows and columns of labels to data (values)
Working with heterogeneous types of data
Missing data, and
When performing powerful operations such as grouping data to analyze less structured data in different forms.

Let’s see how we can install this package, then explore some of the basic functionalities which are the basics of Pandas.

How to Install and Use Pandas?

If you have Anaconda installed on your computer already. In that case, you may skip this process since Pandas comes along with Anaconda, which includes Data Science packages suitable for Linux, Windows, and macOS.

If you don’t have the library already installed or you’re not using Anaconda, I strongly recommend installing it to avoid missing dependencies. Or you may choose to ignore me by entering these commands on your command line:

1 2	$ pip install numpy $ pip install pandas

Once you have followed the tutorial or went ahead to install the Anaconda Stack, you should have Pandas installed. Once installed, you can import the library and check the current version with the following commands:

import pandas as pd

print(pd.__version__)

# '1.1.4.'

Note the pd added is a standard convention to renaming libraries which we will do each time we import Pandas.

Auto-Completion

Before starting this tutorial, it’s good to know about using the tab-completion feature; this way, you can quickly explore all the contents of a package. For example, we can show all the contents of the pandas namespace by typing:

pd.<TAB>

This auto-completion feature also applies to other libraries, including NumPy.

Creating Pandas Objects

There are two (2) common types of data structures in Pandas:

Series – one-dimensional arrays
DataFrame – a two-dimensional array

And one more called the Index.

Let’s understand the underlying concept of how we may work with these two libraries. We may begin by opening a script pandas_tutorial.py or pandas_tutorial.ipynb (A Jupyter notebook).

→ Click here to download the code

Start with importing both libraries (NumPy and Pandas) into our working script.

1 2	import numpy as np import pandas as pd

pandas.Series()

pandas.Series can be thought of as a one-dimensional array with indexes. If you want to create one, there are multiple ways you can do so by using either:

Python Lists
Python Dictionaries
NumPy arrays
Scalar value or Constant

Python Lists

First, I will show you how you can do this with Python lists.

As you can see, inside the parentheses of the Series object created, it contains both a segment of values and indexes. Luckily Pandas provides the capability which lets us access the values and index attributes.

data = pd.Series(data=[0, 1, 2, 3],

index=['first', 'second', 'third', 'fourth'])

print(data)

# ------ output -------

# first 0

# second 1

# third 2

# fourth 3

# dtype: int64

You can think of the Index as an array-like object of type pd.Index, which I will further explain below. For now, knowing that they exist is enough.

# get the index

print(data.index)

# ------ output -------

# Index(['first', 'second', 'third', 'fourth'], dtype='object')

The values are similar to what we learned about NumPy.

# get the values

print(data.values)

# ------ output -------

# [0 1 2 3]

Using NumPy arrays, data can be accessed by the corresponding Index via square-bracket notation:

# get the first value in a series

print(data[0])

# ------ output -------

# 0

# access the first 2 data stored in the series

print(data[0:2])

# ------ output -------

# first 0

# second 1

# dtype: int64

So far, we have seen how data can be accessed by a Numpy implicitly defined integer index. However, this can be done by an explicitly defined Index associated with the values created above. For example, we can use the strings as an index:

# get the first value in a series

print(data['first'])

# ------ output -------

# 0

# access the first 2 data stored in the series

print(data['first':'second'])

# ------ output -------

# first 0

# second 1

# dtype: int64

Dictionaries

The second way to create a Pandas Series is by using a dictionary.

Let’s start by creating a Python dictionary. If you don’t know what they are, visit this tutorial here :

data_dict = {

"first" : 0,

"second" : 1,

"third" : 2,

"fourth" : 3

}

print(data_dict)

# ------ output -------

# {'first': 0, 'second': 1, 'third': 2, 'fourth': 3}

In order to create a Series object out of a Python dictionary, we can do it this way:

data = pd.Series(data_dict)

print(data)

# ------ output -------

# first 0

# second 1

# third 2

# fourth 3

# dtype: int64

NumPy Array

We can also create one where specifying the Index is optional as follows:

array = np.arange(4)

print(array)

# ------ output -------

# [0 1 2 3]

data = pd.Series(array)

print(data)

# ------ output -------

# 0 0

# 1 1

# 2 2

# 3 3

# dtype: int64

Scalar Value or Constant

Then, if you have a scenario where the data needs to be a scalar or a constant, we can fill up to an index specified.

print(pd.Series(10, index=[1, 2, 3, 4, 5]))

# ------ output -------

# 1 10

# 2 10

# 3 10

# 4 10

# 5 10

# dtype: int64

pandas.DataFrame()

Let’s say we want to represent different top tech companies. Assuming you are familiar with Python dictionaries we will have keys and values.

To represent a company like Google, we have the key as ‘name’ and value as ‘Google’ and another key ‘year’ with the value as 1998. This way we have represented data for a single tech company. However, the downside of writing it this way is representing data for more tech companies like Facebook and thousands more becomes complicated. Writing it this way isn’t efficient enough.

company_info1 = {

'name' : 'Google',

'year' : 1998

}

print(company_info1)

# ------ output -------

# {'name': 'Google', 'year': 1998}

company_info2 = {

'name' : 'Facebook',

'year' : 2004

}

print(company_info2)

# ------ output -------

# {'name': 'Facebook', 'year': 2004}

The way I will suggest you think about this is to represent all the values as a list. Let’s see how this will look like below:

tech_companies = ['Google', 'Facebook', 'Nvidia', 'Microsoft']

year_founded = [1998, 2004, 1993, 1975]

companies_info = pd.DataFrame({

'name' : tech_companies,

'year' : year_founded

})

print(companies_info)

# ------ output -------

# | name | year |

# ------------------------

# 0 | Google | 1998 |

# 1 | Facebook | 2004 |

# 2 | Nvidia | 1993 |

# 3 | Microsoft | 1975 |

We can also access the index labels from the pandas.DataFrame Object created like the Series object as well as the column attribute, which is an Index object holding the column labels:

print(companies_info.index)

# ------ output -------

# RangeIndex(start=0, stop=4, step=1)

print(companies_info.columns)

# ------ output -------

# Index(['name', 'year'], dtype='object')

To access the values of a single column, we can write as follow:

# get the values on the column 'name'

print(companies_info['name'])

# ------ output -------

# 0 Google

# 1 Facebook

# 2 Nvidia

# 3 Microsoft

# Name: name, dtype: object

# get the values on the column 'year'

print(companies_info['year'])

# ------ output -------

# 0 1998

# 1 2004

# 2 1993

# 3 1975

# Name: name, dtype: object

You can also think about pandas.DataFrames as two-dimensional NumPy arrays, where both the rows and columns have a generalized index for accessing the data.

pandas.Index()

You might have noticed up until now that both the pandas.DataFrame and pandas.Series contain an explicit index that quickly allows us to get values or modify values within the existing data. According to the documentation provided, these indexes are nothing more than an Immutable n-dimensional array which stores the axis labels for all Pandas objects.

Let’s run an experiment and construct an Index from a list of Integers:

index = pd.Index(np.arange(6))

print(index)

# ------ output -------

# Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

This Index object also supports many operations like NumPy arrays.
Take Python Indexing notation as an example to retrieve values or slices:

print(index[2])

# ------ output -------

# 2

print(index[2:4])

# ------ output -------

# Int64Index([2, 3], dtype='int64')

These Immutable objects also have many attributes such as:

print(index.shape) # get the shape

print(index.size) # get the size

print(index.ndim) # shape of the dimension

print(index.dtype) # data type

# ------ output -------

# (6, )

# 6

# 1

# int64

They are called immutable objects for one reason, and that’s because you can’t modify them.

index[0] = 2

# ------ output -------

# TypeError: Index does not support mutable operations

This operation will certainly raise a TypeError: “Index does not support mutable operations.”

Data Indexing and Selection

If you have read the previous tutorial related to NumPy arrays, you will have learned how you can access, set, and modify values within NumPy arrays created. If you’ve learned that, there’s no need to stress out. There is only a few features you need to know.

When you perform either data indexing or selection on a Pandas Object (Series or DataFrame), it’s very similar to the pattern used in NumPy arrays.

Let’s learn about performing such an operation on a Pandas Series object, then move on to working with a pandas.DataFrame object.

Data Indexing and selection in Series

Let’s create a pandas.Series and perform some basic selection and indexing techniques as we would with NumPy arrays:

continent = pd.Series(data=['Africa', 'Europe', 'Asia', 'North America'],

index=['Nigeria', 'Serbia', 'China', 'USA'])

print(continent)

# ------ output -------

# Nigeria Africa

# Serbia Europe

# China Asia

# USA North America

# dtype: object

We can even decide to modify this pandas.Series object by adding a unique key and assigning a new index value to the new key.

continent['Suriname'] = "South America"

print(continent)

# ------ output -------

# Nigeria Africa

# Serbia Europe

# China Asia

# USA North America

# Suriname South America

# dtype: object

One difference when performing slicing, which we talked about in the tutorial related to NumPy, is that there are two different ways slicing behaves.

The Explicit index style ( continent [ ‘Nigeria’ , ‘China’ ] ): The last Index is included in the slice.
The Implicit index style ( continent [ 0 : 2 ] ): The last Index isn’t included in the slice.

You may confirm from the output below. Remember, we start counting from 0 instead of 1.

print(continent['Nigeria':'China'])

# ------ output -------

# Nigeria Africa

# Serbia Europe

# China Asia

# dtype: object

print(continent[0:2])

# ------ output -------

# Nigeria Africa

# Serbia Europe

# China Asia

# dtype: object

Indexer: loc and iloc

Since these indexing types often cause confusion, special indexer attributes were created to resolve this issue.

The first one is the .loc attribute, which refers to the explicit Index with which we may get rows or columns with particular labels from the Index:

print(continent.loc['Serbia': 'Suriname'])

# ------ output -------

# Serbia Europe

# China Asia

# USA North America

# Suriname South America

# dtype: object

The second one is the .iloc attribute, which refers to the implicit Index with which we can get rows or columns at a particular position in the Index (Note: it only takes integers):

print(continent.iloc[0])

# ------ output -------

# Africa

print(continent.iloc[0:3])

# ------ output -------

# Nigeria Africa

# Serbia Europe

# China Asia

# dtype: object

Data Indexing and selection in DataFrames

Moving on to more complex structured arrays (DataFrames), you can think of it as a two-dimensional structured array.

Let’s create a pandas.DataFrame with multiple pandas.Series objects such as continents, language, and population, and specify the Index as a country located within that region.

continent = pd.Series(data=['Africa', 'Europe', 'Asia', 'North America'],

index=['Nigeria', 'Serbia', 'China', 'USA'])

language = pd.Series(data=['English', 'Serbian', 'Mandarin', 'English'],

index=['Nigeria', 'Serbia', 'China', 'USA'])

population = pd.Series(data=[195900000, 6964000, 1393000000, 328200000],

index=['Nigeria', 'Serbia', 'China', 'USA'])

data = pd.DataFrame({'continent': continent,

'language': language,

'population' : population})

print(data)

# ------ output -------

# continent language population

# Nigeria Africa English 195900000

# Serbia Europe Serbian 6964000

# China Asia Mandarin 1393000000

# USA North America English 328200000

To get all the values within a row, we can pass a single index to an array:

print(data.values[3])

# ------ output -------

# ['North America' 'English' 328200000]

and giving an available index to the pandas.DataFrame, we can get all the languages within the column.

print(data['language']) # OR print(data.language)

# ------ output -------

# Nigeria English

# Serbia Serbian

# China Mandarin

# USA English

# Name: language, dtype: object

As mentioned earlier, we can use these pandas indexers: loc and iloc. The only underlying difference is we need to specify the index and column labels to access a value, sample, or multiple rows within our DataFrame:

print(data.iloc[:2, :2])

# ------ output -------

# continent language

# Nigeria Africa English

# Serbia Europe Serbian

print(data.loc[:'China', 'language'])

# ------ output -------

# Nigeria English

# Serbia Serbian

# China Mandarin

# Name: language, dtype: object

Similarly, direct masking operations are also interpreted row-wise rather than column-wise: we can perform direct masking operations by selecting only the rows with a population greater than ten million.

ten_million = 10000000

print(data[data.population > ten_million])

# ------ output -------

# continent language population

# Nigeria Africa English 195900000

# China Asia Mandarin 1393000000

# USA North America English 328200000

To break down how this works, first, we can use boolean arrays as masks to select a particular subset of the data, which returns either true or false based on the condition. Now to choose the values from the array, we can index based on the boolean array. That’s why we get the result above.

print(data.population > ten_million)

# ------ output -------

# Nigeria True

# Serbia False

# China True

# USA True

# Name: population, dtype: bool

Using these indexers, we can combine both masking (just explained) and fancy indexing to return only a specific column of interest.

ten_million = 10000000

print(data.loc[data.population > ten_million, ['continent']])

# ------ output -------

# continent

# Nigeria Africa

# China Asia

# USA North America

Then we can choose to either modify or set values into our DataFrame, the same way you have seen with NumPy.

print(data.iloc[0, 2] )

# ------ output -------

# 195900000

data.iloc[0, 2] = 195900123

print(data.iloc[0, 2] )

# ------ output -------

# 195900123

You should click on the “Click to Tweet Button” below to share on twitter.

Check out the post on a simple walk-through with Pandas. Share on X

Conclusion

In this post, you discovered the fundamentals behind the Pandas Library. You learned how to create Pandas Objects, and perform data indexing and selection on both Pandas Objects (Series and DataFrame).

In the next tutorial we will learn how to perform different operations with Pandas, how to handle missing data, combining different datasets, and finally about groupby functions along with computing functions like aggregate, apply, filter and transform.

Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.

→ Click here to download the code

Articles

Books

To be notified when this next blog post goes live, be sure to enter your email address in the form !

A Simple Walk-through with Pandas for Data Science – Part 1

Prerequisites

What is Pandas?

How to Install and Use Pandas?

Auto-Completion

Creating Pandas Objects

pandas.Series()

Python Lists

Dictionaries

NumPy Array

Scalar Value or Constant

pandas.DataFrame()

pandas.Index()

Data Indexing and Selection

Data Indexing and selection in Series

Indexer: loc and iloc

Data Indexing and selection in DataFrames

Conclusion

Further Reading

Articles

Books

2 Comments

Write A Comment Cancel Reply

A Simple Walk-through with Pandas for Data Science – Part 1

Prerequisites

What is Pandas?

How to Install and Use Pandas?

Auto-Completion

Creating Pandas Objects

pandas.Series()

Python Lists

Dictionaries

NumPy Array

Scalar Value or Constant

pandas.DataFrame()

pandas.Index()

Data Indexing and Selection

Data Indexing and selection in Series

Indexer: loc and iloc

Data Indexing and selection in DataFrames

Conclusion

Further Reading

Articles

Books

PIE & AI Meetup: Breaking into AI by deeplearning.ai

A Simple Walk-through with Pandas for Data Science – Part 2

Related Posts

How to Flip an Image using Python and OpenCV

Adding a web interface to our NFT Search Engine in 3 steps with Flask

Building an NFT Search Engine in 3 steps using Python and OpenCV

2 Comments

Write A Comment Cancel Reply