In this tutorial, we will learn how to import existing files into our current working project using Pandas. We will also learn how to work with file formats like CSV (Comma Separated Values), JSON (JavaScript Object Notation), and Excel.
This tutorial is part three in our three-part series on the fundamentals of Pandas:
- Part #1: A simple walk-through with Pandas for Data Science, Part 1 (tutorial from two weeks ago)
- Part #2: A simple walk-through with Pandas for Data Science, Part 2 (last week’s tutorial)
- Part #3: How to Import existing files with Pandas (today’s tutorial)
Overview
To recapitulate what we have covered in the last two tutorials, we have seen how to create and perform data selection of Pandas objects. We also further explained how to handle missing data and looked at two methods for combining datasets from different sources. Finally, working with groupby and computing functions like aggregate, apply, filter, and transform.
I recommend you have a quick read to further advance your skills in working with tabular data, if you already haven’t.
How to load and save CSV files in Python with Pandas
A CSV file (Comma-Separated Value) is a raw text file which any of your favorite text editors can display the content. The way the files are stored is in a tabular form, where commas separate columns and new lines separate rows. When looking at the structure, the first row contains the names of the columns for the data.
1 2 3 4 5 6 | import pandas as pd # loading to a csv file url_raw_link = "https://git.io/JL7o9" daily_female_birth = pd.read_csv(filepath_or_buffer=url_raw_link, sep=",") daily_female_birth.head() |
Now let’s see how will we load or save a .csv file. Let’s begin by opening a Python file or a Python script, name it pandas_tutorial.py or pandas_tutorial.ipynb (A Jupyter notebook), then type in this code snippet and execute your code.
First start by importing the needed Python library Pandas into our working script. Then read the .csv file from a web server by passing the URL link as a parameter into the pd.read_csv predefined function.
Note: As for the argument filepath_or_buffer, we can either pass in a URL link to the datasets or a path to the file on our local machine.
As for the separator (sep) argument, the comma is utilized since we know the output file’s delimiter. It’s also important to know other delimiters like a semi-colon (“;”) or a tab (“t”) can be substituted instead of a comma (“,”).
8 9 10 | # writing to a file filename = "./daily-female-birth.csv" daily_female_birth.to_csv(path_or_buf=filename, sep=",") |
To save the pandas.DataFrame object into a file, we need to call the predefined function .to_csv that writes the object to a comma-separated values (CSV) file. You should also pay close attention to the separator (sep) specified, which will be the field delimiter for the output file.
How to load and save Json Format files in Python with Pandas
If you haven’t already heard of JSON, it stands for JavaScript Object Notation. It’s a widely used lightweight format for data storage and transportation of data between a client and a server.
It’s also publicly adopted by developers because it’s readable compared to formats like XML (Extensible Markup Language). Yikes.
Due to its widespread presence and influence on programming, you’ll likely want to learn how to read JSON from a file, server response, or write JSON to a file at some point in your development.
Trust me, both of these jobs are a breeze with Python. You’ll see.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | nba_players = { "Name":{"0":"LeBron James", "1":"Stephen Curry", "2":"James Harden"}, "Gender":{"0":"Male", "1":"Male", "2":"Male"}, "Nationality":{"0":"USA", "1":"USA", "2":"USA"} } nba_players_df = pd.DataFrame(data=nba_players) print(nba_players_df) # ---------------- output ---------------- # Name Gender Nationality # 0 LeBron James Male USA # 1 Stephen Curry Male USA # 2 James Harden Male USA |
One of the easiest ways to write your data in the JSON format to a file using Python is to store your data in a dictionary object, then convert the dictionary object into a pandas object (DataFrame).
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | filename = "nba_players.json" # saving the dataframe as a .json format file nba_players_df.to_json(path_or_buf=filename) # ---------------- output ---------------- #{ # "Name": { # "0": "LeBron James", # "1": "Stephen Curry", # "2": "James Harden" # }, # "Gender": { # "0": "Male", # "1": "Male", # "2": "Male" # }, # "Nationality": { # "0": "USA", # "1": "USA", # "2": "USA" # } #} |
To save the pandas object as type JSON, we need to call the .to_json predefined function, pass the name of the document in addition to the file type .json at the end.
48 49 50 51 | # reading a .json format file filename = "nba_players.json" nba_players_df = pd.read_json(path_or_buf=filename) nba_players_df.head() |
Above all, reading JSON data from a file is as easy as writing it to a file. Using the predefined function pandas.read_json from Pandas, we can extract and parse the JSON string directly from a file.
How to load and save Excel files in Python with Pandas
Excel is a popular spreadsheet developed by Microsoft for Windows, IOS, Android, and macOS. You must have used it when performing some calculations, make when making your monthly budget list, creating visualizations, etc.
Let’s see how we can save a pandas object in the form of an excel file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | football_players = { "Name":{"0":"Lionel Messi", "1":"Cristiano Ronaldo", "2":"Kylian Mbappé", "3":"Mohamed Salah", "4":"Neymar da Silva Santos Júnior"}, "Gender":{"0":"Male", "1":"Male", "2":"Male", "3":"Male", "4":"Male"}, "Nationality":{"0":"Argentine", "1":"Portugal", "2":"France", "3":"Egypt", "4":"Brazil"} } football_players_df = pd.DataFrame(data=football_players) print(football_players_df) # ------------------------ output ---------------------- # Name Gender Nationality # 0 Lionel Messi Male Argentina # 1 Cristiano Ronaldo Male Portugal # 2 Kylian Mbappé Male France # 3 Mohamed Salah Male Egypt # 4 Neymar da Silva Santos Júnior Male Brazil |
Let’s create a Python dictionary of my favorite football players, with three different pieces of information about the players: their name, gender, and nationality.
From the Python dictionary initialized, let’s convert the dictionary into a pandas object.
32 33 34 35 36 37 | file_path = './Football Players.xlsx' sheet_name = "Football Players" football_players_df.to_excel(excel_writer=file_path, sheet_name=sheet_name, index=False) |
When saving the DataFrame as an excel spreadsheet, it’s required to specify the location to save the file and the sheet-name if needed. Also, set the index to False if you don’t want to include the row names.
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | sport_sheets = { 'Football Players': football_players_df, 'Basketball Players': nba_players_df } filename_path = './Sport men.xlsx' writer = pd.ExcelWriter(path=filename_path) # iterate through the sport sheet keys 'Football Players' # and 'Basketball Players'. for sheet_name in sport_sheets.keys(): # save the two dictionaries with different sheet names sport_sheets[sheet_name].to_excel(excel_writer=writer, sheet_name=sheet_name, index=False) writer.save() |
Now we have seen how to write only to one sheet. The question now is, what if we want to write into multiple sheets. Like, save all my top 3 NBA players into one sheet and top 5 football players into another sheet within the same file? First, we will make a dictionary and create a writer object that will let us write the DataFrame object into excel sheets.
Once we iterate through the keys within the variable sport_sheet (‘Football Players’ and ‘Basketball Players’), we will save the distinct key and value pair into the save file, yet indifferent sheet_name. Then eventually, call writer.save that saves the workbook to disk.
How to load specific sheets in Excel
1 2 3 4 5 6 7 8 9 10 11 12 13 | file_path = './Sport men.xlsx' sheet_name = "Football Players" football_players_df = pd.read_excel(io=file_path, sheet_name=sheet_name) print(football_players_df) # ------------------------ output ---------------------- # Name Gender Nationality # 0 Lionel Messi Male Argentina # 1 Cristiano Ronaldo Male Portugal # 2 Kylian Mbappé Male France # 3 Mohamed Salah Male Egypt # 4 Neymar da Silva Santos Júnior Male Brazil |
What if, from the excel spreadsheet, we want to only load all the data in the sheet name, “Football Players”? We need to pass the name of the sheet to the argument “sheet_name”.
15 16 17 18 19 20 21 22 23 24 | sheet_name = "Basketball Players" nba_players_df = pd.read_excel(io=file_path, sheet_name=sheet_name) print(nba_players_df) # ---------------- output ---------------- # Name Gender Nationality # 0 LeBron James Male USA # 1 Stephen Curry Male USA # 2 James Harden Male USA |
Likewise, the process is possible for the Basketball players saved in the sheet_name, “Basketball Players”.
Reading Specific Columns from an Excel File
26 27 28 29 30 31 32 33 34 35 36 37 38 | file_path = './Sport men.xlsx' sheet_name = "Basketball Players" columns = [0, 2] # name & nationality nba_players_df = pd.read_excel(io=file_path, sheet_name=sheet_name, usecols=columns) print(nba_players_df) # ---------------- output ---------------- # Name Nationality # 0 LeBron James USA # 1 Stephen Curry USA # 2 James Harden USA |
In most cases, reading all the columns within your spreadsheet isn’t want you to want to achieve. Some columns aren’t of interest to you. You could only access specific columns of interest to you by passing the indexes as a parameter to the argument usecols. Let’s add the parameter to read the columns that correspond to the basketball player’s “Name” and “Nationality”.
You should click on the “Click to Tweet Button” below to share on twitter.
Check out the post on a simple walk-through with Pandas. Share on XConclusion
In this post, you discovered how to load and save existing file formats. These formats include CSV, JSON, and Excel. Be sure to explore more options on the type of files you may read and save here.
In the next tutorial, we will explore Matplotlib, a library for visualizing data into pie charts, bar charts, histograms, line plots and whatever kinds of graphs you can imagine.
Do you have any questions about Pandas or this post? Leave a comment and ask your question. I’ll do my best to answer.
Further Reading
We have listed some useful resources below if you thirst for more reading.
Articles
- A Simple Walk-through with Pandas for Data Science – Part 1
- A Simple Walk-through with Pandas for Data Science – Part 2
- A Simple Walk-through with Python for Data Science
- A Simple Walk-through with NumPy for Data Science
- pandas I/O API
Books
To be notified when this next blog post goes live, be sure to enter your email address in the form!