Learn Python, Microsoft 365 and Google Workspace
Connect with me: Youtube | LinkedIn | WhatsApp Channel | Web | Facebook | Twitter
🎥 YouTube Playlists to Learn Python:
pip install pandas
)import pandas as pd
)head()
, tail()
, info()
, describe()
df['column']
)iloc
, loc
)dropna()
fillna()
astype()
)sort_values()
)groupby()
)mean()
, sum()
, count()
, etc.)merge()
, concat()
)
read_csv()
)read_excel()
)read_json()
)to_csv()
)to_excel()
)df.plot()
)value_counts()
unique()
nunique()
pivot_table()
Definition: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. [1]
Key Features:
For more details on Data alignment, see Appendix B.
pip
:
pip install pandas
import pandas as pd
print(pd.__version__)
import pandas as pd
pd
?Pandas provides two types of classes for handling data:
Creating a Series by passing a list of values, letting pandas create a default RangeIndex. [2]
import pandas as pd
import numpy as np
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
np.nan
stands for “Not a Number”, representing a missing or undefined numerical value (similar to NaN in other contexts).
Indexing: Both Series and DataFrames have index labels to identify data.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
You can load CSV files from your local system or directly from a URL into Pandas using pd.read_csv().
import pandas as pd
# Loading the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
print(titanic.head())
titanic = pd.read_csv("path_to_your_file/titanic.csv")
print(titanic.head())
Pandas can easily read Excel files using pd.read_excel().
import pandas as pd
tasks = pd.read_excel("tasks_ds.xlsx")
print(tasks.head())
When working with DataFrames in libraries like pandas, it’s essential to quickly inspect your data to understand its structure. Below are some key functions to view data:
head(n)
: Displays the first n rows (default is 5).import pandas as pd
df = pd.read_csv('titanic.csv')
print(df.head())
tail(n)
: Displays the last n rows (default is 5).
df.tail()
info()
: Provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
df.info()
describe()
: Generates summary statistics for numerical columns, such as mean, standard deviation, and quartiles.
df.describe()
Selecting specific parts of a DataFrame is a common operation. You can select columns and rows using different methods:
Selecting Columns
Using the column name in square brackets:
df['column_name']
Selecting multiple columns by providing a list of column names:
df[['col1', 'col2']]
Loading the Titanic dataset (assuming it’s available as a CSV file):
import pandas as pd
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
For example, to select the “Age” column:
df['Age']
This returns a Series representing the “Age” column.
To select “Name”, “Age”, and “Survived” columns:
df[['Name', 'Age', 'Survived']]
This returns a DataFrame with the specified columns.
Selecting Rows
Using iloc
(integer-location based): Select rows by their index position.
df.iloc[0] # First row
df.iloc[1:4] # Rows 2 to 4
Using loc
(label-based): Select rows by their index labels.
df.loc[0] # Row with index 0
df.loc[0:3] # Rows from index 0 to 3 (inclusive)
df.loc[df['col'] > 10] # Rows where 'col' > 10
Let’s load the dataset and inspect its structure:
import pandas as pd
# Load the Titanic dataset (assuming it's available as 'titanic.csv')
df = pd.read_csv('titanic.csv')
# Display the first few rows to understand the data
print(df.head())
This might display something like:
PassengerId | Name | Age | Survived | Pclass |
---|---|---|---|---|
1 | Braund, Mr. Owen Harris | 22.0 | 0 | 3 |
2 | Cumings, Mrs. John Bradley | 38.0 | 1 | 1 |
3 | Heikkinen, Miss. Laina | 26.0 | 1 | 3 |
4 | Futrelle, Mrs. Jacques Heath | 35.0 | 1 | 1 |
5 | Allen, Mr. William Henry | 35.0 | 0 | 3 |
iloc
(Integer-location Based)Select the First Row:
first_row = df.iloc[0]
print(first_row)
Output:
PassengerId 1
Name Braund, Mr. Owen Harris
Age 22.0
Survived 0
Pclass 3
Name: 0, dtype: object
Select Rows 2 to 4 (indices 1 to 3):
rows_2_to_4 = df.iloc[1:4]
print(rows_2_to_4)
Output:
PassengerId Name Age Survived Pclass
1 2 Cumings, Mrs. John Bradley 38.0 1 1
2 3 Heikkinen, Miss. Laina 26.0 1 3
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
loc
(Label-based)Select the Row with Index 0:
row_index_0 = df.loc[0]
print(row_index_0)
Output:
PassengerId 1
Name Braund, Mr. Owen Harris
Age 22.0
Survived 0
Pclass 3
Name: 0, dtype: object
Select Rows with Index 0 to 3 (inclusive):
rows_0_to_3 = df.loc[0:3]
print(rows_0_to_3)
Output:
PassengerId Name Age Survived Pclass
0 1 Braund, Mr. Owen Harris 22.0 0 3
1 2 Cumings, Mrs. John Bradley 38.0 1 1
2 3 Heikkinen, Miss. Laina 26.0 1 3
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
Select Rows Where Age
> 30:
rows_age_above_30 = df.loc[df['Age'] > 30]
print(rows_age_above_30.head())
Output:
PassengerId Name Age Survived Pclass
1 2 Cumings, Mrs. John Bradley 38.0 1 1
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
5 6 Moran, Mr. James 32.0 0 3
6 7 McCarthy, Mr. Timothy J 54.0 0 3
9 10 Nasser, Mrs. Nicholas 14.0 1 2
Filtering rows allows you to extract data that meets certain criteria. Conditions can be combined using logical operators:
Filter based on a single condition:
df[df['column_name'] > 50]
Filter based on multiple conditions:
Using AND (&
):
df[(df['col1'] > 50) & (df['col2'] == 'Value')]
Using OR (|
):
df[(df['col1'] > 50) | (df['col2'] == 'Value')]
Filter based on text matching:
df[df['col'].str.contains('keyword')]
Adding a New Column
You can create new columns by assigning values or calculations based on existing columns:
df['new_col'] = df['col1'] + df['col2']
Deleting a Column
Use the drop
method to remove a column. Set axis=1
to specify columns:
df.drop('col_to_delete', axis=1, inplace=True)
Alternatively, use the del
statement:
del df['col_to_delete']
Answer Key (True/False):
Watch this video for the answer:
Answer key (Mutiple Choice):
Answer Key (Fill in the Blanks):
Beginner: Basic concepts and syntax. Intermediate: More complex problems involving data structures and algorithms. Advanced: Challenging problems that require in-depth understanding and optimization.
Answers to Review Questions:
Pandas doesn’t come with built-in datasets like some other libraries, but it offers many ways to load and handle external datasets. You can easily read data from CSV, Excel, SQL, JSON, and other formats using Pandas.
Here are common datasets you can load and work with in Pandas, along with some examples of reading them into your environment:
You can load CSV files from your local system or directly from a URL into Pandas using pd.read_csv().
import pandas as pd
# Loading the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
print(titanic.head())
titanic = pd.read_csv("path_to_your_file/titanic.csv")
print(titanic.head())
Pandas can easily read Excel files using pd.read_excel().
superstore = pd.read_excel("path_to_your_file/superstore_sales.xlsx")
print(superstore.head())
You can load JSON files using pd.read_json().
# Loading a JSON file
json_data = pd.read_json("path_to_your_file/data.json")
print(json_data.head())
If you’re working with databases, Pandas can directly query them using SQL queries.
import sqlite3
# Create connection to your SQLite database
conn = sqlite3.connect('database_name.db')
# Query the database
data = pd.read_sql_query("SELECT * FROM table_name", conn)
print(data.head())
Pandas can extract tables from HTML web pages using pd.read_html().
# Loading data from a webpage with HTML tables
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)
print(tables[0].head()) # The first table on the page
You can load data from APIs that return JSON, CSV, or other formats. For example, using the Kaggle API, you can download datasets and load them into Pandas.
# After downloading a dataset from Kaggle
kaggle_data = pd.read_csv("path_to_downloaded_kaggle_file.csv")
print(kaggle_data.head())
You can also read data from Google Sheets by exporting them as CSV and reading into Pandas.
# Google Sheets shared link with export format as CSV
sheet_url = "https://docs.google.com/spreadsheets/d/your_sheet_id/export?format=csv"
google_sheets_data = pd.read_csv(sheet_url)
print(google_sheets_data.head())
Pandas can read CSVs from zipped files directly without unzipping them.
# Loading CSV from a zipped file
zip_url = "https://your_url/file.zip"
zipped_data = pd.read_csv(zip_url, compression='zip')
print(zipped_data.head())
You can even copy data from somewhere and paste it into Pandas using pd.read_clipboard().
# Assuming you've copied a table from a webpage or a document
clipboard_data = pd.read_clipboard()
print(clipboard_data.head())
While Pandas itself doesn’t provide built-in datasets, you can use datasets from libraries like Seaborn and Scikit-learn and load them into Pandas:
import seaborn as sns
# Load Titanic dataset from Seaborn and convert to Pandas DataFrame
titanic = sns.load_dataset('titanic')
print(titanic.head())
Example: Scikit-learn Iris Dataset into Pandas
from sklearn.datasets import load_iris
# Load Iris dataset and convert to Pandas DataFrame
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
print(iris_df.head())
There are two primary ways to store MySQL results in a pandas DataFrame using Python:
1. Using pandas.read_sql()
This is the recommended approach as it’s specifically designed for this purpose. Here’s how it works:
import pandas as pd
import mysql.connector
# Establish connection
mydb = mysql.connector.connect(
host="localhost",
user="yourusername",
password="yourpassword",
database="mydatabase"
)
# Define your SQL query
sql = "SELECT * FROM mytable" # Replace with your specific query
# Read the results into a DataFrame
df = pd.read_sql(sql, mydb) # mydb is the connection object
# Close the connection
mydb.close()
# Now you can work with the data in your DataFrame (df)
print(df.head()) # View the first few rows
Explanation:
pandas
and mysql.connector
.sql
).pd.read_sql(sql, mydb)
to execute the query and store the results in a pandas DataFrame named df
. The mydb
argument provides the connection object.df
DataFrame for further analysis or manipulation.2. Using cursor.fetchall() and DataFrame constructor
This method involves fetching the results as a list of tuples and then constructing a DataFrame from it. Here’s an example:
import pandas as pd
import mysql.connector
# Establish connection
mydb = mysql.connector.connect(
host="localhost",
user="yourusername",
password="yourpassword",
database="mydatabase"
)
mycursor = mydb.cursor()
# Execute your SQL query
sql = "SELECT * FROM mytable"
mycursor.execute(sql)
# Fetch the results
data = mycursor.fetchall() # data is a list of tuples
# Define column names (optional, but recommended for clarity)
column_names = [i[0] for i in mycursor.description] # Get column names from cursor description
# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
# Close connection (same as previous method)
mycursor.close()
mydb.close()
# Now you can work with the DataFrame (df)
print(df.head())
Explanation:
fetchall()
which returns a list of tuples.pd.DataFrame(data, columns=column_names)
.df
DataFrame for further analysis.Choosing the right approach:
pandas.read_sql()
is generally preferred as it’s more concise and efficient, especially for larger datasets.Example #: Using SQLAlchemy Engine
from sqlalchemy import create_engine
# Construct the connection URL (replace with your credentials)
engine = create_engine("mysql+mysqlconnector://yourusername:yourpassword@host/yourdatabase")
df = pd.read_sql(sql, engine)
from sqlalchemy import create_engine
# Construct the connection URL (replace with your credentials)
engine = create_engine("mysql+mysqlconnector://root:abc1234@localhost/library")
# Define your SQL query
sql = "SELECT * FROM books" # Replace with your specific query
# Read the results into a DataFrame
df = pd.read_sql(sql, engine) # mydb is the connection object
# Close the connection
# Now you can work with the data in your DataFrame (df)
print(df.head()) # View the first few rows
Example #: Using Database String URI
import pandas as pd
# Replace with your connection string details
connection_string = "mysql+mysqlconnector://yourusername:yourpassword@host/yourdatabase"
df = pd.read_sql(sql, connection_string)
Data alignment refers to how Pandas handles operations between data structures (such as Series
or DataFrames
) with differing indexes. When performing operations like addition, subtraction, or merging, Pandas automatically aligns the data by their index labels to ensure that operations happen between corresponding elements.
This feature helps simplify data operations and avoid errors, especially when dealing with real-world datasets that may not always be perfectly aligned.
When performing operations between two Series
with different indexes, Pandas aligns the data by the index labels and fills any missing values with NaN
(Not a Number).
import pandas as pd
# First Series
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
# Second Series with different index
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
# Adding the two Series
result = s1 + s2
print(result)
Output:
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
b
and c
) are added together.a
and d
, there are no corresponding values in the other Series, so the result is NaN
.When performing operations on DataFrames
, Pandas aligns both rows and columns based on their respective indexes.
# First DataFrame
df1 = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
}, index=['row1', 'row2'])
# Second DataFrame with different columns and rows
df2 = pd.DataFrame({
'B': [5, 6],
'C': [7, 8]
}, index=['row2', 'row3'])
# Adding the two DataFrames
result = df1 + df2
print(result)
Output:
A B C
row1 NaN NaN NaN
row2 NaN 9.0 NaN
row3 NaN NaN NaN
row2
and B
).NaN
.You can handle missing data resulting from alignment by using methods like:
fillna()
: Replace NaN
with a specific value.add()
, sub()
, etc. with fill_value
: Provide a default value for missing entries.fill_value
:result = s1.add(s2, fill_value=0)
print(result)
Output:
a 1.0
b 6.0
c 8.0
d 6.0
dtype: float64
NaN
unless specified otherwise.