Learn how to use the Pandas module in Python for data analysis. Explore functions, examples, and best practices for efficient data manipulation.
pip install pandas
)import pandas as pd
)head()
, tail()
, info()
, describe()
df['column']
)iloc
, loc
)dropna()
fillna()
astype()
)sort_values()
)groupby()
)mean()
, sum()
, count()
, etc.)merge()
, concat()
)
read_csv()
)read_excel()
)read_json()
)to_csv()
)to_excel()
)df.plot()
)value_counts()
unique()
nunique()
pivot_table()
Definition: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 1
Key Features:
For more details on Data alignment, see Data Alignment in Pandas.
pip
:
pip install pandas
import pandas as pd
print(pd.__version__)
import pandas as pd
pd
?Pandas provides two types of classes for handling data:
Creating a Series by passing a list of values, letting pandas create a default RangeIndex. 2
import pandas as pd
import numpy as np
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
np.nan
stands for “Not a Number”, representing a missing or undefined numerical value (similar to NaN in other contexts).
Indexing: Both Series and DataFrames have index labels to identify data.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
You can load CSV files from your local system or directly from a URL into Pandas using pd.read_csv().
import pandas as pd
# Loading the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
print(titanic.head())
for more deta sets, see Datasets for use in examples
titanic = pd.read_csv("path_to_your_file/titanic.csv")
print(titanic.head())
Pandas can easily read Excel files using pd.read_excel().
import pandas as pd
tasks = pd.read_excel("tasks_ds.xlsx")
print(tasks.head())
for more details, see Loading and Handling Datasets in Pandas
When working with DataFrames in libraries like pandas, it’s essential to quickly inspect your data to understand its structure. Below are some key functions to view data:
head(n)
: Displays the first n rows (default is 5).import pandas as pd
df = pd.read_csv('titanic.csv')
print(df.head())
tail(n)
: Displays the last n rows (default is 5).
df.tail()
info()
: Provides a concise summary of the DataFrame, including column names, non-null counts, and data types.
df.info()
describe()
: Generates summary statistics for numerical columns, such as mean, standard deviation, and quartiles.
df.describe()
Selecting specific parts of a DataFrame is a common operation. You can select columns and rows using different methods:
Selecting Columns
Using the column name in square brackets:
df['column_name']
Selecting multiple columns by providing a list of column names:
df[['col1', 'col2']]
Loading the Titanic dataset (assuming it’s available as a CSV file):
import pandas as pd
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
For example, to select the “Age” column:
df['Age']
This returns a Series representing the “Age” column.
To select “Name”, “Age”, and “Survived” columns:
df[['Name', 'Age', 'Survived']]
This returns a DataFrame with the specified columns.
Selecting Rows
Using iloc
(integer-location based): Select rows by their index position.
df.iloc[0] # First row
df.iloc[1:4] # Rows 2 to 4
Using loc
(label-based): Select rows by their index labels.
df.loc[0] # Row with index 0
df.loc[0:3] # Rows from index 0 to 3 (inclusive)
df.loc[df['col'] > 10] # Rows where 'col' > 10
Let’s load the dataset and inspect its structure:
import pandas as pd
# Load the Titanic dataset (assuming it's available as 'titanic.csv')
df = pd.read_csv('titanic.csv')
# Display the first few rows to understand the data
print(df.head())
This might display something like:
PassengerId | Name | Age | Survived | Pclass |
---|---|---|---|---|
1 | Braund, Mr. Owen Harris | 22.0 | 0 | 3 |
2 | Cumings, Mrs. John Bradley | 38.0 | 1 | 1 |
3 | Heikkinen, Miss. Laina | 26.0 | 1 | 3 |
4 | Futrelle, Mrs. Jacques Heath | 35.0 | 1 | 1 |
5 | Allen, Mr. William Henry | 35.0 | 0 | 3 |
iloc
(Integer-location Based)Select the First Row:
first_row = df.iloc[0]
print(first_row)
Output:
PassengerId 1
Name Braund, Mr. Owen Harris
Age 22.0
Survived 0
Pclass 3
Name: 0, dtype: object
Select Rows 2 to 4 (indices 1 to 3):
rows_2_to_4 = df.iloc[1:4]
print(rows_2_to_4)
Output:
PassengerId Name Age Survived Pclass
1 2 Cumings, Mrs. John Bradley 38.0 1 1
2 3 Heikkinen, Miss. Laina 26.0 1 3
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
loc
(Label-based)Select the Row with Index 0:
row_index_0 = df.loc[0]
print(row_index_0)
Output:
PassengerId 1
Name Braund, Mr. Owen Harris
Age 22.0
Survived 0
Pclass 3
Name: 0, dtype: object
Select Rows with Index 0 to 3 (inclusive):
rows_0_to_3 = df.loc[0:3]
print(rows_0_to_3)
Output:
PassengerId Name Age Survived Pclass
0 1 Braund, Mr. Owen Harris 22.0 0 3
1 2 Cumings, Mrs. John Bradley 38.0 1 1
2 3 Heikkinen, Miss. Laina 26.0 1 3
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
Select Rows with index 0 to 3 and columns name and age
df_name_age = df.loc[0:3, ["Name", "Age"]]
# The above line will raise an error because the loc method is not used correctly.
print(df_name_age)
Output:
Name Age
0 Braund, Mr. Owen Harris 22.0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0
2 Heikkinen, Miss. Laina 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
Select Rows Where Age
> 30:
rows_age_above_30 = df.loc[df['Age'] > 30]
print(rows_age_above_30.head())
Output:
PassengerId Name Age Survived Pclass
1 2 Cumings, Mrs. John Bradley 38.0 1 1
3 4 Futrelle, Mrs. Jacques Heath 35.0 1 1
5 6 Moran, Mr. James 32.0 0 3
6 7 McCarthy, Mr. Timothy J 54.0 0 3
9 10 Nasser, Mrs. Nicholas 14.0 1 2
for more details, see Difference Between Pandas loc vs iloc – Key Comparisons
Filtering rows allows you to extract data that meets certain criteria. Conditions can be combined using logical operators:
Filter based on a single condition:
df[df['column_name'] > 50]
Filter based on multiple conditions:
Using AND (&
):
df[(df['col1'] > 50) & (df['col2'] == 'Value')]
Using OR (|
):
df[(df['col1'] > 50) | (df['col2'] == 'Value')]
Filter based on text matching:
df[df['col'].str.contains('keyword')]
Adding a New Column
You can create new columns by assigning values or calculations based on existing columns:
df['new_col'] = df['col1'] + df['col2']
Deleting a Column
Use the drop
method to remove a column. Set axis=1
to specify columns:
df.drop('col_to_delete', axis=1, inplace=True)
Alternatively, use the del
statement:
del df['col_to_delete']
Answer Key (True/False):
Watch this video for the answer:
Answer key (Mutiple Choice):
Answer Key (Fill in the Blanks):
Beginner: Basic concepts and syntax. Intermediate: More complex problems involving data structures and algorithms. Advanced: Challenging problems that require in-depth understanding and optimization.