Learn how to create box plots in Matplotlib using Python. This tutorial covers box plot components, customization, outlier detection, and side-by-side comparisons with violin plots.
A box plot is a graphical summary of data distribution. It helps visualize:
It’s great for comparing distributions across multiple datasets.(atlassian)
Image source: www.atlassian.com
import matplotlib.pyplot as plt
import numpy as np
# Example datasets
np.random.seed(0)
model_A = np.random.normal(70, 10, 100)
model_B = np.random.normal(75, 15, 100)
model_C = np.random.normal(65, 20, 100)
data = [model_A, model_B, model_C]
# Updated box plot with tick_labels
plt.figure(figsize=(8, 5))
plt.boxplot(data, tick_labels=['Model A', 'Model B', 'Model C'], patch_artist=True)
plt.title('Box Plot Example: Model Score Distributions')
plt.ylabel('Scores')
plt.grid(True)
plt.show()
You can quickly compare:
Great question! A box plot is specifically designed to help you identify outliers in your dataset.
Box plots use the Interquartile Range (IQR) to find outliers.
Any data point is considered an outlier if it is:
These points will show up as individual dots outside the whiskers in a box plot.
Example: Let’s say you have a dataset where: Q1 = 20, Q3 = 80, and IQR = 80 - 20 = 60. Then:
Lower Fence (Q1 - 1.5 * IQR) = 20 - 1.5 * 60 = -70
Upper Fence (Q3 + 1.5 * IQR) = 80 + 1.5 * 60 = 170
Any data point less than -70 or greater than 170 would be flagged as an outlier
.
import numpy as np
import matplotlib.pyplot as plt
# Sample data with outliers
data = np.array([12, 13, 14, 15, 16, 17, 18, 30]) # 40 is an outlier
# Box plot
plt.boxplot(data, vert=False, patch_artist=True)
plt.title("Box Plot with Outlier")
plt.xlabel("Value")
plt.grid(True)
plt.show()
# Calculate outlier thresholds
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data < lower_bound) | (data > upper_bound)]
print("Outliers:", outliers)
Console prints:
Outliers: [30]