SciPy in Python
SciPy
is a powerful library in Python for scientific computing, which includes modules for optimization, linear algebra, integration, and importantly, statistics and probability. SciPy
builds on top of NumPy
and provides functions for a variety of statistical calculations and probability distributions.
Statistics and Probability with SciPy
1. Descriptive Statistics
Descriptive statistics summarize data to provide insights without making inferences about the entire population. The scipy.stats
module provides functions to compute key measures like mean, median, variance, and standard deviation.
from scipy import stats
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Mean and median
mean = np.mean(data)
median = np.median(data)
# Variance and standard deviation
variance = np.var(data)
std_dev = np.std(data)
# Skewness and Kurtosis
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)
- Mean: Average of the data values.
- Median: Middle value in a sorted dataset.
- Variance and Standard Deviation: Measure the spread of data.
- Skewness: Measure of asymmetry in the data distribution.
- Kurtosis: Indicates the “tailedness” of the data distribution.
2. Probability Distributions
SciPy
provides functions for working with continuous and discrete probability distributions, including common ones like Normal, Binomial, Poisson, and Uniform distributions.
#### a. Normal Distribution Used in many natural phenomena, represented by its mean (μ) and standard deviation (σ).
# Normal distribution with mean=0 and standard deviation=1
norm_dist = stats.norm(loc=0, scale=1)
# Probability density function (PDF) at x=1
pdf = norm_dist.pdf(1)
# Cumulative distribution function (CDF) at x=1
cdf = norm_dist.cdf(1)
#### b. Binomial Distribution Used for binary outcomes (e.g., success/failure) over several trials.
# Binomial distribution with n=10 trials, p=0.5 probability of success
binom_dist = stats.binom(n=10, p=0.5)
# Probability of getting exactly 5 successes
pmf = binom_dist.pmf(5)
#### c. Poisson Distribution Models the number of times an event occurs in a fixed interval of time or space.
# Poisson distribution with λ=3 (average rate of occurrence)
poisson_dist = stats.poisson(mu=3)
# Probability of getting exactly 2 events
pmf = poisson_dist.pmf(2)
#### d. Uniform Distribution Each outcome in the range has an equal probability of occurring.
# Uniform distribution from 0 to 10
uniform_dist = stats.uniform(loc=0, scale=10)
# PDF and CDF for a given value
pdf = uniform_dist.pdf(5)
cdf = uniform_dist.cdf(5)
3. Sampling Techniques
Sampling is selecting a subset from a population, and SciPy
provides functions for generating random samples from different distributions.
# Random sample of size 5 from a normal distribution
norm_sample = stats.norm.rvs(loc=0, scale=1, size=5)
# Random sample of size 5 from a binomial distribution
binom_sample = stats.binom.rvs(n=10, p=0.5, size=5)
4. Hypothesis Testing
Hypothesis testing is a statistical method for making inferences about populations using sample data. Common tests include the t-test, chi-squared test, and ANOVA.
#### a. T-tests
- One-sample T-test: Tests if the mean of a single group is equal to a known value.
- Two-sample T-test: Compares the means of two independent groups.
# One-sample t-test (testing if mean is equal to 5)
t_statistic, p_value = stats.ttest_1samp(data, 5)
# Two-sample t-test
data1 = np.random.normal(0, 1, 100)
data2 = np.random.normal(1, 1, 100)
t_stat, p_val = stats.ttest_ind(data1, data2)
#### b. Chi-Squared Test Tests for independence between categorical variables or to test the fit of an observed distribution to an expected distribution.
observed = [10, 20, 30]
expected = [15, 15, 30]
chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
#### c. ANOVA (Analysis of Variance) Used to compare means across multiple groups.
# ANOVA test for three groups
group1 = np.random.normal(5, 1, 100)
group2 = np.random.normal(5.5, 1, 100)
group3 = np.random.normal(6, 1, 100)
f_stat, p_val = stats.f_oneway(group1, group2, group3)
5. Confidence Intervals
Confidence intervals estimate a range within which a population parameter is likely to fall, with a specified level of confidence (e.g., 95%).
# Mean and 95% confidence interval for data
confidence_interval = stats.norm.interval(0.95, loc=np.mean(data), scale=stats.sem(data))
6. Correlation and Covariance
- Correlation measures the strength and direction of the relationship between two variables.
- Covariance indicates the direction of the linear relationship between variables.
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 20, 30, 40, 50])
# Pearson correlation coefficient
correlation, p_value = stats.pearsonr(x, y)
# Covariance matrix
covariance_matrix = np.cov(x, y)
7. Linear Regression
Linear regression models the relationship between two variables by fitting a linear equation. SciPy
provides a simple implementation of linear regression.
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# Predicting values using the regression line
predicted_y = intercept + slope * x
8. Non-Parametric Tests
Non-parametric tests do not assume a normal distribution and are useful for data that doesn’t meet parametric test assumptions.
- Mann-Whitney U Test: Compares two independent samples.
- Wilcoxon Signed-Rank Test: Compares two related samples.
# Mann-Whitney U Test
stat, p = stats.mannwhitneyu(data1, data2)
# Wilcoxon Signed-Rank Test
stat, p = stats.wilcoxon(data1, data2)
Summary Table
Statistical Concept | Function(s) | Description |
---|---|---|
Descriptive Statistics | np.mean() , np.var() , stats.skew() |
Basic stats measures like mean, variance, etc. |
Probability Distributions | stats.norm , stats.binom , etc. |
Continuous and discrete probability distributions |
Sampling | stats.norm.rvs() , stats.binom.rvs() |
Generating random samples from distributions |
Hypothesis Testing | stats.ttest_1samp() , stats.chisquare() |
Tests for statistical significance |
Confidence Intervals | stats.norm.interval() |
Provides range estimates for population parameters |
Correlation and Covariance | stats.pearsonr() , np.cov() |
Measures relationships between variables |
Linear Regression | stats.linregress() |
Fits a linear model to data |
Non-Parametric Tests | stats.mannwhitneyu() , stats.wilcoxon() |
Tests for non-normal data distributions |
Using SciPy
for statistics and probability provides a comprehensive toolkit for conducting complex analyses, which is widely applicable in data science, research, and analytics. Let me know if you’d like more details on any specific statistical function!