Share Your Feedback

Datasets in Python

Learn Python variables with this beginner-friendly guide. Understand variable naming rules, assignments, and operations with examples and exercises. Perfect for students and professionals starting their Python journey.

In Python, several built-in libraries provide datasets for practice, making it easy to access and work with various datasets directly in your code. Here are some popular options:

Scikit-learn Datase

Scikit-learn offers several well-known datasets that are built-in for practicing data analysis, machine learning, and classification tasks.

Iris Dataset: Perfect for classification tasks and basic exploratory data analysis.

from sklearn.datasets import load_iris iris = load_iris() print(iris.data)

Wine Dataset: Useful for classification problems in machine learning.

from sklearn.datasets import load_wine wine = load_wine() print(wine.data)

Diabetes Dataset: A small dataset for regression tasks related to diabetes progression.

from sklearn.datasets import load_diabetes diabetes = load_diabetes() print(diabetes.data)

Breast Cancer Dataset: Suitable for binary classification tasks.

from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() print(cancer.data)

Boston Housing Dataset (deprecated): Previously used for regression analysis, but now deprecated due to ethical concerns.

Seaborn Datasets

Seaborn provides a variety of datasets for visualization and analysis, which can be easily loaded.

Titanic Dataset: Useful for classification and visualization tasks.

import seaborn as sns titanic = sns.load_dataset(‘titanic’) print(titanic.head())

Tips Dataset: Ideal for regression analysis and exploratory data visualization.

tips = sns.load_dataset(‘tips’) print(tips.head())

Flights Dataset: Time series data showing monthly passengers.

flights = sns.load_dataset(‘flights’) print(flights.head())

Penguins Dataset: A dataset for classification and visualization tasks related to penguin species.

penguins = sns.load_dataset(‘penguins’) print(penguins.head())

Statsmodels Datasets

Statsmodels provides access to real-world datasets for statistical analysis.

Heart Disease Dataset: For binary classification and logistic regression tasks.

import statsmodels.api as sm heart = sm.datasets.heart.load_pandas().data print(heart.head())

Fair’s Affairs Dataset: Useful for logistic regression and social science analysis.

fair = sm.datasets.fair.load_pandas().data print(fair.head())

US States Dataset: Data on the population and area of US states for regression tasks.

states = sm.datasets.statecrime.load_pandas().data print(states.head())

Pandas Built-in Datasets

Pandas itself doesn’t have built-in datasets, but you can easily load datasets from CSV, Excel, or other formats.

Reading from a URL (e.g., Titanic Dataset):

import pandas as pd url = “https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv” titanic = pd.read_csv(url) print(titanic.head())

Kaggle Datasets via Python API

You can use the Kaggle API to download datasets from Kaggle directly into your environment.

Install the Kaggle package:

pip install kaggle

Download a dataset:

!kaggle datasets download -d shivamb/netflix-shows

TensorFlow Datasets (for deep learning)

TensorFlow provides datasets for machine learning and deep learning tasks.

MNIST Dataset: A collection of handwritten digits for image classification.

import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() print(x_train.shape)

CIFAR-10 Dataset: Image dataset for object recognition tasks.

cifar10 = tf.keras.datasets.cifar10 (x_train, y_train), (x_test, y_test) = cifar10.load_data() print(x_train.shape)

These built-in datasets in Python libraries provide excellent starting points for learning data analysis, machine learning, and visualization skills. You can access them without needing to download any external data.

There are several popular datasets that are excellent for learning data analysis. Here are some of the best options:

Iris Dataset: This classic dataset contains 150 observations of iris flowers, with features like petal length, petal width, sepal length, and sepal width. It’s widely used for beginners in data analysis and machine learning.

Source: UCI Machine Learning Repository

Titanic Dataset: This dataset from Kaggle contains information about passengers on the Titanic and whether they survived. It’s often used to practice data cleaning, exploratory data analysis, and classification tasks.

Source: Kaggle Titanic Dataset

California Housing Dataset: This dataset contains information about house prices in different districts of California and is great for learning regression techniques.

Source: Kaggle California Housing Prices

Retail Sales Dataset: This dataset provides transactional data for retail stores and is ideal for practicing time series analysis, forecasting, and sales trend analysis.

Source: Kaggle Retail Sales Data

NYC Airbnb Dataset: This dataset contains Airbnb listings in New York City and is perfect for learning about exploratory data analysis and visualization.

Source: Kaggle NYC Airbnb Open Data

World Happiness Report Dataset: This dataset contains survey results from people across the world about their happiness levels, life expectancy, and GDP. It is good for correlation analysis and visualization.

Source: Kaggle World Happiness Report

These datasets are freely available and provide diverse contexts for practicing essential data analysis skills such as data cleaning, visualization, and statistical analysis.

Here are more datasets to help you practice and enhance your data analysis skills:

Employee Attrition Dataset: This dataset contains employee data and is used to predict employee attrition (turnover) in an organization, making it ideal for HR analytics and classification tasks.

Source: Kaggle IBM HR Analytics

Netflix Movies and TV Shows Dataset: This dataset contains information on all movies and TV shows on Netflix, including ratings, genres, and release dates. It’s great for content recommendation system practice.

Source: Kaggle Netflix Movies and TV Shows

COVID-19 Dataset: This dataset tracks COVID-19 cases, recoveries, deaths, and more. It’s perfect for time-series analysis, predictive modeling, and visualization.

Source: Kaggle COVID-19 Dataset

Credit Card Fraud Detection Dataset: This dataset contains transactions made by European cardholders, with labeled fraud cases. It’s great for practicing classification and anomaly detection.

Source: Kaggle Credit Card Fraud Detection

Mall Customers Dataset: This dataset includes information about customer segmentation in a mall based on annual income, spending scores, and gender. It’s used for clustering and segmentation analysis.

Source: Kaggle Mall Customers

Global Terrorism Database: This comprehensive dataset records incidents of terrorism worldwide. It’s useful for geographical analysis and understanding global security trends.

Source: Global Terrorism Database

Superstore Sales Dataset: This dataset contains transactional data from a hypothetical store and is used for sales analysis, identifying trends, and forecasting.

Source: Kaggle Superstore Dataset

Heart Disease Dataset: This dataset contains medical data related to heart disease, and it’s often used for predicting heart conditions using classification algorithms.

Source: UCI Machine Learning Repository Heart Disease Dataset

Uber Pickup Data: This dataset contains information about Uber pickups in New York City. It’s great for geographical and time-series analysis.

Source: Kaggle Uber Pickup Data

Students Performance Dataset: This dataset contains student exam scores, gender, parental education, and more, making it perfect for educational analysis and statistical insights.

Source: Kaggle Students Performance

Amazon Product Reviews Dataset: This dataset includes product reviews and ratings from Amazon customers. It’s ideal for sentiment analysis and recommendation systems.

Source: Kaggle Amazon Product Reviews

These datasets cover a wide range of industries and analysis techniques, making them great resources for gaining hands-on experience in data analysis.