Data powers machine learning algorithms and scikit-learn or
Sklearn datasets are included as part of the scikit-learn (
To use a specific dataset, you can simply import it from sklearn.datasets module and call the appropriate function to load the data into your program.
These datasets are usually pre-processed and ready to use, which saves time and effort for data practitioners who need to experiment with different machine learning models and algorithms.
This dataset includes measurements of the sepal length, sepal width, petal length and petal width of 150 iris flowers, which belong to 3 different species: setosa, versicolor and virginica. The iris dataset has 150 rows and 5 columns, which are stored as a dataframe, including a column for the species of each flower.
The variables include:
You can load the iris dataset directly from sklearn using the load_iris function from the sklearn.datasets module.
# To install sklearn
pip install scikit-learn
# To import sklearn
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Print the dataset description
print(iris.describe())
Code for loading the Iris dataset using sklearn. Retrieved from
This sklearn dataset contains information on 442 patients with diabetes, including demographic and clinical measurements:
The Diabetes dataset can be loaded using the load_diabetes() function from the sklearn.datasets module.
from sklearn.datasets import load_diabetes
# Load the diabetes dataset
diabetes = load_diabetes()
# Print some information about the dataset
print(diabetes.describe())
Code for loading the Diabetes dataset using sklearn. Retrieved from
This sklearn dataset is a collection of hand-written digits from 0 to 9, stored as grayscale images. It contains a total of 1797 samples, with each sample is a 2D array of shape (8,8). There are 64 variables (or features) in the digits sklearn dataset, corresponding to the 64 pixels in each digit image.
The Digits dataset can be loaded using the load_digits() function from the sklearn.datasets module.
from sklearn.datasets import load_digits
# Load the digits dataset
digits = load_digits()
# Print the features and target data
print(digits.data)
print(digits.target)
Code for loading the Digits dataset using sklearn. Retrieved from
The Linnerud dataset contains physical and physiological measurements of 20 professional athletes.
The dataset includes the following variables:
To load the Linnerud dataset in Python using sklearn:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
Code for loading the linnerud dataset using sklearn. Retrieved from
This sklearn dataset contains the results of chemical analyses of wines grown in a specific area of Italy, to classify the wines into their correct varieties.
Some of the variables in the dataset:
The Wine dataset can be loaded using the load_wine() function from the sklearn.datasets module.
from sklearn.datasets import load_wine
# Load the Wine dataset
wine_data = load_wine()
# Access the features and targets of the dataset
X = wine_data.data # Features
y = wine_data.target # Targets
# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names
Code for loading the Wine Quality dataset using sklearn. Retrieved from
This sklearn dataset consists of information about breast cancer tumours and was initially created by Dr. William H. Wolberg. The dataset was created to assist researchers and machine learning practitioners in classifying tumours as either malignant(cancerous) or benign (non-cancerous).
Some of the variables included in this dataset:
You can load the Breast Cancer Wisconsin dataset directly from sklearn using the load_breast_cancer function from the sklearn.datasets module.
from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()
# Print the dataset description
print(cancer.describe())
Code for loading the Breast Cancer Wisconsin dataset using sklearn. Retrieved from
Real world sklearn datasets are based on real-world problems, commonly used to practice and experiment with machine learning algorithms and techniques using the sklearn library in Python.
The Boston Housing dataset consists of information on housing in the area of Boston, Massachusetts. It has about 506 rows and 14 columns of data.
Some of the variables in the dataset include:
You can load the Boston Housing dataset directly from scikit-learn using the load_boston function from the sklearn.datasets module.
from sklearn.datasets import load_boston
# Load the Boston Housing dataset
boston = load_boston()
# Print the dataset description
print(boston.describe())
Code for loading the Boston Housing dataset using sklearn. Retrieved from
The Olivetti Faces dataset is a collection of grayscale images of human faces taken between April 1992 and April 1994 at AT&T Laboratories. It contains 400 images of 10 individuals, with each individual having 40 images shot at different angles and different lighting conditions.
You can load the Olivetti Faces dataset in sklearn by using the fetch_olivetti_faces function from the datasets module.
from sklearn.datasets import fetch_olivetti_faces
# Load the dataset
faces = fetch_olivetti_faces()
# Get the data and target labels
X = faces.data
y = faces.target
Code for loading the Olivetti Faces dataset using sklearn. Retrieved from
This sklearn dataset contains information on median house values, as well as attributes for census tracts in California. It also includes 20,640 instances and 8 features.
Some of the variables in the dataset:
You can load the California Housing dataset using the fetch_california_housing function from sklearn.
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
# Get the features and target variable
X = california_housing.data
y = california_housing.target
Code for loading the California Housing dataset using sklearn. Retrieved from
The MNIST dataset is popular and widely used in the fields of machine learning and computer vision. It consists of 70,000 grayscale images of handwritten digits 0–9, with 60,000 images for training and 10,000 for testing. Each image is 28x28 pixels in size and has a corresponding label denoting which digits it represents.
You can load the MNIST dataset from sklearn using the following code:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
Note: The MNIST dataset is a subset of the Digits dataset.
Code for loading the MNIST dataset using sklearn. Retrieved from
The Fashion MNIST dataset was created by Zalando Research as a replacement for the original MNIST dataset. The Fashion MNIST dataset consists of 70,000 grayscale images(training set of 60,000 and a test set of 10,000) of clothing items.
The images are 28x28 pixels in size and represent 10 different classes of clothing items, including T-shirts/tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to the original MNIST dataset, but with more challenging classification tasks due to the greater complexity and variety of the clothing items.
You can load this sklearn dataset using the fetch_openml function.
from sklearn.datasets import fetch_openml
fmnist = fetch_openml(name='Fashion-MNIST')
Code for loading the Fashion MNIST dataset using sklearn. Retrieved from__https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml__ on 30/3/2023.
Generated sklearn datasets are synthetic datasets, generated using the sklearn library in Python. They are used for testing, benchmarking and developing machine learning algorithms/models.
This function generates a random n-class classification dataset with a specified number of samples, features, and informative features.
Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and 3 classes:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, random_state=42)
This code generates a dataset with 100 samples and 5 features, with 3 classes and 3 informative features. The remaining features will be redundant or noise.
Code for loading the make_classification dataset using sklearn. Retrieved from
This function generates a random regression dataset with a specified number of samples, features, and noise.
Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and noise level of 0.1:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
This code generates a dataset with 100 samples and 5 features, with a noise level of 0.1. The target variable y will be a continuous variable.
Code for loading the make_regression dataset using sklearn. Retrieved from
This function generates a random dataset with a specified number of samples and clusters.
Here's an example code to generate this sklearn dataset with 100 samples and 3 clusters:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3, random_state=42)
This code generates a dataset with 100 samples and 2 features (x and y coordinates), with 3 clusters centred at random locations, and with no noise.
Code for loading the make_blobs dataset using sklearn. Retrieved from
These functions generate datasets with non-linear boundaries that are useful for testing non-linear classification algorithms.
Here's an example code for loading the make_moons dataset:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
This code generates a dataset with 1000 samples and 2 features (x and y coordinates) with a non-linear boundary between the two classes, and with 0.2 standard deviations of Gaussian noise added to the data.
Code for loading the make_moons dataset using sklearn. Retrieved from
Here's an example code to generate and load the make_circles dataset:
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, noise=0.05, random_state=42)
Code for loading the make_circles dataset using sklearn. Retrieved from
This function generates a sparse coded signal dataset that is useful for testing compressive sensing algorithms.
Here's an example code for loading this sklearn dataset:
from sklearn.datasets import make_sparse_coded_signal
X, y, w = make_sparse_coded_signal(n_samples=100, n_components=10, n_features=50, n_nonzero_coefs=3, random_state=42)
This code generates a sparse coded signal dataset with 100 samples, 50 features, and 10 atoms.
Code for loading the make_sparse_coded_signal dataset using sklearn. Retrieved from
Sklearn datasets provide a convenient way for developers and researchers to test and evaluate machine learning models without having to manually collect and preprocess data.
They are also available for anyone to download and use freely.
The lead image of this article was generated via HackerNoon's AI Stable Diffusion model using the prompt 'iris dataset'.
More Dataset Listicles: