An image dataset contains specially selected digital images intended to help train, test, and evaluate an artificial intelligence (AI) or machine learning (ML) algorithm, usually a computer vision algorithm.
A face dataset is a type of image dataset that includes images of curated human faces, typically for an ML project. There are several publicly available face datasets that you can leverage instead of collecting your own training data. Managing and optimizing datasets for machine learning is one of the crucial stages in a
Here are the most widely used face datasets.
This dataset is useful for training or testing models for several computer vision tasks, such as face detection, face attribute recognition, facial landmark localization, face synthesis, and face image editing.
The dataset is especially large, covering 10,177 celebrity identities, with a total of 202,599 face images across five landmark locations, and 40 binary attributes annotations for each image.
The
Originally intended as a benchmark for Generative Adversarial Networks (GANs), this dataset includes approximately 70,000 PNG images. The images are high quality, with a resolution of 1024/1024.
The LFW image dataset contains curated face photographs intended for researching
face recognition technology without constraints.
It consists of four separate image datasets, including an original set and three related sets with different types of images used for testing algorithms in different conditions. These aligned datasets include LFW-a, funneled images (ICCV 2007), and deep-funneled images (NIPS 2012). LFW-a and deep-funneled images generate higher quality results than regular or funneled images for most face recognition algorithms.
This dataset has more than 13,000 face images collected from different online sources.
The official webpage of the CelebA dataset is on
PyTorch provides the dataset directly through its torchvision.dataset
module. Users can import the dataset directly and control the variation through parameters. The import has the following definition:
torchvision.datasets.CelebA
(root, split = 'train', target_type = 'attr', transform = None, target_transform = None, download = False)
Here is how each parameter is used:
root
–specifies where the dataset will get downloaded toplit
–specifies what part of the dataset is downloaded, can be 'train', 'valid', 'test
', or 'all'
transform
–a function that transforms an imageattr
: labels the attributes with binary valuesidentity:
labels each image with the person’s identitybbox
: specifies dimensions of each image’s bounding boxlandmarks
: specifies each image’s landmark featuresTensorFlow offers users to use the dataset through its tfds module directly. Users can download the dataset with the following command:
tfds.load(‘celeb_a’, split=’train’, download=True)
Since the dataset is pre-split between three categories (’train’, ’test’
, and ’validation’
), the split
parameter controls which part of the dataset gets downloaded. The dataset comes with a feature dictionary where each feature is a boolean, and the user can control what features should each downloaded picture have.
The FFHQ dataset came to use when researchers trained an architecture using an alternative generative modeling technique called MvM on it. The technique differs from traditional GAN since it models geometric quantities like p-diameters and centroids.
The dataset comes with JSON metadata, a script for downloading it, and its documentation. There are two main ways to access the dataset:
The scripts can take the following arguments when running it to customize the downloading process:
-json
: Downloads the dataset’s metadata as a JSON file--stats:
Displays the dataset’s statistics--images
: Downloads the images in PNG format and a pixel density of 1024x1024 pixels (total download size: 89.1 GB)--thumbs:
Downloads images in the PNG format with a pixel density of 128x128 (total download size: 1.95 GB)--wilds:
Download the original in-the-wild images in the PNG format (total download size: 955 GB)--tfrecord
s: Downloads the multi-resolution TFRecords (total download size: 273 GB)--alig
n: Recreates the images with a pixel density of 1024x1024 from the in-the-wild images--num_threads
: Denotes the number of concurrent threads to download the dataset--num_attempts
: Denotes the number of times the script should try to download each image file in the dataset-no-rotation
: Keeps the original orientation of images and does not align-no-padding
: Instructs to not apply blur-padding around and near the image’s borderssource-dir:
Sends the local directory with existing FFHQ source dataThe LFW dataset comes with two loaders: one called fetch_lfw_peopl
e for face identification and the other called fetch_lfw_pairs
for face verification. This tutorial uses the memmapped version existing in the ~/scikit_learn_data/lfw_home/
through the joblib
utility.
fetch_lfw_people
LoaderThis loader uses supervised learning to classify faces into multiple classes. This tutorial shows how to import the LFW dataset and show the celebrity in the image’s name.
To use the fetch_lfw_people loader:
from sklearn.datasets import fetch_lfw_people
people_from_lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
for name in lfw_people.target_names:
print(name)
Each face in the dataset is assigned a single person id from the target array.
lfw_people.target.shape
list(lfw_people.target[:10])
The loader comes in handy to check if two pictures belong to the same person or not. While fetching the loader, it is important to specify the particular subset of the dataset.
To use the fetch_lfw_pairs loader:
Use the following command to list the available face image pairs after importing the loader:
from sklearn.datasets import fetch_lfw_pairs
lfw_pairs_train_subset = fetch_lfw_pairs(subset='train')
The last command retrieves a list of two items:
['Different persons', 'Same person']
In this article, I covered three of the most popular face datasets you can use to build your own face recognition and face detection models—CelebFaces, FFHQ, and LFW. I showed technical details that can help you retrieve the datasets and use them in your model code. I hope this will give you a head start on your next computer vision project.