One of the most important steps of any data-related project is Exploratory Data Analysis (EDA). It is crucial to explore the distribution of the data to understand it and efficiently decide on the next steps. A straightforward way to explore data distribution is to study its central tendencies through measures of centrality. In this article, we’ll explore the three primary measures of centrality: Mean, Median, and Mode. We’ll discuss their strengths and weaknesses, along with practical examples using SQL and Python.
The mean, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It’s a straightforward way to find a central value.
Let’s say we have a table named Sales
with a column Revenue
. The following query will give us the mean revenue per sale:
SELECT AVG(Revenue) AS MeanRevenue
FROM Sales;
When using Python, we can obtain the same result by importing Pandas and running the following code:
import pandas as pd
# import your data into a pandas dataframe
mean_revenue = df['Revenue'].mean()
The mean leaves no man behind, meaning every data point is taken into consideration. Although this allows it to give us a holistic view of the data, it makes it highly susceptible to outliers.
The median is the middle value when the data is sorted, meaning there is an equal number of data points before and after it.
Note: If there’s an even number of observations, the median is obtained by averaging the two middle values.
We can obtain the median price using the following SQL query:
SELECT
DISTINCT PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY Revenue) OVER() AS MedianRevenue
FROM Sales;
Using Python, however, obtaining the median is much more straightforward:
median_revenue = df['Revenue'].median()
Since the median only considers the order of data points, it is not affected by outliers, making it a great indicator of the center when the data is not symmetrically distributed. However, this strength is also its weakness—it fails to capture the whole dataset.
The mode is the value that appears most frequently in a dataset. It’s particularly useful for categorical data where we want to know the most common category.
To find the mode in SQL, we can execute the following query:
SELECT Revenue, COUNT(*) AS Frequency
FROM Sales
GROUP BY Revenue
ORDER BY Frequency DESC
LIMIT 1;
And for the Pythonistas out there, it’s even simpler:
mode_revenue = df['Revenue'].mode()
The mode is useful for providing categorical insights, making it ideal for identifying the most common category or categories (bimodal or multimodal) in non-numeric datasets. However, it has limitations: if all values are unique, there may be no mode at all, and it tends to be less relevant for continuous data compared to the mean or median.
The most efficient way to understand a dataset’s central tendencies is by considering all three measures of centrality. Each provides unique insights into the data distribution, allowing us to build a more comprehensive understanding of its shape.
By observing all three measures, we get a clearer understanding of the data. If there’s a large difference between the mean and median, that could be a red flag for outliers or a skewed distribution, which would require further analysis or potential data transformations.
Exploring central tendencies is a critical part of exploratory data analysis. The mean, median, and mode each serve different purposes, and understanding when and how to use them is critical for interpreting data correctly. Whether you’re working with continuous or categorical data, applying these measures helps uncover patterns and insights that guide the upcoming steps—whether it’s cleaning, transforming, or modeling.