paint-brush
A Beginner's Guide to Understanding SQL Window Functions and Their Capabilitiesby@yonatansali
6,128 reads
6,128 reads

A Beginner's Guide to Understanding SQL Window Functions and Their Capabilities

by Yonatan SaliJuly 23rd, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Key Takeaways: A window function performs a calculation across a set of rows that are somehow related to the current row, Main types of functions to which window functions are applied are aggregate, ranking and value functions, To use a window function, you need to apply the over() clause that defines a window (a set of rows) within a query result set. The window function then calculates a value for each row in the window, To specify the column for which you want to perform aggregation, you need to add the partition by clause to the over() clause. Partition by is somewhat similar to grouping but returns all the rows with the aggregate function applied, instead of one row per group.
featured image - A Beginner's Guide to Understanding SQL Window Functions and Their Capabilities
Yonatan Sali HackerNoon profile picture


Throughout the hiring process, I had the pleasure of meeting many talented candidates for my team. Since our work involves handling complex data sets, it was important for me to gauge each candidate's ability to find smart solutions. I asked about their experience with window functions in SQLto assess their proficiency. While most of them knew about these functions, few were able to use them effectively.

Even though window functions have been around for almost 20 years, lots of SQL developers still find them difficult to grasp. It's not uncommon for even experienced developers to just copy and paste code from StackOverflow without really understanding what it does. This article is here to help! I'll explain window functions in a way that's easy to understand and provide examples to show you how they work in the real world.


Have you heard about window functions? They're awesome analytical tools that can solve a lot of problems. For instance, let's say you need to calculate a set of rows that share a common attribute, like a client ID. This is where window functions come in handy! They work like aggregate functions but let you keep the uniqueness of each row instead of grouping them together. Plus, the results of window functions show up as an extra field in the output selection. This is super helpful when you're making analytical reports, computing moving averages and running totals, or figuring out different attribution models.


Welcome to the world of SQL and window functions! If you're just starting out, you're in the right place. This article is beginner-friendly, with clear explanations and no complicated terminology or advanced concepts. You'll be able to follow along with ease, even if you're completely new to the topic.


Content Overview

  • Types of Functions Used with Window Functions
    • Aggregate functions
    • Ranking functions
    • Value functions
  • Aggregate Window Functions
  • Key Takeaways



Types of Functions Used with Window Functions

There are three main types of functions to which window functions can be applied over a set of rows (a so-called window): these are aggregate, ranking, and value functions. In the image below, you can see the names of different functions that fall into each category.



Aggregate functions

These perform mathematical operations on a data group, resulting in a single cumulative value. They are used to calculate various aggregates, including the average, the total number of rows, the maximum or minimum values, or the total sum within each window or partition.


  • SUM: adds up all the values in the column

  • COUNT: calculates the number of values in the column, excluding NULL values

  • AVG: finds the average value in the column

  • MAX: identifies the highest value in the column

  • MIN: identifies the lowest value in the column


Ranking functions

These are used to give each row in a partition a rank or order. This is done by evaluating specific criteria, like assigning sequential numbers or basing the ranking on specific values.


  • ROW_NUMBER: assigns a sequential rank number to each new record in a partition
  • RANK: specifies the rank for each row in the result set. In this case, if the system detects identical values, it will assign the same rank to them and skip the next value.
  • DENSE_RANK: assigns a rank to each row within a partition of the result set. Unlike the RANK function, the function returns ranks for identical values without skipping any subsequent values.
  • NTILE: allows us to determine which group the current line belongs to. The number of groups is given in brackets.

Value functions

These make it easy to compare values between different rows in a group and also let you compare values with the first or last value in that group. This means you can easily move through different rows in a window and check out values at the beginning or at the end of the window.


  • LAG or LEAD: access data from the previous or the subsequent row without having to perform a self-join operation. These functions are particularly helpful when solving problems that require comparing a row with another row within the same result set or partition, such as calculating differences over time.
  • FIRST_VALUE or LAST_VALUE: retrieve the first or the last value from a defined window or partition. These functions are especially useful when you want to calculate differences within a specific time period.




To get started with the window functions, let's create a hypothetical 'salary' table and fill it with data.


Table creation:

create table salary
(
employee_id smallint,
employee_name varchar(10),
department varchar(20),
city varchar(20),
salary real
)


Filling the table:

insert into salary values ( 1 ,'Tony' ,'R&D', 'New York', 3000);
insert into salary values ( 2 ,'James' ,'Project management', 'London', 4000);
insert into salary values ( 3 ,'Dina' ,'Engineering', 'Tokyo', 5000);
insert into salary values ( 4 ,'Annie' ,'Security', 'London', 3000);
insert into salary values ( 5 ,'Tom' ,'R&D', 'New York', 3500);
insert into salary values ( 6 ,'Stan' ,'Project management', 'New York', 4200);
insert into salary values ( 7 ,'Jessa' ,'Sales', 'London', 5300);
insert into salary values ( 8 ,'Ronnie' ,'R&D', 'Tokyo', 2900);
insert into salary values ( 9 ,'Andrew' ,'Engineering', 'New York', 1500);
insert into salary values (10,'Dean' ,'Sales', 'Tokyo', 3700)


Let's check if we have filled the 'salary' table successfully:

select *
from salary




The next query will show the names and salaries of the employees from our table:

select employee_name, salary
from salary

Calculating the sum of salaries, average salary, maximum, minimum, and number of rows are some common use cases of aggregate functions:

When an aggregate function is applied, the salaries are aggregated and shown in one line.

But what if we want to display the names and salaries of the employees from the 'salary' table, and in the third column, the sum of all salaries? This value must be the same for all rows.


It’s a great opportunity to use a window function!

select
employee_name,
salary,
sum(salary) over() as sum_salary
from salary



Let's take a closer look at the window function that calculates the sum of the salaries in each row ofsum(salary) over().


The over() expression defines a window or a set of rows over which the function operates. In our example, the window is the entire table, meaning the function will apply across all the rows.

The over() expression only works when paired with functions that were requested before over().


For example, sum(salary) over(), where sum() is an aggregate function. And the whole expression sum(salary) over() is an aggregate window function.


As I said before, all functions to which window functions are applied can be divided into three groups: aggregate, ranking, and value functions.

Aggregate functions sum(), count(), avg(), min(), max() together with the over() expression make a group of aggregate window functions.


In this article, we'll concentrate on this specific type of window functions.



Aggregate Window Functions

Back to the examples!


Let's request the names of employees; their salaries; the sum of all salaries; average, maximum, and minimum salary; the number of employees.


select
employee_name,
salary,
sum(salary) over(),
avg(salary) over(),
max(salary) over(),
min(salary) over(),
count(*) over()
from salary


Now that it’s clearer what window functions are, let's explore some cases where they can be useful in your work.


select
employee_name,
salary,
sum(salary)over(),
salary/sum(salary)over() as share
from salary
order by salary/sum(salary)over() desc


We have calculated the percentage of the total salary budget for each salary in the fourth column. Jessa's salary amounts to nearly 15% of the whole budget for salaries.


Note that we also placed the formula that calculates percentages salary/sum(salary)over() in the sort after order by. A window function can be found not only in the output select, but also in sorting order by.



Another example: let's compare the salaries with the average salary for the company.

select
employee_name,
salary,
avg(salary)over(),
salary-avg(salary)over() as diff_salary
from salary
order by salary-avg(salary)over()


As we can see, Andrew's salary is 2110 less than average, and Jessa's is 1690 above the average.



Let's request three columns: employee name, department, and salary. Also, we'll sort them by department.

select
employee_name,
department,
salary
from salary
order by department


Now we'll request the same three columns, plus a column with the sum of the salaries of all employees. You already know that it can be done with a window function.


select
employee_name,
department,
salary,
sum(salary)over()
from salary
order by department


But what if we want to request not the sum of all salaries, but the sum of salaries for each department, as shown in the last column:

The Engineering Department's employees have a salary of 6500, the PM Department has a salary of 8200, R&D – 9400, Sales – 9000, and Security Department – 3000.



We can do it by adding the partition by parameter to the over() expression:

select
employee_name,
department,
salary,
sum(salary)over(),
sum(salary)over(partition by department)
from salary
order by department



Partition by allows us to apply the window function not to all rows (the entire window), but to column sections.


Doesn't it look like a simple grouping? To calculate the sum of salaries for each department, we would make a grouping by departments (sections in the slang of window functions) and calculate the amount:


select department, sum(salary)
from salary
group by department


In essence, the difference between grouping and partition by is that group by returns one row per group, while partition by, although the results of the function are identical to the results of an aggregate function with group by, provides all the rows with the aggregate function based on a group.


Let's go back to the window functions:

select
employee_name,
department,
salary,
sum(salary)over(partition by department),
salary/sum(salary)over(partition by department) as shape
from salary
order by department


Using the window function, particularly the partition by parameter, we can calculate the share of each employee's salary from the sum of the department's salaries. Or, for example, to compare salaries with the average salary in the department.


Key Takeaways


To recap:


  • A window function performs a calculation across a set of rows that are somehow related to the current row,

  • Main types of functions to which window functions are applied are aggregate, ranking and value functions,

  • To use a window function, you need to apply the over() clause that defines a window (a set of rows) within a query result set. The window function then calculates a value for each row in the window,

  • To specify the column for which you want to perform aggregation, you need to add the partition by clause to the over() clause. Partition by is somewhat similar to grouping but returns all the rows with the aggregate function applied, instead of one row per group.


This is it for now! In the next couple of articles, I’m going to explore more advanced SQL concepts with simple examples suitable for beginners, so stay tuned!