paint-brush
Amazon Data Science Interview: Window Functions and Aliasingby@boyzone87
507 reads
507 reads

Amazon Data Science Interview: Window Functions and Aliasing

by Nate at StrataScratchMarch 1st, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

I have an advanced data science interview question from Amazon today. This question is going to test your date manipulation and formatting skills as well as our window function knowledge. 

Company Mentioned

Mention Thumbnail
featured image - Amazon Data Science Interview: Window Functions and Aliasing
Nate at StrataScratch HackerNoon profile picture

I have an advanced data science interview question from Amazon today. This question is going to test your date manipulation and formatting skills as well as our window function knowledge. 

What I like about this question is that it’s a common analysis that you’d be performing on the job and it’s the perfect question that you’d get on an interview because it would not only tests your advanced data manipulation skills but also your communication skills as you guide the interviewer through your solution. I’m always trying to keep my examples as relevant as possible.

Date manipulations and window functions can be considered advanced SQL skills, but the combination of these 2 technical concepts make this question even more difficult. So let’s break this question down -- I’m going to be really detailed as I guide you through building the solution so you know exactly what’s going on and how the data is being manipulated at each step. 

Let’s start coding. You can watch the video if you don’t feel like reading. 

Also here’s the question in case you want to code with me.

Approach

I like to use the same approach for every question I try to answer. It always involves 4 steps.

1. Explore the underlying data to study the columns and behavior of the data.

Here we can see that we have a date column, created_at, and a value column. The date column is not of date data type, which is common, as sometimes the data isn’t read in as a date, in which case we’ll need to cast or convert this column into a date column. 

The value column is what will be summed up and aggregated into year-month. But it’s important to know that the created_at column is on a day scale, so we’ll need to aggregate this date column to year-month.

2. Identify the required columns

Here we’ve identified that we need the created_at column and the value column. We don’t need the other two columns. 

3. Visualize the output and identify what columns to output and what aggregations are needed

According to the question, the output will require year-month and month-over-month percentage change calculated using ((this month's revenue - last month's revenue) / last month's revenue)*100. So there’s a few aggregations we’re going to need to implement to change the date to year-month and to implement the percentage change formula. But we only need 2 columns in the output -- year-month and percentage change. All the work will be to try to aggregate our data to fit this output.

4. Build solution step-by-step and test

Start coding but build your solution step-by-by and test out the query each time you add logic or business rule. 

Solution

Date format using to_char()

We’ll need to first convert the date column into a date data type because when you look at the table schema, you can see that it’s an object. If it already a date data type then you wouldn’t have to perform this operation. In postgres, we can use this double colon, but with other SQL engines like HIVE and MySQL, you will need to use the more traditional cast() function. The cast function would look like this --cast(created_at as date).

Once you’ve converted the created_at column to a date data type, you then want to format the date to year-month using the to_char() function. The to_char() function recognizes the date column and will format the date to your specification. In this case, we want year and month as ‘YYYY-MM’. This column, year_month, is no longer a date. It is now a character string.

SELECT 
  to_char(created_at::date, 'YYYY-MM') AS year_month,
       *
FROM sf_transactions

This operation is extremely common in analytics. It’s very common to aggregate data on a month level and when you’re dealing with data that spans multiple years, it makes the most sense to display year-month to separate out the years. 

Calculating month-over-month percentage change in revenue

Next we’re going to calculate the month-over-month percentage change in revenue for the year_month by implementing this formula: 

(this month's revenue - last month's revenue) / last month's
revenue

This might look like an easy formula to implement but it actually requires advanced window functions to be able to implement correctly. Let’s build this calculation step-by-step in SQL so you know exactly what’s going on under the hood.

Calculate the sum of the current month’s (or day’s value)

The query will calculate the value for that date in the created_at column.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       created_at,
       sum(value),
FROM sf_transactions
GROUP BY created_at

Calculate last month’s revenue (or the last records revenue)

Next we’re going to calculate the last month’s revenue or in this case we’re really going to calculate the previous row’s revenue using the lag() function.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       created_at,
       sum(value),
       lag(sum(value), 1) OVER
(ORDER BY created_at::date)
FROM sf_transactions
GROUP BY created_at

The lag() function will take the previous record because we have a 1 as a parameter. This parameter tells us which record to take so a value of 1 means to take the previous value while a value of 2 means to take the value 2 rows above the current row. Since we just want the previous row in this case, we just need a value of 1. 

What you’re seeing in the output is not an aggregation of year-month just yet. We’re keeping everything as day so that I can start with raw values before starting to aggregate things.

By calculating both the current days and previous days revenue, we have everything we need to implement the percentage change formula: (this month's revenue - last month's revenue) / last month's revenue

Aggregate the dates to year-month

Now that we have all the components we need to calculate the month over month difference in revenue -- year_month, sum of values of the current month, sum of values of the previous month -- it’s time to aggregate by year-month, rather than keeping the data to the specific date.

In this OVER() part of the query, we can reformat the create_at column so that it’s year_month. Then all we need to do is change the GROUP BY to year_month so that we’re aggregating to year_month rather than the day. The query looks like this.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       sum(value),
       lag(sum(value), 1) OVER
(ORDER BY to_char(created_at::date, 'YYYY-MM'))
FROM sf_transactions
GROUP BY year_month

Implement the month-over-month difference formula

We have all our components separated, but the purpose of doing this is to make sure that everything we’re returning is what we’re expecting. I like to separate everything out into its own components and perform aggregations and apply logic one by one so that I know how the data is
behaving. 

Let’s implement the month-over-month formula now by combining the current and previous month’s sum of revenues.

Here’s what the query would look like if we separate the numerator with the denominator:

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       (sum(value) - lag(sum(value), 1) OVER (ORDER BY to_char
       (created_at::date, 'YYYY-MM'))) as numerator,
       lag(sum(value), 1) OVER (ORDER BY to_char(created_at::date, 
       'YYYY-MM')) as denominator
FROM sf_transactions
GROUP BY year_month

Here’s what the query would look like if we created the ratio:

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       (sum(value) - lag(sum(value), 1) OVER (ORDER BY to_char
       (created_at::date, 'YYYY-MM'))) /
       lag(sum(value), 1) OVER (ORDER BY to_char(created_at::date, 
       'YYYY-MM')) as revenue_diff
FROM sf_transactions
GROUP BY year_month

Use a window alias to for readability

The above query is hard to read and there are pieces of code that are duplicated. You don’t really want duplicated code because if you do decide to change that part of the code, you’d need to change it in all the locations where they’re duplicated. For example, if you decided to use ‘YY-MM’ format for whatever reason, you’d need to change the format in two different places. So we can use a window alias to clean up the code.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       (sum(value) -
lag(sum(value), 1) OVER (w)) /
       lag(sum(value), 1) OVER
(w) as revenue_diff
FROM sf_transactions
GROUP BY year_month
WINDOW w AS (ORDER BY to_char(created_at::date, 'YYYY-MM'))

Now the code is easier to read and you’ve removed the duplicate code.

Lastly, let’s keep up the formatting according to the question. We’ll be explicit with our ordering even though the default behavior already has the date in ascending order. And then we’ll multiple the ratio by 100 to give us a percentage.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,

       (sum(value) -
lag(sum(value), 1) OVER (w)) /

       lag(sum(value), 1) OVER
(w) * 100 as revenue_diff

FROM sf_transactions

GROUP BY year_month
WINDOW w AS (ORDER BY to_char(created_at::date, 'YYYY-MM'))

ORDER BY year_month ASC

Lastly, we’ll round the percentage to two decimal places using the round() function.

SELECT to_char(created_at::date, 'YYYY-MM') AS year_month,
       round((sum(value) -
lag(sum(value), 1) OVER (w)) /
       lag(sum(value), 1) OVER
(w) * 100, 2) as revenue_diff
FROM sf_transactions
GROUP BY year_month
WINDOW w AS (ORDER BY to_char(created_at::date, 'YYYY-MM'))
ORDER BY year_month ASC

Conclusion

So that wasn’t too hard was it? Sometimes it’s really nice to take your time to break down a problem and build it up one step at a time so that you can completely understand how the data is being processed at each step.

Again, our date formatting is common in analytics. You’d be asked to present your data in this format all the time so it’s important to know how to cast your data and how to reformat it. 

Window functions are also important but they can get messy as you saw in the query. A window alias is helpful to make the code more readable and to implement the DRY technique -- Don’t Repeat Yourself. Or don’t repeat code. 

Taken both concepts together, it makes the question much more difficult to solve, so it’s really important to approach the question slowly and build up the solution one step at a time, one logic at a time. This helps you troubleshoot and debug, and it brings the interviewer along the way so they know how and what you’re thinking as you solve the problem. 

Hope this is helpful. Watch the video

Previously published at https://www.stratascratch.com/blog/amazon-data-scientist-interview-questions/