Generating MoM, YoY and CMGR from SQL
Does anyone round here know SQL any more?
We recently hired someone to help us with some data analysis and reporting requirements. We're not really big enough to claim that what we needed was a Data Scientist, but that is effectively what we were looking for. And as is our wont, we created a simple test for applicants, designed to show some ability to extract meaning from data (by creating a histogram), and a working knowledge of SQL. We do most of our running reporting through Periscope, so being able to generate flat datasets from our relational database is an essential part of the job.
Testing candidates is always revealing, but in this case especially so - out of perhaps 20 candidates, only two nailed it - the rest either got the wrong answer, or the right answer the wrong way (for example, joining the two tables in Excel, using a vlookup
function.) This got me thinking - is raw SQL becoming a lost art?
In this post, I'm going to demonstrate how to generate some fairly simple, but important, business indicators directly in SQL, in an attempt to show how useful it is, and why Excel is not always the answer (All of the data, and SQL scripts are available as a Gist.)
The raw data in this case is monthly sales - sitting neatly in a table with two columns - period
and sales
(the sales data was generated using a random percentage change over the previous month of between 80% and 150%).
The stats I'll produce are as follows:
- Month-on-month growth (e.g. Jan-2014 to Feb-2014)
- Year-on-year monthly growth (e.g. Jan-2014 to Jan-2015)
- Annual Compound Monthly Growth Rate (CMGR) - the compound growth rate measured over the preceding 12-month period.
- The total sales in the calendar year.
All of these are trivial to produce in a spreadsheet, which is why people do it this way. But even if it's trivial, it still involves extracting the data, loading it into Excel, and adding the functions. If you want to produce charts straight off of the data, you're going to want to calculate it in one query.
The table looks like this:
The output we are looking for looks like this:
Extract monthly sales
This one I am presuming we can all agree on:
SELECT period, sales FROM sales
Month-on-month growth
In order to get the MoM growth, we need to compare the current row with the value of sales in the previous row. This is done using a Window function (the docs are from Postgresql, but Window functions are also available in MSSQL and Oracle, apparently not in MySQL though?).
I'm not going to explain how exactly this works, as the docs above do a far more comprehensive job than I could, but suffice to say, a Window function consists of two parts - the function itself, and the Window over which it operates. A Window is a set of rows related to the current row. The following expression will return the value of sales in the preceding row to the current row:
LAG(sales, 1) OVER (ORDER BY period)
In this case, the function is LAG(sales, 1)
and the Window is (ORDER BY period)
.
If you want the percentage change, it would be:
(sales - LAG(sales, 1) OVER (ORDER BY period))/LAG(sales, 1) OVER (ORDER BY period)
Overall, our query now looks like this:
SELECT
period
,sales
,(sales - LAG(sales, 1) OVER (ORDER BY period))/LAG(sales, 1) OVER (ORDER BY period) as "MoM"
FROM sales
Note: we can simplify this a little, by putting the Window definition ((ORDER BY period)
) at the end of the statement:
SELECT
period
,sales
,(sales - LAG(sales, 1) OVER w)/LAG(sales, 1) OVER w as "MoM"
FROM sales
WINDOW w as (ORDER BY period)
Year-on-year growth
Using the technique above, we do exactly the same, but using 12 as the LAG
value - because we want the 12th month preceding the current one. Adding this to the query we get:
SELECT
period
,sales
,(sales - LAG(sales, 1) OVER w)/LAG(sales, 1) OVER w as "MoM"
,(sales - LAG(sales, 12) OVER w)/LAG(sales, 12) OVER w as "YoY"
FROM sales
WINDOW w as (ORDER BY period)
CMGR
This one is a little more complicated, but only because the equation itself is complicated:
((last_period_sales / first_period_sales) ^ (1 / periods)) – 1
In our case, we are calculating a running 12-month period CMGR, so we can hard-code periods
as 12, and the first_period_sales
as the LAG(sales, 12)
value:
((sales / LAG(sales, 12) OVER w) ^ (1.0/12) - 1
Note: use 1.0/12
otherwise you'll get an integer/integer division, which will be rounded out to zero.
SELECT
period
,sales
,(sales - LAG(sales, 1) OVER w)/LAG(sales, 1) OVER w as "MoM"
,(sales - LAG(sales, 12) OVER w)/LAG(sales, 12) OVER w as "YoY"
,(sales / LAG(sales, 12) OVER w) ^ (1.0/12) - 1 as "CMGR"
FROM sales
WINDOW w as (ORDER BY period)
Calendar year total
For this column, we want to calculate the total sales within the same calendar year as the current row, and that involves a new keyword - PARTITION
.
If we were to use the sum
aggregate window function over our existing Window ((ORDER BY period)
), we would calculate the sum of all rows in the table, but this is not what we want - we want just those rows that match the year of the current row.
In order to extract the year, we use the date_trunc
function:
date_trunc('year', period)
Now we want to extend our Window to restrict the rows to this value - which we do using PARTITION
. This defines the Window set. The following expression creates a Window that contains only those rows that have the same year as the current row. This is what we want to sum:
WINDOW y as (PARTITION BY date_trunc('year', period))
The full expression now looks like this:
SELECT
period
,sales
,(sales - LAG(sales, 1) OVER w)/LAG(sales, 1) OVER w as "MoM"
,(sales - LAG(sales, 12) OVER w)/LAG(sales, 12) OVER w as "YoY"
,(sales / LAG(sales, 12) OVER w) ^ (1.0/12) - 1 as "CMGR"
,SUM(sales) over y
FROM sales
WINDOW
w as (ORDER BY period)
,y as (PARTITION BY date_trunc('year', period))
This has been a very rapid demonstration of Window functions in SQL - you'll have to Google the docs yourself if you want to dive in, but hopefully that has shown that it is possible to produce data directly from the database without resorting to PivotTables!
Making Freelance Work