Sep 19 2013
Using OLAP Functions to Extend Calendar Capabilities
I think it’s probably a safe bet to suggest that just about every data warehouse (or even transactional system) has some sort of calendar table. In many cases, the unique key for this table might be the natural key of the date itself, or perhaps it’s a system-generated surrogate key. That doesn’t really matter for this post. What I want to do is show one idea of how I used an OLAP aggregate function called row_number()
to extend my calendar functionality, and make it really easy to schedule reports for the “last three months” given an input date.
What are OLAP Functions?
I just got back from the SAP BusinessObjects User Conference (2013 edition). While at the conference I attended a session by Pierpaolo Vezzosi of SAP that covered OLAP function use within a universe. I am not going to try to reproduce his entire presentation here (if you attending the conference, you should be able to download it now). But I thought I would show an example of how I’m using an OLAP function to extend my standard calendar.
OLAP functions are somewhat similar to aggregate functions in that they operate on a series of rows, rather than on a single row at a time. What does this mean? Well, for example, the Sum() function takes zero or more rows as input and returns a single row of output. Let’s assume I have the following six rows of data in a table.
Country | Region | Division | Sales |
---|---|---|---|
USA | East | Food | 500 |
USA | East | Beverage | 475 |
USA | West | Food | 525 |
USA | West | Beverage | 538 |
Canada | Canada | Food | 217 |
Canada | Canada | Beverage | 200 |
A normal aggregate function like sum()
would operate on this set of six rows and return a singular value:
SELECT SUM(sales) FROM TABLE;
The answer would be a single row with a value of 2455. This is different from a non-aggregate function like upper()
which would return an output value for each input row. Something like this, perhaps:
SELECT UPPER(country) FROM TABLE;
Country | Upper(country) |
---|---|
USA | USA |
USA | USA |
USA | USA |
USA | USA |
Canada | CANADA |
Canada | CANADA |
Let me recap:
- An aggregate function operates on a group of rows and returns a single row.* Examples include
sum()
,min()
,max()
,count()
, and others. - A non-aggregate function operates on a single row and returns a value for that row. Examples include
upper()
,substring()
,position()
,cast()
, and others.
* This does, of course, ignore the capability of adding dimensions to my query and including a GROUP BY clause. Let’s ignore that for now.
Where do OLAP functions come in? They’re somewhat of a hybrid, coming in between aggregate and non-aggregate functions. Here’s what an update of my list would look like if it included OLAP functions.
- An aggregate function operates on a group of rows and returns a single row.
- A non-aggregate function operates on a single row and returns a value for each row.
- An OLAP function operates on a group of rows and returns a value for each row. Examples include
rank()
,csum()
,mavg()
, and so on.
In other words, an aggregate function can take six rows of input and generate one row of output. A non-aggregate function can only take one row at a time and will output a single value for each row. An OLAP function can calculate across six rows of input and still output six rows, with an appropriate calculation result being shown for each row. I will do one example using the csum()
(short for cumulative sum) function mentioned above. Here’s my SQL:
SELECT country , region , division , sales , csum(sales, country) FROM table;
The csum()
function requires two arguments. The first is the column to be summed, and the second is the column to be used as a sort key. My output might look like this:
Country | Region | Division | Sales | csum(sales) |
---|---|---|---|---|
USA | East | Food | 500 | 500 |
USA | East | Beverage | 475 | 975 |
USA | West | Food | 525 | 1500 |
USA | West | Beverage | 538 | 2038 |
Canada | Canada | Food | 217 | 2255 |
Canada | Canada | Beverage | 200 | 2455 |
That’s all well and good. But how does it help me extend my calendar?
Calculating Offsets Using row_number()
There is a very useful OLAP function in Teradata called row_number()
which has the following syntax:
ROW_NUMBER() OVER ( [ PARTITION BY <value-column-data> ] ORDER BY <column-list> ROWS { UNBOUNDED | <number-of-rows> } PRECEDING)
I haven’t talked about the extended portion of the syntax, and I don’t need to for this example. I am only going to use the ORDER BY portion of the syntax.
When I run basic SQL, the rows are returned un-ordered. I get them in whatever order the database opts to send them to me. For that reason I cannot rely on a “next” or “previous” sort of functionality because I can never guarantee that the rows will be in any sort of specific order. What about derived tables? Well, yes, I could put an ORDER BY clause in a derived table, but once I join that derived table to another table all bets are off as to how the rows might end up.
Remember how earlier in this post I said that OLAP functions can operate across multiple rows but will return an individual result for each row in the result set? The row_number()
function above has as an optional parameter an ORDER BY
clause that I can use to tell the database that I what to initially sort the rows in a specific order in order to process them. First I am going to show what my SQL looks like and then I will explain it.
select a.yr_id , a.mo_id , (row_number() over (order by a.yr_id desc, a.mo_id desc) - 1) * - 1 as mon_offset from CALENDAR_DT a where a.yr_id * 100 + a.mo_id <= ( select c.yr_id * 100 + c.mo_id as current_month from CALENDAR_DT c , RPT_VW d where c.cal_dt = d.mon_start_dt ) group by 1, 2
In this query yr_id
is a numeric year value like 2012 or 2013 and mo_id
is a numeric month number 1 through 12. There is a database table (a view, actually) called RPT_VW that contains the date that defines the "current month." Each time the warehouse is loaded, this table is updated so that it always contains the first day of the most recent "reporting month" which is what I am interested in. I do it this way instead of relying on a system date from the database.
The sub-query takes the year and multiplies by 100 so the year 2013 becomes 201300. The next step is to add the month, so for September of 2013 the value becomes 2013*100 + 9 or 201300 + 9 with the end result of 201309. This is done so that I can compare across year boundaries (so that December of 2012 is "less than" January of 2013). It works very well. The sub-query returns this year+month value for the current reporting month.
The outside query is going to return the year and month for every row in the calendar table where the year+month combination key is less than or equal to the year+month combination key for the reporting table. For example, if I am in September of 2013 then I will get every date from the beginning of time up to and including the end of September, 2013. Anything from October, 2013 and beyond is ignored. I do this because in this case I don't care about any sort of future date. It's not a requirement for this solution, but it would change my logic if I did allow future dates.
Finally, the row_number()
function kicks in. Once I have filtered my data set down to dates that occur in or before my reporting month, the OLAP function kicks in. The OVER
key word indicates the usage of an OLAP function. In the arguments for the function I see an ORDER BY
clause that requests that my data be sorted by year and then month in descending order. The row_number()
function is next; as each row is returned by the query in the specified order it (the row) is assigned a number. The first row returned will be one (for the year 2013 and the month 09 given my example above), the next row will be two (contains 2013 and 08) and so on from there. Something like this:
Year | Month | Row Number |
---|---|---|
2013 | 9 | 1 |
2013 | 8 | 2 |
2013 | 7 | 3 |
2013 | 6 | 4 |
2013 | 5 | 5 |
The final step in the calculation is to subtract one from the row number and multiply by negative one. This means September 2013 is tagged as row zero, August 2013 has an offset of -1 and so on as shown here. I call this column Month Offset. Row zero is defined as the current month, and every other month has a value that shows a value that indicates how far away that month is from the current month.
Year | Month | Month Offset |
---|---|---|
2013 | 9 | 0 |
2013 | 8 | -1 |
2013 | 7 | -2 |
2013 | 6 | -3 |
2013 | 5 | -4 |
What do I do with this? I use the code above to create a derived table called DT_MONTH_OFFSET
. I then join this table to my calendar table using the year and month as my keys, and create an object based on the result of the row number calculation. This means that every row from September of 2013 is marked as being from "month offset" of zero, and every row from August of 2013 is month one, and so on. Because the month offset value is determined by the "current month" as shown in my reporting table, the values will dynamically update as each new month is loaded into my database.
Using Month Offset
The cherry on top of this OLAP sunday is the ability to create the following pre-defined condition:
DT_MONTH_OFFSET.MONTH_OFFSET between -2 and 0
or perhaps this one:
DT_MONTH_OFFSET.MONTH_OFFSET between @Prompt('Beginning Month Offset','N',,,,) and @Prompt('Ending Month Offset','N',,,,)
The first pre-defined condition will return calendar rows that fall within the last three months including the "current" month. The second pre-defined condition will prompt the user for a month offset range, so the user can pick the "last two months" (using -1 and 0 as arguments) or "previous three months" (range of -3 to -1) or even the current month by entering zero for both prompt values. Because this is dynamic, it works for scheduled reports as well.
This solution even works for fiscal calendars (periods instead of months) or calendars that don't end in December, because I don't care which month is the first or last month of a year! I only care where the specific month (or period) falls in relation to a month (or period) that I have identified as the current month (or period). The example I have shown here works at the month level, but the same technique could be used to show Week Offset or even Day Offset, and in fact I have done both in various universes.
In this specific case I have shown how to use the OLAP row_number() over(order by)
structure to assign a dynamic offset value based on year and month, and only for months that occur in the past. In a future blog post I will show how I have done something similar without OLAP functions.
Hi Dave,
Your postings always have been helpful and insightful. With Tableau, Qlikview and other reporting tools, BO has been loosing ground in the reporting world, how do you see the future for SAP BOBJ (WEBI, Analysis designer), SAP lumira and other. SAP has been creating lot confusion in the user world. I really appreciate, if you could post an article in this regard.
Thanks,
Pavan
Hi Dave,
I have a question on how to create a subquery in WebI within select statement (i.e. in the result objects panel without using a derive table). I have a below query that I am trying to built. I would appreciate your help.
SELECT DISTINCT
A.ORDER_NUMBER,
A.REFERENCE_NUMBER,
A.ORDER_TIME_KEY,
(SELECT MAX(PRICE_TIME_KEY) FROM MASF_ORDERS B WHERE B.ORDER_NUMBER= A.REFERENCE_NUMBER)AS “Surgery Date”,
MASD_ORDER_TYPE.ORDER_TYPE_ID AS ORDER_TYPE
FROM MSDCOGN_MAIN.MASF_ORDERS A
INNER JOIN MSDCOGN_MAIN.MASD_ORDER_TYPE
ON A.ORDER_TYPE_KEY = MASD_ORDER_TYPE.ORDER_TYPE_KEY
WHERE
ORDER_NUMBER IN (‘6128666808’)
Thank you
I am using Dynamic Prompt to Schedule the reports. Where I have Date Filters made in Universe which gives a prompt with either User Selected Start and End Dates or Dynamic Dates like, Today, Yesterday, 2 Days Ago, WTD etc….
Everything is working fine, just I find an issue.
Let’s assume my report is scheduled for Yesterday.
But when I open the report in Launchpad and refresh the report for Dynamic Date ‘2 Days Ago’ and hit save.
The Dynamic Date Changes in the Schedule too.
My Question is if we can avoid this change of Dynamic Date in the Schedule.
If you refresh the report interactively, I don’t see how it should update the schedule. A scheduled instance is supposed to be separate from the “live” version of the report. You might need to get help from SAP on this one.
Just try uncheck the option “Keep last value selected” in the prompt properties for date field.