Jan 17 2008
JOIN_BY_SQL Revisited
For the past several years I have included notes about the JOIN_BY_SQL parameter now available for universe designers. The short definition: instead of documents having multiple SQL passes with a join or synchronize option, this parameter asks the database to do the work instead. I have stated in more than one presentation that we saw major performance improvements by using this feature. Now I have to back-track on that just a bit. 🙂
We recently started using the query load manager from Teradata. Any query that extends beyond a certain period of time (measured in seconds) is downgraded from high priority to medium, and potentially even lower depending on the amount of time required. During benchmark testing we were ramping up to 100 concurrent (active) query users and we noticed that our JOIN_BY_SQL queries were taking just a few seconds longer than the threshold set on the server.
So there are a couple of options to consider. One: we could extend the time definition. However, this was set after careful analysis and after input from some very smart Teradata folks, so that wasn’t really a choice for us. Another option was turning off JOIN_BY_SQL. When we did that and reran our queries, something interesting happened. The queries ran faster individually than they did as one larger query. What happened to our performance boost that I had been bragging about?
For the past two years we’ve been working on this system and tweaking it / tuning it for performance. We have added aggregate indexes. We have added more parameters (prompts) to our reports. We rewrote reports so they would return more summarized data (and therefore fewer rows). We have done all sorts of things, frankly, and one of the net benefits is that reports that used to return tens of thousands of rows are now returning under a thousand summary rows instead. It turns out that the performance benefits of JOIN_BY_SQL have been minimized by our other tuning efforts, and our reports actually return faster with three smaller queries than with one larger query.
The one large JOIN_BY_SQL query runs just long enough to drop into medium priority. The multiple smaller queries all run to completion at high priority on the database side. The net result: we have now turned off the JOIN_BY_SQL option in our universes on this project. The bottom line: like anything else, JOIN_BY_SQL does not appear to be a “magic bullet” solution. It, like anything else, should be carefully evaluated under the unique requirements for each project.
Actually, for people using Crystal it is a magical bullet as it allows queries in Crystal against a universe using multiple universe contexts and therefore SQL statements. Crystal at this time cannot handle queries against a universe that generates multiple SQL statements otherwise.
I used this setting against Teradata using Webi for up to 10 SQL statements (caused by 10 fact tables, multiple star schema), but I was not able to do a sufficient analysis regarding perfromance with real data volumes, because of resource restraints.
In any case the JOIN_BY_SQL will reduce the load on the BOE server, as only one microcube will be returned, instead of multiple ones. If this benefit outweighs the additional load put on the DBMS server needs to be evaluated on a case by case basis (just as you did, Dave) 🙂
The big change for us was our individual data provider count went from 6 down to 3 after we pivoted the data so CY and PY data were on the same row. We also reduced the row count for each data provider by providing more restrictive prompts. So the number of rows returned by the individual data providers has been reduced substantially.
I believe that if we were still running larger data providers (with tens of thousands of rows) that the report performance would still be better with JOIN_BY_SQL turned on. That despite the way Teradata would have treated them.
Your points about Crystal are perfectly valid, of course. In this project we’re using only Webi to generate reports so that was not a constraint for us.
When you use Join_By_SQL what happens when the queries do not have matching dimensions. Does this still push the work down to the database?
Hi, Chris, and thanks for your comment. Prior to JOIN_BY_SQL being available you might see two combination operations in the query panel: JOIN or SYNCHRONIZE. The JOIN_BY_SQL technique only works for JOIN operations. Anything that requires a SYNCHRONIZE operation is still done on the client.
At least that is what my memory is telling me; I did not try to retest that prior to responding to your question.
This seems to work well in Teradata but I saw Netezza query performance suffer. It was better to there to have the default behavior send the smaller chucks of sql to this engine.