Sunday, July 22, 2007

OLAP Paradise - WITH ROLLUP

Though this is an age old concept, but after seeing the ignorance factor of many regarding this, I thought of writing a bit about it. ROLLUP is a GROUP BY modifier that adds extra rows to the output which contain summary data. At each level of aggregation the aggregated column is filled with a NULL value.

Let's see a small example. I know the values are too small for the data presented, but just to keep it readable.

   1: mysql> CREATE TABLE `rollup_1` (
   2:   `url` varchar(255) DEFAULT NULL,
   3:   `year` int(11) DEFAULT NULL,
   4:   `country` char(2) DEFAULT NULL,
   5:   `visit` int(11) DEFAULT NULL
   6: ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
   7: Query OK, 0 rows affected (0.01 sec)
   8:  
   9: mysql> SELECT * FROM rollup_1;
  10: +----------------------+------+---------+-------+
  11: | url                  | year | country | visit |
  12: +----------------------+------+---------+-------+
  13: | http://www.yahoo.com | 2005 | US      | 12000 |
  14: | http://www.yahoo.com | 2005 | IN      | 10000 |
  15: | http://www.yahoo.com | 2006 | US      | 16000 |
  16: | http://www.yahoo.com | 2006 | IN      | 10700 |
  17: | http://www.yahoo.com | 2007 | US      | 21000 |
  18: | http://www.yahoo.com | 2007 | IN      | 13200 |
  19: +----------------------+------+---------+-------+
  20: 6 rows in set (0.00 sec)
  21:  
  22: mysql> SELECT IFNULL(url, 'TOTAL') AS url, SUM(visit) AS visit FROM rollup_1 GROUP BY url WITH ROLLUP;
  23: +----------------------+-------+
  24: | url                  | visit |
  25: +----------------------+-------+
  26: | http://www.yahoo.com | 82900 |
  27: | TOTAL                | 82900 |
  28: +----------------------+-------+
  29: 2 rows in set, 1 warning (0.00 sec)
  30:  
Complete analysis:
   1: mysql> SELECT IFNULL(url, 'ALL_URLS') AS url,
   2:     -> IFNULL(year, 'ALL_YEARS') AS year,
   3:     -> IFNULL(country, 'ALL_COUNTRIES') AS country,
   4:     -> SUM(visit) FROM rollup_1
   5:     -> GROUP BY url, year, country WITH ROLLUP;
   6: +----------------------+-----------+---------------+------------+
   7: | url                  | year      | country       | SUM(visit) |
   8: +----------------------+-----------+---------------+------------+
   9: | http://www.yahoo.com | 2005      | IN            |      10000 |
  10: | http://www.yahoo.com | 2005      | US            |      12000 |
  11: | http://www.yahoo.com | 2005      | ALL_COUNTRIES |      22000 |
  12: | http://www.yahoo.com | 2006      | IN            |      10700 |
  13: | http://www.yahoo.com | 2006      | US            |      16000 |
  14: | http://www.yahoo.com | 2006      | ALL_COUNTRIES |      26700 |
  15: | http://www.yahoo.com | 2007      | IN            |      13200 |
  16: | http://www.yahoo.com | 2007      | US            |      21000 |
  17: | http://www.yahoo.com | 2007      | ALL_COUNTRIES |      34200 |
  18: | http://www.yahoo.com | ALL_YEARS | ALL_COUNTRIES |      82900 |
  19: | ALL_URLS             | ALL_YEARS | ALL_COUNTRIES |      82900 |
  20: +----------------------+-----------+---------------+------------+
  21: 11 rows in set, 3 warnings (0.00 sec)
There is a small Gotcha while using ROLLUP. As you can see, my table definition does not put a NOT NULL constraint on any of the fields. Though you might have guessed what I'm pointing to, let's see this example.
   1: mysql> INSERT INTO rollup_1 values ('http://www.yahoo.com', 2007, null, 12000);
   2: Query OK, 1 row affected (0.00 sec)
   3:  
   4: mysql> SELECT IFNULL(url, 'ALL_URLS') AS url, 
   5:     -> IFNULL(year, 'ALL_YEARS') AS year, 
   6:     -> IFNULL(country, 'ALL_COUNTRIES') AS country, 
   7:     -> SUM(visit) FROM rollup_1 
   8:     -> GROUP BY url, year, country WITH ROLLUP;
   9: +----------------------+-----------+---------------+------------+
  10: | url                  | year      | country       | SUM(visit) |
  11: +----------------------+-----------+---------------+------------+
  12: | http://www.yahoo.com | 2005      | IN            |      10000 |
  13: | http://www.yahoo.com | 2005      | US            |      12000 |
  14: | http://www.yahoo.com | 2005      | ALL_COUNTRIES |      22000 |
  15: | http://www.yahoo.com | 2006      | IN            |      10700 |
  16: | http://www.yahoo.com | 2006      | US            |      16000 |
  17: | http://www.yahoo.com | 2006      | ALL_COUNTRIES |      26700 |
  18: | http://www.yahoo.com | 2007      | ALL_COUNTRIES |      12000 |
  19: | http://www.yahoo.com | 2007      | IN            |      13200 |
  20: | http://www.yahoo.com | 2007      | US            |      21000 |
  21: | http://www.yahoo.com | 2007      | ALL_COUNTRIES |      46200 |
  22: | http://www.yahoo.com | ALL_YEARS | ALL_COUNTRIES |      94900 |
  23: | ALL_URLS             | ALL_YEARS | ALL_COUNTRIES |      94900 |
  24: +----------------------+-----------+---------------+------------+
  25: 12 rows in set, 3 warnings (0.00 sec)
As you can see in the output, there are two rows claiming to provide data for url='http://www.yahoo.com' and year=2007, question is which one is which one? Few good answers are:
  1. One that appears late in the result set.
  2. One that has a higher value for SUM(visit) field. This is arguable in general as there might be negative values, for our example we can say that's impossible.
The best answer is of course this. For using ROLLUP with data like this, please make sure that you have fields marked as NOT NULL and instead specify another DEFAULT like 'not set' or something else.

While using ROLLUP, you may further like to consider few points.

  • ROLLUP and ORDER BY are mutually exclusive, they cannot be used in the same statement.
  • If you are using LIMIT clause along with ROLLUP, remember LIMIT clause is applied after ROLLUP, so it may include the aggregated rows.
  • In MySQL documentation, it is mentioned that "Because the NULL values in the super-aggregate rows are placed into the result set at such a late stage in query processing, you cannot test them as NULL values within the query itself.". You can see in the above queries, we are actually testing this value to be replaced by TOTAL_* string. Also if your data itself does not have NULL values in it, it works wonders. Look at the output below, I have removed the row inserted above.
   1: mysql> SELECT IFNULL(url, 'ALL_URLS') AS url, 
   2:     -> IFNULL(year, 'ALL_YEARS') AS year, 
   3:     -> IFNULL(country, 'ALL_COUNTRIES') AS country, 
   4:     -> SUM(visit) FROM rollup_1 
   5:     -> GROUP BY url, year, country WITH ROLLUP HAVING country is null;
   6: +----------------------+-----------+---------------+------------+
   7: | url                  | year      | country       | SUM(visit) |
   8: +----------------------+-----------+---------------+------------+
   9: | http://www.yahoo.com | 2005      | ALL_COUNTRIES |      22000 |
  10: | http://www.yahoo.com | 2006      | ALL_COUNTRIES |      26700 |
  11: | http://www.yahoo.com | 2007      | ALL_COUNTRIES |      34200 |
  12: | http://www.yahoo.com | ALL_YEARS | ALL_COUNTRIES |      82900 |
  13: | ALL_URLS             | ALL_YEARS | ALL_COUNTRIES |      82900 |
  14: +----------------------+-----------+---------------+------------+
  15: 5 rows in set, 4 warnings (0.00 sec)
Hope this blog was helpful to you. Keep posting your comments.

2 comments:

Anonymous said...

This was a very informative article for the neophyte like me. And the practical examples gave me just the hints I needed to get WITH ROLLUP working in my queries. Thanks!

Anonymous said...

IFNULL is a great way to improve readability of results, thanks for the tip!