Tag Archives: analytics

KEEP DENSE_RANK versus ROW_NUMBER – further details

I found this nice post from Laurent Schneider the other day and wanted  to comment, but my comments were a bit more of a digression and discussion, so I’ve blogged it and put a link on the comments of the post by Laurent.

I’d always used the ROW_NUMBER method myself until I read this and then figured I’d try the KEEP DENSE_RANK method, which works, as Laurent describes. One thing that didn’t sit well with me in the post from Laurent was that he said “the second one should be more performant” – I prefer hard facts, so I decided to test it a bit and my results are below.

In the simple example that Laurent gave, it’s difficult to tell which is quickest, since the table in question only has a handful of rows and therefore any benchmarking is more susceptible to other influences, clouding the results. I figured I’d build a larger table and try it on that.

Before I did that though, I did get the plans from the two statements Laurent ran on the EMP table and both show the same resource costings:

Firstly, for the ROW_NUMBER method:

select ename
,      deptno
,      sal
from   (select ename
 ,      deptno
 ,      sal
 ,      row_number() over (partition by deptno order by sal desc,empno) r
 from   emp
 )
where  r=1;

Plan hash value: 3291446077                                                                                                             

---------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |      |    14 |   644 |     4  (25)| 00:00:01 |
|*  1 |  VIEW                    |      |    14 |   644 |     4  (25)| 00:00:01 |
|*  2 |   WINDOW SORT PUSHED RANK|      |    14 |   644 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL     | EMP  |    14 |   644 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

 1 - filter("R"=1)
 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DEPTNO" ORDER BY
 INTERNAL_FUNCTION("SAL") DESC ,"EMPNO")<=1)

Note
-----
 - dynamic sampling used for this statement

Now, the KEEP DENSE_RANK method:

select max(ename) keep (dense_rank first order by sal desc,empno) ename
,      deptno
,      max(sal) sal
from   emp 
group by deptno;

Plan hash value: 15469362

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    14 |   644 |     4  (25)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |    14 |   644 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |   644 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement

So, the plans are very similar, especially in terms of having the same resource usage…which means they should be similar in terms of performance…running them, as Laurent did, shows around 1s response times, which, as I say, doesn’t conclusively prove which method is quickest.

OK, on to a bigger example then…

I basically created a similar table to EMP, called JEFF_EMP and added a few more columns (for later) and then put ten million rows in it, taking around 1.3GB on my system…plenty to churn through.

DROP TABLE jeff_emp PURGE
/
CREATE TABLE jeff_emp(deptno     NUMBER
                     ,ename      VARCHAR2(100)
                     ,first_name VARCHAR2(50)
                     ,initials   VARCHAR2(30)
                     ,surname    VARCHAR2(50)
                     ,sal        NUMBER
                     ,empno      NUMBER
                     )
/
INSERT INTO jeff_emp(deptno,ename,first_name,initials,surname,sal,empno)
SELECT (MOD(ROWNUM,3) + 1) * 10
,      'FIRSTNAME_'||TO_CHAR(ROWNUM)||'_INITIALS_'||TO_CHAR(ROWNUM)||'_SURNAME_'||TO_CHAR(ROWNUM)
,      'FIRSTNAME_'||TO_CHAR(ROWNUM)
,      'INITIALS_'||TO_CHAR(ROWNUM)
,      'SURNAME_'||TO_CHAR(ROWNUM)
,      ROWNUM * 100
,      ROWNUM
FROM   (SELECT LEVEL l FROM dual CONNECT BY LEVEL < 10000001)
ORDER BY l
/
COMMIT
/
EXEC dbms_stats.gather_table_stats(ownname => USER, tabname => 'JEFF_EMP',estimate_percent=>10);

Now, here is the plan for the ROW_NUMBER method:

---------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |    10M|   868M|       |   204K  (1)| 00:40:49 |
|*  1 |  VIEW                    |          |    10M|   868M|       |   204K  (1)| 00:40:49 |
|*  2 |   WINDOW SORT PUSHED RANK|          |    10M|   629M|  1533M|   204K  (1)| 00:40:49 |
|   3 |    TABLE ACCESS FULL     | JEFF_EMP |    10M|   629M|       | 46605   (1)| 00:09:20 |
---------------------------------------------------------------------------------------------

…and the results:

ENAME                                                                                                    DEPTNO        SAL
---------------------------------------------------------------------------------------------------- ---------- ----------
FIRSTNAME_9999999_INITIALS_9999999_SURNAME_9999999                                                           10  999999900
FIRSTNAME_10000000_INITIALS_10000000_SURNAME_10000000                                                        20 1000000000
FIRSTNAME_9999998_INITIALS_9999998_SURNAME_9999998                                                           30  999999800

Elapsed: 00:00:24.47

…and the KEEP DENSE_RANK method plan:

-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     3 |   198 | 47109   (2)| 00:09:26 |
|   1 |  SORT GROUP BY     |          |     3 |   198 | 47109   (2)| 00:09:26 |
|   2 |   TABLE ACCESS FULL| JEFF_EMP |    10M|   629M| 46605   (1)| 00:09:20 |
-------------------------------------------------------------------------------

…and it’s results:

ENAME                                                                                                    DEPTNO        SAL
---------------------------------------------------------------------------------------------------- ---------- ----------
FIRSTNAME_9999999_INITIALS_9999999_SURNAME_9999999                                                           10  999999900
FIRSTNAME_10000000_INITIALS_10000000_SURNAME_10000000                                                        20 1000000000
FIRSTNAME_9999998_INITIALS_9999998_SURNAME_9999998                                                           30  999999800

Elapsed: 00:00:07.76

So, reasonably clear results, indicating that the KEEP DENSE_RANK is about a third of the time to run, compared to the ROW_NUMBER method. You can also see from the plans that the ROW_NUMBER method involves use of TEMP, whereas the KEEP DENSE_RANK doesn’t, hence the slowdown.

So, Laurent was correct in his assertion that it should be more performant…but it’s nice to see the results based on a more meaningful set of data.

Now, there was one other thing that concerned me, and that was whether if you added more columns into the SQL, would it change the performance fo either method to any significant degree, so I started using the extra name columns like this:

SELECT ename
,      first_name
,      initials
,      surname
,      deptno
,      sal
FROM   (SELECT ename
        ,      first_name
        ,      initials
        ,      surname
        ,      deptno
        ,      sal
        ,      ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC,empno) r 
        FROM   jeff_emp
       )
WHERE  r = 1
/

…which has a plan of:

---------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |    10M|  1546M|       |   307K  (1)| 01:01:36 |
|*  1 |  VIEW                    |          |    10M|  1546M|       |   307K  (1)| 01:01:36 |
|*  2 |   WINDOW SORT PUSHED RANK|          |    10M|  1107M|  2606M|   307K  (1)| 01:01:36 |
|   3 |    TABLE ACCESS FULL     | JEFF_EMP |    10M|  1107M|       | 46605   (1)| 00:09:20 |
---------------------------------------------------------------------------------------------

…and results:

ENAME                                                                                                FIRST_NAME                                         INITIALS                       SURNAME                      DEPTNO         SAL
---------------------------------------------------------------------------------------------------- -------------------------------------------------- ------------------------------ -------------------------------------------------- ---------- ----------
FIRSTNAME_9999999_INITIALS_9999999_SURNAME_9999999                                                   FIRSTNAME_9999999                                  INITIALS_9999999               SURNAME_9999999          10  999999900
FIRSTNAME_10000000_INITIALS_10000000_SURNAME_10000000                                                FIRSTNAME_10000000                                 INITIALS_10000000              SURNAME_10000000         20 1000000000
FIRSTNAME_9999998_INITIALS_9999998_SURNAME_9999998                                                   FIRSTNAME_9999998                                  INITIALS_9999998               SURNAME_9999998          30  999999800

Elapsed: 00:00:25.76

For the KEEP DENSE_RANK I get:

SELECT MAX(ename) KEEP (DENSE_RANK FIRST ORDER BY sal DESC,empno) ename
,      MAX(first_name) KEEP (DENSE_RANK FIRST ORDER BY sal DESC,empno) first_name
,      MAX(initials) KEEP (DENSE_RANK FIRST ORDER BY sal DESC,empno) initials
,      MAX(surname) KEEP (DENSE_RANK FIRST ORDER BY sal DESC,empno) surname
,      deptno
,      MAX(sal) sal
FROM   jeff_emp 
GROUP BY deptno
/

Which has the following plan:

-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     3 |   348 | 47109   (2)| 00:09:26 |
|   1 |  SORT GROUP BY     |          |     3 |   348 | 47109   (2)| 00:09:26 |
|   2 |   TABLE ACCESS FULL| JEFF_EMP |    10M|  1107M| 46605   (1)| 00:09:20 |
-------------------------------------------------------------------------------

…and results:

ENAME                                                                                                FIRST_NAME                                         INITIALS                       SURNAME                      DEPTNO         SAL
---------------------------------------------------------------------------------------------------- -------------------------------------------------- ------------------------------ -------------------------------------------------- ---------- ----------
FIRSTNAME_9999999_INITIALS_9999999_SURNAME_9999999                                                   FIRSTNAME_9999999                                  INITIALS_9999999               SURNAME_9999999          10  999999900
FIRSTNAME_10000000_INITIALS_10000000_SURNAME_10000000                                                FIRSTNAME_10000000                                 INITIALS_10000000              SURNAME_10000000         20 1000000000
FIRSTNAME_9999998_INITIALS_9999998_SURNAME_9999998                                                   FIRSTNAME_9999998                                  INITIALS_9999998               SURNAME_9999998          30  999999800

Elapsed: 00:00:14.56

So, the differential in performance has reduced significantly, with the KEEP DENSE_RANK around double it’s original time, whilst the ROW_NUMBER method has only increased marginally. I’ve not tested with adding additional columns, but I’m guessing (I know…I could and should test it!) it will get worse, to the extent that, eventually, the KEEP DENSE_RANK will become the worse performer. If that’s the case, then essentially, these two methods have different scalability dynamics and one should bear this in mind when considering which to choose, depending on how many DENSE_RANK’d columns you’d need to deliver your results.

Hope this helps.

RANK v DENSE_RANK v ROW_NUMBER

I was asked a question by one of my users about the difference between the RANK and ROW_NUMBER analytics yesterday, so here is a post on it…

RANK, ROW_NUMBER and also DENSE_RANK are very useful for taking a set of rows and ordering them in a defined manner, whilst giving each row a “position value”. They differ based on the approach taken to define the value of their position in the set of output rows. In some circumstances, they may all give the same value however, dependent on the data, they may differ.

An example based on the SCOTT.EMP table, helps to illustrate…

SELECT empno
,      sal
,      RANK() OVER(ORDER BY sal) rank_position
,      DENSE_RANK() OVER(ORDER BY sal) dense_rank_position
,      ROW_NUMBER() OVER(ORDER BY sal) row_number_position
FROM   emp
/

which returns, on my 11gR1 database…

EMPNO        SAL RANK_POSITION DENSE_RANK_POSITION ROW_NUMBER_POSITION
---------- ---------- ------------- ------------------- -------------------
7369        800             1                   1                   1
7900        950             2                   2                   2
7876       1100             3                   3                   3
7521       1250             4                   4                   4
7654       1250             4                   4                   5
7934       1300             6                   5                   6
7844       1500             7                   6                   7
7499       1600             8                   7                   8
7782       2450             9                   8                   9
7698       2850            10                   9                  10
7566       2975            11                  10                  11
7788       3000            12                  11                  12
7902       3000            12                  11                  13
7839       5000            14                  12                  14
14 rows selected.

Notice that RANK has given the two employees with SAL = 1250, the same position value of 4 and the two employees with SAL=3000, the same position value of 12. Notice also that RANK skips position values 5 and 13 as it has two entries for 4 and 12 respectively. RANK uses all numbers between 1 and 14, except 5 and 13. RANK has both repeats and gaps in it’s ordering.

DENSE_RANK is similar to RANK, in that it gives the two employees with SAL=1250, the same position value of 4, but then it does not skip over position value 5 – it simply carries on at position 5 for the next values. DENSE_RANK uses, for the position values, all numbers between 1 and 12, without leaving any out, and using 4 and 11 twice. DENSE_RANK has no gaps in it’s ordering, only repeats.

ROW_NUMBER gives each row a unique position value and consequently uses all the numbers between 1 and 14. ROW_NUMBER has no gaps or repeats in it’s ordering. Note that the position value on ROW_NUMBER is not deterministic, since the ORDER BY clause only has SAL in it. If you want to ensure the order is the same each time, you need to add further columns to the ORDER BY clause.