Try it here
Subscribe
Oracle Analytic functions

All about Oracle Analytic functions

all_about_oracle_analytic_functions

The general syntax of analytic function is:

Function(arg1,..., argn) OVER ( [PARTITION BY <...>] [ORDER BY <....>] [<window_clause>] )

<window_clause> is like "ROW <?>" or "RANK <?>"

How are analytic functions different from group or aggregate functions?

Query-1
SELECT deptno,
COUNT(*) DEPT_COUNT
FROM emp
WHERE deptno IN (20, 30)
GROUP BY deptno;

DEPTNO                 DEPT_COUNT             
---------------------- ---------------------- 
20                     5                      
30                     6                      

2 rows selected

Consider the Query-1 and its result. Query-1 returns departments and their employee count. Most importantly it groups the records into departments in accordance with the GROUP BY clause. As such any non-"group by" column is not allowed in the select clause.

Query-2
SELECT empno, deptno, 
COUNT(*) OVER (PARTITION BY 
deptno) DEPT_COUNT
FROM emp
WHERE deptno IN (20, 30);

     EMPNO     DEPTNO DEPT_COUNT
---------- ---------- ----------
      7369         20          5
      7566         20          5
      7788         20          5
      7902         20          5
      7876         20          5
      7499         30          6
      7900         30          6
      7844         30          6
      7698         30          6
      7654         30          6
      7521         30          6

11 rows selected.

Now consider the analytic function query (Query-2) and its result. Note the repeating values of DEPT_COUNT column.

This brings out the main difference between aggregate and analytic functions. Though analytic functions give aggregate result they do not group the result set. They return the group value multiple times with each record. As such any other non-"group by" column or expression can be present in the select clause, for example, the column EMPNO in Query-2.

Analytic functions are computed after all joins, WHERE clause, GROUP BY and HAVING are computed on the query. The main ORDER BY clause of the query operates after the analytic functions. So analytic functions can only appear in the select list and in the main ORDER BY clause of the query.

In absence of any PARTITION or <window_clause> inside the OVER( ) portion, the function acts on entire record set returned by the where clause. Note the results of Query-3 and compare it with the result of aggregate function query Query-4.

Query-3
SELECT empno, deptno, 
COUNT(*) OVER ( ) CNT
FROM emp
WHERE deptno IN (10, 20)
ORDER BY 2, 1;

     EMPNO     DEPTNO        CNT
---------- ---------- ----------
      7782         10          8
      7839         10          8
      7934         10          8
      7369         20          8
      7566         20          8
      7788         20          8
      7876         20          8
      7902         20          8
Query-4
SELECT COUNT(*) FROM emp
WHERE deptno IN (10, 20);

  COUNT(*)
----------
         8

How to break the result set in groups or partitions?

It might be obvious from the previous example that the clause PARTITION BY is used to break the result set into groups. PARTITION BY can take any non-analytic SQL expression.

Some functions support the inside the partition to further limit the records they act on. In the absence of any analytic functions are computed on all the records of the partition clause.

The functions SUM, COUNT, AVG, MIN, MAX are the common analytic functions the result of which does not depend on the order of the records.

Functions like LEAD, LAG, RANK, DENSE_RANK, ROW_NUMBER, FIRST, FIRST VALUE, LAST, LAST VALUE depends on order of records.

How to specify the order of the records in the partition?

The answer is simple, by the "ORDER BY" clause inside the OVER( ) clause. This is different from the ORDER BY clause of the main query which comes after WHERE. In this section we go ahead and introduce each of the very useful functions LEAD, LAG, RANK, DENSE_RANK, ROW_NUMBER, FIRST, FIRST VALUE, LAST, LAST VALUE and show how each depend on the order of the record.

The general syntax of specifying the ORDER BY clause in analytic function is:

ORDER BY <sql_expr> [ASC or DESC] NULLS [FIRST or LAST]

The syntax is self-explanatory.

ROW_NUMBER

All the above three functions assign integer values to the rows depending on their order. That is the reason of clubbing them together.

ROW_NUMBER( ) gives a running serial number to a partition of records. It is very useful in reporting, especially in places where different partitions have their own serial numbers. In Query-5, the function ROW_NUMBER( ) is used to give separate sets of running serial to employees of departments 10 and 20 based on their HIREDATE.

Query-5 (ROW_NUMBER example)
SELECT empno, deptno, hiredate,
ROW_NUMBER( ) OVER (PARTITION BY
deptno ORDER BY hiredate
NULLS LAST) SRLNO
FROM emp
WHERE deptno IN (10, 20)
ORDER BY deptno, SRLNO;

EMPNO  DEPTNO HIREDATE       SRLNO
------ ------- --------- ----------
  7782      10 09-JUN-81          1
  7839      10 17-NOV-81          2
  7934      10 23-JAN-82          3
  7369      20 17-DEC-80          1
  7566      20 02-APR-81          2
  7902      20 03-DEC-81          3
  7788      20 09-DEC-82          4
  7876      20 12-JAN-83          5

8 rows selected.

RANK and DENSE_RANK

RANK and DENSE_RANK both provide rank to the records based on some column value or expression. In case of a tie of 2 records at position N, RANK declares 2 positions N and skips position N+1 and gives position N+2 to the next record. While DENSE_RANK declares 2 positions N but does not skip position N+1.

Query-6 shows the usage of both RANK and DENSE_RANK. For DEPTNO 20 there are two contenders for the first position (EMPNO 7788 and 7902). Both RANK and DENSE_RANK declares them as joint toppers. RANK skips the next value that is 2 and next employee EMPNO 7566 is given the position 3. For DENSE_RANK there are no such gaps.

Query-6 (RANK and DENSE_RANK example)
SELECT empno, deptno, sal,
RANK() OVER (PARTITION BY deptno
ORDER BY sal DESC NULLS LAST) RANK,
DENSE_RANK() OVER (PARTITION BY
deptno ORDER BY sal DESC NULLS
LAST) DENSE_RANK
FROM emp
WHERE deptno IN (10, 20)
ORDER BY 2, RANK;

EMPNO  DEPTNO   SAL  RANK DENSE_RANK
------ ------- ----- ----- ----------
  7839      10  5000     1          1
  7782      10  2450     2          2
  7934      10  1300     3          3
  7788      20  3000     1          1
  7902      20  3000     1          1
  7566      20  2975     3          2
  7876      20  1100     4          3
  7369      20   800     5          4

8 rows selected.

LEAD and LAG

LEAD has the ability to compute an expression on the next rows (rows which are going to come after the current row) and return the value to the current row. The general syntax of LEAD is shown below:

LEAD (<sql_expr>, <offset>, <default>) OVER (<analytic_clause>)

<sql_expr> is the expression to compute from the leading row.
<offset> is the index of the leading row relative to the current row.
<offset> is a positive integer with default 1.
<default> is the value to return if the points to a row outside the partition range.

The syntax of LAG is similar except that the offset for LAG goes into the previous rows. Query-7 and its result show simple usage of LAG and LEAD function.

Query-7 (LEAD and LAG)
SELECT deptno, empno, sal,
LEAD(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) NEXT_LOWER_SAL,
LAG(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) PREV_HIGHER_SAL
FROM emp
WHERE deptno IN (10, 20)
ORDER BY deptno, sal DESC;

 DEPTNO  EMPNO   SAL NEXT_LOWER_SAL PREV_HIGHER_SAL
------- ------ ----- -------------- ---------------
     10   7839  5000           2450               0
     10   7782  2450           1300            5000
     10   7934  1300              0            2450
     20   7788  3000           3000               0
     20   7902  3000           2975            3000
     20   7566  2975           1100            3000
     20   7876  1100            800            2975
     20   7369   800              0            1100

8 rows selected.

Handling NULL values

In a LAG or LEAD function expression, the optional RESPECT NULLS or IGNORE NULLS keywords can be specified in either of two locations:

  • In the argument list
  • Immediately following the closing parenthesis that delimits the argument list

The RESPECT NULLS and IGNORE NULLS keywords have these effects:

  • If you specify the RESPECT NULLS keywords, rows whose expression evaluates to NULL are included when offset rows are counted.
  • If you specify the IGNORE NULLS keywords, any row whose expression evaluates to NULL is not included when offset rows are counted.

If you specify IGNORE NULLS and all of the expression values for the rows in the window partition are NULL, the LAG or LEAD function returns the default value for each row. The function returns the NULL value if no default argument is specified.

Example:

 SELECT empno,comm,
        LEAD(comm) OVER (ORDER BY empno) next_comm,
	LAG(comm) OVER (ORDER BY empno) pre_comm
 FROM emp
 ORDER BY empno;

   EMPNO	COMM	 NEXT_COMM	PRE_COMM
---------       ------   ----------  ------------ 
    7369		300	
    7499	300.00	500	
    7521	500.00		        300
    7566		1400	        500
    7654	1400.00		
    7698	  	                1400
    7782			
    7788			
    7839		0	
    7844	0.00		
    7876			        0
    7900			
    7902			
    7934			

Now using IGNORE NULLS

 SELECT empno,comm,
        LEAD(comm IGNORE NULLS) OVER (ORDER BY empno) next_comm,
	LAG(comm IGNORE NULLS) OVER (ORDER BY empno) last_comm
 FROM emp
 ORDER BY empno;

   	EMPNO	COMM	NEXT_COMM	LAST_COMM
    .....   .....   .........   .........
	7369		300	
	7499	300.00	500	
	7521	500.00	1400	300
	7566		1400	500
	7654	1400.00	0	500
	7698		0	1400
	7782		0	1400
	7788		0	1400
	7839		0	1400
	7844	0.00		1400
	7876			0
	7900			0
	7902			0
	7934			0

As you can see the LEAD function is now looking for the next non null value, instead of just the next value.

FIRST VALUE and LAST VALUE function

The general syntax is:

FIRST_VALUE(<sql_expr>) OVER (<analytic_clause>)

The FIRST_VALUE analytic function picks the first record from the partition after doing the ORDER BY. The <sql_expr> is computed on the columns of this first record and results are returned. The LAST_VALUE function is used in similar context except that it acts on the last record of the partition.

Query-8 (FIRST_VALUE)
-- How many days after the first hire of each department were the next
-- employees hired?

SELECT empno, deptno, hiredate - FIRST_VALUE(hiredate)
OVER (PARTITION BY deptno ORDER BY hiredate) DAY_GAP
FROM emp
WHERE deptno IN (20, 30)
ORDER BY deptno, DAY_GAP;

     EMPNO     DEPTNO    DAY_GAP
---------- ---------- ----------
      7369         20          0
      7566         20        106
      7902         20        351
      7788         20        722
      7876         20        756
      7499         30          0
      7521         30          2
      7698         30         70
      7844         30        200
      7654         30        220
      7900         30        286

11 rows selected.

FIRST and LAST function

The FIRST function (or more properly KEEP FIRST function) is used in a very special situation. Suppose we rank a group of record and found several records in the first rank. Now we want to apply an aggregate function on the records of the first rank. KEEP FIRST enables that.

The general syntax is:

Function( ) KEEP (DENSE_RANK FIRST ORDER BY <expr>) OVER (<partitioning_clause>)

Please note that FIRST and LAST are the only functions that deviate from the general syntax of analytic functions. They do not have the ORDER BY inside the OVER clause. Neither do they support any clause. The ranking done in FIRST and LAST is always DENSE_RANK. The query below shows the usage of FIRST function. The LAST function is used in similar context to perform computations on last ranked records.

Query-9 (KEEP FIRST)
-- How each employee's salary compare with the average salary of the first
-- year hires of their department?

SELECT empno, deptno, TO_CHAR(hiredate,'YYYY') HIRE_YR, sal,
TRUNC(
AVG(sal) KEEP (DENSE_RANK FIRST
ORDER BY TO_CHAR(hiredate,'YYYY') )
OVER (PARTITION BY deptno)
     ) AVG_SAL_YR1_HIRE
FROM emp
WHERE deptno IN (20, 10)
ORDER BY deptno, empno, HIRE_YR;

     EMPNO     DEPTNO HIRE        SAL AVG_SAL_YR1_HIRE
---------- ---------- ---- ---------- ----------------
      7782         10 1981       2450             3725
      7839         10 1981       5000             3725
      7934         10 1982       1300             3725
      7369         20 1980        800              800
      7566         20 1981       2975              800
      7788         20 1982       3000              800
      7876         20 1983       1100              800
      7902         20 1981       3000              800

8 rows selected.

PERCENT_RANK

The PERCENT_RANK function is an OLAP ranking function that calculates a ranking value for each row in an OLAP window, normalized to a range from 0 to 1.

Each PERCENT_RANK value is computed as the row's RANK minus 1, divided by the number of rows in the partition minus 1. Values closer to 1 generally represent higher rankings and values closer to 0 generally represent lower rankings.

This function takes no argument, but the empty parentheses must be specified. If the optional window PARTITION clause is also specified, the rankings are calculated within the subset of rows that each partition defines. If there is a single row in the partition, its PERCENT_RANK value is 0.

SELECT empno, sal,RANK() OVER (ORDER BY sal) rnk,
        PERCENT_RANK() OVER (ORDER BY sal) AS per_rank
FROM emp;

   	EMPNO	SAL	RNK	PER_RANK
	7369	800.00	1	0
	7900	950.00	2	0.0769230769230769
	7876	1100.00	3	0.153846153846154
	7654	1250.00	4	0.230769230769231
	7521	1250.00	4	0.230769230769231
	7934	1300.00	6	0.384615384615385
	7844	1500.00	7	0.461538461538462
	7499	1600.00	8	0.538461538461538
	7782	2450.00	9	0.615384615384615
	7698	2850.00	10	0.692307692307692
	7566	2975.00	11	0.769230769230769
	7902	3000.00	12	0.846153846153846
	7788	3000.00	12	0.846153846153846
	7839	5000.00	14	1

CUME_DIST

The CUME_DIST function calculates the number of rows that are ranked lower than or equal to the current row, including the current row, which is divided by the total number of rows in the partition. Values closer to 1 represent higher rankings and values closer to 0 represent lower rankings.

This function takes no argument, but the empty parentheses must be specified. If the optional window PARTITION clause is also specified, the rankings are calculated within the subset of rows that each partition defines. If there is a single row in the partition, its CUME_DIST value is 1.

 SELECT empno, sal,RANK()OVER(ORDER BY sal DESC) rnk,
	  PERCENT_RANK() OVER (ORDER BY sal) AS per_rank,
	  CUME_DIST() OVER (ORDER BY sal DESC) AS cume_dist
	FROM emp;
    
   	EMPNO	SAL	RNK	PER_RANK	        CUME_DIST
      ......    ......  ...     ..................      ..................
	7369	800.00	14	0	                1
	7900	950.00	13	0.0769230769230769	0.928571428571428
	7876	1100.00	12	0.153846153846154	0.857142857142857
	7521	1250.00	10	0.230769230769231	0.785714285714286
	7654	1250.00	10	0.230769230769231	0.785714285714286
	7934	1300.00	9	0.384615384615385	0.642857142857143
	7844	1500.00	8	0.461538461538462	0.571428571428571
	7499	1600.00	7	0.538461538461538	0.5
	7782	2450.00	6	0.615384615384615	0.428571428571429
	7698	2850.00	5	0.692307692307692	0.357142857142857
	7566	2975.00	4	0.769230769230769	0.285714285714286
	7902	3000.00	2	0.846153846153846	0.214285714285714
	7788	3000.00	2	0.846153846153846	0.214285714285714
	7839	5000.00	1	1	                0.0714285714285714

Order of computation and performance tips

Defining the PARTITOIN BY and ORDER BY clauses on indexed columns (ordered in accordance with the PARTITION CLAUSE and then the ORDER BY clause in analytic function) will provide optimum performance. For Query-5, for example, a composite index on (deptno, hiredate) columns will prove effective.

It is advisable to always use CBO for queries using analytic functions. The tables and indexes should be analyzed and optimizer mode should be CHOOSE.

Even in absence of indexes analytic functions provide acceptable performance but need to do sorting for computing partition and order by clause. If the query contains multiple analytic functions, sorting and partitioning on two different columns should be avoided if they are both not indexed.

Writer profile pic

Uk01 on Nov 03, 2015 at 12:12 am


If you like dEexams.com and would like to contribute, you can write your article here or mail your article to admin@deexams.com . See your article appearing on the dEexams.com main page and help others to learn.



Post Comment

Comments( 0)

×

Forgot Password

Please enter your email address below and we will send you information to change your password.