SQL Server Top N per Group


The top n per group is a classic task in SQL. One example of this task is “give me the three most recent orders for each customer”, which is the same as saying “give me the top three orders per customer when the orders are sorted with the most recent first”. There are different ways to accomplish this task and some are more efficient than others.

There are three solutions to this task. The third solution is a special case. It provides a more efficient solution than the other two only when N equals 1 in the top N per group task and a POC index is absent. This third solution is the concatenation (carry-along sort) solution. The three solutions are:

  • ROW_NUMBER
  • TOP and APPLY
  • Concatenation

As Itzik Ben-Gan says in the book T-SQL Querying by Microsoft Press on page 363, “two main factors determine which solution is most efficient: the availability of a supporting index and the density of the partitioning (group) column.

POC Index

POC is an acronym for Partitioning, Ordering and Covering. The partitioning element defines the groups. The ordering element defines the order, based on which you filter the first N rows in each group. The covering element represents the rest of the columns you need to return. We are not using group by here because we are not aggregating anything as we are interested in returning actual data rows based on sorting and filtering. The PO elements of POC should form the index key list and the C element should form the index include list.

If our example here is to return the three most recent orders for each customer from a Sales.Orders table in Itzik Ben-Gan’s database then the code for creating that index is as follows:

create unique index idx_poc on Sales.Orders(custid, orderdate desc, orderid desc) include(empid);

Density

The other important factor in determining which solution is most efficient is the density of the partitioning element. That is custid in our case. What does density mean? A low density example would be having 1,000,000 customers with about 10 orders per customer on average. That is low on the many side of the one-to-many relationship. A high density example would be having 10 different customers each with 1,000,000 orders per customer on average. That is high on the many side of the one-to-many relationship. If the table is very small, then it will run quickly and performance considerations are not that important.

Which solution should you choose, based on the density?

To understand why this is the case requires an understanding of pages, seeks and scans, which are not discussed in this post.

Learn with YouTube

Here’s a video called Ace SQL Interviews: A Data-driven Approach for Data Scientists.