一种高性能的每组最大值SQL查询方法

本教程将介绍一种高性能的每组最大值SQL查询方法的处理方法,这篇教程是从别的地方看到的,然后加了一些国外程序员的疑问与解答,希望能对你有所帮助,好了,下面开始学习吧。

一种高性能的每组最大值SQL查询方法 教程 第1张

问题描述

我正在尝试构建一个基础结构,用于按需快速运行回归,从包含我们的Web服务器上所有历史活动的数据库中提取Apache请求。为了通过确保我们仍然递减来自较小客户端的请求来提高覆盖率,我希望通过为每个客户端检索至多n个(对于这个问题,假设10个)请求来确保请求的分布。

我在这里找到了许多类似问题的答案,其中最接近的似乎是SQL query to return top N rows per ID across a range of IDs,但答案大多是与性能无关的解决方案
我已经试过了。例如,ROW_NUMBER()分析函数为我们提供了所需的数据:

SELECT
 *
FROM
 (
 SELECT
  dailylogdata.*,
  row_number() over (partition by dailylogdata.contextid order by occurrencedate) rn
 FROM
  dailylogdata
 WHERE
  shorturl in (?)
 )
WHERE
 rn <= 10;

但是,假设该表包含给定一天的数百万个条目,并且这种方法需要从索引中读取与我们的选择标准匹配的所有行,以便应用ROW_NUMBER分析函数,则性能非常糟糕。我们最终选择了近一百万行,却因为它们的row_number超过10而丢弃了其中的绝大多数。

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation | Name  | Starts | E-Rows | A-Rows |A-Time| Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
||0 | SELECT STATEMENT| |1 |  |  12222 |00:09:08.94 |  895K| 584K| 301 | | | |||
||*  1 |  VIEW  | |1 |4427K|  12222 |00:09:08.94 |  895K| 584K| 301 | | | |||
||*  2 |WINDOW SORT PUSHED RANK| |1 |4427K|  13536 |00:09:08.94 |  895K| 584K| 301 |  2709K|743K|97M (1)| 4096 ||
||3 | PARTITION RANGE SINGLE| |1 |4427K| 932K|00:22:27.90 |  895K| 584K|0 | | | |||
||4 |  TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA|1 |4427K| 932K|00:22:27.61 |  895K| 584K|0 | | | |||
||*  5 |INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT |1 |  17345 | 932K|00:00:00.75 | 1448 |0 |0 | | | |||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
||
|Predicate Information (identified by operation id):|
|---------------------------------------------------|
||
|1 - filter("RN"<=:SYS_B_2) |
|2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DAILYLOGDATA"."CONTEXTID" ORDER BY "OCCURRENCEDATE")<=:SYS_B_2)|
|5 - access("SHORTURL"=:P1) |
||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

但是,如果我们只查询特定上下文ID的前10个结果,则可以大大加快执行速度:

SELECT
 *
FROM
 (
 SELECT
  dailylogdata.*
 FROM
  dailylogdata
 WHERE
  shorturl in (?)
  and contextid = ?
 )
WHERE
 rownum <= 10;

运行此查询的统计信息:

|-------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation| Name  | Starts | E-Rows | A-Rows |A-Time| Buffers ||
|-------------------------------------------------------------------------------------------------------------------------|
||0 | SELECT STATEMENT  | |1 |  |  10 |00:00:00.01 |14 ||
||*  1 |  COUNT STOPKEY | |1 |  |  10 |00:00:00.01 |14 ||
||2 |PARTITION RANGE SINGLE| |1 |  10 |  10 |00:00:00.01 |14 ||
||3 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA|1 |  10 |  10 |00:00:00.01 |14 ||
||*  4 |  INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT |1 |1 |  10 |00:00:00.01 | 5 ||
|-------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
|1 - filter(ROWNUM<=10)|
|4 - access("SHORTURL"=:P1 AND "CONTEXTID"=TO_NUMBER(:P2)) |
| |
+-------------------------------------------------------------------------------------------------------------------------+

在本例中,Oracle足够聪明,可以在获得10个结果后停止检索数据。我收集一组完整的上下文ID并以编程方式生成一个查询,该查询由每个上下文ID的一个查询实例和union all整个查询组成,但是考虑到上下文ID的绝对数量,我们可能会遇到Oracle内部的限制,即使不是这样,这种方法也充满了笨拙的味道。

有什么人知道有一种方法可以保持第一个查询的简单性,同时保持与第二个查询相称的性能?还要注意,我实际上并不关心检索一组稳定的行;只要它们满足我的标准,它们就可以用于回归。

编辑:Adam Musch的建议奏效了。我在这里附加了他的更改的性能结果,因为我无法将它们放在对他的答案的评论回应中。这次我还使用了一个更大的数据集进行测试,以下是来自我最初的ROW_NUMBER方法的(缓存)统计数据以供比较:

|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation| Name  | Starts | E-Rows | A-Rows |A-Time| Buffers | Reads  |  OMem |  1Mem | Used-Mem ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
||0 | SELECT STATEMENT  | |1 |  |  12624 |00:00:22.34 | 1186K| 931K| | | ||
||*  1 |  VIEW | |1 |1163K|  12624 |00:00:22.34 | 1186K| 931K| | | ||
||*  2 |WINDOW NOSORT| |1 |1163K|1213K|00:00:21.82 | 1186K| 931K|  3036M| 17M| ||
||3 | TABLE ACCESS BY INDEX ROWID| TWTEST|1 |1163K|1213K|00:00:20.41 | 1186K| 931K| | | ||
||*  4 |  INDEX RANGE SCAN | TWTEST_URLCONTEXT |1 |1163K|1213K|00:00:00.81 | 8568 |0 | | | ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
|1 - filter("RN"<=10)  |
|2 - filter(ROW_NUMBER() OVER ( PARTITION BY "CONTEXTID" ORDER BY  NULL )<=10)  |
|4 - access("SHORTURL"=:P1)  |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

我冒昧地略微简化了Adam的建议;以下是修改后的查询...

select
 *
from
 twtest
where
 rowid in (
 select
rowid
 from (
select
  rowid,
  shorturl,
  row_number() over (partition by shorturl, contextid
order by null) rn
from
  twtest
 )
 where rn <= 10
 and shorturl in (?)
);

...及其(缓存)评估的统计信息:

|--------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation | Name  | Starts | E-Rows | A-Rows |A-Time| Buffers |  OMem |  1Mem | Used-Mem ||
|--------------------------------------------------------------------------------------------------------------------------------------|
||0 | SELECT STATEMENT| |1 |  |  12624 |00:00:01.33 |19391 | | | ||
||1 |  NESTED LOOPS| |1 |1 |  12624 |00:00:01.33 |19391 | | | ||
||2 |VIEW | VW_NSO_1 |1 |1163K|  12624 |00:00:01.27 | 6770 | | | ||
||3 | HASH UNIQUE  | |1 |1 |  12624 |00:00:01.27 | 6770 |  1377K|  1377K| 5065K (0)||
||*  4 |  VIEW  | |1 |1163K|  12624 |00:00:01.25 | 6770 | | | ||
||*  5 |WINDOW NOSORT | |1 |1163K|1213K|00:00:01.09 | 6770 |283M|  5598K| ||
||*  6 | INDEX RANGE SCAN| TWTEST_URLCONTEXT |1 |1163K|1213K|00:00:00.40 | 6770 | | | ||
||7 |TABLE ACCESS BY USER ROWID| TWTEST|  12624 |1 |  12624 |00:00:00.04 |12621 | | | ||
|--------------------------------------------------------------------------------------------------------------------------------------|
|  |
|Predicate Information (identified by operation id):  |
|---------------------------------------------------  |
|  |
|4 - filter("RN"<=10)|
|5 - filter(ROW_NUMBER() OVER ( PARTITION BY "SHORTURL","CONTEXTID" ORDER BY NULL NULL )<=10)|
|6 - access("SHORTURL"=:P1)|
|  |
|Note |
|-----|
|- dynamic sampling used for this statement (level=2) |
|  |
+--------------------------------------------------------------------------------------------------------------------------------------+

正如所宣传的,我们只访问经过完全筛选的行的dailylogdata表。我担心似乎仍在根据它声称正在选择的行数(1213K)对urlContext索引进行完全扫描,但考虑到它只使用6770个缓冲区(即使我增加了上下文特定结果的数量,这个数字也保持不变),这可能具有误导性。

推荐答案

这是一个不太好的解决方案,但似乎做了您想做的事情:尽快缩短索引扫描,并在通过筛选条件和top-n查询条件限定数据之前不读取数据。

请注意,测试时使用的是shorturl =条件,而不是shorturl IN条件。

with rowid_list as
(select rowid
from (select *
  from (select rowid,
row_number() over (partition by shorturl, contextid
 order by null) rn
 from dailylogdata
 )
 where rn <= 10
  )
  where shorturl = ? 
)
select * 
  from dailylogdata
 where rowid in (select rowid from rowid_list)

with子句获取前10个ROID,为满足您的条件的shorturlcontextid的每个唯一组合筛选一个窗口NOSORT。然后,它循环遍历这组roid,按rowid获取每个roid。

----------------------------------------------------------------------------------------------------
| Id  | Operation | Name  | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------
|0 | SELECT STATEMENT| |  1 |286 |  1536(1)| 00:00:19 |
|1 |  NESTED LOOPS| |  1 |286 |  1536(1)| 00:00:19 |
|2 |VIEW | VW_NSO_1 |136K|  1596K|910(1)| 00:00:11 |
|3 | HASH UNIQUE  | |  1 |  3326K|| |
|*  4 |  VIEW  | |136K|  3326K|910(1)| 00:00:11 |
|*  5 |WINDOW NOSORT | |136K|  2794K|910(1)| 00:00:11 |
|*  6 | INDEX RANGE SCAN| TABLE_REDACTED_INDEX |136K|  2794K|910(1)| 00:00:11 |
|7 |TABLE ACCESS BY USER ROWID| TABLE_REDACTED |  1 |274 |  1(0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - filter("RN"<=10)
5 - filter(ROW_NUMBER() OVER ( PARTITION BY "CLIENT_ID","SCE_ID" ORDER BY NULL NULL
  )<=10)
6 - access("TABLE_REDACTED"."SHORTURL"=:b1)

好了关于一种高性能的每组最大值SQL查询方法的教程就到这里就结束了,希望趣模板源码网找到的这篇技术文章能帮助到大家,更多技术教程可以在站内搜索。