高度警惕：这个操作会导致Oracle 夯死

数据运维 2024-01-29 Escape 手机阅读

首先请大家转发，让很多朋友知道这个坑！

前几天一个老客户的核心系统出现了异常，整个故障持续了几分钟时间，虽然不长，但是仍然影响较大。从用户提供的ASH和AWR我也进行了简单分析，最后到现场排查了一下，发现了一些蛛丝马迹。

从AWR来看，top event出现了大量的shared pool latch以及cursor pin相关等待事件，看上去数据库基本上处于夯死的状态了。进一步看简单看了一下ASH，发现故障持续时间并不长，就几分钟。

对于shared pool latch，cursor:pin S wait in X我们都知道跟SQL解析有关，而其中latch:shared pool又跟硬解析有关。当时猜测是不是有突发大量的硬解析或者说DDL等操作，根据用户反馈该时间点并非业务高峰期，更不会有DDL等操作。

此时如果如果要进一步诊断的话，我们通常应该去看具体是哪些用户导致的这么高的latch等待，比如这个用户在当时在干什么？这部分内容ASH是有一些数据的，不过看得不够彻底。

实际上我们要去看这个432会话的执行SQL情况，发现部分关键SQL死活查不到，v$sqlv$sqlareadba_hist_sqltext等。询问用户有哪些监控的时候，提到只有zabbix，其中zabbix监控的内容极为简单和粗略，基本上没用。所以我在想，此时如果用户有一套针对数据库的zCloud 细粒度监控多好，最新的6.1版本功能很强，支持10多种商业开源国产数据库，支持各种细粒度下钻，非常适合排查疑难杂症。比如要分析过去某个时间段，看负载情况，直接拖拽就行，简直不要太方便：

实际上23年我几次出差，我每次到现场的时候，客户都说李总你很久都没来了，我们刚好有几个小问题，请帮我们看看。我基本上没用CRT/xShell之类工具登录过客户生产环境，都是让用户打开zCloud直接看，期间帮用户排除到了好几个隐患。

继续回到案例上来，这里我们可以从ASH看到实际上等待最高session的Program是sqlplus（且user是SYS);很明显这是运维人员的操作，并非应用程序。此外还有些latch:row cache objects等待，那么有没有可能是row cache objects影响了latch:shared pool呢？我分析ash裸数据发现，session 432和604 是相互阻塞的，而其中时间点最早的432执行的一个SQLID 找不到sqltext文本。

后面用户反馈他们运维人员在故障时间点之前执行了一个巡检脚本，后面脚本停止后就恢复了，我也看了一下他们的巡检脚本，实际上并不复杂，也就100来个SQL查询而已，其中有个查询shared pool碎片的脚本，这引起了我的注意。我们都知道实际上查询x$ksmsp 风险是比较高的。我在我们交付体系内部问了下，至少之前有3个客户在Oracle 11.2.0.3/11.2.0.4 版本中通过查询x$ksmsp时候，导致出现大量shared pool latch，进而导致实例hung的情况。这里我show一个类似的脚本：

SET lines 120
col sga_heap FOR a30
SELECT KSMCHIDX "SubPool",
'sga heap(' || KSMCHIDX || ',0)' sga_heap,
ksmchcom ChunkComment,
decode(round(ksmchsiz 1000), 0, '0-1K', 1, '1-2K', 2, '2-3K', 3, '3-4K', 4, '4-5K', 5, '5-6k', 6, '6-7k', 7, '7-8k', 8, '8-9k', 9, '9-10k', '10K') "size",
count(*),
ksmchcls Status,
sum(ksmchsiz) Bytes
FROM x$ksmsp
WHERE KSMCHCOM = 'free memory'
GROUP BY ksmchidx,
ksmchcls,
'sga heap(' || KSMCHIDX || ',0)',
ksmchcom,
ksmchcls,
decode(round(ksmchsiz 1000), 0, '0-1K', 1, '1-2K', 2, '2-3K', 3, '3-4K', 4, '4-5K', 5, '5-6k', 6, '6-7k', 7, '7-8k', 8, '8-9k', 9, '9-10k', '10K')
ORDER BY "size";

这里我再说强调，这个脚本不要在生产库上执行。

重要的事情说三遍！

到这里就很清晰了，我的分析就是脚本中查询x$ksmsp导致的此次问题。那么多很多人或许会问，为什么这个查询会导致latch:shared pool呢？实际上Oracle MOS也有很多文章或者相关Bug的描述中都提到了，不建议查询这个x$视图。实际上在我们技术体系内部，以前是强调过很多次的，大家应该都刻在脑海中了。先分享下Oracle的官方文档：

Diagnosing and Resolving Error ORA-04031 on the Shared Pool or Other Memory Pools (Doc ID 146599.1)

Be aware of Bug:4715420 stating that selecting from X$KSMSP is asking one session to hold the shared pool latches in turn for a LONG period of time and should be avoided on live systems. Selecting from X$KSMSP on a production system is a very bad idea.

ORA-4031 on RAC Database With 'ges big msg p' Consuming Memory (Doc ID 1433164.1)

NOTE: It is NOT recommended to run queries on X$KSMSP when the database instance is under load. Performance of the database will be impacted, especially nowadays with very large SGAs. Bug 14020215 was filed for ORA-600 errors and unplanned outages running queries directly on X$KSMSP. There is a view, X$KSMSP_NWEX, in later versions of 11g that is safer to use for investigation of memory usage. However, we STRONGLY recommend you not run these queries unless given specific instructions from Oracle Support to do so.

我想一定有很多人跟我一样喜欢探究根源的，那么就是这个查询SQL是否会真的持有share pool latch？实际上要分析这一点，Oracle提供了很多手段，大家可以使用event，oradebug poke或者 gdb。这里不得不说，Oracle 在这方面的诊断手段非常的强大，不过我测试evevnt 10005 在10g+版本已经被废弃了，因此我这里简单用gdb试试。

我们知道latch的相关操作主要是调用kslgetl,kslfre等，那么这里直接开干！

+++Session 1

SQL> select sid from v$mystat where rownum=1;

SID
----------
191

SQL> select spid from v$process where addr=(select paddr from v$session where sid=191);

SPID
------------------------------------------------
27392

SQL> SET lines 120
SQL> col sga_heap FOR a30
SQL> SELECT KSMCHIDX "SubPool",
'sga heap(' || KSMCHIDX || ',0)' sga_heap,
ksmchcom ChunkComment,
decode(round(ksmchsiz 1000), 0, '0-1K', 1, '1-2K', 2, '2-3K', 3, '3-4K', 4, '4-5K', 5, '5-6k', 6, '6-7k', 7, '7-8k', 8, '8-9k', 9, '9-10k', '10K') "size",
count(*),
2 3 4 5 6 ksmchcls Status,
sum(ksmchsiz) Bytes
7 8 FROM x$ksmsp
9 WHERE KSMCHCOM = 'free memory'
GROUP BY ksmchidx,
ksmchcls,
'sga heap(' || KSMCHIDX || ',0)',
ksmchcom,
ksmchcls,
10 11 12 13 14 15 decode(round(ksmchsiz 1000), 0, '0-1K', 1, '1-2K', 2, '2-3K', 3, '3-4K', 4, '4-5K', 5, '5-6k', 6, '6-7k', 7, '7-8k', 8, '8-9k', 9, '9-10k', '10K')
ORDER BY "size"; 16

SubPool SGA_HEAP CHUNKCOMMENT size COUNT(*) STATUS BYTES
---------- ------------------------------ -------------------------------- ---------- ---------- ---------------- ----------
1 sga heap(1,0) free memory 0-1K 82 free 13240
1 sga heap(1,0) free memory 1-2K 3 free 2528
1 sga heap(1,0) free memory 10K 12 R-free 10073632
1 sga heap(1,0) free memory 10K 8 free 18130496
1 sga heap(1,0) free memory 4-5K 2 free 7192

+++Session 2

oracle@Ora11g-ogg-Target:/home/oracle $gdb -p 27392
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
......
......
(gdb) set pagination off
(gdb) rbreak ^ksl.*
Breakpoint 1 at 0xa6f93c
kslhngcbk1;
Breakpoint 2 at 0xa6f9b6
kslhngcbk0;
Breakpoint 3 at 0xa706f6
kslgtexholder;
Breakpoint 4 at 0xa7081e
kslown;
Breakpoint 5 at 0xa71460
kslfpl;
Breakpoint 6 at 0xa71612
kslfpl0;
Breakpoint 7 at 0xa71948
kslgpl;
Breakpoint 8 at 0xa7195c
。。。。。。
。。。。。。
。。。。。。
(gdb) c
Continuing.
。。。。。

Breakpoint 195, 0x0000000009401cc4 in kslgetl ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1, where:594916352, why:3991
(gdb) c
Continuing.

Breakpoint 193, 0x000000000940108e in kslfre ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1611718792, where:161554944, why:1
(gdb) c
Continuing.
。。。。。

Breakpoint 195, 0x0000000009401cc4 in kslgetl ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1, where:593658880, why:3991
(gdb) c
Continuing.

Breakpoint 193, 0x000000000940108e in kslfre ()
(gdb) c
Continuing.

Breakpoint 194, 0x0000000009401544 in kslwlscan ()
(gdb) c
Continuing.

Breakpoint 195, 0x0000000009401cc4 in kslgetl ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1, where:0, why:4039
(gdb) c
Continuing.

Breakpoint 193, 0x000000000940108e in kslfre ()
(gdb) c
Continuing.

Breakpoint 195, 0x0000000009401cc4 in kslgetl ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1, where:0, why:3980
(gdb) c
Continuing.

Breakpoint 193, 0x000000000940108e in kslfre ()
(gdb) c
Continuing.

Breakpoint 195, 0x0000000009401cc4 in kslgetl ()
(gdb) printf "kslgetl laddr:%x, willing:%d, where:%d, why:%dn", $rdi, $rsi, $rdx, $rcx
kslgetl laddr:6010e088, willing:1, where:0, why:4039
(gdb) c
Continuing.

Breakpoint 193, 0x000000000940108e in kslfre ()
(gdb) c
Continuing.

这里就完全手敲了，就几个命令，如果要把日志输出等更好一些，可以写个shell更ok一些。从上面的结果我们可以看到，执行这个x$ksmsp 查询，这里这里调用了至少超过5次kslgetl;通过addr地址，我们确认就是share pool latch的地址(6010e088,查询v$latch.addr,需要补全）：

SQL> select addr, latch#, child#, level#, name from v$latch_children where name in ('shared pool');

ADDR LATCH# CHILD# LEVEL# NAME
---------------- ---------- ---------- ---------- ---------------------------------------------------------
000000006010E448 336 7 7 shared pool
000000006010E3A8 336 6 7 shared pool
000000006010E308 336 5 7 shared pool
000000006010E268 336 4 7 shared pool
000000006010E1C8 336 3 7 shared pool
000000006010E128 336 2 7 shared pool
000000006010E088 336 1 7 shared pool

7 rows selected.

SQL>