Oracle ASM磁盘组被误删除!他跑,她追,他插翅难飞!

经常删库的小伙伴们,经常听说一句话:"删库跑路,三年起步!"
说的就是变更不规范,亲人两行泪的故事。
那么究竟是什么导致故事变成了事故呢,这一切的背后究竟是人性的扭曲、良心的泯灭还是道德的沦丧呢?让我们回顾一下案发现场:
注意:
本故事纯属虚构,如有雷同,纯属巧合,请勿模仿。
又快到下班时间了,突然,DBA小C接到了客户的电话,反馈数据库服务器A系统磁盘空间快不足了,之前给ASM加的两块盘sdh,sdi还没开始用,先踢出ASM磁盘组,临时扩容下系统盘。
接到任务,粗心大意的小C只想快速完成工作,避免加班,所以他匆忙检查了sdh,sdi盘已经被加到新建的+CJC磁盘组里,既然客户说没在用,那就不检查了,直接删除+CJC磁盘组。
说干就干,直接执行drop diskgroup:
发现active diskgroup磁盘删除不了:
[grid@cjc-db-02 ~]$ sqlplus as sysasmSQL> drop diskgroup CJC including contents;ORA-15039: diskgroup not droppedORA-15027: active use of diskgroup "CJC" precludes its dismount
一不做二不休,小C想到,别耽误我下班,直接停库操作:
[oracle@cjc-db-02 ~]$ srvctl status database -d cjcDatabase is running.[oracle@cjc-db-02 ~]$ srvctl stop database -d cjc[oracle@cjc-db-02 ~]$ srvctl status database -d cjcDatabase is not running.
再次执行,成功删除掉了+CJC磁盘组:
[grid@cjc-db-02 ~]$ sqlplus as sysasmSQL> drop diskgroup CJC including contents;Diskgroup dropped.
启动数据库,有报错:
[oracle@cjc-db-02 ~]$ srvctl start database -d cjcPRCR-1079 : Failed to start resource ora.cjc.dbCRS-2640: Required resource 'ora.CJC.dg' is missing.SQL> startupORA-39511: Start of CRS resource for instance '222' failed with error:[CRS-2640: Required resource 'ora.CJC.dg' is missing.CRS-0222: Resource 'ora.cjc.db' has dependency error.clsr_start_resource:260 status:222clsrapi_start_db:start_asmdbs status:222
尝试重新启动has,发现问题仍然没有解决:
[root@cjc-db-02 bin]# ./crsctl stop has[root@cjc-db-02 bin]# ./crsctl start has
原来是因为踢盘后,CRS信息还没有更新,需要通过 stvctl modify ...命令更新后才能启动数据库。
正当小C准备更新CRS信息时,客户电话又打来了,先别踢盘了,业务同事反馈,sdh,sdi昨天晚上已经开始写业务数据了,不能删除!
啊,这,,,新上线的库还没有来得及备份啊!

小C抽了根烟冷静了一下,默默打开了BOSS直聘,过了一会又打开了12306,最后又抱着试一试的心态在浏览器上搜索了ASM磁盘组误操作后如何恢复?
终于看到了希望,原来drop diskgroup只是逻辑删除,数据被覆盖之前还可以挽救回来,最终小C通过kfed工具成功的挽回了数据...
本次案例小C深有感触,明白了对生产环境要时刻抱有敬畏之心,并更新了自己的微信签名,从原来的"乾坤未定,你我皆是黑马!"改成了"乾坤已定,你我皆是牛马!"

下面,让我们看下如何使用kfed工具恢复误删除的磁盘组:
环境说明:
DB:Oracle 19.22单机+ASMOS:Oracle Linux 7.6
启动has:
[root@cjc-db-02 bin]# mount -o remount,size=5G dev/shm[root@cjc-db-02 bin]# ./crsctl start has
查看资源状态:
[root@cjc-db-02 bin]# ./crsctl stat res -t--------------------------------------------------------------------------------Name Target State Server State details --------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINE ONLINE cjc-db-02 STABLEora.LISTENER.lsnr ONLINE ONLINE cjc-db-02 STABLEora.asm ONLINE ONLINE cjc-db-02 Started,STABLEora.ons OFFLINE OFFLINE cjc-db-02 STABLE--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.cjc.db 1 OFFLINE OFFLINE Instance Shutdown,ST ABLEora.cssd 1 ONLINE ONLINE cjc-db-02 STABLEora.diskmon 1 OFFLINE OFFLINE STABLEora.evmd 1 ONLINE ONLINE cjc-db-02 STABLE--------------------------------------------------------------------------------
查看磁盘组信息:
[grid@cjc-db-02 ~]$ sqlplus as sysasmset line 300col name for a15select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB------------ --------------- ----------- ------ ---------- ---------- 1 DATA MOUNTED EXTERN 12288 9740
查看磁盘信息:
col path for a30select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH------------ ----------- -------- ------- ---------- ---------- ---------- --------------- ------------------------------ 0 0 NORMAL UNKNOWN2048 0 0 dev/sdk 0 1 NORMAL UNKNOWN2048 0 0 dev/sdh 0 2 NORMAL UNKNOWN2048 0 0 dev/sdi 0 3 NORMAL UNKNOWN2048 0 0 dev/sdj 1 0 NORMAL UNKNOWN2048 2048 1624 DATA_0000 dev/sdb 1 1 NORMAL UNKNOWN2048 2048 1628 DATA_0001 dev/sdc 1 2 NORMAL UNKNOWN2048 2048 1620 DATA_0002 dev/sdd 1 3 NORMAL UNKNOWN2048 2048 1624 DATA_0003 dev/sde 1 4 NORMAL UNKNOWN2048 2048 1632 DATA_0004 dev/sdf 1 5 NORMAL UNKNOWN2048 2048 1612 DATA_0005 dev/sdg10 rows selected.
新建磁盘组 CJC:
[grid@cjc-db-02 ~]$ asmca




查看磁盘组信息:
set line 300col name for a15select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB------------ --------------- ----------- ------ ---------- ---------- 1 DATA MOUNTED EXTERN 12288 9740 2 CJC MOUNTED EXTERN 4096 3988
col path for a30select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH------------ ----------- -------- ------- ---------- ---------- ---------- ------------------------------ ------------------------------ 0 0 NORMAL UNKNOWN2048 0 0 /dev/sdk 0 1 NORMAL UNKNOWN2048 0 0 /dev/sdj 1 0 NORMAL UNKNOWN2048 2048 1620 DATA_0000 /dev/sdb 1 1 NORMAL UNKNOWN2048 2048 1624 DATA_0001 /dev/sdc 1 2 NORMAL UNKNOWN2048 2048 1612 DATA_0002 /dev/sdd 1 3 NORMAL UNKNOWN2048 2048 1624 DATA_0003 /dev/sde 1 4 NORMAL UNKNOWN2048 2048 1624 DATA_0004 /dev/sdf 1 5 NORMAL UNKNOWN2048 2048 1608 DATA_0005 /dev/sdg 2 0 NORMAL UNKNOWN2048 2048 1980 CJC_0000 /dev/sdh 2 1 NORMAL UNKNOWN2048 2048 1996 CJC_0001 /dev/sdi10 rows selected.
新增测试数据:
[oracle@cjc-db-02 ~]$ sqlplus / as sysdbaselect name from v$dbfile;NAME--------------------------------------------------------------------------------+DATA/CJC/DATAFILE/users.260.1165422711+DATA/CJC/DATAFILE/undotbs1.259.1165422693+DATA/CJC/DATAFILE/system.257.1165422411+DATA/CJC/DATAFILE/sysaux.258.1165422581
create tablespace CJC datafile '+CJC' size 10M;create user CJC identified by "a" default tablespace CJC;grant dba to CJC;conn CJC/acreate table t1 as select level as id from dual connect by level/tmp/sdh01
块号1备份
kfed read /dev/sdh blkn=1 >/tmp/sdh02
aun1备份
kfed read /dev/sdh aun=1 >/tmp/sdh03
磁盘头信息备份
kfed read /dev/sdi >/tmp/sdi01
块号1备份
kfed read /dev/sdi blkn=1 >/tmp/sdi02
aun1备份
kfed read /dev/sdi aun=1 >/tmp/sdi03
删除磁盘组,模拟误操作:
[grid@cjc-db-02 ~]$ sqlplus / as sysasmSQL> drop diskgroup CJC including contents;ORA-15039: diskgroup not droppedORA-15027: active use of diskgroup "CJC" precludes its dismount
停库后继续删除:
[oracle@cjc-db-02 ~]$ srvctl status database -d cjcDatabase is running.[oracle@cjc-db-02 ~]$ srvctl stop database -d cjc[oracle@cjc-db-02 ~]$ srvctl status database -d cjcDatabase is not running.
成功删除掉CJC磁盘组:
[grid@cjc-db-02 ~]$ sqlplus / as sysasmSQL> drop diskgroup CJC including contents;Diskgroup dropped.
启动数据库报错:
[oracle@cjc-db-02 ~]$ srvctl start database -d cjcPRCR-1079 : Failed to start resource ora.cjc.dbCRS-2640: Required resource 'ora.CJC.dg' is missing.SQL> startupORA-39511: Start of CRS resource for instance '222' failed with error:[CRS-2640: Required resource 'ora.CJC.dg' is missing.CRS-0222: Resource 'ora.cjc.db' has dependency error.clsr_start_resource:260 status:222clsrapi_start_db:start_asmdbs status:222
恢复误删除的磁盘组:
查看磁盘组信息:
set line 300col name for a15select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB------------ --------------- ----------- ------ ---------- ---------- 1 DATA MOUNTED EXTERN 12288 9704
col path for a30select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;SQL> GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH------------ ----------- -------- ------- ---------- ---------- ---------- --------------- ------------------------------ 0 0 NORMAL UNKNOWN2048 0 0 /dev/sdk 0 1 NORMAL UNKNOWN2048 0 0 /dev/sdh 0 2 NORMAL UNKNOWN2048 0 0 /dev/sdi 0 3 NORMAL UNKNOWN2048 0 0 /dev/sdj 1 0 NORMAL UNKNOWN2048 2048 1620 DATA_0000 /dev/sdb 1 1 NORMAL UNKNOWN2048 2048 1624 DATA_0001 /dev/sdc 1 2 NORMAL UNKNOWN2048 2048 1612 DATA_0002 /dev/sdd 1 3 NORMAL UNKNOWN2048 2048 1620 DATA_0003 /dev/sde 1 4 NORMAL UNKNOWN2048 2048 1620 DATA_0004 /dev/sdf 1 5 NORMAL UNKNOWN2048 2048 1608 DATA_0005 /dev/sdg10 rows selected.
恢复
--备份磁盘头信息
kfed read /dev/sdh >/tmp/sdh01xxx
块号1备份
kfed read /dev/sdh blkn=1 >/tmp/sdh02xxx
aun1备份
kfed read /dev/sdh aun=1 >/tmp/sdh03xxx
磁盘头信息备份
kfed read /dev/sdi >/tmp/sdi01xxx
块号1备份
kfed read /dev/sdi blkn=1 >/tmp/sdi02xxx
aun1备份
kfed read /dev/sdi aun=1 >/tmp/sdi03xxx
通过diff命令进行对比:
[grid@cjc-db-02 ~]$ diff -C 1 /tmp/sdh01 /tmp/sdh01xxx|grep "kfdhdb.hdrsts"! kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER! kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER
[grid@cjc-db-02 ~]$ diff -C 1 /tmp/sdi01 /tmp/sdi01xxx|grep "kfdhdb.hdrsts"! kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER! kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER
通过对比删除前和删除后磁盘头数据的不同部分,可以看到kfdhdb.hdrsts行有差异。
尝试改回删除前磁盘头的信息:
--修改sdh01xxx文件
kfdhdb.hdrsts:状态值从4改为3,状态名称从KFDHDR_FORMER改为KFDHDR_MEMBER
--修改sdi01xxx文件
kfdhdb.hdrsts:状态值从4改为3,状态名称从KFDHDR_FORMER改为KFDHDR_MEMBER
[grid@cjc-db-02 ~]$ vi /tmp/sdh01xxx
将
kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER
改成
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
[grid@cjc-db-02 ~]$ vi /tmp/sdi01xxx
将
kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER
改成
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
合并修改:
[grid@cjc-db-02 ~]$ kfed merge /dev/sdh text=/tmp/sdh01xxx[grid@cjc-db-02 ~]$ kfed merge /dev/sdi text=/tmp/sdi01xxx
检查磁盘组:可以看到CJC磁盘组了
[grid@cjc-db-02 ~]$ sqlplus / as sysasmset line 300col name for a15select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB------------ --------------- ----------- ------ ---------- ---------- 0 CJC DISMOUNTED 0 0 1 DATA MOUNTED EXTERN 12288 9704
挂载磁盘组
SQL> alter diskgroup CJC mount;
[grid@cjc-db-02 trace]$ tail -100f alert_+ASM.log .....2024-04-06T14:43:30.834474+08:00SUCCESS: alter diskgroup CJC mountWARNING: unknown state for diskgroup resource ora.CJC.dg, Return Value: 3
查看
set line 300col name for a15select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;SQL> SQL> GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB------------ --------------- ----------- ------ ---------- ---------- 1 DATA MOUNTED EXTERN 12288 9704 2 CJC MOUNTED EXTERN 4096 3976
可以正常启动数据库:
[oracle@cjc-db-02 ~]$ sqlplus / as sysdbaSQL*Plus: Release 19.0.0.0.0 - Production on Sat Apr 6 14:44:53 2024Version 19.22.0.0.0Copyright (c) 1982, 2023, Oracle. All rights reserved.Connected to an idle instance.SQL> startupORACLE instance started.Total System Global Area 1174402440 bytesFixed Size 8938888 bytesVariable Size 369098752 bytesDatabase Buffers 788529152 bytesRedo Buffers 7835648 bytesDatabase mounted.Database opened.
可以看到CJC磁盘组的数据
SQL> select * from cjc.t1;ID---------- 1 2 3 4 5 6 7 8 91010 rows selected.
参考:微信公众号"数据库运维之道"文章:【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组
【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组
数据库运维之道,公众号:数据库运维之道【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组
###chenjuchao 20240406###