分类目录归档:Oracle RAC

一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析

有朋友告知我集群突然异常,让我给看看什么原因,集群alert日志

2021-03-14 21:02:15.517 [OHASD(31771)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 31771 开头
2021-03-14 21:02:15.561 [OHASD(31771)]CRS-0714: Oracle Clusterware 发行版 12.1.0.2.0。
2021-03-14 21:02:15.619 [OHASD(31771)]CRS-2112: 已在节点 rac1 上启动 OLR 服务。
2021-03-14 21:02:15.791 [OHASD(31771)]CRS-1301: 已在节点 rac1 上启动 Oracle 高可用性服务。
2021-03-14 21:02:15.910 [OHASD(31771)]CRS-8017: 位置:/etc/oracle/lastgasp具有2个重新启动指导日志文件,0个已发布,0个出现错误
2021-03-14 21:02:16.789 [CSSDAGENT(32015)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32015 开头
2021-03-14 21:02:16.868 [CSSDMONITOR(32017)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32017 开头
2021-03-14 21:02:17.751 [ORAROOTAGENT(32008)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32008 开头
2021-03-14 21:02:17.916 [ORAAGENT(32012)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32012 开头
2021-03-14 21:02:18.604 [ORAAGENT(32012)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:02:18.969 [ORAAGENT(32117)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32117 开头
2021-03-14 21:02:19.050 [MDNSD(32130)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 32130 开头
2021-03-14 21:02:19.117 [EVMD(32132)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 32132 开头
2021-03-14 21:02:20.078 [GPNPD(32151)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 32151 开头
2021-03-14 21:02:21.145 [GIPCD(32172)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 32172 开头
2021-03-14 21:02:21.163 [GPNPD(32151)]CRS-2328: 已在节点 rac1 上启动 GPNPD。
2021-03-14 21:02:22.172 [ORAROOTAGENT(32181)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32181 开头
2021-03-14 21:02:22.339 [CLSECHO(32204)]CRS-10001: 14-Mar-21 21:02 ACFS-9391: 正在检查现有 ADVM/ACFS 安装。
2021-03-14 21:02:22.580 [CLSECHO(32209)]CRS-10001: 14-Mar-21 21:02 ACFS-9392: 正在验证操作系统的 ADVM/ACFS 安装文件。
2021-03-14 21:02:22.598 [CLSECHO(32211)]CRS-10001: 14-Mar-21 21:02 ACFS-9393: 正在验证 ASM 管理员设置。
2021-03-14 21:02:22.646 [CLSECHO(32216)]CRS-10001: 14-Mar-21 21:02 ACFS-9308: 正在加载已安装的 ADVM/ACFS 驱动程序。
2021-03-14 21:02:22.678 [CLSECHO(32219)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleoks.ko' 驱动程序。
2021-03-14 21:02:22.809 [CLSECHO(32234)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleadvm.ko' 驱动程序。
2021-03-14 21:02:22.892 [CLSECHO(32290)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleacfs.ko' 驱动程序。
2021-03-14 21:02:23.054 [CLSECHO(32334)]CRS-10001: 14-Mar-21 21:02 ACFS-9327: 正在验证 ADVM/ACFS 设备。
2021-03-14 21:02:23.079 [CLSECHO(32336)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/asm/.asm_ctl_spec'。
2021-03-14 21:02:23.108 [CLSECHO(32340)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/ofsctl'。
2021-03-14 21:02:23.263 [CLSECHO(32346)]CRS-10001: 14-Mar-21 21:02 ACFS-9322: 已完成
2021-03-14 21:02:28.571 [CSSDMONITOR(32409)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32409 开头
2021-03-14 21:02:28.756 [CSSDAGENT(32425)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32425 开头
2021-03-14 21:02:28.975 [OCSSD(32436)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 32436 开头
2021-03-14 21:02:30.072 [OCSSD(32436)]CRS-1713: CSSD 守护程序已在 hub 模式下启动
2021-03-14 21:02:46.185 [OCSSD(32436)]CRS-1707: 节点 rac1 (编号为 1) 的租约获取已完成
2021-03-14 21:02:47.337 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR3; 详细资料见
 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.357 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR2; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.365 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR1; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:48.781 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 rac2 。
2021-03-14 21:02:50.971 [OCTSSD(32591)]CRS-8500: Oracle Clusterware OCTSSD 进程以操作系统进程 ID 32591 开头
2021-03-14 21:02:51.938 [OCTSSD(32591)]CRS-2403: 主机 rac1 上的集群时间同步服务处于观察程序模式。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac2。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2401: 已在主机 rac1 上启动了集群时间同步服务。
2021-03-14 21:02:52.167 [OCTSSD(32591)]CRS-2409: 主机 rac1 上的时钟与集群标准时间不同步。
由于集群时间同步服务正在以观察程序模式运行, 所以未采取任何操作。
2021-03-14 21:02:59.284 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.486 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.514 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:18.163 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac1。
2021-03-14 21:03:19.406 [OCSSD(32436)]CRS-1625: 节点 rac2 (编号为 2) 已关闭
2021-03-14 21:03:19.419 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 。
2021-03-14 21:03:24.916 [OSYSMOND(318)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 318 开头
2021-03-14 21:03:26.558 [CRSD(325)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 325 开头
2021-03-14 21:03:27.750 [CRSD(325)]CRS-1012: 已在节点 rac1 上启动 OCR 服务。
2021-03-14 21:03:27.807 [CRSD(325)]CRS-1201: 已在节点 rac1 上启动 CRSD。
2021-03-14 21:03:28.470 [ORAAGENT(1027)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1027 开头
2021-03-14 21:03:28.499 [ORAROOTAGENT(1031)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 1031 开头
2021-03-14 21:03:28.515 [ORAAGENT(1036)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1036 开头
2021-03-14 21:03:28.666 [ORAAGENT(1036)]CRS-5011: 检查资源 "oracledb" 失败: 详细资料见 "(:CLSN00007:)"
 (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_oracle.trc")
2021-03-14 21:03:30.649 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.718 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.722 [CRSD(325)]CRS-1024: 由于此节点上的 ASM 实例未处于活动状态, 此节点上的集群就绪服务终止。详细信息见 
(:PROCR00009:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。
2021-03-14 21:03:30.736 [ORAROOTAGENT(1031)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/orarootagent_root' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:3:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_orarootagent_root.trc)。
2021-03-14 21:03:30.736 [ORAAGENT(1027)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/oraagent_grid' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:1:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_grid.trc)。
2021-03-14 21:03:30.793 [CRSD(1157)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 1157 开头
2021-03-14 21:03:31.457 [OLOGGERD(1162)]CRS-8500: Oracle Clusterware OLOGGERD 进程以操作系统进程 ID 1162 开头
2021-03-14 21:03:31.798 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:31.823 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:40.234 [CRSD(1157)]CRS-1013: ASM 磁盘组中的 OCR 位置不可访问。
详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc。
2021-03-14 21:03:40.238 [CRSD(1157)]CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错
ORA-15077: 找不到提供所需磁盘组的 ASM 实例
], 集群就绪服务中止。详细资料见 (:CRSD00111:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。

从整个集群的启动过程看cssd,crs都起来了,然后等一会由于crs无法访问ocr磁盘组,导致异常.开始crs起来了,证明ocr磁盘组应该是mount成功过.后面看错误提示又无法访问了.根据经验以及ora.asm失败的提示,怀疑很可能是asm实例出现问题了.对于这样的情况,分析asm的alert日志是最好的方法.通过分析日志发现

Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 1
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup ARCHLOG was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 2
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup DATA was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 3
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup OCR was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 5
Sun Mar 14 21:03:24 2021
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:3} */
Sun Mar 14 21:03:24 2021
WARNING: failed to online diskgroup resource ora.ARCHLOG.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.OCR.dg (unable to communicate with CRSD/OHASD)
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc  (incident=123423):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423/+ASM1_rbal_32721_i123423.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 21:03:25 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
Sun Mar 14 21:03:27 2021
NOTE: [crsd.bin@rac1.schic.org (TNS V1-V3) 325] opening OCR file +OCR.255.4294967295
Starting background process ASMB
Sun Mar 14 21:03:27 2021
ASMB started with pid=28, OS id=932 
Sun Mar 14 21:03:27 2021
NOTE: ASMB registering with ASM instance as Standard client 0xffffffffffffffff (reg:3401595347) (new connection)
Sun Mar 14 21:03:27 2021
NOTE: Standard client +ASM1:+ASM:racscan registered, osid 934, mbr 0x0, asmb 932 (reg:3401595347)
Sun Mar 14 21:03:27 2021
NOTE: ASMB connected to ASM instance +ASM1 osid: 934 (Flex mode; client id 0xffffffffffffffff)
Sun Mar 14 21:03:28 2021
NOTE: AMDU dump of disk group ARCHLOG initiated at /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423
ERROR: ORA-600 in COD recovery for diskgroup 1/0x730955f2 (ARCHLOG)
ERROR: ORA-600 thrown in RBAL for group number 1
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
USER (ospid: 32721): terminating the instance due to error 488
Sun Mar 14 21:03:30 2021
System state dump requested by (instance=1, osid=32721 (RBAL)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_diag_32691_20210314210330.trc
Sun Mar 14 21:03:30 2021
Dumping diagnostic data in directory=[cdmp_20210314210330], requested by (instance=1, osid=32721 (RBAL)), s
ummary=[abnormal instance termination].
Sun Mar 14 21:03:30 2021
Instance terminated by USER, pid = 32721

通过上述日志,果然发现ocr磁盘组先mount成功,然后asm实例由于ARCHLOG磁盘组的ORA-00600 kfdAuDealloc2错误而导致整个实例crash,从而使得ocr磁盘组无法被crs访问,从而出现了”CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错 ORA-15077: 找不到提供所需磁盘组的 ASM 实例], 集群就绪服务中止”这样的错误提示.进一步分析为什么archlog进程会报这个错误.

SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885' 
Sun Mar 14 20:46:46 2021
SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885'
Sun Mar 14 20:49:24 2021
NOTE: Dropping directory '+archlog/oracledb/archivelog/2021_03_11' recursively
Sun Mar 14 20:49:24 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114015):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015/+ASM1_ora_15281_i114015.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 20:49:24 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
NOTE:AMDU dump of disk group ARCHLOG initiated at/u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015
Sun Mar 14 20:49:28 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114016):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []

因为这个库有一个历史背景:几天前由于存储cache导致,数据库使用备份还原(还原到一个新磁盘组中,老磁盘组没有使用),今天估计是运维人员在清理老磁盘组中不要的文件,然后archlog中的归档日志的时候,清空了+archlog/oracledb/archivelog/2021_03_11中的文件,然后触发asm删除该目录的异常(异常原因估计和上次清理存储cache引起了该磁盘组的元数据异常有关).该故障的基本思路原因已经清楚:由于archlog磁盘组本身元数据库有问题,清理该磁盘组文件之后,引起该磁盘组删除空目录出发问题,从而使得整个asm 实例crash.进而引起crs异常.解决方法比较简单,因为archlog磁盘组本身已经不需要,直接dd掉磁盘头,让其启动的时候不再mount,故障解决

[grid@rac1 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHLOG.dg
               ONLINE  OFFLINE      rac1                     STABLE
               ONLINE  OFFLINE      rac2                     STABLE
ora.DATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER1.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.NEWDATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.OCR.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.asm
               ONLINE  ONLINE       rac1                     Started,STABLE
               ONLINE  ONLINE       rac2                     Started,STABLE
ora.net1.network
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.ons
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.MGMTLSNR
      1        ONLINE  OFFLINE      rac2                     169.254.86.142 7.7.7
                                                             .1,STARTING
ora.cvu
      1        ONLINE  ONLINE       rac1                     STABLE
ora.mgmtdb
      1        ONLINE  OFFLINE                               Instance Shutdown,ST
                                                             ABLE
ora.oc4j
      1        ONLINE  OFFLINE      rac1                     STARTING
ora.xff.db
      1        ONLINE  OFFLINE      rac1                     STARTING
      2        ONLINE  OFFLINE      rac2                     STARTING
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       rac1                     STABLE
--------------------------------------------------------------------------------
[grid@rac1 ~]$ 
发表在 ORACLE 12C, Oracle RAC | 标签为 , , | 评论关闭

has a disk HB, but no network HB—-traceroute不通导致

有客户反馈集群有一个节点异常,检查发现crs进程异常
3


重启crs发现cssd进程无法正常启动
no-network-hb

明显私网异常,进一步分析发现私网相互可以ping,但是无法traceroute其他节点
traceroute-not-work
traceroute-not-work2

客户反馈近期安装了安全软件,客户停掉安全软件之后,traceroute恢复正常
1
2

集群也正常启动

[root@his01 cssd]# crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.FRA.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.LISTENER.lsnr
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.OCRVOTE.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.asm
               ONLINE  ONLINE       his01                    Started             
               ONLINE  ONLINE       his02                    Started             
ora.gsd
               OFFLINE OFFLINE      his01                                        
               OFFLINE OFFLINE      his02                                        
ora.net1.network
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.ons
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.registry.acfs
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       his02                                        
ora.cvu
      1        ONLINE  ONLINE       his02                                        
ora.his01.vip
      1        ONLINE  ONLINE       his01                                        
ora.his02.vip
      1        ONLINE  ONLINE       his02                                        
ora.oc4j
      1        ONLINE  ONLINE       his02                                        
ora.orcl.db
      1        ONLINE  ONLINE       his01                    Open                
      2        ONLINE  ONLINE       his02                    Open                
ora.scan1.vip
      1        ONLINE  ONLINE       his02                                        
发表在 Oracle RAC | 标签为 , , | 评论关闭

12.1人工修改操作系统时间导致数据库异常

有客户数据库版本为12.1.0.1 版本RAC,突发发生重启,让协助分析原因
数据库alert日志报ORA-15064错误

Mon Apr 15 15:06:26 2019
WARNING: inbound connection timed out (ORA-3136)
Mon Apr 15 15:41:26 2019
NOTE: ASMB terminating
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
System state dump requested by (instance=1, osid=61426 (ASMB)), summary=[abnormal instance termination].
Mon Apr 15 15:41:26 2019
USER (ospid: 61426): terminating the instance due to error 15064
Mon Apr 15 15:41:26 2019
System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_61287.trc
Mon Apr 15 15:41:27 2019
opiodr aborting process unknown ospid (1171) as a result of ORA-1092
Mon Apr 15 15:41:27 2019
ORA-1092 : opitsk aborting process

这里看,明显asmb异常导致数据库无法正常访问asm从而出现数据库crash的问题.

分析asm日志

Mon Apr 15 15:41:26 2019
WARNING: client [+ASM1:+ASM] not responsive for 2069s; state=0x1. pid 23155
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: client [orcl1:orcl] not responsive for 2069s; state=0x1. killing pid 61436
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [orcl1:orcl] after 2069 seconds (mbr 2)
WARNING: client [-MGMTDB:_mgmtdb] not responsive for 2070s; state=0x1. killing pid 24026
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [-MGMTDB:_mgmtdb] after 2070 seconds (mbr 1)
Mon Apr 15 15:41:26 2019
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb
NOTE: cleaned up ASM client orcl1:orcl
Mon Apr 15 15:41:43 2019
NOTE: Standard client -MGMTDB:_mgmtdb registered, osid 183707, mbr 0x1 (reg:1371965153)
Mon Apr 15 15:42:16 2019
NOTE: Standard client orcl1:orcl registered, osid 184063, mbr 0x2 (reg:2088418628)
Mon Apr 15 15:44:30 2019
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.

asm日志中和mos中的GEN0 terminating the ASM instance due to error 15082 (文档 ID 2096988.1)描述比较匹配.根据客户反馈,他们使用ntp进行修改了时间,基本上可以确定是由于oracle的Bug 19032250(在12.1.0.2中修复)在ntp修改时间跨度过大触发的相关问题(人工直接修改时间也可能出现类似问题)

对于rac修改时间建议
1. 如果时间慢了,关闭数据库和集群直接把时间向前调整,启动集群和数据库
2. 如果时间快了,关闭数据库和集群等实际时间过关闭集群和库的时间之后,再往回调整时间,启动集群和数据库

发表在 Oracle RAC | 评论关闭