使用oradebug修改数据库scn

闲着无事看到几篇文章介绍了使用oradebug修改数据库scn的案例,这里也做了两个测试,发现该功能确实很巧妙,通过修改内存中的scn值,然后写入控制文件和数据文件,实现修改scn的方法,不过同样该方法的危害性极大,这里仅供测试使用,生产环境切不可乱使用,可能引起很严重后果
数据库版本信息

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select '惜分飞' XIFENFEI FROM DUAL;

XIFENF
------
惜分飞

在open库中修改scn

SQL> oradebug setmypid
Statement processed.

--查看当前scn
SQL>  oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [20009228, 20009248) = 00000000 0007A09F 00000019 00000000 00000000 00000000 00000000 20009034
SQL>  select CHECKPOINT_CHANGE# a from v$datafile;

         A
----------
    499314
    499314
    499314
    499314

SQL> select dbms_flashback.get_system_change_number a from dual;

         A
----------
    499877

SQL> select to_number('7A09F','xxxxxxxxx') from dual;

TO_NUMBER('7A09F','XXXXXXXXX')
------------------------------
                        499871

--修改内存中scn值(十进制)
SQL>  oradebug poke 0x20009228 4 8
BEFORE: [20009228, 2000922C) = 00000000
AFTER:  [20009228, 2000922C) = 00000008
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [20009228, 20009248) = 00000008 0007A0D8 00000052 00000000 00000000 00000000 00000000 20009034

SQL> col a for 999999999999999
SQL> select dbms_flashback.get_system_change_number a from dual;

         A
------------
34360238301

SQL> select to_number('8','xx')*4294967296+to_number('0007A0D8','xxxxxxxx') a from dual;

               A
----------------
     34360238296

--做一个checkpoint为了内存中的scn值写入控制文件和数据文件
SQL>  alter system checkpoint;

System altered.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  318767104 bytes
Fixed Size                  1267236 bytes
Variable Size              96471516 bytes
Database Buffers          213909504 bytes
Redo Buffers                7118848 bytes
Database mounted.
Database opened.

SQL> col a for 999999999999999
SQL> select CHECKPOINT_CHANGE# a from v$datafile;

               A
----------------
     34360238496
     34360238496
     34360238496
     34360238496

SQL> select CHECKPOINT_CHANGE# a from v$datafile_header;

               A
----------------
     34360238496
     34360238496
     34360238496
     34360238496

在mount库中修改scn

SQL> startup mount
ORACLE instance started.

Total System Global Area  318767104 bytes
Fixed Size                  1267236 bytes
Variable Size              96471516 bytes
Database Buffers          213909504 bytes
Redo Buffers                7118848 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
--因为数据库是mount状态不能看到scn值
SQL>  oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [20009228, 20009248) = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 20009034
SQL>  col a for 999999999999999
SQL> select CHECKPOINT_CHANGE# a from v$datafile_header;

               A
----------------
     34360240739
     34360240739
     34360240739
     34360240739

--求出WRAP SCN值
SQL> select 34360240739/4294967296 from dual;

34360240739/4294967296
----------------------
            8.00011697

--修改内存中scn值(十六进制)
SQL> oradebug poke 0x20009228 4 0x0000000a
BEFORE: [20009228, 2000922C) = 00000000
AFTER:  [20009228, 2000922C) = 0000000A
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [20009228, 20009248) = 0000000A 00000000 00000000 00000000 00000000 00000000 00000000 20009034

SQL> alter database open;

Database altered.

SQL> select dbms_flashback.get_system_change_number a from dual
  2  ;

               A
----------------
     42949673074

--注意:使用此种方法修改BASE SCN如果不指定,会从0开始计数
SQL>  oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [20009228, 20009248) = 0000000A 00000077 0000001C 00000000 00000000 00000000 00000000 20009034
SQL> select to_number('A','xx')*4294967296+to_number('00000077','xxxxxxxx') a from dual;

               A
----------------
     42949673079

SQL> alter system checkpoint;

System altered.

SQL> select CHECKPOINT_CHANGE# a from v$datafile_header;

               A
----------------
     42949673095
     42949673095
     42949673095
     42949673095

SQL> select CHECKPOINT_CHANGE# a from v$datafile;

               A
----------------
     42949673095
     42949673095
     42949673095
     42949673095

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  318767104 bytes
Fixed Size                  1267236 bytes
Variable Size              96471516 bytes
Database Buffers          213909504 bytes
Redo Buffers                7118848 bytes
Database mounted.
Database opened.
SQL> select CHECKPOINT_CHANGE# a from v$datafile_header;

               A
----------------
     42949673231
     42949673231
     42949673231
     42949673231

在oradebug推进scn的过程中,需要注意不同平台,不同位数的ORACLE数据库可能推进方式有一定的区别,操作前最好在系统平台位数上进行测试,否则有可能导致恢复后果更加麻烦

发表在 非常规恢复 | 标签为 , | 3 条评论

DATAGUARD中MAXIMUM AVAILABILITY+LGWR SYNC导致主库不能启动

qq群里面的一朋友的的DG因为备库已经下架,主库重启的时候不能正常启动,帮忙处理结果如下
版本相关信息

Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.1.0 - Production
ORACLE_HOME = /export/home/oracle/product/9.2.0
System name:    SunOS
Node name:      eTermSrv2
Release:        5.10
Version:        Generic
Machine:        sun4u
Instance name: abcd

数据库不能启动日志

Fri Aug 10 12:37:56 2012
ALTER DATABASE OPEN
Fri Aug 10 12:37:56 2012
LGWR: Primary database is in CLUSTER CONSISTENT mode
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
LNS0 started with pid=16
Fri Aug 10 12:37:59 2012
ORA-1013 signalled during: ALTER DATABASE OPEN...
Fri Aug 10 12:41:45 2012
LGWR: Error 12535 verifying archivelog destination LOG_ARCHIVE_DEST_2
LGWR: Continuing...
Fri Aug 10 12:41:45 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-12535: TNS:operation timed out
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
Creating archive destination LOG_ARCHIVE_DEST_2: 'ora9i'
LGWR: Error 12535 creating archivelog file 'ora9i'
Fri Aug 10 12:45:32 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-12535: TNS:operation timed out
LGWR: Completed archiving  log 1 thread 1 sequence 6808
Thread 1 advanced to log sequence 6808
Fri Aug 10 12:45:32 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '/export/home/oracle/oradata/abcd/system01.dbf'
LGWR: terminating instance due to error 1157
Instance terminated by LGWR, pid = 11504

通过这里大概看出数据库原dg配置是MAXIMUM AVAILABILITY,因为备机下架,导致ora9i的tns不能访问,从而出现一些列错误,其中使得lgwr异常,因为oracle的某种内部机制,导致dbwr不能访问数据文件(这里体现出来是system01.dbf不能访问,但实际上应该是所有所有数据文件均不能访问,因为system01.dbf位于第一,所以报出该错误.)

*** SESSION ID:(3.1) 2012-08-10 12:37:56.847
Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
*** 2012-08-10 12:41:45.614
Error 12535 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
*** 2012-08-10 12:41:45.615
LGWR: Error 12535 verifying archivelog destination LOG_ARCHIVE_DEST_2
Continuing...
ORA-12535: TNS:operation timed out
*** 2012-08-10 12:45:32.514
Error 12535 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
Error 12535 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
*** 2012-08-10 12:45:32.514
LGWR: Error 12535 creating archivelog file 'ora9i'
*** 2012-08-10 12:45:32.514
kcrrfail: dest:2 err:12535 force:0
ORA-12535: TNS:operation timed out
error 1157 detected in background process
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '/export/home/oracle/oradata/abcd/system01.dbf'

通过trace文件,更加清楚的说明,可能是因为lgwr异常导致dbwr访问数据文件出现问题.

问题分析/解决汇总

SQL> show parameter log_archive_dest_state_1;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_state_1             string      enable

SQL> show parameter log_archive_dest_state_2;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_state_2             string      enable

SQL> show parameter log_archive_dest_1;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_1                   string      location=/export/home/oracle/o
                                                 radata/abcd/archive

SQL> show parameter log_archive_dest_2;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_2                   string      SERVICE=ora9i LGWR SYNC AFFIRM

SQL>select  protection_mode,database_role from v$database;

PROTECTION_MODE      DATABASE_ROLE
-------------------- ----------------
MAXIMUM AVAILABILITY PRIMARY

SQL>   show parameter succ;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_min_succeed_dest         integer     1

通过上面的sql查询结果和alert日志与trace文件结合分析,大体结论是:
数据库的dg的保护模式为:MAXIMUM AVAILABILITY
数据库的日志传输方式是:LGWR SYNC AFFIRM
现在因为备机下架,主库LGWR不能通过tns访问备库,从而导致数据库的dbwr进程访问异常,是的数据库不能正常被open
这里的故障有一些巧合:MAXIMUM AVAILABILITY+LGWR SYNC AFFIRM+9.2.0.1+SunOS
我不清楚是不是ORACLE的bug导致,但是这个问题解决起来比较简单,只需要修改log_archive_dest_state_2=defer,使得log_archive_dest_2参数不生效,让lgwr不再访问备机

发表在 Data Guard | 2 条评论

处理smon清理临时段导致数据库异常案例

一个朋友的数据库在经过自己的千辛万苦终于open成功,但是几分钟就down掉,使得他想导出数据重建库的目标不能实现.让我帮忙处理
alert日志报ORA-00600[kafspa:columnBuffer1]

Wed Aug  8 10:55:31 2012
Completed: ALTER DATABASE OPEN
Wed Aug  8 10:55:41 2012
Errors in file /oracle/ora10/admin/ora10g/udump/ora10g_ora_12160.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:55:47 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:55:47 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:55:58 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 3 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:55:59 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:56:10 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 4 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:56:11 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:56:22 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 5 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:56:32 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:56:43 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 6 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:56:53 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:57:04 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 7 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:57:14 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:57:25 2012
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 8 out of maximum 100 non-fatal internal errors.
Wed Aug  8 10:57:35 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_smon_8941.trc:
ORA-00600: internal error code, arguments: [kafspa:columnBuffer1], [28], [22], [], [], [], [], []
Wed Aug  8 10:57:38 2012
Errors in file /oracle/ora10/admin/ora10g/bdump/ora10g_pmon_8856.trc:
ORA-00474: SMON process terminated with error
Wed Aug  8 10:57:38 2012
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 8856

这里出现ORA-00600[kafspa:columnBuffer1],一个未知的错误,但是根据相关的提示,可以大概猜出来是什么原因导致数据库异常

Non-fatal internal error happenned while SMON was doing temporary segment drop.

出现这个错误,使得我们想到一个smon的功能,清理临时段.该数据库down掉很可能和smon清理临时段的过程发生失败有关系

SMON encountered 8 out of maximum 100 non-fatal internal errors.

这个错误提示是因为smon内部最多允许发生100次错误,记录错误发生了8次,当然这次数据库down掉是smon还没有达到100次就直接abort掉

SQL> col name for a32
SQL> col value for a24
SQL> col description for a70
SQL> set linesize 150
SQL> select a.ksppinm name,b.ksppstvl value,a.ksppdesc description
  2    from x$ksppi a,x$ksppcv b
  3   where a.inst_id = USERENV ('Instance')
  4     and b.inst_id = USERENV ('Instance')
  5     and a.indx = b.indx
  6     and upper(a.ksppinm) LIKE upper('%&param%')
  7  order by name
  8  
SQL> /
Enter value for param: smon_internal_errlimit
old   6:    and upper(a.ksppinm) LIKE upper('%&param%')
new   6:    and upper(a.ksppinm) LIKE upper('%smon_internal_errlimit%')

NAME                             VALUE                    DESCRIPTION
-------------------------------- ------------------------ -----------------------------------
_smon_internal_errlimit          100                      limit of SMON internal errors

分析trace文件

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/ora10/product
System name:	Linux
Node name:	DBN-HLD-155-2
Release:	2.6.18-92.el5PAE
Version:	#1 SMP Tue Apr 29 13:31:02 EDT 2008
Machine:	i686
Instance name: ora10g
Redo thread mounted by this instance: 1
Oracle process number: 8
Unix process pid: 8941, image: oracle@DBN-HLD-155-2 (SMON)

*** SERVICE NAME:() 2012-08-08 10:55:20.208
*** SESSION ID:(274.1) 2012-08-08 10:55:20.208
*** 2012-08-08 10:55:20.208
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [25027], [0], [0], [], [], [], [], []
Current SQL statement for this session:
select file#, block#, ts# from seg$ where type# = 3

果然是smon在查询type#=3的时候发现异常,出现ORA-00600[25027]错误.通过对seg$相关视图分析,可以知道type#=3表示临时段,也就是说数据库smon在查询哪些segment是临时段的时候发生意外,而结合alert日志,完整的错误应该就是:数据库启动后,smon进程为了清理临时段,需要通过select file#, block#, ts# from seg$ where type# = 3查询临时段,但是在查询的时候发生错误.而这个错误累积几次导致数据库异常关闭.

解决问题

--修改两个参数
event='10061 trace name context forever, level 10'
_smon_internal_errlimit=1000000

--启动数据库
SQL> startup pfile='/tmp/pfile'
ORACLE instance started.

Total System Global Area 1577058304 bytes
Fixed Size                  1267716 bytes
Variable Size             385878012 bytes
Database Buffers         1174405120 bytes
Redo Buffers               15507456 bytes
Database mounted.
Database opened.

因为屏蔽了smon回收临时段,数据库未出现开始时错误,观察数据库几个小时,运行正常,到此基本上解决了此次异常,通过exp可以顺利导出数据然后导入到新库中.

导致该异常sql分析

SQL> select file#, block#, ts# from seg$ where type# = 3;
select file#, block#, ts# from seg$ where type# = 3
                               *
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [0], [0], [], [], [], [],
[]


SQL> set autot trace exp
SQL> select file#, block#, ts# from seg$ where type# = 3;

Execution Plan
----------------------------------------------------------
Plan hash value: 1605285479

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    10 |   150 |   389   (1)| 00:00:05 |
|*  1 |  TABLE ACCESS FULL| SEG$ |    10 |   150 |   389   (1)| 00:00:05 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("TYPE#"=3)

SQL> select count(rowid) from seg$;

Execution Plan
----------------------------------------------------------
Plan hash value: 763549841

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    12 |   389   (1)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |    12 |            |          |
|   2 |   TABLE ACCESS FULL| SEG$ | 45682 |   535K|   389   (1)| 00:00:05 |
---------------------------------------------------------------------------

SQL> select /*+ full(t) */ count(*) from seg$;

Execution Plan
----------------------------------------------------------
Plan hash value: 763549841

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| SEG$ |  3223 |    29   (0)| 00:00:01 |
-------------------------------------------------------------------

SQL> select ts# from seg$;

Execution Plan
----------------------------------------------------------
Plan hash value: 1605285479

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  3223 |  9669 |    29   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| SEG$ |  3223 |  9669 |    29   (0)| 00:00:01 |
--------------------------------------------------------------------------

SQL> set autot off
SQL> select count(rowid) from seg$;                                                                                 
SQL> /

COUNT(ROWID)
------------
       45727

SQL> select /*+ full(t) */ count(*) from seg$;

  COUNT(*)
----------
     45727

SQL> select  ts# from seg$;
0
0
……
7
7

ORA-00600: internal error code, arguments: [25027], [0], [0], [], [], [], [],
[]

SQL> !dbv file='/oracle/ora10/oradata/ora10g/system01.dbf'

DBVERIFY: Release 10.2.0.4.0 - Production on Thu Aug 9 14:05:09 2012

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /oracle/ora10/oradata/ora10g/system01.dbf
Block Checking: DBA = 4225673, Block Type = KTB-managed data block
data header at 0xb6d8225c
kdbchk: bad row offset slot 6 offs 3030 fseo 3752 dtl 8168 bhs 72
Page 31369 failed with check code 6135
Block Checking: DBA = 4236289, Block Type = KTB-managed data block
data header at 0xb6d7225c
kdbchk: tosp bad (-13399)
Page 41985 failed with check code 6127


DBVERIFY - Verification complete

Total Pages Examined         : 192000
Total Pages Processed (Data) : 47588
Total Pages Failing   (Data) : 2
Total Pages Processed (Index): 40929
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 1784
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 101699
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 1775671440 (0.1775671440)

SQL> ANALYZE TABLE sys.SEG$ VALIDATE STRUCTURE CASCADE;
ANALYZE TABLE sys.SEG$ VALIDATE STRUCTURE CASCADE
*
ERROR at line 1:
ORA-01498: block check failure - see trace file

通过这里的分析大概可以确定是由于两块KTB-managed data block数据块异常,导致直接对seg$进行TABLE ACCESS FULL操作的时候发生异常.因为这个库已经破坏了数据一致性,先导出来数据,至于出现该错误的原因,后续继续关注分析

发表在 Oracle备份恢复 | 标签为 , | 一条评论