惜分飞

Oracle断电故障处理

发表于 2022 年 5 月 31 日由惜分飞

异常断电导致数据库异常恢复文件报ORA-00283 ORA-00742 ORA-00312

 
D:\check_db>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on 星期二 5月 31 00:38:42 2022

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> recover datafile 1;
ORA-00283: 恢复会话因错误而取消
ORA-00742: 日志读取在线程 %d 序列 %d 块 %d 中检测到写入丢失情况
ORA-00312: 联机日志 3 线程 1:
'D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_3_HJ32KJD5_.LOG'

这个错误比较明显是由于异常断电引起的写丢失导致.而且这种故障在没有备份的情况下，没有什么好处理方法,只能屏蔽一致性强制拉库,尝试强制拉库报错如下

SQL> startup mount pfile='d:/pfile.txt'
ORACLE 例程已经启动。

Total System Global Area 2.0310E+10 bytes
Fixed Size                  2290000 bytes
Variable Size            3690991280 bytes
Database Buffers         1.6576E+10 bytes
Redo Buffers               40837120 bytes
数据库装载完毕。
SQL> recover database until cancel;
ORA-00279: 更改 18755939194213 (在  生成) 对于线程 1 是必需的


指定日志: {<RET>=suggested | filename | AUTO | CANCEL}
D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_3_HJ32KJD5_.LOG
ORA-00600: internal error code, arguments: [3020], [2], [78824], [8467432], [],
[], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 78824, file
offset is 645726208 bytes)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: 'D:\ORADATA\ORCL\SYSAUX01.DBF'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 80834


ORA-01112: 未启动介质恢复


SQL> alter database open resetlogs;
alter database open resetlogs
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [krsi_al_hdr_update.15], [4294967295], [], [],[], [], [], [], [], [], [], []

ORA-600 krsi_al_hdr_update.15错误,主要是由于redo异常导致无法resetlogs成功,具体参考：Alter Database Open Resetlogs returns error ORA-00600: [krsi_al_hdr_update.15], (Doc ID 2026541.1)描述,处理这个问题之后,再次resetlogs库，报ORA-600 2662错误

SQL> alter database open resetlogs;
alter database open resetlogs
*
第 1 行出现错误:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00600: internal error code, arguments: [2662], [4366], [4112122046],
[4366], [4112228996], [12583040], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [2662], [4366], [4112122045],
[4366], [4112228996], [12583040], [], [], [], [], [], []
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00600: internal error code, arguments: [2662], [4366], [4112122040],
[4366], [4112228996], [12583040], [], [], [], [], [], []
进程 ID: 4644
会话 ID: 1701 序列号: 3

这个问题比较简单,通过修改scn即可绕过去,之后数据库open报ORA-600 4194等错误

SQL> alter database open ;
alter database open 
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [4194], [

SMON: enabling tx recovery
Database Characterset is ZHS16GBK
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc  (incident=322982):
ORA-00600: internal error code, arguments: [4137], [10.33.3070116], [0], [0], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_322982\orcl_smon_5112_i322982.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
replication_dependency_tracking turned off (no async multimaster replication found)
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_ora_3340.trc  (incident=323030):
ORA-00600: 内部错误代码, 参数: [4194], [
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_323030\orcl_ora_3340_i323030.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue May 31 09:05:04 2022
Sweep [inc][322982]: completed
ORACLE Instance orcl (pid = 13) - Error 600 encountered while recovering transaction (10, 33).
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc:
ORA-00600: internal error code, arguments: [4137], [10.33.3070116], [0], [0], [], [], [], [], [], [], [], []
Checker run found 1 new persistent data failures
Tue May 31 09:05:05 2022
Sweep [inc][323030]: completed
Sweep [inc2][322982]: completed
Tue May 31 09:05:14 2022
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc  (incident=322983):
ORA-00600: internal error code, arguments: [4193], [10.33.3070116], [0], [], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_322983\orcl_smon_5112_i322983.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue May 31 09:05:14 2022
ORA-600 signalled during: alter database open...
Block recovery stopped at EOT rba 2.61.16
Block recovery completed at rba 2.61.16, scn 4366.4112429058
Block recovery from logseq 2, block 60 to scn 18755939643393
Recovery of Online Redo Log: Thread 1 Group 2 Seq 2 Reading mem 0
  Mem# 0: D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_2_K9BSVC11_.LOG
Block recovery completed at rba 2.61.16, scn 4366.4112429058
Dumping diagnostic data in directory=[cdmp_2022053],requested by(instance=1,osid=5112(SMON)),summary=[incident=322983].
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc:
ORA-01595: error freeing extent (3) of rollback segment (1))
ORA-00600: internal error code, arguments: [4193], [10.33.3070116], [3], [], [], [], [], [], [], [], [], []

对异常undo进行处理,数据库正常open成功

SQL> shutdown immediate;
ORA-00600: 内部错误代码, 参数: [4193], [


SQL> shutdown abort;
ORACLE 例程已经关闭。
SQL> startup mount
ORACLE 例程已经启动。

Total System Global Area 2.0310E+10 bytes
Fixed Size                  2290000 bytes
Variable Size            3690991280 bytes
Database Buffers         1.6576E+10 bytes
Redo Buffers               40837120 bytes
数据库装载完毕。
SQL> alter database open;

数据库已更改。

hcheck检测有一些字典不一致,建议客户逻辑导出,然后导入到新库中

HCheck Version 07MAY18 on 31-5月 -2022 09:12:22
----------------------------------------------
Catalog Version 11.2.0.4.0 (1102000400)
db_name: ORCL

                                   Catalog       Fixed
Procedure Name                     Version    Vs Release    Timestamp      Resul
t
------------------------------ ... ---------- -- ---------- -------------- -----
-
.- LobNotInObj                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingOIDOnObjCol          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- SourceNotInObj              ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- OversizedFiles              ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- PoorDefaultStorage          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- PoorStorage                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- TabPartCountMismatch        ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- OrphanedTabComPart          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingSum$                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingDir$                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- DuplicateDataobj            ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- ObjSynMissing               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjSeqMissing               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedUndo                ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndex               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndexPartition      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndexSubPartition   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTable               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTablePartition      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTableSubPartition   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- MissingPartCol              ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedSeg$                ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndPartObj#         ... 1102000400 <=  *All Rel* 05/31 09:12:23 FAIL

HCKE-0024: Orphaned Index Partition Obj# (no OBJ$) (Doc ID 1360935.1)
ORPHAN INDPART$: OBJ#=149167 BO#=6378 - no OBJ$ row
ORPHAN INDPART$: OBJ#=149168 BO#=6378 - no OBJ$ row

.- DuplicateBlockUse           ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- FetUet                      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- Uet0Check                   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- SeglessUET                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadInd$                     ... 1102000400 <=  *All Rel* 05/31 09:12:23 FAIL

HCKE-0030: OBJ$ INDEX entry has no IND$ or INDPART$/INDSUBPART$ entry (Doc ID 13
60528.1)
OBJ$ INDEX PARTITION has no INDPART$ entry: Obj#=148278 SYS Name=WRH$_FILESTATXS
_PK PARTITION=WRH$_FILEST_1572571104_16462
OBJ$ INDEX PARTITION has no INDPART$ entry: Obj#=148920 SYS Name=WRH$_FILESTATXS
_PK PARTITION=WRH$_FILEST_1572571104_16678

.- BadTab$                     ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadIcolDepCnt               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjIndDobj                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- TrgAfterUpgrade             ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjType0                    ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadOwner                    ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- StmtAuditOnCommit           ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadPublicObjects            ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadSegFreelist              ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadDepends                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 WARN

HCKW-0016: Dependency$ p_timestamp mismatch for VALID objects (Doc ID 1361045.1)

[E] - P_OBJ#=6376 D_OBJ#=6765

.- CheckDual                   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjectNames                 ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadCboHiLo                  ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- ChkIotTs                    ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- NoSegmentIndex              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- BadNextObject               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- DroppedROTS                 ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- FilBlkZero                  ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- DbmsSchemaCopy              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- OrphanedObjError            ... 1102000400 >  1102000000 05/31 09:12:24 PASS
.- ObjNotLob                   ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- MaxControlfSeq              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- SegNotInDeferredStg         ... 1102000400 >  1102000000 05/31 09:12:24 PASS
.- SystemNotRfile1             ... 1102000400 >   902000000 05/31 09:12:24 PASS
.- DictOwnNonDefaultSYSTEM     ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- OrphanTrigger               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- ObjNotTrigger               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
---------------------------------------
31-5月 -2022 09:12:24  Elapsed: 2 secs
---------------------------------------
Found 4 potential problem(s) and 1 warning(s)
Contact Oracle Support with the output and trace file
to check if the above needs attention or not

PL/SQL 过程已成功完成。

发表在 Oracle备份恢复 | 标签为 krsi_al_hdr_update.15, ORA-00742, ORA-600 2662, ORA-600 4194, oracle异常恢复, oracle断电恢复 | 评论关闭

PostgreSQL恢复系列:pg_control异常恢复

发表于 2022 年 5 月 30 日由惜分飞

在PG中pg_control文件类似oracle数据库的control文件(控制文件),在Oracle中如果该文件丢失/损坏,可以通过alter database create controlfile命令进行创建,对于PG数据库来说也可以通过pg_resetwal命令来实现创建,由于pg_control文件损坏,需要人工指定一些参数完成pg_resetwal相关操作
pg_resetwal 使用说明

-bash-4.2$ pg_resetwal --help
pg_resetwal resets the PostgreSQL write-ahead log.

Usage:
  pg_resetwal [OPTION]... DATADIR

Options:
  -c, --commit-timestamp-ids=XID,XID
                                   set oldest and newest transactions bearing
                                   commit timestamp (zero means no change)
 [-D, --pgdata=]DATADIR            data directory
  -e, --epoch=XIDEPOCH             set next transaction ID epoch
  -f, --force                      force update to be done
  -l, --next-wal-file=WALFILE      set minimum starting location for new WAL
  -m, --multixact-ids=MXID,MXID    set next and oldest multitransaction ID
  -n, --dry-run                    no update, just show what would be done
  -o, --next-oid=OID               set next OID
  -O, --multixact-offset=OFFSET    set next multitransaction offset
  -u, --oldest-transaction-id=XID  set oldest transaction ID
  -V, --version                    output version information, then exit
  -x, --next-transaction-id=XID    set next transaction ID
      --wal-segsize=SIZE           size of WAL segments, in megabytes
  -?, --help                       show this help, then exit

Report bugs to <pgsql-bugs@lists.postgresql.org>.
PostgreSQL home page: <https://www.postgresql.org/>

确认现在业务表记录情况

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# select count(1) from ac_event;
 count  
--------
 246266
(1 row)

模拟pg_control文件异常

-bash-4.2$ ps -ef|grep postgres
postgres  37178      1  0 09:58 ?        00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data
postgres  37179  37178  0 09:58 ?        00:00:00 postgres: logger 
postgres  37181  37178  0 09:58 ?        00:00:00 postgres: checkpointer 
postgres  37182  37178  0 09:58 ?        00:00:00 postgres: background writer 
postgres  37183  37178  0 09:58 ?        00:00:00 postgres: walwriter 
postgres  37184  37178  0 09:58 ?        00:00:00 postgres: autovacuum launcher 
postgres  37185  37178  0 09:58 ?        00:00:00 postgres: stats collector 
postgres  37186  37178  0 09:58 ?        00:00:00 postgres: logical replication launcher 
root      41368  41314  0 11:06 pts/1    00:00:00 su - postgres
postgres  41369  41368  0 11:06 pts/1    00:00:00 -bash
postgres  45071  41369  0 12:07 pts/1    00:00:00 ps -ef
postgres  45072  41369  0 12:07 pts/1    00:00:00 grep --color=auto postgres
-bash-4.2$ kill -9 37178
-bash-4.2$ ps -ef|grep postgres
root      41368  41314  0 11:06 pts/1    00:00:00 su - postgres
postgres  41369  41368  0 11:06 pts/1    00:00:00 -bash
postgres  45095  41369  0 12:08 pts/1    00:00:00 ps -ef
postgres  45096  41369  0 12:08 pts/1    00:00:00 grep --color=auto postgres
-bash-4.2$ pwd
/var/lib/pgsql/14/data/global
-bash-4.2$ ls -l pg_control 
-rw-------. 1 postgres postgres 8192 May 30 12:04 pg_control
-bash-4.2$ rm -rf pg_control 
-bash-4.2$ ls -l pg_control 
ls: cannot access pg_control: No such file or directory

PG启动失败

-bash-4.2$ pg_ctl start 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....postgres: could not find the database system
Expected to find it in the directory "/var/lib/pgsql/14/data",
but could not open file "/var/lib/pgsql/14/data/global/pg_control": No such file or directory
 stopped waiting
pg_ctl: could not start server
Examine the log output.

创建空pg_control文件启动依旧失败

-bash-4.2$ touch /var/lib/pgsql/14/data/global/pg_control
-bash-4.2$ pg_ctl start 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-30 12:09:43.953 CST [45215] PANIC:  
   could not read file "global/pg_control": read 0 of 296
 stopped waiting
pg_ctl: could not start server
Examine the log output.

设置next-wal-file
-l, –next-wal-file=WALFILE,这个参数设置下一个新的WAL文件的最小值，这个值可以从$PGDATA/pg_wal目录下去看最后一个WAL 文件，这个文件的id+1即可

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -l
total 16384
-rw-------. 1 postgres postgres 16777216 May 30 12:04 000000010000000000000014
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-bash-4.2$

这个文件+1，-l 000000010000000000000015
设置next-transaction
-x, –next-transaction-id=XID,这个参数设置pg_control中的下一个XID的值，这个值可以从pg_xact目录下的文件中查询

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_xact
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 30 12:03 0000

最后一个是0000，那么下一个XID就是0001，然后乘以 1048576 (0×100000)，实际上后面直接加5个0就行了。注意，这个值是16进制的。-x 0×000100000
multixact-ids设置
-m, –multixact-ids=MXID1,MXID2，这个参数包含两个部分，MXID1和MXID2，都可以从$PGDATA/pg_multixact/offsets目录下获得。MXID1的值，首先找到最大值，+1，再乘以 65536 (0×10000，相当于后面加4个0)作为这个参数的前半部分。找到最小的值，后面加4个0，作为MXID2的值

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_multixact/offsets
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 29 22:06 0000
-bash-4.2$

-m 0×00010000, 0×00000000（由于oldest multitransaction ID不能为0,因此后续这个值需要适当调整）
multixact-offset设置
-O, –multixact-offset=OFFSET,这个参数可以从$PGDATA/pg_multixact/members目录下获得。找到最大值，+1，乘以 52352 (0xCC80)

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_multixact/members
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 24 02:20 0000

-O 0xCC80
尝试执行pg_resetwal

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00010000,0x00000000 -O 0xCC80 $PGDATA
pg_resetwal: error: oldest multitransaction ID (-m) must not be 0

multixact-ids值不对,进行调整后处理
postmaster.pid文件需要清理
由于PG库异常关闭,需要人工清理掉该文件

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA
pg_resetwal: error: lock file "postmaster.pid" exists
-bash-4.2$ rm -rf postmaster.pid

pg_resetwal结果预览

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Guessed pg_control values:

pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7103392535324046312
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:          0:3
Latest checkpoint's NextOID:          12000
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        3
Latest checkpoint's oldestXID's DB:   0
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 0
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float8 argument passing:              by value
Data page checksum version:           0


Values to be changed:

First log segment after reset:        000000010000000000000015
NextMultiXactId:                      131072
OldestMultiXid:                       65536
OldestMulti's DB:                     0
NextMultiOffset:                      52352
NextXID:                              1048576
OldestXID:                            3
OldestXID's DB:                       0

If these values seem acceptable, use -f to force reset.

pg_resetwal进行创建pg_control并启动PG

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 -f $PGDATA
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Write-ahead log reset
-bash-4.2$ pg_ctl start
waiting for server to start....2022-05-30 13:33:28.266 CST [51437] LOG:  
redirecting log output to logging collector process
2022-05-30 13:33:28.266 CST [51437] HINT:  Future log output will appear in directory "log".
 done
server started

验证数据

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=#  select count(1) from ac_event;
 count  
--------
 245275
(1 row)

这种方法恢复之后,建议理解dump数据,然后导入到新库中

发表在 PostgreSQL恢复 | 标签为 pg_control丢失恢复, pg_control异常恢复, pg_control强制启动, pg_resetwal使用 | 评论关闭

PostgreSQL恢复系列:wal日志丢失恢复

发表于 2022 年 5 月 29 日由惜分飞

WAL是Write Ahead Log的简写，和oracle的redo日志类似，存放在$PGDATA/pg_xlog中，10版本以后在$PGDATA/pg_wal目录.在oracle数据库中,如果redo丢失,分为active/current和inactive的redo,分别有不同的处理方式,对于oracle需要实例恢复的redo丢失，需要屏蔽数据库一致性，强制打开数据库,对于PG数据库这部分日志丢失该如何恢复,主要是通过pg_resetwal/pg_resetxlog（10以前版本）命令来实现，这里通过一个测试来验证
创建测试表并强制kill数据库

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# create table t_xifenfei as select * from pg_database;
SELECT 4
postgres=# select count(1) from t_xifenfei;
 count 
-------
     4
(1 row)

postgres=# \q
-bash-4.2$ ps -ef|grep post
root       1819      1  0 May28 ?        00:00:00 /usr/libexec/postfix/master -w
postfix    1838   1819  0 May28 ?        00:00:00 qmgr -l -t unix -u
postgres  11102      1  0 05:49 ?        00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data
postgres  11103  11102  0 05:49 ?        00:00:00 postgres: logger 
postgres  11105  11102  0 05:49 ?        00:00:00 postgres: checkpointer 
postgres  11106  11102  0 05:49 ?        00:00:00 postgres: background writer 
postgres  11107  11102  0 05:49 ?        00:00:00 postgres: walwriter 
postgres  11108  11102  0 05:49 ?        00:00:00 postgres: autovacuum launcher 
postgres  11109  11102  0 05:49 ?        00:00:01 postgres: stats collector 
postgres  11110  11102  0 05:49 ?        00:00:00 postgres: logical replication launcher 
root      22743  22300  0 18:26 pts/3    00:00:00 su - postgres
postgres  22744  22743  0 18:26 pts/3    00:00:00 -bash
postgres  22937  22744  0 18:28 pts/3    00:00:00 psql
postgres  22938  11102  0 18:28 ?        00:00:00 postgres: postgres postgres [local] idle
postfix   32623   1819  0 21:10 ?        00:00:00 pickup -l -t unix -u
root      33032  32912  0 21:15 pts/2    00:00:00 su - postgres
postgres  33033  33032  0 21:15 pts/2    00:00:00 -bash
postgres  35210  33033  0 21:51 pts/2    00:00:00 ps -ef
postgres  35211  33033  0 21:51 pts/2    00:00:00 grep --color=auto post
-bash-4.2$ kill -9 11102

删除wal日志

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 311296
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000014
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000015
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000016
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000017
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000018
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000019
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001A
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001B
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001C
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001D
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001E
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001F
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000020
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000021
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000022
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000023
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000024
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000025
-rw-------. 1 postgres postgres 16777216 May 29 21:51 000000010000000000000013
-bash-4.2$ rm -rf 0000000100000000000000*
-bash-4.2$ ls
archive_status

查询当时数据库需要的最小wal记录

-bash-4.2$ pg_controldata 
pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7100998319216817119
Database cluster state:               in production
pg_control last modified:             Sat 28 May 2022 09:36:11 PM CST
Latest checkpoint location:           0/13692F80
Latest checkpoint's REDO location:    0/13692F48
Latest checkpoint's REDO WAL file:    000000010000000000000013   <===需要的记录
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:17824
Latest checkpoint's NextOID:          32769
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        727
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  17824
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sat 28 May 2022 09:31:41 PM CST

尝试启动PG

-bash-4.2$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-29 21:52:22.926 CST [35270] LOG:  
redirecting log output to logging collector process
2022-05-29 21:52:22.926 CST [35270] HINT:  Future log output will appear in directory "log".
. stopped waiting
pg_ctl: could not start server
Examine the log output.

启动pg失败,查看日志记录

2022-05-29 21:52:22.926 CST [35270] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
            compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 21:52:22.929 CST [35270] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 21:52:22.931 CST [35270] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 21:52:22.936 CST [35272] LOG:  database system was interrupted; last known up at 2022-05-28 21:36:11 CST
2022-05-29 21:52:23.049 CST [35272] LOG:  invalid primary checkpoint record
2022-05-29 21:52:23.049 CST [35272] PANIC:  could not locate a valid checkpoint record
2022-05-29 21:52:24.211 CST [35270] LOG:  startup process (PID 35272) was terminated by signal 6: Aborted
2022-05-29 21:52:24.211 CST [35270] LOG:  aborting startup due to startup process failure
2022-05-29 21:52:24.218 CST [35270] LOG:  database system is shut down

错误比较明显,无法定位到有效的checkpoint记录,在oracle里面的意思可以理解为无法进行实例恢复,pg启动失败

重设wal
由于数据库为不一致状态,需要使用-f进行强制重设

-bash-4.2$ pg_resetwal $PGDATA
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
-bash-4.2$ pg_resetwal -f $PGDATA
Write-ahead log reset

启动PG成功

-bash-4.2$ pg_ctl start -D $PGDATA
waiting for server to start....2022-05-29 22:01:02.647 CST [37178] LOG:  
redirecting log output to logging collector process
2022-05-29 22:01:02.647 CST [37178] HINT:  Future log output will appear in directory "log".
 done
server started

日志记录

2022-05-29 22:01:02.647 CST [37178] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
                    compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 22:01:02.649 CST [37178] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 22:01:02.651 CST [37178] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 22:01:02.653 CST [37180] LOG:  database system was shut down at 2022-05-29 22:00:47 CST
2022-05-29 22:01:02.661 CST [37178] LOG:  database system is ready to accept connections

查看wal日志,产生新记录

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 16384
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 29 22:01 000000010000000000000014

验证刚刚创建测试表

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# select count(1) from t_xifenfei;
ERROR:  relation "t_xifenfei" does not exist
LINE 1: select count(1) from t_xifenfei;
                             ^

由于需要进行实例恢复的wal日志丢失导致这表记录也丢失.由此可见这类操作可能导致数据丢失风险,对于生产环境,需要慎重,

发表在 PostgreSQL, PostgreSQL恢复 | 标签为 PG wal异常恢复, pg_ctl: could not start server, pg异常恢复 | 评论关闭

17813235971

QQ 咨询

Oracle断电故障处理

PostgreSQL恢复系列:pg_control异常恢复

PostgreSQL恢复系列:wal日志丢失恢复

文章分类

最近发表

Oracle断电故障处理

PostgreSQL恢复系列:pg_control异常恢复

PostgreSQL恢复系列:wal日志丢失恢复

标签云

文章分类

最近发表