TIP 93#: GoldenGate lag

I have a client with Oracle DBs using bi-directional GoldenGate replication in production and UAT environment. I noticed that DBs are out of sync. Checking GoldenGate lag using info all did not show any major lag, Also nothing was reported in GoldenGate log or replication discard file. All processes
were up and running on source and target boxes per the following snaps. (On node A :EXT01SA is primary extract, PPSA01SB is pump which pushes changes to node B, and RESB01SA is replicat which receives changes from node B - The same set of processes exist on node B)

------- Node A --------
GGSCI (########) 1> info all

Program Status Group Lag Time Since Chkpt

MANAGER RUNNING 
EXTRACT RUNNING EXT01SA 00:00:00 00:00:10 
EXTRACT RUNNING PPSA01SB 00:00:00 00:00:01 
REPLICAT RUNNING RESB01SA 00:00:00 00:00:02 

------ Node B ---------

Program Status Group Lag Time Since Chkpt

MANAGER RUNNING 
EXTRACT RUNNING EXT01SB 00:00:00 00:00:03 
EXTRACT RUNNING PPSB01SA 00:00:00 00:00:02 
REPLICAT RUNNING RESA01SB 00:00:00 00:00:01 


Checking further on the GG processes, I found that pump is behind primary extract although it is not reported in info all.In more details, the following shows that Primary extract is on sequence 349 while pump is on sequence 340 !!! When there is no lag, GG pump should read almost the same sequence that primary extract writes.

GGSCI > send extract EXT01SA, status
Sending STATUS request to EXTRACT EXT01SA ...
EXTRACT EXT01SA (PID 21599)
Current status: Recovery complete: At EOF
Current read position:
Sequence #: 5297
RBA: 17509888
Timestamp: ############
Current write position:
 Sequence #: 349 
RBA: 921878
Timestamp: ##########
Extract Trail: ############

GGSCI > send extract PPSA01SB status
Sending STATUS request to EXTRACT PPSA01SB ...

EXTRACT PPSA01SB (PID 22968)
Current status: Recovery complete: Processing data

Current read position:
Sequence #: 340 
RBA: 472796720
Timestamp: #########
Extract Trail: ############

Current write position:
Sequence #: 737
RBA: 2248444
Timestamp: ###########
Extract Trail: ##########



So now, we know why DBs are out of sync but what is the cause? Checking with log dump, the sequence that GG pump stopped on is the end of file for trail 340. (Interesting !!).
Further checking in GG log showed that the primary extract trail was rolled over manually with alter extract EXT01SA, etrollover; and in GG when a trail file is rolled over manually, Pump and replicat stops with detecting the new trail files and these processes require to be rolled over also. Interestingly enough info all does not show any lag but the moment pump or replicat is rolled over, the commands starts showing the right lag.

Lesson : "info all" command in GoldenGate does not cover all circumstance for reporting lag. The best way is to find out read and write sequence of each GG process and determine the lag.

3 comments:

Vladimir Grigorian said...

I have a question about Flashback Data Archive tables in Goldengate replication. Are these tables usually excluded in an extract like "TABLEEXCLUDE .SYS_FBA_*"?
What is the approach for replicating and initial load of Flash Back Data Archive tables?
Here is my problem. I exported a user schema using Oracle DataPump, imported into destination database. GoldenGate abended saying some tables dont exist on dest. I checked tables and there were about 200 tables that were not exported because they are FBDA tables and DataPump just ignores them. So, I recreated them on source with scripts.
So, how is this done? Source and destination have their own FBDA and these tables should not be neither recreated with scripts on dest or replicated? Or they have to be replicated with contents?

Shervin said...

Hi Vladimir,

flashback Data archive tables need to be excluded, you can do it either in Data Pump Extract or replicated with TABLEEXCLUDE.
However to make dest DB the same as source DB, you need to enabled Flashback Data Archive for tables on dest DB.

Hope this helps.

projob66 said...

the info all and even send status commands will screw you eventually in GG. You have to roll your own monitor which will keep track of the explicit RBA each synchronous check of your monitor.. then when RBA is static for too long it ALARMS with whatever flavor of alert/alarm stuff you have.. I use unix email kicked by perl