TIP 107# : Virtualization on Oracle Database Appliance (ODA)


 As you know, Early this year Oracle started supporting virtualization on Oracle Database Appliance which enhances the functionality of ODA. One of the main usage of this virtualization is to utilize ODA for Weblogic as well as database.
I had a chance to test various Weblogic offerings on ODA and I thought to share my findings with readers, hopefully it helps readers with planning and architecting the environment using ODA and virtualization. Basically, the two main versions which can be run on ODA V1 and ODA X3-2 and support virtualization are ODA Version 2.5 and ODA version 2.6. The following shows my findings on Weblogic install for each version:
ODA 2.5
  • Oracle Weblogic template on ODA is available on e-delivery for Weblogic 10.3.6 and 12c
  • Oracle Weblogic template includes Weblogic configurator which is a new tool (specific to ODA) to automate installation of Weblogic on ODA similar to what OAM does for database.
  • Weblogic configurator only installs Weblogic in cluster mode, having non-cluster weblogic is not supported with the configurator.
  • Weblogic configurator installs OTD (Oracle Traffic Director) and as of now, there is no way to skip it during installation. However after installation, it can be shutdown if it is not needed.
  • Minimum installation with Weblogic configurator includes two Weblogic cluster nodes which installs 6VMs in total (1 VM for Weblogic admin,2 VMs (one per compute node) for Weblogic managed server, 1 VM for OTD admin, 2 VMs (one per compute node) for OTD)
  • Weblogic configurator also creates a database for internal work on ODA_BASE.
  • Weblogic configurator installs all components as root due to a bug.
  • Since 6VMs are the possible minimum configuration and each VM should have at least 2VCPU, there will be total of 12VCPU (6Cores) for minimum installation 
ODA 2.6
  • No Weblogic template on ODA is released for 2.6
  • No Weblogic configurator for automation of Weblogic is released
  • As of now, creation readhat Linux and installing Weblogic manually would be the only approach in this release
Bottom line: 

Although it is recommended to utilize automation on ODA to ensure ODA stays with Oracle standard template and to have less impact on future changes such as patching and upgrade, however Oracle Weblogic configurator looks not to be mature enough to be utilized. First of all, to use it, you end up to stay with 2.5 (not with the latest ODA release ) plus the configurator in 2.5 is buggy (installation of weblogic as root) and it has several shortcomings which are listed above as well as not much flexible. I recommend to wait (if you can) as Weblogic configurator should be out soon or use manual process for Weblogic with the risk that in long run, you may not be able to utilize patch bundle (if Oracle includes weblogic in ODA bundle) or any further automation which may be planned for Weblogic. Also consider that installation of OTD does add complexity to weblogic licensing on ODA.

At the end, I found FAQ pdf file on otn website helpful.

http://www.oracle.com/technetwork/middleware/weblogic-oda/overview/faq-weblogicserver-on-oda-1927929.pdf

 If you have any questions or you want to see more on this topic, please drop me a line
.

TIP 106# : Quick tips on Cloning Oracle

As Oracle clone is one of the quickest way to create new software binaries with all the patches from a source database and it comes handy specially when Oracle software installation media is not available. In this post, I am trying to remind myself and readers some tips about Oracle clone which worth to know:

  • Cloning is not cross platform
  • Clone usually is done in 2 phases - prepare_clone.pl in source and clone.pl in target
  • Clone does take care of inventory and relink so further patches can be applied on the new cloned home
  • For Oracle database and CRS, prepare_clone.pl is not required
  • ORACLE_HOME and ORACLE_HOME_NAME should be specified when clone.pl is run, in 11g, ORACLE_BASE should be specified also
  • If Oracle inventory is not in default location, before running clone.pl, edit cs.properties and add -ignoreSysPrereqs -invPtrLoc <oraInst_path>/oraInst.loc
  • Make sure to run root.sh after clone.pl is completed
  • You may need to create oraInst.loc file manually on target if the file does not exists, to do so, just create a file in default location with inventory_loc=<oraInventory_path>
  • The alternative of running clone.pl is to use runInstaller for cloning, the format would be like this : ./runInstaller -clone -silent -ignorePreReq ORACLE_HOME="" ORACLE_HOME_NAME="" ORACLE_BASE="" (in Windows it would be setup.exe)
  • If the server has more than one perl installed, it is recommended to set PERL5LIB
  • If clone is done on target server which has inventory and the cloned ORACLE_HOME has already been used at somepoint, first ORACLE_HOME needs to be de-attached using this command : ./runInstaller -detachHome ORACLE_HOME= This is why it is recommended to clone in brand new directory to avoid any issue.
  • Clone log in target is created in $ORACLE_HOME/clone/logs
Enjoy cloning!

TIP 105# : ACFS and bugs



In Oracle Grid Infrastructure 11g Release 2, Oracle extends the capability of ASM by introducing the ASM
Cluster Filesystem, or ACFS. In other words, Oracle ASM 11g Release 2 has capability to the manage all data, including Oracle database files, Oracle Clusterware files and non-structured, general-purpose data such as log files and text files. This sounds amazing, however in my last experience with ACFS on 112034 on Linux 64bit, I found pretty nasty bugs which I think worth sharing.
Bug Workaround Bug No
Copy( cp) Performance On ACFS  Do not use ls while cp is running 
OR
Wait for 11204
12626187
ls -l' and 'find' Commands Slow on ACFS No workaround - Wait for 11204 10418517

TIP 104#: Node Eviction in RAC 11gr2 due to temporary network hiccups on heartbeat communication

In next couple months, I will examine different eviction scenarios in RAC 11gR2 and when I see some strange behavior, I will do my best to inform it in this blog, so RAC lover, please stay tuned and here is the scenario number 1 :

AS you know, in 11gR2, oracle uses UDP protocol for heartbeats between the nodes.
In this post, I present a node eviction scenario when UDP communication is blocked between the nodes and you see that depends on where and how UDP is blocked, a different situation could occur.
The test is done on two node RAC in 11.2.0.3PSU3 version on Linux

Scenario 1: When UDP communication is blocked on the second node


In this scenario, outgoing UDP for ocssd process on node2 is blocked.
To do so, we find out UDP port on which ocssd is listening and will disable any outgoing traffic

netstat -a --inet |grep -i udp | grep -i racnode2

udp        0      0 racnode2-priv:14081         *:*                                     
udp        0      0 racnode2-priv:52358         *:*                                     
udp        0      0 racnode2-priv:52242         *:*                                     
udp        0      0 racnode2-priv:42517         *:*  --> ocssd
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:60741         *:*  

[root@racnode2 ~]# lsof -i :42517
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3005 grid   55u  IPv4  22340       UDP racnode2-priv:42517 


42517 is the port which ocssd on racnode2 sends its heartbeat.
To break heartbeat communication between node2 and node1, any outgoing traffic on racnode2 for ocssd process (port 42517) is blocked with this command

iptables -A OUTPUT -s 192.168.2.152 -p udp --sport 42517 -j DROP

alter log on node1 shows that node2 is evicted (bootless eviction) and then it is reconfigured and is joined the cluster.

[cssd(3015)]CRS-1612:Network communication with node racnode2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.840 seconds
2012-12-31 10:50:54.213
[cssd(3015)]CRS-1611:Network communication with node racnode2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.820 seconds
2012-12-31 10:50:58.232
[cssd(3015)]CRS-1610:Network communication with node racnode2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.810 seconds
2012-12-31 10:51:01.056
[cssd(3015)]CRS-1607:Node racnode2 is being evicted in cluster incarnation 249572820; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/racnode1/cssd/ocssd.log.
2012-12-31 10:51:02.584
[cssd(3015)]CRS-1625:Node racnode2, number 2, was manually shut down
2012-12-31 10:51:02.590
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 .
2012-12-31 10:51:02.630
[crsd(3393)]CRS-5504:Node down event reported for node 'racnode2'.
2012-12-31 10:51:05.827
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'Generic'.
2012-12-31 10:51:05.829
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'ora.orcl'.
2012-12-31 10:51:37.987
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.

ocssed on racnode2 has more details, I highlighted couple key lines

2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
2012-12-31 10:51:01.129: [    CSSD][3019058064]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
.
.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: total iocapables 0
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: graceful shutdown completed.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssnmSendManualShut: Notifying all nodes that this node has been manually shut down
.
.
2012-12-31 10:51:25.352: [    CSSD][3040868032]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1356979885
2012-12-31 10:51:25.353: [    CSSD][3040868032]clssscmain: Environment is production
.
.
2012-12-31 10:51:26.167: [GIPCHTHR][3024477072] gipchaWorkerCreateInterface: created local interface for node 'racnode2', haName 'CSS_racnode-cluster', inf 'udp://192.168.2.152:29788'

A key note here is that udp is reconfigured to be run on different port and after that node2 is able to join the cluster and starts up all its resources.
The following netstat also confirms that ocssd.bin listens on the new port

[root@racnode2 ~]# netstat -a --inet |grep -i udp | grep -i racnode2
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:35489         *:*                                     
udp        0      0 racnode2-priv:38321         *:*                                     
udp        0      0 racnode2-priv:60741         *:*                                     
udp        0      0 racnode2-priv:10321         *:*                                     
udp        0      0 racnode2-priv:29788         *:*   --> new port is created.....
   
[root@racnode2 working]# lsof -i :29788
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 5919 grid   52u  IPv4 764918       UDP racnode2-priv:29788 

If I sum up, the following sequence of events occurs :

  •  udp communication for heartbeat is blocked (outgoing udp on ocssd port)
  •  Node1 evicts Node2
  •  Node2 is able to stop all IO capable resources and as the result, no need to boot the node (11g  feature).
  •  Node2 starts CSSD and reconfigures UDP port
  •  Node2 is able to join the cluster

This sounds perfect as node2 is able to recover by itself. It looks like transparent and straight forward recovery.
Let see how this failure is recovered if UDP hiccups occur on node1 (master node in two node RAC)


Scenario 2: When UDP communication is blocked on the first node

To follow the same step, UDP port for hearbeat is found and it is blocked as it is shown in below

bash-3.2$ netstat -a --inet |grep -i udp | grep -i racnode1
udp        0      0 racnode1-priv:36613         *:*   
udp        0      0 racnode1-priv:36892         *:*   
udp        0      0 racnode1-priv:26055         *:*   
udp        0      0 racnode1-priv:13167         *:*   
udp        0      0 racnode1-priv:17914         *:*   
udp        0      0 racnode1-priv:51067         *:*   

[root@racnode1 ~]# lsof -i :36613
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3010 grid   55u  IPv4  19676       UDP racnode1-priv:36613 

To block heartbeat, all outgoing traffic on port 36613 is blocked
iptables -A OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP

Based on scenario 1, I expected to see the same sequence of events. In other words, I expected to see node2 is evicted and is reconfigured and is rejoined the cluster.
However, in this case, it is seen that node2 is evicted and then as it is shown in below cssd is hung in start up and joining the cluster.

[root@racnode1 ~]# crsctl check cluster -all
**************************************************************
racnode1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
racnode2:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
[root@racnode2 ~]# crsctl stat res -init -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               Abnormal Termination
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       racnode2                                     
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING      
ora.cssdmonitor
      1        ONLINE  ONLINE       racnode2                                     
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  ONLINE       racnode2                                     
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       racnode2                                     
ora.gpnpd
      1        ONLINE  ONLINE       racnode2                                     
ora.mdnsd
      1        ONLINE  ONLINE       racnode2          

Even,unblocking the same port with dropping the rule from iptables does not help and still CSS on node2 is not able to join the cluster.

iptables -L      
      
iptables -D OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP


[root@racnode1 ~]# iptables -L      
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination  


After reviewing all logs (quite length log, so I avoid to copy it here!), It is seen that ocssd in node2 complains about network heartbeat and no reconfig attempt is done, also in node1, after blocking UDP, the interface was disabled and no try is done to setup communication on differently.
As I mentioned earlier, Although UDP port is unblocked, still the following error is reported on node2 and node1 repeatedly.

Node 2
==========
 [    CSSD][3013077904]clssnmvDHBValidateNcopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 249572810, wrtcnt, 181183, LATS 1471304, lastSeqNo 181182, uniqueness 1356788464, timestamp 1356790643/1482244
 
Node1 
=========
[GIPCHALO][3023862672] gipchaLowerProcessNode: no valid interfaces found to node for 25790 ms, node 0xa062a88 { host 'racnode2', haName 'CSS_racnode-cluster', srcLuid be28e076-9f3aafb1, dstLuid 61a1f895-ba260945 numInf 0, contigSeq 2639, lastAck 2626, lastValidAck 2638, sendSeq [2627 : 2683], createTime 4294328280, sentRegister 1, localMonitor 1, flags 0x2408 }

It turned out that the issue is reported as Bug 14281269 : "NODE CAN'T REJOIN THE CLUSTER AFTER A TEMPORARY INTERCONNECT FAILURE - PROBLEM:after an interconnect failure on the first node the second node restarts the clusterware (rebootless restart) as expected, but can't join the cluster again till the interconnect interface of node1
is not shutdown/startup manually "
At the time of posting this, there is no patch available and the suggested workaround is to bounce interconnect interface.

In my test, even bouncing node2 (evicted node) did not help and I ended up to kill gipc daemon on node1 (master node/surviving node) and it did help and whole cluster recovered and node2 was able to join the cluster.

[root@racnode1 working]# ps -ef |grep -i gipc
grid      2961     1  0 05:40 ?        00:00:16 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7709  4792  0 06:44 pts/1    00:00:00 grep -i gipc
[root@racnode1 working]# kill -9 2961
[root@racnode1 working]# ps -ef |grep -i gipc
grid      7717     1 15 06:44 ?        00:00:00 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7755  4792  0 06:44 pts/1    00:00:00 grep -i gipc


[/u01/app/11.2.0/grid/bin/oraagent.bin(3528)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:5} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log.
2012-12-29 06:44:56.972
[/u01/app/11.2.0/grid/bin/orarootagent.bin(3535)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:23} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-12-29 06:44:56.974
[/u01/app/11.2.0/grid/bin/oraagent.bin(3741)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:5:63} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2012-12-29 06:44:57.098
[ohasd(2414)]CRS-2765:Resource 'ora.ctssd' has failed on server 'racnode1'.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2401:The Cluster Time Synchronization Service started on host racnode1.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:01.164
[cssd(3010)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:02.363
[crsd(7759)]CRS-1012:The OCR service started on node racnode1.
2012-12-29 06:45:03.155
[evmd(7762)]CRS-1401:EVMD started on node racnode1.
2012-12-29 06:45:05.147
[crsd(7759)]CRS-1201:CRSD started on node racnode1.
2012-12-29 06:45:38.798
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-29 06:45:38.800
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.


===== alert for node2 =========

2012-12-29 06:39:38.132
[cssd(7700)]CRS-1605:CSSD voting file is online: /dev/sda1; details in /u01/app/11.2.0/grid/log/racnode2/cssd/ocssd.log.
2012-12-29 06:45:01.165
[cssd(7700)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2401:The Cluster Time Synchronization Service started on host racnode2.
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:05.257
[ohasd(2405)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2012-12-29 06:45:16.836
[ctssd(8061)]CRS-2408:The clock on host racnode2 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2012-12-29 06:45:25.475
[crsd(8199)]CRS-1012:The OCR service started on node racnode2.
2012-12-29 06:45:25.541
[evmd(8079)]CRS-1401:EVMD started on node racnode2.
2012-12-29 06:45:27.331
[crsd(8199)]CRS-1201:CRSD started on node racnode2.
2012-12-29 06:45:35.642
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-12-29 06:45:36.181
[/u01/app/11.2.0/grid/bin/oraagent.bin(8347)]CRS-5011:Check of resource "orcl" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2012-12-29 06:45:37.301
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"

To conclude, in 2 node RAC :
 




  1. Network hiccups on heartbeat port on node2 is recovered automatically.
  2. Network hiccups on heartbeat port on node1 requires manual intervention due to bug 14281269
  3. Due to several reported bug, it is recommended to be on 11203 PSU3 at least.check out metalink note for other bugs: List of gipc defects that prevent GI from starting/joining after network hiccups [ID 1488378.1])

 




TIP 103# : GoldenGate and statistics

After having GoldenGate setup in your environment, the main question you most likely face is whether or not GoldenGate is working and if so, how much work is done.
GoldenGate provides stats command to report the work which is done, the following present how to use it properly and effectively.

To get stats from now on wards, follow the below steps :

1. Reset numbers    : stats extract ext_int, reset
2. Apply the changes
3. Reterive stats  :
                     stats extract ext_int, table xxxx.yyyy, latest
                     stats extract ext_int, totalsonly xxxx.yyyy, latest


Please be careful that if you do not reset the numbers, all the reported statistics is since startup of a GoldenGate process. In other words, the following stats commands work in the following way if numbers are not reset

stats extract ext_int, totalsonly xxxx.yyyyy  --> since startup of ext_int
 stats extract ext_int, daily, table xxxx.yyyy --> since startup of day
 stats extract ext_int, daily, table owner.*   --> for all tables since startup of day


Hope this helps for those who may find numbers misleading/incorrect.