Exadata DB Connections Timeout - Solution
Issue #:
14:28:55.237
14:28:55.237 ERROR [de4071bb-1008-4f44-b8db-e4225bb529af] Producty - Method: Exception found: {}
org.sql2o.Sql2oException: Could not acquire a connection from DataSource - spring - Connection is not available, request timed out after 5000ms.
at org.sql2o.Connection.createConnection(Connection.java:301)
at org.sql2o.Connection.<init>(Connection.java:63)
... 34 lines omitted ...
Caused by: java.sql.SQLTransientConnectionException: spring - Connection is not available, request timed out after 5000ms.
Impacted:
· We saw Servers went down.
· Scheduled Jobs affected due to DB connectivity issue.
Core DBA:
Further to the discussion, we could see high I/O or 10 mins starting from 2:27 PM till 2:32 PM today. post validation, we could see below SQL’s are running for long with more I/O.
d81zg34jjjj8q
c1zvd53ss5z7q
aarf661nukynv
Analysis #1:
We see below error in DB alert log from different ip-address. Can you please check any trace files generated or any errors in listener (local or scan) during that time.
3 03/JAN/20 02:31:00.459000000 PM +11:00 "WARNING: inbound connection timed out (ORA-3136)
"
3 03/JAN/20 02:31:00.459000000 PM +11:00 " address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))
Core DBA: No there are trace files are generated.
Analysis #2:
After analysing AWR report, found below information in alert log file from node 3 and 4. Seems DRM hang caused reconfiguration starting issue (reconfiguration started same time in node 3 and 4).
Please can you provide LMON trace file during 14:27 to 14:32.
Soon I will be providing full AWR report analysis.
https://support.oracle.com/knowledge/Oracle%20Database%20Products/1528362_1.html
Analysis #3:
Summary Analysis:
1. At 2:28pm, DB cluster reconfiguration started same time on both nodes 3 and 4.
2. From 2:30 to 2:32pm, DB connections were timed out.
3. Cluster background processors were timed out.
4. Locking mechanism also timed out.
Work around Solution:
5. currently DB is running 11.2.0.4, After checking LMON trace file and patchset based on ID 1528362.1. we can raise request to oracle SR to provide required patch.
6. DRM hang causes frequent RAC Instances Reconfiguration (Doc ID 1528362.1)
Detail Analysis:
7. Server Load almost similar on 2nd and 3rd Jan
3rd Jan:
I# |
Num CPUs |
CPU Cores |
CPU Sckts |
Load Begin |
Load End |
% Busy |
% Usr |
% Sys |
% WIO |
% Idl |
Busy Time (s) |
Idle Time (s) |
Total Time (s) |
Memory (M) |
End CPUs |
End Cores |
End Sckts |
End Memory (M) |
3 |
72 |
36 |
2 |
11.69 |
8.88 |
6.82 |
4.90 |
1.62 |
0.01 |
93.18 |
17,339.40 |
236,948.67 |
254,288.07 |
773,472.32 |
|
|
|
|
4 |
72 |
36 |
2 |
8.23 |
6.37 |
4.62 |
3.41 |
1.04 |
0.01 |
95.38 |
11,780.39 |
243,201.43 |
254,981.82 |
773,472.32 |
|
|
|
|
Sum |
|
|
|
|
|
|
|
|
|
|
29,119.79 |
480,150.10 |
509,269.89 |
|
|
|
|
|
2nd Jan:
I# |
Num CPUs |
CPU Cores |
CPU Sckts |
Load Begin |
Load End |
% Busy |
% Usr |
% Sys |
% WIO |
% Idl |
Busy Time (s) |
Idle Time (s) |
Total Time (s) |
Memory (M) |
End CPUs |
End Cores |
End Sckts |
End Memory (M) |
3 |
72 |
36 |
2 |
8.18 |
9.96 |
13.60 |
9.81 |
3.23 |
0.01 |
86.40 |
34,923.49 |
221,884.27 |
256,807.76 |
773,472.32 |
|
|
|
|
4 |
72 |
36 |
2 |
6.43 |
6.64 |
7.18 |
5.44 |
1.50 |
0.01 |
92.82 |
18,507.58 |
239,110.78 |
257,618.36 |
773,472.32 |
|
|
|
|
Sum |
|
|
|
|
|
|
|
|
|
|
53,431.07 |
460,995.05 |
514,426.12 |
|
|
|
|
Below Other waits seen high, along with network on 3rd Jan.
3rd
Jan:
User I/O |
Sys I/O |
Other |
Applic |
Commit |
Network |
Concurcy |
Config |
Cluster |
DB CPU |
|
|
21.10 |
0.02 |
18.76 |
1.40 |
0.08 |
13.15 |
0.09 |
2.46 |
17.35 |
20.30 |
2nd Jan:
User I/O |
Sys I/O |
Other |
Applic |
Commit |
Network |
Concurcy |
Config |
Cluster |
DB CPU |
|
|
39.09 |
0.01 |
0.46 |
0.07 |
0.03 |
0.59 |
0.08 |
3.07 |
5.03 |
34.51 |
Below other event is cluster resource (oracle internal activity) occurred only on 3rd Jan. you can find below more details on other event.
3rd Jan:
|
Wait |
Event |
Wait Time |
Summary Avg Wait Time (ms) |
||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
|
DB CPU |
|
|
14,846.06 |
|
20.30 |
|
|
|
|
2 |
* |
Other |
gcs resource directory to be unfrozen |
6,775 |
95.11 |
13,349.67 |
1970.43 |
18.25 |
1961.97 |
1949.20 |
1974.73 |
18.05 |
2 |
* |
User I/O |
cell single block physical read |
30,760,043 |
0.00 |
12,614.08 |
0.41 |
17.25 |
0.44 |
0.40 |
0.47 |
0.05 |
2 |
* |
System I/O |
RMAN backup & recovery I/O |
857,759 |
0.00 |
12,467.22 |
14.53 |
17.05 |
14.53 |
14.43 |
14.64 |
0.14 |
2 |
* |
Cluster |
gc freelist |
60,131 |
82.01 |
10,038.55 |
166.94 |
31.80 |
166.94 |
166.94 |
166.94 |
|
1 |
* |
Concurrency |
row cache lock |
13,680 |
5.21 |
3,567.35 |
260.77 |
4.88 |
612.07 |
0.18 |
1223.96 |
865.35 |
2 |
* |
Concurrency |
library cache lock |
10,558 |
5.54 |
3,363.50 |
318.57 |
4.60 |
1875.60 |
107.96 |
3643.25 |
2499.83 |
2 |
* |
User I/O |
cell smart table scan |
17,331,056 |
51.23 |
2,536.76 |
0.15 |
3.47 |
0.15 |
0.15 |
0.15 |
0.00 |
2 |
* |
Concurrency |
library cache pin |
14,121 |
0.00 |
2,121.93 |
150.27 |
2.90 |
158.28 |
148.95 |
167.60 |
13.19 |
2 |
* |
Network |
SQL*Net more data from client |
527,196 |
0.00 |
1,723.15 |
3.27 |
2.36 |
3.27 |
3.27 |
3.27 |
|
2 |
2nd
Jan:
|
Event |
Wait Time |
Summary Avg Wait Time (ms) |
|||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
|
DB CPU |
|
|
30,793.51 |
|
34.51 |
|
|
|
|
2 |
* |
User I/O |
cell single block physical read |
67,029,701 |
0.00 |
24,153.60 |
0.36 |
27.07 |
0.37 |
0.35 |
0.39 |
0.03 |
2 |
* |
User I/O |
cell multiblock physical read |
3,357,939 |
0.00 |
5,863.56 |
1.75 |
6.57 |
1.73 |
1.66 |
1.80 |
0.10 |
2 |
* |
User I/O |
cell smart table scan |
24,176,351 |
50.98 |
4,641.71 |
0.19 |
5.20 |
0.19 |
0.19 |
0.19 |
0.00 |
2 |
* |
Network |
SQL*Net more data from client |
664,649 |
0.00 |
2,603.27 |
3.92 |
2.92 |
3.92 |
3.92 |
3.92 |
|
2 |
* |
User I/O |
cell list of blocks physical read |
4,298,928 |
0.00 |
2,545.01 |
0.59 |
2.85 |
0.59 |
0.58 |
0.60 |
0.01 |
2 |
* |
Cluster |
gc cr grant 2-way |
15,503,329 |
0.00 |
1,838.16 |
0.12 |
2.06 |
0.12 |
0.12 |
0.12 |
0.00 |
2 |
* |
Cluster |
gc cr multi block request |
3,849,691 |
0.00 |
1,775.51 |
0.46 |
1.99 |
0.46 |
0.43 |
0.49 |
0.04 |
2 |
* |
User I/O |
direct path read temp |
240,951 |
0.00 |
1,220.15 |
5.06 |
1.37 |
5.03 |
4.49 |
5.57 |
0.77 |
2 |
* |
Scheduler |
resmgr:cpu quantum |
179,407 |
0.00 |
894.56 |
4.99 |
1.00 |
5.00 |
4.95 |
5.04 |
0.06 |
2 |
Locking mechanism issue occurred only for node 4 on 3rd Jan.
3rd Jan:
|
Wait |
Event |
Wait Time |
Summary Avg Wait Time (ms) |
||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
|
DB CPU |
|
|
14,846.06 |
|
20.30 |
|
|
|
|
2 |
* |
Cluster |
gc freelist |
60,007 |
82.06 |
10,022.62 |
167.02 |
31.75 |
167.02 |
167.02 |
167.02 |
|
1 |
* |
Other |
gcs resource directory to be unfrozen |
5,038 |
94.58 |
9,912.54 |
1967.55 |
13.55 |
1966.12 |
1963.72 |
1968.51 |
3.39 |
2 |
* |
User I/O |
cell single block physical read |
19,363,802 |
0.00 |
8,386.46 |
0.43 |
11.47 |
0.45 |
0.42 |
0.47 |
0.04 |
2 |
* |
Concurrency |
row cache lock |
8,527 |
8.36 |
3,566.50 |
418.26 |
4.88 |
647.87 |
0.20 |
1295.55 |
915.95 |
2 |
* |
Concurrency |
library cache lock |
3,888 |
7.97 |
2,868.64 |
737.82 |
3.92 |
1966.38 |
176.03 |
3756.72 |
2531.93 |
2 |
* |
User I/O |
cell smart table scan |
17,331,056 |
51.23 |
2,536.76 |
0.15 |
3.47 |
0.15 |
0.15 |
0.15 |
0.00 |
2 |
* |
Concurrency |
library cache pin |
6,249 |
0.00 |
2,029.24 |
324.73 |
2.77 |
224.79 |
85.30 |
364.28 |
197.27 |
2 |
* |
Network |
SQL*Net more data from client |
527,196 |
0.00 |
1,723.15 |
3.27 |
4.15 |
3.27 |
3.27 |
3.27 |
|
1 |
* |
User I/O |
cell multiblock physical read |
802,957 |
0.00 |
1,309.16 |
1.63 |
1.79 |
1.63 |
1.42 |
1.84 |
0.29 |
2 |
2nd
Jan:
|
Wait |
Event |
Wait Time |
Summary Avg Wait Time (ms) |
||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
|
DB CPU |
|
|
14,846.06 |
|
20.30 |
|
|
|
|
2 |
* |
Cluster |
gc freelist |
60,007 |
82.06 |
10,022.62 |
167.02 |
31.75 |
167.02 |
167.02 |
167.02 |
|
1 |
* |
Other |
gcs resource directory to be unfrozen |
5,038 |
94.58 |
9,912.54 |
1967.55 |
13.55 |
1966.12 |
1963.72 |
1968.51 |
3.39 |
2 |
* |
User I/O |
cell single block physical read |
19,363,802 |
0.00 |
8,386.46 |
0.43 |
11.47 |
0.45 |
0.42 |
0.47 |
0.04 |
2 |
* |
Concurrency |
row cache lock |
8,527 |
8.36 |
3,566.50 |
418.26 |
4.88 |
647.87 |
0.20 |
1295.55 |
915.95 |
2 |
* |
Concurrency |
library cache lock |
3,888 |
7.97 |
2,868.64 |
737.82 |
3.92 |
1966.38 |
176.03 |
3756.72 |
2531.93 |
2 |
* |
User I/O |
cell smart table scan |
17,331,056 |
51.23 |
2,536.76 |
0.15 |
3.47 |
0.15 |
0.15 |
0.15 |
0.00 |
2 |
* |
Concurrency |
library cache pin |
6,249 |
0.00 |
2,029.24 |
324.73 |
2.77 |
224.79 |
85.30 |
364.28 |
197.27 |
2 |
* |
Network |
SQL*Net more data from client |
527,196 |
0.00 |
1,723.15 |
3.27 |
4.15 |
3.27 |
3.27 |
3.27 |
|
1 |
* |
User I/O |
cell multiblock physical read |
802,957 |
0.00 |
1,309.16 |
1.63 |
1.79 |
1.63 |
1.42 |
1.84 |
0.29 |
2 |
2nd Jan:
|
Event |
Wait Time |
Summary Avg Wait Time (ms) |
|||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
|
DB CPU |
|
|
30,793.51 |
|
34.51 |
|
|
|
|
2 |
* |
User I/O |
cell single block physical read |
54,299,295 |
0.00 |
19,740.72 |
0.36 |
22.12 |
0.37 |
0.35 |
0.39 |
0.03 |
2 |
* |
User I/O |
cell multiblock physical read |
3,355,153 |
0.00 |
5,845.42 |
1.74 |
6.55 |
1.72 |
1.65 |
1.79 |
0.10 |
2 |
* |
User I/O |
cell smart table scan |
24,176,351 |
50.98 |
4,641.71 |
0.19 |
5.20 |
0.19 |
0.19 |
0.19 |
0.00 |
2 |
* |
Network |
SQL*Net more data from client |
664,649 |
0.00 |
2,603.27 |
3.92 |
4.66 |
3.92 |
3.92 |
3.92 |
|
1 |
* |
User I/O |
cell list of blocks physical read |
4,298,627 |
0.00 |
2,543.38 |
0.59 |
2.85 |
0.59 |
0.58 |
0.60 |
0.01 |
2 |
* |
Cluster |
gc cr grant 2-way |
15,469,713 |
0.00 |
1,833.78 |
0.12 |
2.05 |
0.12 |
0.12 |
0.12 |
0.00 |
2 |
* |
Cluster |
gc cr multi block request |
3,846,794 |
0.00 |
1,765.56 |
0.46 |
1.98 |
0.46 |
0.43 |
0.49 |
0.04 |
2 |
* |
User I/O |
direct path read temp |
240,951 |
0.00 |
1,220.15 |
5.06 |
1.37 |
5.03 |
4.49 |
5.57 |
0.77 |
2 |
* |
Scheduler |
resmgr:cpu quantum |
178,964 |
0.00 |
885.25 |
4.95 |
0.99 |
4.96 |
4.91 |
5.02 |
0.08 |
2 |
More
details about other event caused problem only on node 4.
3rd Jan:
|
Wait |
Event |
Wait Time |
Summary Avg Wait Time (ms) |
||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
4 |
System I/O |
RMAN backup & recovery I/O |
427,929 |
0.00 |
6,263.64 |
14.64 |
74.04 |
|
|
|
|
|
4 |
|
background cpu time |
|
|
1,516.74 |
|
17.93 |
|
|
|
|
|
4 |
Other |
gcs resource directory to be unfrozen |
132 |
90.91 |
242.64 |
1838.18 |
2.87 |
|
|
|
|
|
4 |
Other |
ges lms sync during dynamic remastering and reconfig |
202 |
46.53 |
134.44 |
665.54 |
1.59 |
|
|
|
|
|
4 |
Other |
ges LMON to get to FTDONE |
5,950 |
99.01 |
130.12 |
21.87 |
1.54 |
|
|
|
|
|
4 |
Other |
enq: CO - master slave det |
1,143 |
100.00 |
129.66 |
113.44 |
1.53 |
|
|
|
|
|
4 |
Application |
enq: TM - contention |
21 |
0.00 |
127.77 |
6084.19 |
1.51 |
|
|
|
|
|
4 |
Concurrency |
library cache pin |
110 |
0.00 |
91.35 |
830.46 |
1.08 |
|
|
|
|
|
4 |
System I/O |
log file sequential read |
15,195 |
0.00 |
84.48 |
5.56 |
1.00 |
|
|
|
|
|
4 |
System I/O |
log file parallel write |
94,181 |
0.00 |
47.35 |
0.50 |
0.56 |
|
|
|
|
|
Only 1 sql query seen high cpu,io,elapsed_time on 3rd Jan only.
3rd Jan:
After checking ASH report, sql_id 5fazzyb6hs9g6 is a DB scheduler backend job but not executed from 2:28pm to 2:32pm.
Sid, Serial# |
% Activity |
Event |
% Event |
User |
Program |
# Samples Active |
XIDs |
32,30861 |
50.00 |
CPU + Wait for CPU |
50.00 |
username |
(J010) |
1/360 [ 0%] |
0 |
538,32305 |
50.00 |
CPU + Wait for CPU |
50.00 |
username |
(J006) |
1/360 [ 0%] |
1 |
Analysis #4:
Issue: From 07:00 PM to 07:08 PM we can see DB connection issue happen due to that All services went down,
Could you please verify on priority, from last 2 days its happen 3rd time.
Core DBA:
Node 3 Alert log error:
FAL[server, ARC3]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance - Archival Error. Archiver continuing.
Node 4 Alert log error:
Instance terminated by LMS0, pid = 113256
USER (ospid: 206853): terminating the instance
Instance terminated by USER, pid = 206853
Raised Oracle SR requested by manager.
Analysis #5:
Please find in-detail how node 4 is evicted (Instance restarted) from cluster.
@Core DBA: Between 7:02:52 to 7:08:04 PM not much information generated in DB alert log file and I found in trace file 'KSV master wait' event waiting but no information provided. if you can provide us cluster logs then we can go in-depth at cluster level and find out DB instance hang/restart reason. You can follow below link to get all cluster log locations and please provide ASM alert log file also.
http://db.geeksinsight.com/2012/09/29/11gr2-rac-important-trace-directories-log-locations/
Reconfiguration started at 6:37:37 PM, same time in instance3 and instance4 and completed successfully.
Instance 3
Instance 4
reconfiguration is started at 7:02:51 PM, same time in instance3 and instance4 but this time node 3 got error in communicating with node 4 at 7:02:52 PM.
Instance 3:
Node 3 decided to go for node 4 restart at 7:08:02 PM with reason 0x2 Ã After checking this error in google, it is relating to Exadata memory.
Note: node restart depends on memory, network, storage (IO), cpu (process) communication between nodes.
Instance 4:
Node 3 once again started reconfiguration at 7:08:02 PM but this time node 4 not exist (because node restarted)
Instance 3:
Node 4 before restarted, all DB connections were timeout at 7:08:02 PM
Node 3 started DB connection timeout at 7:08:03 PM (after node 4)
Node 3 started recovery at 7:08:04 PM
Instance 3:
During recovery in node 3, archive logs for node 4 rejected at standby, followed by archiver error. (node 4 is down at 7:08:08 PM)
Instance 4:
Node 4 started at 7:08:10 pm
Instance 4:
Reconfiguration started at 7:08:13 PM and this time node 3 and node 4 successfully.
Instance 3:
Instance 4:
Trace file -- 'KSV master wait'
0: waited for 'KSV master wait'
=0x0, =0x0, =0x0
wait_id=9229685 seq_num=54786 snap_id=1
wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000010 sec of elapsed time
1: waited for 'KSV master wait'
=0x0, =0x0, =0x0
wait_id=9229684 seq_num=54785 snap_id=1
wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000021 sec of elapsed time
2: waited for 'rdbms ipc message'
timeout=0x12c, =0x0, =0x0
wait_id=9229683 seq_num=54784 snap_id=1
wait times: snap=3.076189 sec, exc=3.076189 sec, total=3.076189 sec
wait times: max=3.000000 sec
wait counts: calls=1 os=1
occurred after 0.000002 sec of elapsed time
3: waited for 'KSV master wait'
=0x0, =0x0, =0x0
wait_id=9229682 seq_num=54783 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000026 sec of elapsed time
Final Solution from Oracle SR:
After changing below parameters, issue resolved
Comments
Post a Comment