Exadata DB Connections Timeout

Exadata DB Connections Timeout - Solution

Issue #:

14:28:55.237

14:28:55.237 ERROR [de4071bb-1008-4f44-b8db-e4225bb529af] Producty - Method: Exception found: {}

org.sql2o.Sql2oException: Could not acquire a connection from DataSource - spring - Connection is not available, request timed out after 5000ms.

at org.sql2o.Connection.createConnection(Connection.java:301)

at org.sql2o.Connection.<init>(Connection.java:63)

... 34 lines omitted ...

Caused by: java.sql.SQLTransientConnectionException: spring - Connection is not available, request timed out after 5000ms.

Impacted:

· We saw Servers went down.

· Scheduled Jobs affected due to DB connectivity issue.

Core DBA:

Further to the discussion, we could see high I/O or 10 mins starting from 2:27 PM till 2:32 PM today. post validation, we could see below SQL’s are running for long with more I/O.

d81zg34jjjj8q

c1zvd53ss5z7q

aarf661nukynv

Analysis #1:

We see below error in DB alert log from different ip-address. Can you please check any trace files generated or any errors in listener (local or scan) during that time.

3 03/JAN/20 02:31:00.459000000 PM +11:00 "WARNING: inbound connection timed out (ORA-3136)

3 03/JAN/20 02:31:00.459000000 PM +11:00 " address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))

Core DBA: No there are trace files are generated.

Analysis #2:

After analysing AWR report, found below information in alert log file from node 3 and 4. Seems DRM hang caused reconfiguration starting issue (reconfiguration started same time in node 3 and 4).

Please can you provide LMON trace file during 14:27 to 14:32.

Soon I will be providing full AWR report analysis.

https://support.oracle.com/knowledge/Oracle%20Database%20Products/1528362_1.html

Analysis #3:

Summary Analysis:

1. At 2:28pm, DB cluster reconfiguration started same time on both nodes 3 and 4.

2. From 2:30 to 2:32pm, DB connections were timed out.

3. Cluster background processors were timed out.

4. Locking mechanism also timed out.

Work around Solution:

5. currently DB is running 11.2.0.4, After checking LMON trace file and patchset based on ID 1528362.1. we can raise request to oracle SR to provide required patch.

6. DRM hang causes frequent RAC Instances Reconfiguration (Doc ID 1528362.1)

Detail Analysis:

7. Server Load almost similar on 2^nd and 3^rd Jan

3^rd Jan:

I#	Num CPUs	CPU Cores	CPU Sckts	Load Begin	Load End	% Busy	% Usr	% Sys	% WIO	% Idl	Busy Time (s)	Idle Time (s)	Total Time (s)	Memory (M)	End CPUs	End Cores	End Sckts	End Memory (M)
3	72	36	2	11.69	8.88	6.82	4.90	1.62	0.01	93.18	17,339.40	236,948.67	254,288.07	773,472.32
4	72	36	2	8.23	6.37	4.62	3.41	1.04	0.01	95.38	11,780.39	243,201.43	254,981.82	773,472.32
Sum											29,119.79	480,150.10	509,269.89

2^nd Jan:

I#	Num CPUs	CPU Cores	CPU Sckts	Load Begin	Load End	% Busy	% Usr	% Sys	% WIO	% Idl	Busy Time (s)	Idle Time (s)	Total Time (s)	Memory (M)	End CPUs	End Cores	End Sckts	End Memory (M)
3	72	36	2	8.18	9.96	13.60	9.81	3.23	0.01	86.40	34,923.49	221,884.27	256,807.76	773,472.32
4	72	36	2	6.43	6.64	7.18	5.44	1.50	0.01	92.82	18,507.58	239,110.78	257,618.36	773,472.32
Sum											53,431.07	460,995.05	514,426.12

Below Other waits seen high, along with network on 3^rd Jan.

3^rd Jan:

	User I/O	Sys I/O	Other	Applic	Commit	Network	Concurcy	Config	Cluster	DB CPU
	21.10	0.02	18.76	1.40	0.08	13.15	0.09	2.46	17.35	20.30

2^nd Jan:

	User I/O	Sys I/O	Other	Applic	Commit	Network	Concurcy	Config	Cluster	DB CPU
	39.09	0.01	0.46	0.07	0.03	0.59	0.08	3.07	5.03	34.51

Below other event is cluster resource (oracle internal activity) occurred only on 3^rd Jan. you can find below more details on other event.

3^rd Jan:

	Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
*		DB CPU			14,846.06		20.30					2
*	Other	gcs resource directory to be unfrozen	6,775	95.11	13,349.67	1970.43	18.25	1961.97	1949.20	1974.73	18.05	2
*	User I/O	cell single block physical read	30,760,043	0.00	12,614.08	0.41	17.25	0.44	0.40	0.47	0.05	2
*	System I/O	RMAN backup & recovery I/O	857,759	0.00	12,467.22	14.53	17.05	14.53	14.43	14.64	0.14	2
*	Cluster	gc freelist	60,131	82.01	10,038.55	166.94	31.80	166.94	166.94	166.94		1
*	Concurrency	row cache lock	13,680	5.21	3,567.35	260.77	4.88	612.07	0.18	1223.96	865.35	2
*	Concurrency	library cache lock	10,558	5.54	3,363.50	318.57	4.60	1875.60	107.96	3643.25	2499.83	2
*	User I/O	cell smart table scan	17,331,056	51.23	2,536.76	0.15	3.47	0.15	0.15	0.15	0.00	2
*	Concurrency	library cache pin	14,121	0.00	2,121.93	150.27	2.90	158.28	148.95	167.60	13.19	2
*	Network	SQL*Net more data from client	527,196	0.00	1,723.15	3.27	2.36	3.27	3.27	3.27		2

2^nd Jan:

Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
*		DB CPU			30,793.51		34.51					2
*	User I/O	cell single block physical read	67,029,701	0.00	24,153.60	0.36	27.07	0.37	0.35	0.39	0.03	2
*	User I/O	cell multiblock physical read	3,357,939	0.00	5,863.56	1.75	6.57	1.73	1.66	1.80	0.10	2
*	User I/O	cell smart table scan	24,176,351	50.98	4,641.71	0.19	5.20	0.19	0.19	0.19	0.00	2
*	Network	SQL*Net more data from client	664,649	0.00	2,603.27	3.92	2.92	3.92	3.92	3.92		2
*	User I/O	cell list of blocks physical read	4,298,928	0.00	2,545.01	0.59	2.85	0.59	0.58	0.60	0.01	2
*	Cluster	gc cr grant 2-way	15,503,329	0.00	1,838.16	0.12	2.06	0.12	0.12	0.12	0.00	2
*	Cluster	gc cr multi block request	3,849,691	0.00	1,775.51	0.46	1.99	0.46	0.43	0.49	0.04	2
*	User I/O	direct path read temp	240,951	0.00	1,220.15	5.06	1.37	5.03	4.49	5.57	0.77	2
*	Scheduler	resmgr:cpu quantum	179,407	0.00	894.56	4.99	1.00	5.00	4.95	5.04	0.06	2

Locking mechanism issue occurred only for node 4 on 3^rd Jan.

3^rd Jan:

	Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
*		DB CPU			14,846.06		20.30					2
*	Cluster	gc freelist	60,007	82.06	10,022.62	167.02	31.75	167.02	167.02	167.02		1
*	Other	gcs resource directory to be unfrozen	5,038	94.58	9,912.54	1967.55	13.55	1966.12	1963.72	1968.51	3.39	2
*	User I/O	cell single block physical read	19,363,802	0.00	8,386.46	0.43	11.47	0.45	0.42	0.47	0.04	2
*	Concurrency	row cache lock	8,527	8.36	3,566.50	418.26	4.88	647.87	0.20	1295.55	915.95	2
*	Concurrency	library cache lock	3,888	7.97	2,868.64	737.82	3.92	1966.38	176.03	3756.72	2531.93	2
*	User I/O	cell smart table scan	17,331,056	51.23	2,536.76	0.15	3.47	0.15	0.15	0.15	0.00	2
*	Concurrency	library cache pin	6,249	0.00	2,029.24	324.73	2.77	224.79	85.30	364.28	197.27	2
*	Network	SQL*Net more data from client	527,196	0.00	1,723.15	3.27	4.15	3.27	3.27	3.27		1
*	User I/O	cell multiblock physical read	802,957	0.00	1,309.16	1.63	1.79	1.63	1.42	1.84	0.29	2

2^nd Jan:

	Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
*		DB CPU			14,846.06		20.30					2
*	Cluster	gc freelist	60,007	82.06	10,022.62	167.02	31.75	167.02	167.02	167.02		1
*	Other	gcs resource directory to be unfrozen	5,038	94.58	9,912.54	1967.55	13.55	1966.12	1963.72	1968.51	3.39	2
*	User I/O	cell single block physical read	19,363,802	0.00	8,386.46	0.43	11.47	0.45	0.42	0.47	0.04	2
*	Concurrency	row cache lock	8,527	8.36	3,566.50	418.26	4.88	647.87	0.20	1295.55	915.95	2
*	Concurrency	library cache lock	3,888	7.97	2,868.64	737.82	3.92	1966.38	176.03	3756.72	2531.93	2
*	User I/O	cell smart table scan	17,331,056	51.23	2,536.76	0.15	3.47	0.15	0.15	0.15	0.00	2
*	Concurrency	library cache pin	6,249	0.00	2,029.24	324.73	2.77	224.79	85.30	364.28	197.27	2
*	Network	SQL*Net more data from client	527,196	0.00	1,723.15	3.27	4.15	3.27	3.27	3.27		1
*	User I/O	cell multiblock physical read	802,957	0.00	1,309.16	1.63	1.79	1.63	1.42	1.84	0.29	2

2^nd Jan:

Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
*		DB CPU			30,793.51		34.51					2
*	User I/O	cell single block physical read	54,299,295	0.00	19,740.72	0.36	22.12	0.37	0.35	0.39	0.03	2
*	User I/O	cell multiblock physical read	3,355,153	0.00	5,845.42	1.74	6.55	1.72	1.65	1.79	0.10	2
*	User I/O	cell smart table scan	24,176,351	50.98	4,641.71	0.19	5.20	0.19	0.19	0.19	0.00	2
*	Network	SQL*Net more data from client	664,649	0.00	2,603.27	3.92	4.66	3.92	3.92	3.92		1
*	User I/O	cell list of blocks physical read	4,298,627	0.00	2,543.38	0.59	2.85	0.59	0.58	0.60	0.01	2
*	Cluster	gc cr grant 2-way	15,469,713	0.00	1,833.78	0.12	2.05	0.12	0.12	0.12	0.00	2
*	Cluster	gc cr multi block request	3,846,794	0.00	1,765.56	0.46	1.98	0.46	0.43	0.49	0.04	2
*	User I/O	direct path read temp	240,951	0.00	1,220.15	5.06	1.37	5.03	4.49	5.57	0.77	2
*	Scheduler	resmgr:cpu quantum	178,964	0.00	885.25	4.95	0.99	4.96	4.91	5.02	0.08	2

More details about other event caused problem only on node 4.

3^rd Jan:

	Wait		Event		Wait Time			Summary Avg Wait Time (ms)
I#	Class	Event	Waits	%Timeouts	Total(s)	Avg(ms)	%DB time	Avg	Min	Max	Std Dev	Cnt
4	System I/O	RMAN backup & recovery I/O	427,929	0.00	6,263.64	14.64	74.04
4		background cpu time			1,516.74		17.93
4	Other	gcs resource directory to be unfrozen	132	90.91	242.64	1838.18	2.87
4	Other	ges lms sync during dynamic remastering and reconfig	202	46.53	134.44	665.54	1.59
4	Other	ges LMON to get to FTDONE	5,950	99.01	130.12	21.87	1.54
4	Other	enq: CO - master slave det	1,143	100.00	129.66	113.44	1.53
4	Application	enq: TM - contention	21	0.00	127.77	6084.19	1.51
4	Concurrency	library cache pin	110	0.00	91.35	830.46	1.08
4	System I/O	log file sequential read	15,195	0.00	84.48	5.56	1.00
4	System I/O	log file parallel write	94,181	0.00	47.35	0.50	0.56

Only 1 sql query seen high cpu,io,elapsed_time on 3^rd Jan only.

3^rd Jan:

After checking ASH report, sql_id 5fazzyb6hs9g6 is a DB scheduler backend job but not executed from 2:28pm to 2:32pm.

Sid, Serial#	% Activity	Event	% Event	User	Program	# Samples Active	XIDs
32,30861	50.00	CPU + Wait for CPU	50.00	username	(J010)	1/360 [ 0%]	0
538,32305	50.00	CPU + Wait for CPU	50.00	username	(J006)	1/360 [ 0%]	1

Analysis #4:

Issue: From 07:00 PM to 07:08 PM we can see DB connection issue happen due to that All services went down,

Could you please verify on priority, from last 2 days its happen 3^rd time.

Core DBA:

Node 3 Alert log error:

FAL[server, ARC3]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance - Archival Error. Archiver continuing.

Node 4 Alert log error:

Instance terminated by LMS0, pid = 113256

USER (ospid: 206853): terminating the instance

Instance terminated by USER, pid = 206853

Raised Oracle SR requested by manager.

Analysis #5:

Please find in-detail how node 4 is evicted (Instance restarted) from cluster.

@Core DBA: Between 7:02:52 to 7:08:04 PM not much information generated in DB alert log file and I found in trace file 'KSV master wait' event waiting but no information provided. if you can provide us cluster logs then we can go in-depth at cluster level and find out DB instance hang/restart reason. You can follow below link to get all cluster log locations and please provide ASM alert log file also.

http://db.geeksinsight.com/2012/09/29/11gr2-rac-important-trace-directories-log-locations/

Reconfiguration started at 6:37:37 PM, same time in instance3 and instance4 and completed successfully.

Instance 3

Instance 4

reconfiguration is started at 7:02:51 PM, same time in instance3 and instance4 but this time node 3 got error in communicating with node 4 at 7:02:52 PM.

Instance 3:

Node 3 decided to go for node 4 restart at 7:08:02 PM with reason 0x2 à After checking this error in google, it is relating to Exadata memory.

Note: node restart depends on memory, network, storage (IO), cpu (process) communication between nodes.

Instance 4:

Node 3 once again started reconfiguration at 7:08:02 PM but this time node 4 not exist (because node restarted)

Instance 3:

Node 4 before restarted, all DB connections were timeout at 7:08:02 PM

Node 3 started DB connection timeout at 7:08:03 PM (after node 4)

Node 3 started recovery at 7:08:04 PM

Instance 3:

During recovery in node 3, archive logs for node 4 rejected at standby, followed by archiver error. (node 4 is down at 7:08:08 PM)

Instance 4:

Node 4 started at 7:08:10 pm

Instance 4:

Reconfiguration started at 7:08:13 PM and this time node 3 and node 4 successfully.

Instance 3:

Instance 4:

Trace file -- 'KSV master wait'

0: waited for 'KSV master wait'

=0x0, =0x0, =0x0

wait_id=9229685 seq_num=54786 snap_id=1

wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec

wait times: max=0.000000 sec

wait counts: calls=1 os=1

occurred after 0.000010 sec of elapsed time

1: waited for 'KSV master wait'

=0x0, =0x0, =0x0

wait_id=9229684 seq_num=54785 snap_id=1

wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec

wait times: max=0.000000 sec

wait counts: calls=1 os=1

occurred after 0.000021 sec of elapsed time

2: waited for 'rdbms ipc message'

timeout=0x12c, =0x0, =0x0

wait_id=9229683 seq_num=54784 snap_id=1

wait times: snap=3.076189 sec, exc=3.076189 sec, total=3.076189 sec

wait times: max=3.000000 sec

wait counts: calls=1 os=1

occurred after 0.000002 sec of elapsed time

3: waited for 'KSV master wait'

=0x0, =0x0, =0x0

wait_id=9229682 seq_num=54783 snap_id=1

wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

wait times: max=0.000000 sec

wait counts: calls=1 os=1

occurred after 0.000026 sec of elapsed time

Final Solution from Oracle SR:

After changing below parameters, issue resolved

Search the post

Oracle Core DBA - GoldenGate - OEM - PostgreSQL - MongoDB - AWS_RDS

Exadata DB Connections Timeout - Solution

Comments

Post a Comment

Popular Posts

AWR Report Analysis

Goldengate 19c Installation with issue resolution

Oracle 11g Physical Standy Setup

Oracle 12c 2-node RAC setup in VMware Player 15 -- Part01

Oracle to Postgres

Execution_Explain Plan

Active Session History Analysis

OEM queries

General Queries

Fragmentation & Re-organisation