Exadata DB Connections Timeout - Solution

Issue #:

 14:28:55.237    

14:28:55.237 ERROR [de4071bb-1008-4f44-b8db-e4225bb529af] Producty - Method: Exception found: {}

org.sql2o.Sql2oException: Could not acquire a connection from DataSource - spring - Connection is not available, request timed out after 5000ms.

              at org.sql2o.Connection.createConnection(Connection.java:301)

              at org.sql2o.Connection.<init>(Connection.java:63)

... 34 lines omitted ...

Caused by: java.sql.SQLTransientConnectionException: spring - Connection is not available, request timed out after 5000ms.

Impacted:

·         We saw Servers went down.

·         Scheduled Jobs affected due to DB connectivity issue.

   

Core DBA:

 Further to the discussion, we could see high I/O or 10 mins starting from 2:27 PM till 2:32 PM today.  post validation, we could see below SQL’s are running for long with more I/O.

d81zg34jjjj8q

c1zvd53ss5z7q

aarf661nukynv


Analysis #1:

 We see below error in DB alert log from different ip-address. Can you please check any trace files generated or any errors in listener (local or scan) during that time.

 

3              03/JAN/20 02:31:00.459000000 PM +11:00          "WARNING: inbound connection timed out (ORA-3136)

"

3              03/JAN/20 02:31:00.459000000 PM +11:00          "  address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))

 

Core DBA: No there are trace files are generated.


Analysis #2:

 After analysing AWR report, found below information in alert log file from node 3 and 4. Seems DRM hang caused reconfiguration starting issue (reconfiguration started same time in node 3 and 4).

 Please can you provide LMON trace file during 14:27 to 14:32.

Soon I will be providing full AWR report analysis.

 https://support.oracle.com/knowledge/Oracle%20Database%20Products/1528362_1.html

Analysis #3:

 Summary Analysis:

 1.       At 2:28pm, DB cluster reconfiguration started same time on both nodes 3 and 4.

2.       From 2:30 to 2:32pm, DB connections were timed out.

3.       Cluster background processors were timed out.

4.       Locking mechanism also timed out.

 Work around Solution:

 5.       currently DB is running 11.2.0.4, After checking LMON trace file and patchset based on ID 1528362.1.  we can raise request to oracle SR to provide required patch.  

6.       DRM hang causes frequent RAC Instances Reconfiguration (Doc ID 1528362.1)

Detail Analysis:

 7.       Server Load almost similar on 2nd and 3rd Jan

3rd Jan:


I#

Num CPUs

CPU Cores

CPU Sckts

Load Begin

Load End

% Busy

% Usr

% Sys

% WIO

% Idl

Busy Time (s)

Idle Time (s)

Total Time (s)

Memory (M)

End CPUs

End Cores

End Sckts

End Memory (M)

3

72

36

2

11.69

8.88

6.82

4.90

1.62

0.01

93.18

17,339.40

236,948.67

254,288.07

773,472.32

 

 

 

 

4

72

36

2

8.23

6.37

4.62

3.41

1.04

0.01

95.38

11,780.39

243,201.43

254,981.82

773,472.32

 

 

 

 

Sum

 

 

 

 

 

 

 

 

 

 

29,119.79

480,150.10

509,269.89

 

 

 

 

 


2nd Jan:

I#

Num CPUs

CPU Cores

CPU Sckts

Load Begin

Load End

% Busy

% Usr

% Sys

% WIO

% Idl

Busy Time (s)

Idle Time (s)

Total Time (s)

Memory (M)

End CPUs

End Cores

End Sckts

End Memory (M)

3

72

36

2

8.18

9.96

13.60

9.81

3.23

0.01

86.40

34,923.49

221,884.27

256,807.76

773,472.32

 

 

 

 

4

72

36

2

6.43

6.64

7.18

5.44

1.50

0.01

92.82

18,507.58

239,110.78

257,618.36

773,472.32

 

 

 

 

Sum

 

 

 

 

 

 

 

 

 

 

53,431.07

460,995.05

514,426.12

 

 

 

 



Below Other waits seen high, along with network on 3rd Jan.

 3rd Jan:


User I/O

Sys I/O

Other

Applic

Commit

Network

Concurcy

Config

Cluster

DB CPU

 

21.10

0.02

18.76

1.40

0.08

13.15

0.09

2.46

17.35

20.30

 2nd Jan:


User I/O

Sys I/O

Other

Applic

Commit

Network

Concurcy

Config

Cluster

DB CPU

 

39.09

0.01

0.46

0.07

0.03

0.59

0.08

3.07

5.03

34.51

 

Below other event is cluster resource (oracle internal activity) occurred only on 3rd Jan. you can find below more details on other event.

3rd Jan:

 

Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

 

DB CPU

 

 

14,846.06

 

20.30

 

 

 

 

2

*

Other

gcs resource directory to be unfrozen

6,775

95.11

13,349.67

1970.43

18.25

1961.97

1949.20

1974.73

18.05

2

*

User I/O

cell single block physical read

30,760,043

0.00

12,614.08

0.41

17.25

0.44

0.40

0.47

0.05

2

*

System I/O

RMAN backup & recovery I/O

857,759

0.00

12,467.22

14.53

17.05

14.53

14.43

14.64

0.14

2

*

Cluster

gc freelist

60,131

82.01

10,038.55

166.94

31.80

166.94

166.94

166.94

 

1

*

Concurrency

row cache lock

13,680

5.21

3,567.35

260.77

4.88

612.07

0.18

1223.96

865.35

2

*

Concurrency

library cache lock

10,558

5.54

3,363.50

318.57

4.60

1875.60

107.96

3643.25

2499.83

2

*

User I/O

cell smart table scan

17,331,056

51.23

2,536.76

0.15

3.47

0.15

0.15

0.15

0.00

2

*

Concurrency

library cache pin

14,121

0.00

2,121.93

150.27

2.90

158.28

148.95

167.60

13.19

2

*

Network

SQL*Net more data from client

527,196

0.00

1,723.15

3.27

2.36

3.27

3.27

3.27

 

2


2nd Jan:


Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

 

DB CPU

 

 

30,793.51

 

34.51

 

 

 

 

2

*

User I/O

cell single block physical read

67,029,701

0.00

24,153.60

0.36

27.07

0.37

0.35

0.39

0.03

2

*

User I/O

cell multiblock physical read

3,357,939

0.00

5,863.56

1.75

6.57

1.73

1.66

1.80

0.10

2

*

User I/O

cell smart table scan

24,176,351

50.98

4,641.71

0.19

5.20

0.19

0.19

0.19

0.00

2

*

Network

SQL*Net more data from client

664,649

0.00

2,603.27

3.92

2.92

3.92

3.92

3.92

 

2

*

User I/O

cell list of blocks physical read

4,298,928

0.00

2,545.01

0.59

2.85

0.59

0.58

0.60

0.01

2

*

Cluster

gc cr grant 2-way

15,503,329

0.00

1,838.16

0.12

2.06

0.12

0.12

0.12

0.00

2

*

Cluster

gc cr multi block request

3,849,691

0.00

1,775.51

0.46

1.99

0.46

0.43

0.49

0.04

2

*

User I/O

direct path read temp

240,951

0.00

1,220.15

5.06

1.37

5.03

4.49

5.57

0.77

2

*

Scheduler

resmgr:cpu quantum

179,407

0.00

894.56

4.99

1.00

5.00

4.95

5.04

0.06

2

Locking mechanism issue occurred only for node 4 on 3rd Jan.

 3rd Jan:

 

Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

 

DB CPU

 

 

14,846.06

 

20.30

 

 

 

 

2

*

Cluster

gc freelist

60,007

82.06

10,022.62

167.02

31.75

167.02

167.02

167.02

 

1

*

Other

gcs resource directory to be unfrozen

5,038

94.58

9,912.54

1967.55

13.55

1966.12

1963.72

1968.51

3.39

2

*

User I/O

cell single block physical read

19,363,802

0.00

8,386.46

0.43

11.47

0.45

0.42

0.47

0.04

2

*

Concurrency

row cache lock

8,527

8.36

3,566.50

418.26

4.88

647.87

0.20

1295.55

915.95

2

*

Concurrency

library cache lock

3,888

7.97

2,868.64

737.82

3.92

1966.38

176.03

3756.72

2531.93

2

*

User I/O

cell smart table scan

17,331,056

51.23

2,536.76

0.15

3.47

0.15

0.15

0.15

0.00

2

*

Concurrency

library cache pin

6,249

0.00

2,029.24

324.73

2.77

224.79

85.30

364.28

197.27

2

*

Network

SQL*Net more data from client

527,196

0.00

1,723.15

3.27

4.15

3.27

3.27

3.27

 

1

*

User I/O

cell multiblock physical read

802,957

0.00

1,309.16

1.63

1.79

1.63

1.42

1.84

0.29

2

2nd Jan:

 

Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

 

DB CPU

 

 

14,846.06

 

20.30

 

 

 

 

2

*

Cluster

gc freelist

60,007

82.06

10,022.62

167.02

31.75

167.02

167.02

167.02

 

1

*

Other

gcs resource directory to be unfrozen

5,038

94.58

9,912.54

1967.55

13.55

1966.12

1963.72

1968.51

3.39

2

*

User I/O

cell single block physical read

19,363,802

0.00

8,386.46

0.43

11.47

0.45

0.42

0.47

0.04

2

*

Concurrency

row cache lock

8,527

8.36

3,566.50

418.26

4.88

647.87

0.20

1295.55

915.95

2

*

Concurrency

library cache lock

3,888

7.97

2,868.64

737.82

3.92

1966.38

176.03

3756.72

2531.93

2

*

User I/O

cell smart table scan

17,331,056

51.23

2,536.76

0.15

3.47

0.15

0.15

0.15

0.00

2

*

Concurrency

library cache pin

6,249

0.00

2,029.24

324.73

2.77

224.79

85.30

364.28

197.27

2

*

Network

SQL*Net more data from client

527,196

0.00

1,723.15

3.27

4.15

3.27

3.27

3.27

 

1

*

User I/O

cell multiblock physical read

802,957

0.00

1,309.16

1.63

1.79

1.63

1.42

1.84

0.29

2

2nd Jan:


Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

 

DB CPU

 

 

30,793.51

 

34.51

 

 

 

 

2

*

User I/O

cell single block physical read

54,299,295

0.00

19,740.72

0.36

22.12

0.37

0.35

0.39

0.03

2

*

User I/O

cell multiblock physical read

3,355,153

0.00

5,845.42

1.74

6.55

1.72

1.65

1.79

0.10

2

*

User I/O

cell smart table scan

24,176,351

50.98

4,641.71

0.19

5.20

0.19

0.19

0.19

0.00

2

*

Network

SQL*Net more data from client

664,649

0.00

2,603.27

3.92

4.66

3.92

3.92

3.92

 

1

*

User I/O

cell list of blocks physical read

4,298,627

0.00

2,543.38

0.59

2.85

0.59

0.58

0.60

0.01

2

*

Cluster

gc cr grant 2-way

15,469,713

0.00

1,833.78

0.12

2.05

0.12

0.12

0.12

0.00

2

*

Cluster

gc cr multi block request

3,846,794

0.00

1,765.56

0.46

1.98

0.46

0.43

0.49

0.04

2

*

User I/O

direct path read temp

240,951

0.00

1,220.15

5.06

1.37

5.03

4.49

5.57

0.77

2

*

Scheduler

resmgr:cpu quantum

178,964

0.00

885.25

4.95

0.99

4.96

4.91

5.02

0.08

2


More details about other event caused problem only on node 4.

3rd Jan:

 

Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

4

System I/O

RMAN backup & recovery I/O

427,929

0.00

6,263.64

14.64

74.04

 

 

 

 

 

4

 

background cpu time

 

 

1,516.74

 

17.93

 

 

 

 

 

4

Other

gcs resource directory to be unfrozen

132

90.91

242.64

1838.18

2.87

 

 

 

 

 

4

Other

ges lms sync during dynamic remastering and reconfig

202

46.53

134.44

665.54

1.59

 

 

 

 

 

4

Other

ges LMON to get to FTDONE

5,950

99.01

130.12

21.87

1.54

 

 

 

 

 

4

Other

enq: CO - master slave det

1,143

100.00

129.66

113.44

1.53

 

 

 

 

 

4

Application

enq: TM - contention

21

0.00

127.77

6084.19

1.51

 

 

 

 

 

4

Concurrency

library cache pin

110

0.00

91.35

830.46

1.08

 

 

 

 

 

4

System I/O

log file sequential read

15,195

0.00

84.48

5.56

1.00

 

 

 

 

 

4

System I/O

log file parallel write

94,181

0.00

47.35

0.50

0.56

 

 

 

 

 

 Only 1 sql query seen high cpu,io,elapsed_time on 3rd Jan only.

3rd Jan:

 

After checking ASH report, sql_id 5fazzyb6hs9g6 is a DB scheduler backend job but not executed from 2:28pm to 2:32pm.

Sid, Serial#

% Activity

Event

% Event

User

Program

# Samples Active

XIDs

32,30861

50.00

CPU + Wait for CPU

50.00

username

 (J010)

1/360 [ 0%]

0

538,32305

50.00

CPU + Wait for CPU

50.00

username

 (J006)

1/360 [ 0%]

1


Analysis #4:

 Issue: From 07:00 PM to 07:08 PM we can see DB connection issue happen due to that All services went down, 

Could you please verify on priority, from last 2 days its happen 3rd time.

 Core DBA:

Node 3 Alert log error:

FAL[server, ARC3]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance - Archival Error. Archiver continuing.

 

Node 4 Alert log error:

Instance terminated by LMS0, pid = 113256

USER (ospid: 206853): terminating the instance

Instance terminated by USER, pid = 206853

 

Raised Oracle SR requested by manager.


Analysis #5:

 Please find in-detail how node 4 is evicted (Instance restarted) from cluster.

 @Core DBA: Between 7:02:52 to 7:08:04 PM not much information generated in DB alert log file and I found in trace file 'KSV master wait' event waiting but no information provided. if you can provide us cluster logs then we can go in-depth at cluster level and find out DB instance hang/restart reason.  You can follow below link to get all cluster log locations and please provide ASM alert log file also.

 http://db.geeksinsight.com/2012/09/29/11gr2-rac-important-trace-directories-log-locations/

 Reconfiguration started at 6:37:37 PM, same time in instance3 and instance4 and completed successfully.

Instance 3

Instance 4

reconfiguration is started at 7:02:51 PM, same time in instance3 and instance4 but this time node 3 got error in communicating with node 4 at 7:02:52 PM.

 Instance 3:

Node 3 decided to go for node 4 restart at 7:08:02 PM with reason 0x2 à After checking this error in google, it is relating to Exadata memory.

 Note: node restart depends on memory, network, storage (IO), cpu (process) communication between nodes.

 Instance 4:

Node 3 once again started reconfiguration at 7:08:02 PM but this time node 4 not exist (because node restarted)

Instance 3:

Node 4 before restarted, all DB connections were timeout at 7:08:02 PM


Node 3 started DB connection timeout at 7:08:03 PM (after node 4)



Node 3 started recovery at 7:08:04 PM

Instance 3:

During recovery in node 3, archive logs for node 4 rejected at standby, followed by archiver error. (node 4 is down at 7:08:08 PM)

Instance 4:

Node 4 started at 7:08:10 pm

 Instance 4:

Reconfiguration started at 7:08:13 PM and this time node 3 and node 4 successfully.

Instance 3:

Instance 4:

Trace file -- 'KSV master wait'

 

     0: waited for 'KSV master wait'

        =0x0, =0x0, =0x0

        wait_id=9229685 seq_num=54786 snap_id=1

        wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec

        wait times: max=0.000000 sec

        wait counts: calls=1 os=1

        occurred after 0.000010 sec of elapsed time

     1: waited for 'KSV master wait'

        =0x0, =0x0, =0x0

        wait_id=9229684 seq_num=54785 snap_id=1

        wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec

        wait times: max=0.000000 sec

        wait counts: calls=1 os=1

        occurred after 0.000021 sec of elapsed time

2: waited for 'rdbms ipc message'

        timeout=0x12c, =0x0, =0x0

        wait_id=9229683 seq_num=54784 snap_id=1

        wait times: snap=3.076189 sec, exc=3.076189 sec, total=3.076189 sec

        wait times: max=3.000000 sec

        wait counts: calls=1 os=1

        occurred after 0.000002 sec of elapsed time

     3: waited for 'KSV master wait'

        =0x0, =0x0, =0x0

        wait_id=9229682 seq_num=54783 snap_id=1

        wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

        wait times: max=0.000000 sec

        wait counts: calls=1 os=1

        occurred after 0.000026 sec of elapsed time

Final Solution from Oracle SR:

After changing below parameters, issue resolved


  


Comments

Popular Posts

AWR Report Analysis

Goldengate 19c Installation with issue resolution

Oracle 11g Physical Standy Setup

Oracle 12c 2-node RAC setup in VMware Player 15 -- Part01

Oracle to Postgres

Execution_Explain Plan

Active Session History Analysis

General Queries

Fragmentation & Re-organisation

OEM patches & vulnerabilities