Exadata - Oracle DB Performing Slow due to High IO

Issue #:

People are complaining that DB is performing slow and observing high I/O. Earlier to today, could see there are few set of functions related to application Module load causing the slowness.Can you please have a look further and provide your guidance.

 Analysis #1:

 From below AWR report, we can see cell getting exhaust with full table scan. Can you please provide cell statistics (load) between 1pm to 2pm (today and yesterday).

 @App lead:

Every time we cannot say jobs load issue if data comes more. We need to go in detail where exactly caused problem.

 From last week, I have seen load max 12 but today it went to 15. And both instances more than 10.


This time USER IO is high and timeout is 47% (not good sign).



Wait

Event

Wait Time

Summary Avg Wait Time (ms)

I#

Class

Event

Waits

%Timeouts

Total(s)

Avg(ms)

%DB time

Avg

Min

Max

Std Dev

Cnt

*

User I/O

cell smart table scan

85,678,559

47.82

251,734.83

2.94

65.15

2.94

2.93

2.95

0.01

2

*

User I/O

cell single block physical read

75,291,249

0.00

45,166.16

0.60

11.69

0.60

0.60

0.60

0.00

2

*


DB CPU



33,067.17


8.56





2

*

User I/O

cell multiblock physical read

2,605,214

0.00

8,380.55

3.22

2.17

3.12

2.47

3.77

0.92

2

*

User I/O

cell list of blocks physical read

5,474,088

0.00

5,052.89

0.92

1.31

0.94

0.89

0.99

0.07

2

*

User I/O

direct path read temp

262,112

0.00

4,348.86

16.59

1.13

16.65

15.94

17.36

1.00

2

*

User I/O

direct path read

1,017,655

0.00

3,059.73

3.01

0.79

2.49

1.92

3.05

0.80

2

*

Network

SQL*Net more data from client

578,089

0.00

2,593.47

4.49

1.26

4.49

4.49

4.49


1

*

Scheduler

resmgr:cpu quantum

203,446

0.00

1,426.67

7.01

0.37

7.20

6.11

8.29

1.54

2

*

Cluster

gc cr grant 2-way

10,534,866

0.00

1,402.35

0.13

0.36

0.17

0.12

0.22

0.08

2


Below highlighted sql_id’s, cpu and IO wait are very high. we need to talk with developer team to tune this pl/sql.


Total

Per Execution

Percentage of Total


SQL Id

CPU (s)

Elapsed (s)

IOWait (s)

Gets

Reads

Rows

Cluster (s)

Execs

CPU (s)

Elapsed (s)

IOWait (s)

Gets

Reads

Rows

Cluster (s)

DB CPU

DB time

IO Wait

Gets

Reads

Cluster

Execs

SQL Text

3x4h2q7rhhq30

1,147.15

3,381.13

2,315.04

28,337,764

12,787,826

1

156.77

1

1,147.15

3,381.13

2,315.04

28,337,764.00

12,787,826.00

1.00

156.77

3.47

0.88

0.72

0.20

0.09

4.82

0.00

DECLARE job BINARY_INTEGER := ...

fdp0brc3xgcpy

1,146.47

2,641.43

1,483.05

53,540,177

42,737,097

1

87.61

1

1,146.47

2,641.43

1,483.05

53,540,177.00

42,737,097.00

1.00

87.61

3.47

0.68

0.46

0.37

0.31

2.70

0.00

DECLARE job BINARY_INTEGER := ...

 Below we can see highlighted table doing full table scan.


@core DBA:

Seems there is no metric created in OEM for Exadata machine. If we can enable metric IOPS then it will be easy for us to analyse Exadata cells IO whenever we see “cell smart table scan”.

Please find attached oracle document for metric details along with creation steps.

After creating metric, expected output will be:

Analysis #2:

 

Requested by app team:

 

Please find below current SGA and CPU. If we can increase SGA to 60GB and CPU to 24.

NAME                                               TYPE    VALUE   

-------------------------------------------------- ----------- -

sga_max_size                                       big integer 34G

sga_target                                         big integer 28G

cpu_count                                          integer     12

                                                                                                  

You can change large pages also based on SGA. This will help us for LOB’s segment.

use_large_pages

*

TRUE


@core DBA

 App team accidently confirmed DB facing issue because of memory/cpu crunch. So, Client requested how much memory and cpu needed extra so that he can approve it. So, I suggested values.

 DB is using ASMM only, there is no value set for memory_target value.

NAME                                               TYPE        VALUE
-------------------------------------------------- ----------- -----
memory_max_target                                  big integer 0
memory_target                                      big integer 0 

DB parameter use_large_pages set to TRUE, so I suggest to check huge_pages on server level and change it accordingly, so that we can take advantage when ever memory requested more than default 4KB. Can you please provide system parameter values from server level and you can check current huge_pages value set in server level.

 https://docs.oracle.com/database/121/UNXAR/appi_vlm.htm#UNXAR403

https://www.carajandb.com/en/blog/2016/7-easy-steps-to-configure-hugepages-for-your-oracle-database-server/

Analysis #3:

 Please find below my analysis for right side graph.

 1.       There is spike after 12pm on 7th Jan but not on 8th Jan for IORM, IO.

2.       If you see IO is constant for whole period but after 12PM on 8th is also constant.

3.       IORM wait time is 12ms on 7th Jan whereas on 8th Jan < 2ms and if you see small IO also increased during that time.

 Please have a look on belong oracle doc, it is not easy to identify IO bottle neck using AWR report only. We need Exadata health checkup also.

http://www.aioug.org/ODevCYatra/2018/Karan_Oracle_DISK_IO.pdf

 If they configure Exadata (read only for us), then we can easily analyse whenever we will have issue in DB.

https://docs.oracle.com/cd/E24628_01/doc.121/e27442/ch7_metrics.htm#EMXIG356

 Core DBA: Has anyone done a TFA analysis? I would like to see the OSWatcher logs between 12 and 3PM that day. If we have raised an SR for the elevated cell smart scan event, we should get a TFA diagcollect and send that to Oracle for analysis.

Venu: We already requested to SR team but hey refused to look TFA report and told us to first test given hidden parameters.

Core DBA:

It seems the cell server bad health started from Nov19th, still it is in RED. To do deep troubleshooting/analysis may we need to initiate discussion with L3 DBA team.

Final Solution:

We collected SQL's which are performing full table scan and used high IO. We ran sql tuning advisor, collcted sql history plan values and manually checked missing columns indexes. At last we created new indexes for impacted SQLs to reduce IO.



Comments

Popular Posts

AWR Report Analysis

Goldengate 19c Installation with issue resolution

Oracle 11g Physical Standy Setup

Oracle 12c 2-node RAC setup in VMware Player 15 -- Part01

Oracle to Postgres

Execution_Explain Plan

Active Session History Analysis

General Queries

Fragmentation & Re-organisation

OEM patches & vulnerabilities