Exadata - Oracle DB Performing Slow due to High IO
Issue #:
People are complaining that DB is performing slow and observing high I/O. Earlier to today, could see there are few set of functions related to application Module load causing the slowness.Can you please have a look further and provide your guidance.
Analysis #1:
From below AWR report, we can see cell getting exhaust with full table scan. Can you please provide cell statistics (load) between 1pm to 2pm (today and yesterday).
@App lead:
Every time we cannot say jobs load issue if data comes more. We need to go in detail where exactly caused problem.
From last week, I have seen load max 12 but today it went to 15. And both instances more than 10.
This time USER IO is high and timeout is 47% (not good sign).
Wait |
Event |
Wait Time |
Summary Avg Wait Time (ms) |
|||||||||
I# |
Class |
Event |
Waits |
%Timeouts |
Total(s) |
Avg(ms) |
%DB time |
Avg |
Min |
Max |
Std Dev |
Cnt |
* |
User I/O |
cell smart table scan |
85,678,559 |
47.82 |
251,734.83 |
2.94 |
65.15 |
2.94 |
2.93 |
2.95 |
0.01 |
2 |
* |
User I/O |
cell single block physical read |
75,291,249 |
0.00 |
45,166.16 |
0.60 |
11.69 |
0.60 |
0.60 |
0.60 |
0.00 |
2 |
* |
DB CPU |
33,067.17 |
8.56 |
2 |
||||||||
* |
User I/O |
cell multiblock physical read |
2,605,214 |
0.00 |
8,380.55 |
3.22 |
2.17 |
3.12 |
2.47 |
3.77 |
0.92 |
2 |
* |
User I/O |
cell list of blocks physical read |
5,474,088 |
0.00 |
5,052.89 |
0.92 |
1.31 |
0.94 |
0.89 |
0.99 |
0.07 |
2 |
* |
User I/O |
direct path read temp |
262,112 |
0.00 |
4,348.86 |
16.59 |
1.13 |
16.65 |
15.94 |
17.36 |
1.00 |
2 |
* |
User I/O |
direct path read |
1,017,655 |
0.00 |
3,059.73 |
3.01 |
0.79 |
2.49 |
1.92 |
3.05 |
0.80 |
2 |
* |
Network |
SQL*Net more data from client |
578,089 |
0.00 |
2,593.47 |
4.49 |
1.26 |
4.49 |
4.49 |
4.49 |
1 |
|
* |
Scheduler |
resmgr:cpu quantum |
203,446 |
0.00 |
1,426.67 |
7.01 |
0.37 |
7.20 |
6.11 |
8.29 |
1.54 |
2 |
* |
Cluster |
gc cr grant 2-way |
10,534,866 |
0.00 |
1,402.35 |
0.13 |
0.36 |
0.17 |
0.12 |
0.22 |
0.08 |
2 |
Below highlighted sql_id’s, cpu and IO wait are very high. we need to talk with developer team to tune this pl/sql.
Total |
Per Execution |
Percentage of Total |
|||||||||||||||||||||
SQL Id |
CPU (s) |
Elapsed (s) |
IOWait (s) |
Gets |
Reads |
Rows |
Cluster (s) |
Execs |
CPU (s) |
Elapsed (s) |
IOWait (s) |
Gets |
Reads |
Rows |
Cluster (s) |
DB CPU |
DB time |
IO Wait |
Gets |
Reads |
Cluster |
Execs |
SQL Text |
1,147.15 |
3,381.13 |
2,315.04 |
28,337,764 |
12,787,826 |
1 |
156.77 |
1 |
1,147.15 |
3,381.13 |
2,315.04 |
28,337,764.00 |
12,787,826.00 |
1.00 |
156.77 |
3.47 |
0.88 |
0.72 |
0.20 |
0.09 |
4.82 |
0.00 |
DECLARE job BINARY_INTEGER := ... |
|
1,146.47 |
2,641.43 |
1,483.05 |
53,540,177 |
42,737,097 |
1 |
87.61 |
1 |
1,146.47 |
2,641.43 |
1,483.05 |
53,540,177.00 |
42,737,097.00 |
1.00 |
87.61 |
3.47 |
0.68 |
0.46 |
0.37 |
0.31 |
2.70 |
0.00 |
DECLARE job BINARY_INTEGER := ... |
Below we can see highlighted table doing full table scan.
@core DBA:
Seems there is no metric created in OEM for Exadata machine. If we can enable metric IOPS then it will be easy for us to analyse Exadata cells IO whenever we see “cell smart table scan”.
Please find attached oracle document for metric details along with creation steps.
After creating metric, expected output will be:
Analysis #2:
Requested by app team:
Please find below current SGA and CPU. If we can increase SGA to 60GB and CPU to 24.
NAME TYPE VALUE
-------------------------------------------------- ----------- -
sga_max_size big integer 34G
sga_target big integer 28G
cpu_count integer 12
You can change large pages also based on SGA. This will help us for LOB’s segment.
use_large_pages |
* |
TRUE |
@core DBA
App team accidently confirmed DB facing issue because of memory/cpu crunch. So, Client requested how much memory and cpu needed extra so that he can approve it. So, I suggested values.
DB is using ASMM only, there is no value set for memory_target value.
NAME TYPE VALUE
-------------------------------------------------- ----------- -----
memory_max_target big integer 0
memory_target big integer 0
DB parameter use_large_pages set to TRUE, so I suggest to check huge_pages on server level and change it accordingly, so that we can take advantage when ever memory requested more than default 4KB. Can you please provide system parameter values from server level and you can check current huge_pages value set in server level.
https://docs.oracle.com/database/121/UNXAR/appi_vlm.htm#UNXAR403
Analysis #3:
Please find below my analysis for right side graph.
1. There is spike after 12pm on 7th Jan but not on 8th Jan for IORM, IO.
2. If you see IO is constant for whole period but after 12PM on 8th is also constant.
3. IORM wait time is 12ms on 7th Jan whereas on 8th Jan < 2ms and if you see small IO also increased during that time.
Please have a look on belong oracle doc, it is not easy to identify IO bottle neck using AWR report only. We need Exadata health checkup also.
http://www.aioug.org/ODevCYatra/2018/Karan_Oracle_DISK_IO.pdf
If they configure Exadata (read only for us), then we can easily analyse whenever we will have issue in DB.
https://docs.oracle.com/cd/E24628_01/doc.121/e27442/ch7_metrics.htm#EMXIG356
Core DBA: Has anyone done a TFA analysis? I would like to see the OSWatcher logs between 12 and 3PM that day. If we have raised an SR for the elevated cell smart scan event, we should get a TFA diagcollect and send that to Oracle for analysis.
Venu: We already requested to SR team but hey refused to look TFA report and told us to first test given hidden parameters.
Core DBA:
It seems the cell server bad health started from Nov19th, still it is in RED. To do deep troubleshooting/analysis may we need to initiate discussion with L3 DBA team.
Final Solution:
We collected SQL's which are performing full table scan and used high IO. We ran sql tuning advisor, collcted sql history plan values and manually checked missing columns indexes. At last we created new indexes for impacted SQLs to reduce IO.
Comments
Post a Comment