Question

IFS10 - CPU 100% and very slow system

  • 11 November 2019
  • 34 replies
  • 3871 views

Userlevel 3
Badge +8

After we upgraded to IFS10 we are struggling with slowness in our Oracle DB. We haven’t been able to pinpoint the cause of our problems. But very often (5-10 times each week), we see that the CPU goes to 100% and doesn’t let go. Every user in our company are then struggling with slowness when this happens. As an example, saving a time registration can take 3-4 minutes.

Today, a restart of the DB server was necessary to get it back to  normal.  

We have used the query below to find CPU consuming operations, and have also checked background jobs and DMBS jobs. Nothing special has been running when the problems occur, or at least nothing that helps us to conclude on a specific problem.

select s.username,
s.inst_id,
s.sid,
s.serial#,
s.status,
(select value from gv$sesstat natural join v$statname where name = 'physical reads' and sid=s.sid)disk_reads,
(select value from gv$sesstat natural join v$statname where name = 'session logical reads' and sid=s.sid) logical_reads,
(select value/100 cpu_sec from gv$sesstat natural join v$statname where name = 'CPU used by this session' and sid=s.sid)cpu_sec,
s.logon_time,
s.program,
s.osuser,
s.machine,
p.pid,
p.spid,
s.client_identifier,
s.client_info,
s.module,
s.action,
s.event,
s.wait_class,
case
when s.status = 'ACTIVE' then
s.last_call_et
else
null
end wait_sec,
s.blocking_session_status,
s.blocking_instance,
s.blocking_session,
s.sql_id,
s.sql_hash_value,
s.sql_address,
s.prev_sql_id,
s.prev_hash_value,
s.prev_sql_addr,
s.sql_child_number,
s.prev_child_number,
s.server
from gv$session s join gv$process p on (s.inst_id = p.inst_id and s.paddr = p.addr)
where s.type = 'USER' and s.status='ACTIVE'
order by disk_reads desc

We have integrations and scheduled tasks running, all which have been running fine in IFS 8.1 for years.

After googling, we some people says that Oracle 12.2 is not as effective as 11, but we can’t verify this.

Does anyone have some tips to solve these things, or at least find the root cause when the problems occur? Appreciate all help we can get. 

Some facts:

  • Around 5-700 concurrent users
  • Using Azure AD for authentication
  • Using around major 10 Modules in IFS (DocMan, Finance, Engineering, SCM, etc)
  • DB size around 600GB
  • Running reports with Crystal Reports (set up via web, running queries towards IFS)

 


34 replies

Userlevel 6
Badge +9

Presuming most of the CPU is being consumed by the Oracle service,

Have you checked the memory usage of the server whilst CPU hits 100%? also does CPU reach 100% only when certain batch jobs execute? (you can check this by querying the executing background jobs when CPU hits 100%). This should shed some light on the issue.


If there are no background jobs running and CPU hits 100% and there is no significant errors found in the Oracle alert log, I believe an Oracle Stats pack report or AWR report has to be analyzed in order for any issues.

Userlevel 3
Badge +8

@NadeeshHerath , correct. All CPU is consumed by Oracle Service. The memory usage (tot 96GB) is constantly at around 90%, I think this is normal behavior. 

We checked the background jobs when the CPU hits 100% Nothing time consuming or big things was executing, so no findings there. 

 

We discovered that the SWAP drive was not configured according to increased RAM, and setup as SSD. This has been fixed now, waiting to check if this has an effect. 

 

We have also played with the parameter OPTIMIZER_FEATURES_ENABLE, setting this to 11.2.0.4. And turned off parallel execution. Not 100% if these settings has any effect yet. Does anyone have experience with this?

 

We haven’t checked the Oracle Alert log, but I would assume we would’ve heard from our DB partner company if anything is wrong there.  Can check with them. I also thing we should look at the Oracle Stats pack or AWR reports. 

Userlevel 5
Badge +7

Pinpointing these issues may take several queries and a bit of monitoring to pin down. Can you please provide the following information.

  1. The server has 96 GB of memory installed. How much memory is being used by the “oracle.exe” process(es)? Please confirm that this server only hosts one database.
  2. What are the values of the following Oracle parameters?
    • compatible
    • cpu_count
    • db_cache_size
    • job_queue_processes
    • optimizer_deatures_enable
    • optimizer_index_cost_adj
    • pga_aggregate_target
    • processes
    • sessions
    • sga_max_size
    • sga_target
    • shared_pool_size
    • sort_area_size
  3. How many cores are assigned to this server?
  4. Is the server physical or virtual?
  5. Are the table statistics current (withing the past week)?
  6. Have the indexes been validated lately?
  7. What is the operating system version?
  8. Do you have a utility installed to view the threads under the “oracle.exe” process to see what type of process (worker, execution, job, etc) is consuming the CPU? All sessions run as a separate thread under the “oracle.exe” process.
  9. If the issue went away when the database was restarted it was either a background job (IFS or database) or an interactive process launched by a user. Please comment.
  10. What is your schedule for updating the IFS search indexes on objects? Do you have all of them enabled?
  11. When the excessive CPU was found was the database creating large numbers of archive logs? This will let us know if the process was performing any inserts, updates, or deletions as part of the processing.
  12. Do you have the in-memory option turned on?
Userlevel 3
Badge +8

@alanbclark , thanks for looking at this

  1. 81.6 GB. Yes, it only serves one database
    • compatible: 12.2.0
    • cpu_count: 16
    • db_cache_size: 40G
    • job_queue_processes: 100
    • optimizer_features_enable: 11.2.0.4
    • optimizer_index_cost_adj: 10
    • pga_aggregate_target: 2000M
    • processes: 800
    • sessions: 1232
    • sga_max_size: 80G
    • sga_target: 80G
    • shared_pool_size: 4G
    • sort_area_size=65536
  2. 4 sockets / 16 vCPU
  3. Virtual
  4. Yes
  5. Yes
  6. Windows Server 2016 (1607)
  7. Yes
  8. The issue went away after the restart, and we didn’t notice any particular time consuming jobs or processes running before the restart. 
  9. Rebuild indexes are scheduled to run every 2. month. It doesn’t seems to have run after the upgrade.
  10. No
  11. Not in use
Userlevel 7
Badge +15

Intelligent tuning is only possible with good measurements. If you’re not comfortable with STATSPACK (included with every version of Oracle) or AWR (included if you’ve separately licensed the Oracle Tuning Pack on top of Enterprise Edition), I highly recommend Database Performance Analyzer.

https://www.solarwinds.com/database-performance-analyzer

Userlevel 5
Badge +7

As an alternative to a $2,000 utility, here is a query that will show sessions ranked by CPU usage and shows helpful information about the session so you can track down the process/user. The query displays helpful information like username, machine, program, type, event, and the sql text/command being referenced.

 

select sess_cpu.con_id, 
sess_cpu.sid, 
NVL(sess_cpu.username, 'Oracle Process') username, 
sess_cpu.status, 
sess_cpu.logon_time, 
sess_cpu.machine, sess_cpu.program, sess_cpu.type, sess_cpu.event,
round((sysdate - sess_cpu.logon_time)*1440*60) logon_SECS, 
sess_cpu.value/100 SESS_CPU_SECS, (sess_cpu.value - call_cpu.value)/100 LAST_CPU_SECS,
round((sess_cpu.value/100)/round((sysdate - sess_cpu.logon_time)*1440*60)*100,2) Percent, 
sess_cpu.sql_id,
(select sql_text from v$sql sql where sql.sql_id = sess_cpu.sql_id) sql_text
from
(select se.con_id, se.sql_id, ss.statistic#, se.sid, se.username, se.status, se.program, se.logon_time, sn.name, ss.value,
se.machine, se.type, se.event
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used by this session') ) sess_cpu,
(select se.con_id, ss.statistic#, se.sid, ss.value, value/100 seconds 
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used when call started') ) call_cpu
where sess_cpu.sid = call_cpu.sid
and sess_cpu.con_id = call_cpu.con_id
and round((sysdate - sess_cpu.logon_time)*1440*60) > 0
order by percent desc;

 

 

Userlevel 6
Badge +15

Hello,

It is interesting to note that your Apps 10 database is facing performance issues, especially CPU spiking at 100%. In our experience, Apps 10 database is designed to perform fast.

  1. Please check to see if there is any Oracle maintenance jobs running as per the Oracle alert log around the same time when the CPU spikes up. We had a client who ran into a similar issue running Oracle 12.1.0.2 and it turned out to be a maintenance job. The issue went away once the job was disabled.  
  2. Check for blocking locks in the database using the query below. This also causes high CPU usage.
  3. ------- REM IFS SQL to find blocking locks ----------------
    select logon_time, sid, serial#, blocking_session, program, seconds_in_wait, machine, client_info, username "ORACLE USERNAME", wait_class, audsid, process
    from  v$session 
    where blocking_session is not NULL 
    order by logon_time;

Please check these two first and keep me posted.

Userlevel 3
Badge +8

As an alternative to a $2,000 utility, here is a query that will show sessions ranked by CPU usage and shows helpful information about the session so you can track down the process/user. The query displays helpful information like username, machine, program, type, event, and the sql text/command being referenced.

 

select sess_cpu.con_id, 
sess_cpu.sid, 
NVL(sess_cpu.username, 'Oracle Process') username, 
sess_cpu.status, 
sess_cpu.logon_time, 
sess_cpu.machine, sess_cpu.program, sess_cpu.type, sess_cpu.event,
round((sysdate - sess_cpu.logon_time)*1440*60) logon_SECS, 
sess_cpu.value/100 SESS_CPU_SECS, (sess_cpu.value - call_cpu.value)/100 LAST_CPU_SECS,
round((sess_cpu.value/100)/round((sysdate - sess_cpu.logon_time)*1440*60)*100,2) Percent, 
sess_cpu.sql_id,
(select sql_text from v$sql sql where sql.sql_id = sess_cpu.sql_id) sql_text
from
(select se.con_id, se.sql_id, ss.statistic#, se.sid, se.username, se.status, se.program, se.logon_time, sn.name, ss.value,
se.machine, se.type, se.event
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used by this session') ) sess_cpu,
(select se.con_id, ss.statistic#, se.sid, ss.value, value/100 seconds 
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used when call started') ) call_cpu
where sess_cpu.sid = call_cpu.sid
and sess_cpu.con_id = call_cpu.con_id
and round((sysdate - sess_cpu.logon_time)*1440*60) > 0
order by percent desc;

 

 

I’m getting an error running this

 

Userlevel 5
Badge +7

Try this one. I removed an inner select statement that displays the SQL_TEXT of the command being run. You can find the SQL commands being run by querying the v$sql VIEW and matching the SQL_ID being returned form the query shown below.

select sess_cpu.con_id, 
sess_cpu.sid, 
NVL(sess_cpu.username, 'Oracle Process') username, 
sess_cpu.status, 
sess_cpu.logon_time, 
sess_cpu.machine, sess_cpu.program, sess_cpu.type, sess_cpu.event,
round((sysdate - sess_cpu.logon_time)*1440*60) logon_SECS, 
sess_cpu.value/100 SESS_CPU_SECS, (sess_cpu.value - call_cpu.value)/100 LAST_CPU_SECS,
round((sess_cpu.value/100)/round((sysdate - sess_cpu.logon_time)*1440*60)*100,2) Percent, 
sess_cpu.sql_id
from
(select se.con_id, se.sql_id, ss.statistic#, se.sid, se.username, se.status, se.program, se.logon_time, sn.name, ss.value,
se.machine, se.type, se.event
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used by this session') ) sess_cpu,
(select se.con_id, ss.statistic#, se.sid, ss.value, value/100 seconds 
from v$session se, v$sesstat ss, v$statname sn
where se.sid = ss.sid
and sn.statistic# = ss.statistic#
and sn.name in ('CPU used when call started') ) call_cpu
where sess_cpu.sid = call_cpu.sid
and sess_cpu.con_id = call_cpu.con_id
and round((sysdate - sess_cpu.logon_time)*1440*60) > 0
order by percent desc;
 

 

Userlevel 3
Badge +8

Thanks @alanbclark . Is there any information to get from showing the INACTIVE sessions? I see that the query I posted in my first post shows almost the same information, but only showing the ACTIVE sessions. My also shows the SQL running. 

Userlevel 7
Badge +15

You'll need to also join by the SQL child ID. If it's null on the session, you'll want 0 on the library cache. This happens because the same query can be executed with multiple plans depending on table statistics, available indexes, and the parsing schema.

Badge +1

Hi,

Could you if possible get a statpack report for that system during the time period when the problem occur. 

 

Could you also describe the typical user behavior during this time frame?

 

Send the statpack report to me for further investigations in IFS Benchmarking lab.


 

Userlevel 6
Badge +14

Hi,

Is this a Multi-Tenant database or a standalone database? Are you running the base version of Oracle 12.2. or have you applied the later patch releases? There were significant performance issues related to Multi-tenant databases in the base release which were fixed in later PSU’s. 

Userlevel 7
Badge +15

@alanbclark , this is definitely going in my toolkit. 

I’d like to suggest a few edits:

∙ You can usually look at the CLIENT_INFO to get username information for IFSSYS sessions.
∙ As I said, V$SQL may include multiple copies of a given query for things like index changes or for a different parsing user. I added the child number to make a unique join.
∙ It might be helpful to separate the session CPU from the call CPU. Otherwise, a session that starts out idle and then later does a lot of work might get underreported.
∙ I joined to active background jobs.
 

 

Userlevel 4
Badge +9

Thanks for the SQL; this is really interesting and something i will use for analysis on any performance issues.

I notice a difference between @durette  and @alanbclark  SQL scripts in that the percentage columns return very different results and this is the field used for ordering the highest consuming job. @alanbclark percentage column seems to have values for most lines and the value seems sensible, whereas @durette only a small number have a value returned in percentage and the rest are empty.

Are you able to explain why different?

Which of the two SQL statements should be more accurate to use?

Regards

Shaun

Userlevel 7
Badge +15

I reported the average session CPU separately from the average call CPU. Most sessions will tend to be currently idle.

Userlevel 3
Badge +8

Hi,

Is this a Multi-Tenant database or a standalone database? Are you running the base version of Oracle 12.2. or have you applied the later patch releases? There were significant performance issues related to Multi-tenant databases in the base release which were fixed in later PSU’s. 

It’s a standalone database. We are running latest PSU’s

Userlevel 6
Badge +11

From a Crystal Reports framework perspective make sure you have installed the new version of the Crystal Web Service and the Crystal .net runtime that comes with Apps 10.

Also you can enable the CWS logs and see how much time it takes to process and generate the layout.

Userlevel 3
Badge +8

Just wanted to give an update on this. 

We did three changes and haven’t seen the CPU spike to 100% after this:

  • Changed Parallel Execution from 640 to 0 (alter system set parallel_max_servers=0 scope=both;)
  • Changed optimizer_features_enable to 11.2.0.4
  • Changed disks handling for swapfile, changed to SSD and size according to RAM

We are still struggling with slower query results than in IFS8, much more waiting for our users. If anyone have tips on how what to check, any help will be appreciated! Especially the HR checks (User_access_api) and Language/Translation checks (DOMAIN_SYS) seems to take much longer time in IFS10. We are only using one language in our IFS version. Could maybe be an configuration error in our system?

Badge +1

If your Database server is swapping, you really need to add more memory.

 

Another reason for poor performance can be backup jobs, either RMAN or Server backup

Userlevel 3
Badge +8

@terje , we have 96GB of RAM and the SSD swap drive is set to 200GB. Would think this is enough?

 

We'll check our RMAN and Server backup. 

Badge +1

SSD swap area is a safety net, that only should be used in rare situations.

If you OS or DB is swapping, during daily operations, your performance will be downgraded.

81,6 GB SGA should be enough, if the db parameters are optimized.

If you have swapping on OS level, you need give more memory to the OS, reducing the DB SGA.

Userlevel 5
Badge +7

It would be interesting to know what and how many objects are available for parallel execution. Most IFS objects have a parallel processing degree of 1, and therefore won’t start up parallel processes.

 

SELECT owner, table_name, degree
FROM dba_tables
where trim(degree) <> '1'
order by owner, table_name;
 

Here’s q query where you can view parallel statistics should it still be enabled on other databases. Needs privilege to run this query.

select * from v$pq_sysstat;

 

Badge +1

If you look in your top SQL you might see:

select toid from type$ where…

It is related to metadata loading in Aurena and consumes quite some resources during metadata load. This is due to a bug in Oracle which can be resolved by applying a patch. Basically the patch solves a missing sys index.

 

Search Bug 25154887 : RECURSIVE QUERIES ON TYPE$ RUN DURING DESC SLOW IN 12C in Oracle Metalink.

 

//Fredrik

Userlevel 3
Badge +8

@alanbclark, the first query return 0 records. Does this mean that disabling parallel execution is not the reason why it’s better now?

 

@terje , our Oracle DB partner doubt that RMAN is the cause of slowness. Daily incremental backup at 19:00, runs for 20min, log backup each half hour, but this is finished in 1 minute. Full backup runs for a while, but this is only running in weekends. 

We’ll check the swapping settings. 

Reply