Health Checks & KPI Monitoring

The Right expertise At The Right Time

Case Study:

Health Checks investigate logs and KPIs on the server, database and application to spot problems before they become an issue. Monitoring provides historical and real time information along with alerts for important KPIs. The combination ensures that service uptime is maintained and future downtime is planned.

About the client

The client in this case study is a logistics company providing services to the retail sector. They operate 24/6 and half day on Saturday. Cyberdan Ltd support their environment for Linux and Oracle running a third party warehouse system.

Initial Findings

An initial health check was performed on their system and several key issues were identified. These included:-


• low buffer cache hit ratio in Oracle database (43%) and PGA (56%)
• slow running queries in Oracle database

• lack of RAM in OS and high usage of swap space
- high CPU usage

• no specific Oracle backup in place

• printing issues


These initial findings show issues with the system for performance, operations and disaster contingency.

Remedial Work

The platform is running on VMware and there were resources available on the VM host. RAM was increased from 4GB to 12GB. CPU was increased from 2 cores to 4 cores.

Using estimates of Oracle memory usage, the Oracle memory allocation was raised from 1GB to 6GB. This still left some scope for tweaking should it require more in production.

Oracle RMAN backup scripts were written to make off server backups nightly and hourly backups of archive logs. Oracle logs multiplexed to local and off server storage for real time resilience.

Several printers were no longer available on the network. These were removed.

Results

Changes requiring downtime were scheduled in during non operational periods (Saturday evenings in this case) so that there was no disruption to operations.

Oracle buffer cache hit ratio increased from 43% to 91% due to the extra memory allocated allowing more data to be retained in buffer cache. The PGA hit ratio rose from 56% to 97% indicating that program code and user variables were mostly contained in RAM. Long running queries sped up and general responsiveness of the system improved.

Due to more RAM, the system swap space dropped dramatically to almost none ( Linux often reserves a little swap space). CPU load dropped not only due to the extra cores added but also less time managing swap space and Oracle queries taking less time to run and using CPU over a shorter period.

The Oracle database is now backed up off the server and alerts have been setup to monitor the status.

Real time Oracle logs are duplicated off the server so that should the system go down the most current log is still available offline.

Printers are working as expected.

Overall a successful set of changes.