What are health checks?
Health Checks investigate logs and KPIs on the server, database and application to spot problems before they become an issue. Monitoring provides historical and real time information along with alerts for important KPIs. The combination ensures that service uptime is maintained and future downtime is planned. Every client we take on will have a initial health check during the onboarding phase. Once actions have been taken to remediate any problems detected, a regular health check is performed each month and report is produced for feedback.
About the client
The client in this case study is a logistics company providing services to the retail sector. They operate 24/6 and half day on Saturday. As their IT partner we support their environment for Linux and Oracle running the RedPrairie/JDA WMS system.
Initial Findings
An initial health check was performed on their system and several key issues were identified as major concerns and to be addressed accordingly. These included:-
- low buffer cache hit ratio in Oracle database (43%) and PGA (56%)
- Slow running queries in Oracle database
- Lack of RAM in OS and high usage of swap space
- High CPU usage
- No specific Oracle backup in place
- Printing issues
These initial findings showed issues with the system for performance, operations and disaster contingency. specicially this client operates in the retail and warehouse industry, therefore having a system under resourced to handle the workloads needed, This would eventually cause serious negative impact on the service and availiblity if these problems were not addressed.
Remedial work
The platform is running on VMware and there were resources available on the VM host. RAM was increased from 4GB to 12GB. CPU was increased from 2 cores to 4 cores. Using estimates of Oracle memory usage, the Oracle memory allocation was raised from 1GB to 6GB. This still left some scope for tweaking should it require more in production.
Oracle RMAN backup scripts were written to make off server backups nightly and hourly backups of archive logs. Oracle logs multiplexed to local and off server storage for real time resilience. Several printers were no longer available on the network. These were removed. Changes requiring downtime were scheduled in during non operational periods (Saturday evenings in this case) so that there was no disruption to operations.
Results
Oracle buffer cache hit ratio increased from 43% to 91% due to the extra memory allocated allowing more data to be retained in buffer cache. The PGA hit ratio rose from 56% to 97% indicating that program code and user variables were mostly contained in RAM.
Long running queries sped up and general responsiveness of the system improved. Due to more RAM, the system swap space dropped dramatically to almost none ( Linux often reserves a little swap space).
CPU load dropped not only due to the extra cores added but also
less time managing swap space and Oracle queries taking less time to run and using CPU over a shorter period.
Conclusion
Undertaking a system health check allowed us to quite easily spot system misconfigurations on a system and see the lack of performance as it was heavily under resourced for workload it is intended for.
The Oracle database is now got onsite backups in place and alerts have been setup to monitor the status. Real time Oracle logs are duplicated off the server so that should the system go down the most current log is still available offline.