Data grows year-on-year and the expectation is that marketing and services should be tailored to an individual. The Marketing Services Team at this Global Risk and Analytics Company support some of the largest US Banks and Financial Institutions. They provide highly targeted data sets for a range of purposes including marketing, risk, portfolio and credit analyses. The existing infrastructure to support this team was significant and well optimised, using a well configured industry leading scheduler to manage their data centres.
The team were looking for new solutions to maximise the efficiency of the data centres and provide internal teams with accurate predictions on the completion time of batch workloads. This would continue to ensure customer SLAs are met and expectations managed to help drive more revenue and lower support costs. Providing these capabilities would help retain valuable customers in a competitive market and reduce capital expenditure by removing the need to invest in more hardware to meet the rapidly growing business demands.
Each month thousands of reports are run on behalf of their customers, each with their own deadline. Confidence that they would finish on time was based on historic run times and experience. This is a common challenge when sharing infrastructure across a large business. To tackle it, the team had internally developed several systems over the past decade, to provide that visibility. However, they had never achieved a level of confidence that meant it could be relied upon in totality.
To keep pace with the growing business needs, the infrastructure was upgraded but still encountered demand spikes that could not be forecast. This meant manual monitoring and intervention was common, to reprioritise workloads and meet customer requirements.
The team were aware that job queue optimisation in their existing scheduler was good but could still deliver heightened levels of business process efficiency. However, they had never identified a solution that would meet their requirements. The workload queue strategy was traditional and required requests from sales teams to move workloads up the queue for key clients.
The infrastructure, while significant and with a huge number of grid slots was being controlled by the existing scheduler. By adding more intelligence and predictability to workload scheduling, increased production levels and shorter queue times can be delivered. This will enable more workloads or iterations to be run on the same hardware and reduce the need burst to cloud or purchase additional hardware.
The team provided 13 months of usage data from their existing scheduler logs to YellowDog’s Professional Service teams to train the advanced machine learning models. The team also developed an application to automate the extraction of this data so the models can be re-trained with new data on a regular basis, enabling the prediction engine to maintain its accuracy.
The YellowDog Professional Services team trained the existing YellowDog Machine Learning Models over an 8-week period and combined three models to predict the end-to-end run time (Wall Time) of workloads and also the time they were actively running on hardware (CPU Time). The evaluation while limited to only 13 months data, looked at seasonal trends and how the infrastructure operated on a weekly and monthly basis as many of the workloads are repeated at regular intervals.
Leveraging the data from YellowDog’s Forward Prediction Analytics, the Workload Scheduling functionality can be integrated with the existing scheduler to issue commands to manage the workload queue via an API to reduce queue times. Adding a new level of intelligence to workload scheduling removes the need for manual intervention to reprioritise workloads and increases the efficiency and quantity of batch processes that can be run on the existing infrastructure.
The information on predicted run time and on-going performance were both exposed via the API which can be integrated into third party and proprietary sales dashboards. Leveraging existing interfaces reduced the development time and accelerated time to market
VP and Global Head of Solution Architecture Leading Global Risk & Analytics Business
“YellowDog have created functionality that we have wanted for over a decade. This can transform how we run our infrastructure.”
Retraining the existing YellowDog prediction models using data supplied delivered results which exceeded all expectations. The models were able to deliver predicted time for workloads with 96% confidence for CPU Time and 73% confidence for Wall Time. This level of confidence enables more intelligent prioritisation of the existing scheduler, helping to ensure clients stay happy and hardware utilisation is optimised.
The prediction data generated by YellowDog is used to intelligently manage the existing scheduler’s queue with the aim of reducing the queue time by a significant amount. Furthermore, the introduction of automated workload reprioritisation removes the need for any manual intervention, allowing staff to concentrate on other higher value tasks. This service works alongside the existing scheduler, making it more intelligent without having to replace any of the existing setup.
Improving the efficiency of the compute grid while delivering more performance from the existing investment also allows for thousands of extra workloads to be run, helping to drive more revenue from a fixed cost base, improving profitability of each workload.
Complete the form to receive our case study as a PDF and the option to find out more.
You are seeing this because you are using a browser that is not supported. The YellowDog website is built using modern technology and standards. We recommend upgrading your browser with one of the following to properly view our website:Windows
Please note that this is not an exhaustive list of browsers. We also do not intend to recommend a particular manufacturer's browser over another's; only to suggest upgrading to a browser version that is compliant with current standards to give you the best and most secure browsing experience.