A Hybrid Cloud Kubernetes Scheduler for Machine Learning Workloads

Kieley, James

Demand for processing machine learning workloads has grown incredibly over the past few years. Kubernetes, an open-source container orchestrator, has been widely used by public and private cloud providers for building scalable systems for meeting this demand. The data used…

Demand for processing machine learning workloads has grown incredibly over the past few years. Kubernetes, an open-source container orchestrator, has been widely used by public and private cloud providers for building scalable systems for meeting this demand. The data used to train machine learning workloads can be sensitive in nature, and organizations may prefer to be responsible for their data security and governance by housing it on on-premises systems. Hybrid cloud gives organizations the flexibility to use both on-premises and cloud infrastructure together, leveraging the advantages of both. While there is a long list of benefits, Kubernetes has limitations by design that limit a user’s abilities in a hybrid cloud environment. The Kubernetes control plane does not allow for the management of worker nodes across cloud providers. This boundary puts new responsibilities on the end-user when deploying a hybrid cloud workload. The end-user must create their clusters and specify which cluster the workload will be scheduled to ahead of time. The Kubernetes scheduler will not take the capacity of another cluster into account. To address these limitations, this thesis presents a new hybrid cloud Kubernetes scheduler that can create new clusters on-demand and burst machine learning workloads to a public cloud when on-premises resources are insufficient. Workloads begin scheduling on an on-premises Kubernetes cluster. When the on-premises cluster’s capacity is exhausted, a new Kubernetes cluster is created on-demand in a public cloud provider, and machine learning tasks waiting in the Kubernetes scheduling queue are dynamically migrated to the public cloud provider’s Kubernetes cluster. The public Kubernetes cluster is dynamically sized and auto scaled based on the pending tasks’ demand. When migrating tasks, the data dependencies among tasks are considered, and a region is dynamically chosen to reduce migration time and cost. The scheduler is experimentally evaluated with real-world machine learning workloads, including predicting if a subscriber will stay with a subscription service, predicting the discount needed to retain a subscription customer, predicting if a credit card transaction is fraudulent, and simulated real-world job arrival behavior in a real hybrid cloud environment. Results show that the scheduler can substantially reduce the workload execution time by dynamically migrating tasks from on-premises to public cloud and minimizing the cost by dynamically sizing and scaling the public cluster.

Copyright Statement