Dynamic virtual private clusters with OpenNebula and SGE

Son of Grid Engine (SGE) is the scheduler-of-choice for many of our users in bioinformatics. Using SGE allows for optimum hardware utilization, while keeping the work required by systems administrators simple.
Most of the SGE installations are on hardware which is going out of support within the next year. The users are now facing the choice: buying new hardware, or running externally. Replacing hardware for an expected peak capacity demand is expensive, while the hardware may be idle for most of the time. The cloud has the obvious benefit of flexibility, where extra resources can be provisioned as necessary, reducing the upfront cost at the same time.
SURFsara is committed to delivering high-performance compute facilities to the scientific community in The Netherlands, and as such, would like to offer virtual private clusters. Ideally, these clusters scale flexibly, to allow multiple user communities to pool hardware resources and improve hardware utilization by spreading peak capacity demands over time. At the same time, the work flow systems offered on the virtual private clusters should match those used by our scientific community, as this will make the transition from running on own hardware to running on cloud easier, and will make integration with existing pipelines possible. Therefore, SURFsara is developing virtual private clusters running on our OpenNebula-managed cloud, where the clusters themselves are managed by a custom module. This module interfaces with both SGE and OpenNebula.

This presentation discusses the design, the implementation and the experiences of dynamic SGE-driven virtual clusters on the HPC Cloud offered by SURFsara. Topics discussed are:

– queue inspection: how to determinine the current and future workload;
– criteria to add or remove worker nodes from queues and from the SGE compute grid;
– how OCA is used to add or remove nodes;
– how new nodes are automatically configured;
– how new nodes are automatically added to SGE queues, and nodes are removed from SGE queues;
– integration of the different parts into a working system;
– conclusion: do the dynamic virtual clusters improve hardware resource utilization, and are the dynamic virtual clusters valued by our users?