SOL - Deep Learning GPU Cluster

The Diagnostic Image Analysis Group's goal is to improve image analysis tools for clinicians. Since 2015, deep learning has become an essential tool for artificial intelligence, substatially improving the acurracy and generalizability on developed image classification techniques. In order to create high quality deep learning based algorithms for clinical practice, DIAG's AI researchers require the use of high-power general purpose graphic processing units (GPGPUs).

SOL is DIAG's deep learning infrastructure. It offers researches access to a centralized pool of GPGPUs. It is used to create state-of-the-art tools for detecting and classifying illnesses and quantify extent and identify possible treatment options for diseases.

Capacity/Activity

Resource Value
Compute nodes 30
GPUs 80
CPUs 788
Total memory 3.5TB
Active users/month 40
Jobs run/month 7000

Architecture

SOL is a compute cluster which consists of 30 compute nodes that give access to a pool of 80 GPUs (mainly NVIDIA GTX1080Ti and RTX2080Ti cards), a dedicated 500 TB storage server for serving data to deep learning compute nodes using 20 GBit networking, and a Prometheus+Grafana based monitoring solution to monitor the activity and health of the cluster.

sol-cluster-architecture

Users log in to job nodes directly to either schedule an experiment or to interact with a running experiment using Jupyter Notebooks. We use Slurm as an automated job queue. Experiments are encapsulated in docker containers to isolate their software stacks from the base systems of our compute nodes. This allows researchers maximum flexibilty when it comes to trying out new, experimental libraries and software without us needing to install it on the base systems ourselves.

Future

The SOL compute cluster now runs reliably since mid 2017. The cluster is planned to steadily grow until then end of 2019 when it will reach its maximum design capacity of 100 GPUs. After this point, plans are to consolidate and upgrade old hardware and to remove non-stardard components and compute nodes from the cluster.

In addition to this, the compute cluster will be rolled out as a service to other departments of the hospital and we will start offering compute capabilites built around SOL to other departments of Radboud University in 2019.

People

Paul Konstantin Gerke

Paul Konstantin Gerke

Research Software Engineer

 Sil van de Leemput

Sil van de Leemput

 Thomas de Bel

Thomas de Bel

 Erdi Calli

Erdi Calli

 Matin Hosseinzadeh

Matin Hosseinzadeh

 Xie Weiyi

Xie Weiyi

 Patrick Brand

Patrick Brand