Large Model Support

The size of memory available within GPUs can be one of the limitations when training deep learning models. Learning from large datasets would directly imply using models with larger capacity and hence learnable parameters. A data scientist or a research is somehow pushed to make some compromises to stay within the limits of the available GPU memory.

This capability is unique to Watson Machine Learning Community Edition and opens the opportunity to address larger challenges and get much more work done within a single server running Watson Machine Learning CE.

Large Model Support (LMS) is enabled by Watson Machine Learning Community Edition and its full stack of deep learning frameworks and technologies involved. With the possibility of caching models and swapping tensors between GPU memory and system memory, compromises in dataset size, model capacity, or training batch size can be avoided. Of course the connection between the GPU memory and the system memory is optimized through the NVIDIA® NVLink™ technology to minimize the data transfer overhead.

A basic illustration of how Large Model Support together with NVLink can enable overcome out-of-memory errors when training deep learning models on GPUs.
Image Source: AC922 Technical Overview and Introduction IBM Redbook.

In short, successful training of deep learning models that would otherwise exhaust GPU memory and abort with "out-of-memory" errors can be realized using LMS. One or more elements of a deep learning model can lead to GPU memory exhaustion including Model complexity, original data size (e.g. high-resolution images), and training batch size. 

For a Getting-Started Tutorial on how to utilitze LMS in a deep-learning development pipeline, check the following guides from IBM for TensorFlow/Keras, PyTorch, and IBM Caffe.