-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning with PyTorch Lightning
By :

There is often a need to have an audit, balance, and control mechanism during the training process. Imagine you are training a model for 1,000 epochs and a network failure causes an interruption after 500 epochs. How do you resume training from a certain point while ensuring that you won't lose all your progress, or save a model checkpoint from a cloud environment? Let's see how to deal with these practical challenges that are often part and parcel of an engineer's life.
Notebooks hosted in cloud environments such as Google Colab have resource limits and idle timeout periods. If these limits are exceeded during the development of a model, then the notebook is deactivated. Owing to the inherently elastic nature of the cloud environment, (which is one of the value propositions of the cloud) the underlying compute and storage resources are decommissioned when a notebook is deactivated. If you refresh...