9 Tips for Training Models on your University’s HPC Cluster | by Conor O’Sullivan | Mar, 2023

By Jessie Hobb On Mar 28, 2023

How to effectively run and debug code in a resource-constrained environment

Queue job, wait 24 hours, cuda runtime error: out of memory

Queue job, wait 24 hours, FileNotFoundError: No such file or directory

Queue job, wait 24 hours, RuntimeError: stack expects each tensor…

AHHHGH!!!

Debugging code on a high-performance computing (HPC) cluster can be incredibly frustrating. To make matters worse, at university you will be sharing resources with other students. Jobs will be added to a queue. You can wait hours before you even know if your code has errors.

I’ve recently been training models on my university’s HPC. I’ve learned some things (the hard way). I want to share these tips and tricks to hopefully make your experience a bit smoother. I’ll keep things general so you can apply them to any system.

Do as much development on your personal machine as possible

An HPC cluster has one objective — crunch the numbers. No fancy IDE, no debugger and no co-pilot. Do you really want to code your entire project using vim?

To avoid going insane (save that for your thesis), you should develop the code on your own machine. Train the model for at least 1 epoch using a small sample. Include tests to make sure the data is loaded correctly and that all results are saved. It is no fun training a model for 50 epochs to find you forgot to save the best one (yes, I did this).

Keep the code simple

Any additional steps you include in the code will increase the chance of a failed run. You should only run the processes that need heavy computing power. For example, after the model has been trained you will want to evaluate it. This can be done on your local machine. If your test set is small enough even a CPU will do.

Do not hardcode any paths or variables

When moving the code from your local machine to HPC, some things will inevitably have to change. For example, the file paths for loading data and saving results. I was developing on a Mac but the HPC has a Linux operating system. This meant I needed to change the device from “mps” to “cuda”.

Figure 1: examples of variables to be updated (source: author)

Do not hardcode anything that will need to change. Use variables and define them right at the top of your script. This makes them easy to change and you will not forget anything. Trust me! You do not want to scroll through lines of code using vim.

Increase batch size

In Figure 1 above, you can see one of the variables I included is batch_size. This is the number of samples that are loaded and used to update model parameters at each iteration. A larger batch size means you can process more samples in parallel on a GPU leading to faster training times.

Increasing this on your local machine will quickly lead to a “cuda runtime error: out of memory”. In comparison, the HPC can handle a larger batch size. However, it is important to not increase it too much as this can have a negative effect on model accuracy. I simply doubled the batch size from 32 to 64.

Use system arguments for any experiments

As far as possible, you want to avoid editing code on the HPC. At the same time, to train the best model you will need to do experiments. To get around this, use system arguments. As seen in Figure 2, these allow you to update variables in your script at the command line.

Figure 2: examples of system arguments (source: author)

The first one allows me to update the save path of the final model (model_name). The others allow me to sample, scale and clean the data in various ways (see Figure 3). You could even update the model architecture. For example, by passing in a list [x,y,z] that defines the number of nodes in the hidden layers of a neural network.

Figure 3: training a U-Net with 4 data cleaning experiments (source: author)

Include a sample argument

The sample system argument is particularly useful. The one I’ve included above is a binary flag. If set to true, the modelling code will be run using a subset of 1000 samples. You could also pass in the actual number of samples as an integer.

Often I schedule 2 jobs right after each other — the sampled and full dataset runs. The sampled run will usually complete within a few minutes (unless someone is hogging all the GPUs!!). This helps point out any pesky bugs. If something goes wrong, I have the opportunity to stop the full dataset run, correct it and kick it off again before the end of the day.

Be verbose

No, I’m not talking about your friend majoring in finance. Your code should tell you as much as possible. This will help fix any bugs that do occur. Use those print statements! Some things I find useful to print on each run:

The system device (i.e. cuda or CPU)
Length of the dataset, training and validation sets
A sample of the data before and after any transformations
Training and validation loss for each epoch and batch
The model architecture
The value for any system arguments

You never know what is going to go wrong. For machine learning, logical errors can be difficult to identify. Your code can run perfectly but produce a model that is absolute garbage. Having a record of the process will help you trace back to where an error is coming from. This is especially important if you want to run multiple experiments.

Use a diff checker

Things will go wrong. Very wrong. I broke my own rules and made a few changes to the script on the HPC. Okay, I made many changes. I lost track of what I had done and the code would not run properly. This is where a diff checker came in handy.

It will compare, line-by-line, the text in one document to another. I used it to compare the script on the HPC to one of my local machines. This pointed out what I had changed and I could immediately identify the problem.

Figure 4: pointing out the difference in HPC and local script (source: diff checker)

Be realistic about your research

The tips so far have lacked detail. There is a lot you can do to speed up the training for specific model packages (e.g. XGBoost). Keeping better track of your data and models can also help avoid errors and confusion. But, really, there is only so much you can do. My final tip is to be realistic about what you can achieve with your available resources.

In the era of LLMs, this may be disheartening. It seems like all the major breakthroughs are achieved with increasing computational power. You must consider that a model the size of GPT-3 is estimated to cost $450,000 to train. This rivals some departments’ entire budgets! You simply cannot compete.

Yet, you can still make a valuable contribution. Fine-tuning models do not need as many resources. Collecting and labelling data is often a more significant contribution than chasing SOTA on benchmarks. Every sub-domain to which you apply machine learning will come with its own challenges. Often a lack of resources will lead to more innovative solutions to these challenges. So be thankful for those long job queues…

Oh, who am I kidding!! Anyone want to buy me a GPU?