9 Tips for Training Models on your University’s HPC Cluster | by Conor O’Sullivan | Mar, 2023
How to effectively run and debug code in a resource-constrained environmentPhoto by Martijn Baudoin on UnsplashQueue job, wait 24 hours, cuda runtime error: out of memoryQueue job, wait 24 hours, FileNotFoundError: No such file or directoryQueue job, wait 24 hours, RuntimeError: stack expects each tensor…AHHHGH!!!Debugging code on a high-performance computing (HPC) cluster can be incredibly frustrating. To make matters worse, at university you will be sharing resources with other students. Jobs will be added to a queue.…