Data Analysis in HPC Cluster

Data Analysis Software

To run the different options of software in the cluster, a tool called Environment Modules is used. This allows you to see all the different programs already installed in the cluster as well as their different versions. From there, you can load the one you need and use it accordingly. On this page, you can find more information on how to check and load software in the cluster. If you want to get a more hands-on approach, check this set of exercises before diving in with your own work.

Certain applications are widely used and do not require any loading to connect to the cluster. Utility scripts are provided to integrate with job submissions to the Torque cluster. These applications are MATLAB, R and Jupyter Notebooks. For more information on what these are and how to use them see here.

The HPC Wiki also contains more information on how to effectively use MATLAB, R, and Python on the cluster. These pages are relevant on how to create environments, use specific versions, and understand the features that have been hand-crafted by the TG to make your analysis pipelines as smooth as possible:

Using Integrated Development Environments on the Cluster

There are two integrated development environments (IDE) that you can use for coding and have specific ways of connecting them to the cluster. In some cases, this means you might be able to run the program from your laptop by using the integrated terminal in the app to connect to the cluster.

The two options currently available are:

PyCharm – an IDE for Python project. You can find more information about the IDE and how to use it in the cluster here. Our lab member Maartje Koot created a small tutorial on setting up PyCharm Pro to Run Jupyter Notebooks in the cluster. You can find that here.
VSCode – is a cross-platform source-code editor made by Microsoft. It is great for laptops that do not have a lot of memory/RAM as it is a very light app. This page contains information on connecting your local VSCode to the cluster server. The developers of VSCode also have official documentation on how to do this here.

Running Jupyter/iPython Notebooks Remotely

If you are using iPython/Jupyter Notebook to do your data analysis, it is great to be able to edit and run them just like you would in your own computer, even if your data are stored on the cluster. The main advantage of this approach is that you can work with data on the cluster without rendering an entire virtual desktop through a VNC viewer (which might be too demanding).

The process may appear involved at first but after a couple of times of doing it, it becomes very quick and easy, much like connecting to the cluster. To make things even better, you can use bash scripts to automate some of these steps.

You can find a step-by-step guide here

Data Analysis in Parallel

This section can be thought of as a tips and tricks for a specific situation: you have collected data from multiple subjects and you want to perform an analysis on each subject’s data independently. Using some of the resources above, especially knowledge of bash, you can run all these analyses in parallel and avoid running each one of them script-by-script.

This section of the HPC cluster WIKI has a small exercise in Python, R and MATLAB to see how you can do this in each of these software options.