Skip to main content

Computing tools help unlock new medicine

Effective utilisation of computing tools is critical to both AI and HPC research, with recent projects in both fields highlighting the scientific benefit of carefully managing resources. 

HPC and AI researchers are still finding surprising ways to innovate on existing research tools and practices through the use of computing technology. Research computing is making efficient use of HPC cycles to accelerate research into Covid-19. AI is enabling emerging fields of research not be possible without new approaches to artificial Intelligence (AI).

Recently a distributed computing project, Folding@home (F@h) has been gaining a huge amount of support and publicity for its research into Covid-19. A tweet from the project in April highlighted that the combined computing power of the project totalled more than one exaflop – larger than the most powerful supercomputer in the world. Chris Coates, technological innovation Lead at OCF, has been working with his colleagues to develop a set of instructions for the Slurm Workload Manager. These instructions allow HPC centre operators to donate unused computing power to the F@h project. 

Coates explained the thought process behind setting up the F@H Slurm instructions for OCF customers.‘This started out as a standard stress test to commission a system. At the time we were doing a performance run for HPL. Once we do our performance runs you would normally leave that running at 100 per cent utilisation for 72 hours.’

While these tests must be done to verify the stability of newly installed system, there is a  wasteful element to using these computing cycles for little scientific benefit. 

Coates and his colleagues were exploring ways to use these resources more effectively. ‘We took that action upon ourselves to use this time and resources for good. Even if it is not part of a real project, what can we use those cycles for? F@h came up because a lot of us have been involved with the project in one way or another.’

F@h is a distributed computing project that simulates protein dynamics, including the process of protein folding and the movements of proteins implicated in a variety of diseases. The project enables people to contribute their time on their own personal computers to help run simulations. Insights from this data are helping scientists to better understand biology, and providing new opportunities for developing therapeutics.

The timing was fortunate as the team behind F@h began switching their focus to COVID-19 around the same time that OCF were looking at how to best use these computing cycles. It was an easy choice at that point,’ noted Coates.

OCF set out to write instructions for OCF customers, or anyone with an x86 Slurm cluster – although OCF were also using a Kuberentes instance. The instructions allow these research centres to donate any spare capacity to assist the F@h project in its efforts to sequence protein dynamics of Covid-19.

In a blog post by OCF, Coates stated that ‘Spare capacity can be utilised when users are not using all HPC resources and any donation of clock cycles doesn’t need to impact on any current workloads. GPU capacity is the most sought after at this time, but all donated resources help.’

Maximising utilisation

What started out as a technical exercise for OCF staff looking to better utilise the computing cycles has enabled researchers to donate their spare computing resources to help combat Covid-19. With many universities closed or partially operational it could be a good time to use spare computing resources in this manner. 

Even if there are a small number of critical applications running on a cluster it may be impossible to cease all computing operations. While some may have smaller clusters or partitions that can be switched off many modern systems rely on dual socket servers which share power supplies. Equally there may be integrated cooling solutions which make it difficult to efficiently run the cluster at less than full utilisation.

In many cases, making efficient use of those compute cycles to produce scientific output is the most efficient way of utilising the lefty over capacity of a large cluster.

Coates noted that the work did not take long because the team had already been looking at using the scheduler for idle tasks. It was just a case of switching that over to focus on F@h. ‘Off the back of that we decided that it would be a technical exercise for us to just say if we want to use all of the spare cycles on a cluster for our pre-defined workload,’

Proteins are molecular machines made of a linear chain of chemicals called amino acids that, in many cases, spontaneously ‘fold’ into compact, functional structures. Much like any other machine, it’s how a protein’s components are arranged and moved that determine the protein’s function.

Viruses also have proteins that they use to suppress the immune systems and reproduce themselves. To help tackle coronavirus, F@H project is trying to determine how these viral proteins work and how we can design therapeutics to stop them.

Coates also noted that the reason they chose to develop these instructions for the Slurm Workload Manager was due to its popularity with many research centres and universities. ‘A lot of our educational establishments use Slurm as a Workload manager anyway and we have deployed quite a few clusters out there into the world with Slurm. It is quickly becoming somewhat of a standard for educational centres, they can expand it to their needs quickly because it is open source.’ added Coates. ‘Because we know that a lot of those educational sector customers have an underutilised cluster resource it was another easy choice. We wanted to make sure that people could get the most out of this if they decided to jump onto it. We realised that a lot of universities are in semi shut down at the moment so there was no point letting those cycles go to waste.’

Learning at a distance

Deep learning has long been known to be a powerful tool for research computing. However one of the stumbling blocks has often been the availability of good quality data. Without the requisite amount of high quality data the model will struggle to develop the required accuracy or insight.

Federated learning is one attempt to combat this challenge. Mona Flores, global head of medical AI at Nvidia explains, how federated learning enables AI research using decentralised data. ‘It means that if we have three hospitals a, b and c each one of us has our own data set. We do not share these data sets but the eventual model is learning from all of our data,’ said Flores. ‘Now people can keep their data and their intellectual property and they can have all of their privacy concerns addressed. You do not need to send any of the data back and forth.’

This is particularly useful for sensitive data often found in healthcare or medical image scans. By removing the need to share patient data, hospitals, research centres and academic institutions can share data quickly through a centralised server hosted by Nvidia.

Federated learning could add real benefit to research, particularly for areas where AI is just emerging and there may not be enough data to run a model from a single organisation. In this case, organisations can participate in collaboration rather than not being able to complete the research, or having to share large scale datasets including private information.

‘Today, if I want to run deep learning on computerised tomography (CT), scans and I am studying a certain problem and I only have 50 patient scans with this condition. You can imagine if I tried to train a model just using my 50 CT scans – depending on what the model is – I may not be able to get a very accurate result. 

But if I am learning not just my own cases but yours and on all the other available cases, now I have so much more data to train the model,’ stated Flores. ‘As you know deep learning needs a lot of data so now the researchers can have a model that works better, at least at their own institution, through federated learning as opposed to what they could have had just training on their own dataset.’ 

Social engineering or standardisation? 

While federated learning can solve key challenges in data availability for AI it can also create problems of its own. Sharing this kind of data requires that collaborators think about key parameters that must be collected but also how the data will be collected and stored.

‘Federated learning is really just coming up and there is not just a single way of doing this. Even once you choose the specific way of doing federated learning there are many things that can change,’ stressed Flores.

‘What we are noticing now is that initially when you start doing federated learning there is a lot of social engineering that needs to happen. Being able to collect the data in the same manner in each place for example. You may need to annotate the data so you have to make sure all of the places can agree on what certain parameters.’

‘All of this stuff is happening today by what I call social engineering, over time a lot of that will become automated and this makes the experimentation and iteration much faster,’ added Flores.

There are many questions that are either developing over time or that must be addressed on a case by case basis by the organisations collecting the data. For example, how do you decide on and exchange the weights underlying the model? 

‘There are many different variables and also the methodology to come up with a better model at the end,’ stated Flores. ‘How do you aggregate this data? There are many questions still left to answer.’

‘This is a field that is just starting and it is going to continue to emerge and get better and more efficient and will become much easier for people to do. It really depends on what you are studying,’ noted Flores.

While there are certain parameters that must be agreed upon in order to ensure that the data sets can be used as a single shared model, some homogeneity of the data can be beneficial. 

Small changes such as different scanners providing the same type of image or two populations of different age groups or other population metrics can help to normalise the data and provide a more varied insight for the researchers creating the model.

‘If you are doing something where you need to collect blood count in terms of haemoglobin and someone reports it in terms of hematocrit then it is not going to make sense,’ said Flores. ‘The old adage of garbage in, garbage out still exists. You definitely need to have some sort of standardisation. ‘Having said that you can use instruments from multiple vendors as long as the characteristics of the scan and the setup parameters can be matched up.’

‘The deep learning model can correct for a certain amount of noise in the image relating to differences in the make and model of a scanner for instance,’ added Flores. ‘To the extent that you have lots of heterogeneous data, that can actually make the model more robust.’ Using this approach it would theoretically be possible to ‘generalise the model to someone in Spain even though we have different scanners and we do things slightly differently in the US,’ Flores noted.

While deep learning and AI are still in their infancy for many research areas it is important that data sources can be made available. Even more imperative is that standards and methods must exist in order to ensure that data is usable and accessible so that it is not siloed away but available for researchers to use effectively.

‘Deep learning needs a large dataset and this is really what is making AI and specifically deep learning more common these days,’ stated Flores. ‘We have had DL algorithms since the 1950s but only now are you hearing about them being used in clinical medicine now and that is because we have an abundance of data that was not available before. 

 


Topics

Read more about:

Artificial intelligence (AI), HPC

Media Partners