Building a 3.2M vCPU Supercomputer in the Public Cloud

In our latest interview, Niall discusses our recent achievement where we built a 3.2M vCPU supercomputer in the public cloud, using Amazon Web Services (AWS). Watch the video to learn more.

What Was the Challenge We Were Trying to Overcome With This Run?

We built this very, very large cluster to perform virtual screening for a life sciences company.

Virtual screening is used by drug companies as an initial stage in the development of new medicines or therapeutics. This is to avoid spending multiple years – at a considerable investment – on the wrong compound. By taking the candidate protein and seeing how well it docks in a virtual space, companies can identify if they are on the right track before they proceed with physical testing. The idea is to search as widely as you can in publicly available databases, to see how many hits you get.

The more hits you get at this particular stage, the more likely you are to succeed as you move from the virtual world to the physical world, where chemical analyses and testing begins.

What Did We Do to Perform This Run?

We launched 46,733 Spot Instances, using 24 Amazon EC2 Fleets and 8 different instance types. We spread the machines across two AWS regions, in North America and Europe. We used Amazon S3, which is AWS’ object storage, for both the seeding of the compounds, as well as the results.

What Was Unique About What We Did?

I think there were multiple facets to the uniqueness of the run. A lot of it boils down to the way the YellowDog Platform is uniquely architected.

Our workload manager or scheduler is in constant communication with our worker threads. The threads have capabilities and statuses that tell the scheduler updates such as, “Am I up and running?”, “What am I doing at this particular moment in time?” etc.

Alongside our workload management capabilities, we have our compute provisioning. We can use a really wide range of compute, across many different data centres, geographical regions, and instance lifecycle types (e.g. on-demand, Spot, reserved and on-premise instances).

What is unique is the abstraction between those worker threads and the compute. Despite the different varieties of instances we can provision, our scheduler just sees workers, or worker threads. In this particular case, it sees over 46,000 worker threads and goes “OK, these are available to me to execute this virtual screening process.” All we need to do is dispatch workload to these threads, as they register with the Platform.

So, from my perspective, that is what’s unique. The ability to effectively use volunteer or commodity compute to get a huge amount of workload done in a very short period of time.

Did We Have Any Cost Mitigations in Place for This Run?

There are several aspects to cost. The first mitigation is to use Spot Instances. In this particular run, we only used Spot Instances.

The second mitigation is to make sure that when you’re scaling up machines, the machines are working and fully utilised ASAP. So, for this run, as soon as we spooled up machines we deployed the workers and gave them work. So, they’re working at maximum efficiency as soon as they’re ready from the cloud provider.

This efficiency must be applied to deprovisioning instances also. What we do at YellowDog is bring the machines down when there is no further work to be done, so that they’re matching the outstanding workload queue quite precisely.

What you’re after is a nice slope up and slope down of instances coming online and offline, as work is completed.

Are There Risks Associated With Using Spot Instances?

I think for using Spot Instances, it’s ‘horses for courses’. For some workloads you do need to use machines that are available for the entire time you require them. For Spot, you have to be always aware that there is a possibility for those machines to be taken back by the cloud provider.

What we do at YellowDog, based on the way our scheduler works, is keep an inventory of all our workers. So, we monitor the workload queue and if we detect workers are lost, we re-queue that work and ask for more machines if necessary.

The key thing when using preemptible or Spot Instances is being able to recover from failures automatically. You not only need your workload to be recoverable, but you need the way you’re distributing your tasks to these instances, to also detect these failures, so they can redistribute the tasks without intervention. At this scale, if you have to do that manually, that’s going to be a lot of work and you just increase your operational risk and ultimately your costs.

What’s the Significance for Drug Discovery Companies?

As mentioned above, the key thing with virtual screening is the number of hits you can get – the wider you can search the compound space, the greater the success parameters as you move through the process, as you’ve already done a lot of the work virtually.

In this run, we virtually screened hundreds of millions of compounds in an hour. To compare this to using on-premise infrastructure, you’re talking about a million compounds in a year. This means you have to be very precise with your search space. Whereas in the cloud, you can be a lot wider and pick up those compounds that you might not have already looked at.

Is This Technology Beneficial to Smaller Companies?

This technology can be used by companies of all sizes. The benefit for smaller companies is that if you’re doing one, two, three of these virtual screening runs a year, you do not have to go out and purchase a massive grid to do this. By using the cloud you can do this very, very easily. With YellowDog, you can bring up the compute you need at scale and at speed, to get the job done and move on with the rest of the process.

Click here to view the blog on the AWS website.

Share this:

Latest News Articles

Running FSI workloads on AWS with YellowDog 10.07.2024

Mark Noctor joins YellowDog as Chief Commercial Officer 09.05.2024

YellowDog enables Nextflow users to go Hybrid and Multi-Cloud 08.11.2023