• Subscribe

Getting the most out of your supercomputer

Speed read
  • It’s difficult to ensure that all components of a supercomputer operate efficiently
  • XDMoD tool measures quality of service, system, and user job performance
  • Future improvements will also track cloud-based metrics

As the name implies, supercomputers are pretty special machines. Researchers from every field seek out their high-performance capabilities, but time spent using such a device is expensive. As recently as 2015, it took the same amount of energy to run Tianhe-2, the world’s second-fastest supercomputer, for a year as it did to power a 13,501 person town in Mississippi.

<strong>Supercomputing time</strong> is incredibly valuable for today’s researchers, who use their high-performance capabilities for everything from climate science to investigating the origins of autism to predicting fire intensity and growth, shown here. Courtesy University at Buffalo. And that’s not to mention the initial costs associated with purchase, as well as salaries for staff to help run and support the machine. Supercomputers are kept incredibly busy by their users, often oversubscribed, with thousands of jobs in the queue waiting for others to finish.

With computing time so valuable, managers of supercomputing centers are always looking for ways to improve performance and speed throughput for users. This is where Tom Furlani and his team at the University at Buffalo’s Center for Computational Research, come in.

Thanks to a grant from the National Science Foundation (NSF) in 2010, Furlani and his colleagues have developed the XD Metrics on Demand (XDMoD) tool, to help organizations improve production on their supercomputers and better understand how they are being used to enable science and engineering.

"XDMoD is an incredibly useful tool that allows us not only to monitor and report on the resources we allocate, but also provides new insight into the behaviors of our researcher community," says John Towns, PI and Project Director for the Extreme Science and Engineering Discovery Environment (XSEDE).

Canary in the coal mine

Modern supercomputers are complex combinations of compute servers, high speed networks, and high performance storage systems. Each of these areas is a potential point of under performance or even outright failure. Add system software and the complexity only increases.

<strong>Canary in a coal mine.</strong> Without an effective way to monitor HPC systems, users often end up being the first to alert administrators to a problem, much like the birds carried into 19th-century mines to detect the presences of dangerous gasses. Courtesy US Department of Mines.With so much that can go wrong, a tool that can identify problems or poor performance as well as monitor overall usage is vital. XDMoD aims to fulfill that role by performing three functions:

1. Job accounting – XDMoD provides metrics about utilization, including who is using the system and how much, what types of jobs are running, plus length of wait times, and more. 

2. Quality of service – The complex mechanisms behind HPC often mean that managers and support personnel don’t always know if everything is working correctly—or they lack the means to ensure that it is. All too often this results in users serving as “canaries in the coal mine” who identify and alert admins only after they’ve discovered an issue. 

To solve this, XDMoD launches application kernels daily that provide baseline performances for the cluster in question. If these kernels show that something that should take 30 seconds is now taking 120, support personnel know they need to investigate. XDMoD’s monitoring of the Meltdown and Spectre patches is a perfect example—the application kernels allowed system personnel to quantify the effects of the patches put in place to mitigate the chip vulnerabilities.

3. Job-level performance – Much like job accounting, job-level performance zeroes in on usage metrics. However, this task focuses more on how well users' codes are performing. XDMoD can measure the performance of every single job, helping users to improve the efficiency of their job or even figure out why it failed.

Furlani also expects that XDMoD will soon include a module to help quantify the return on investment (ROI) for these expensive systems, by tying external funding of the supercomputer’s users to their external research funding. 

Thanks to its open-source code, XDMoD’s reach extends to commercial, governmental, and academic supercomputing centers worldwide, including England, Spain, Belgium, Germany, and many others.

Future features

In 2015, the NSF awarded the University at Buffalo a follow-on grant to continue work on XDMoD. Among other improvements, the project will include cloud computing metrics. Cloud use is growing all the time, and jobs performed there are much different in terms of metrics.

<strong>Who’s that user?</strong> XDMoD’s customizable reports help organizations better understand how their computing resources are being used to enable science and engineering. This graph depicts the allocation of resources delivered by supporting funding agency. Courtesy University at Buffalo. (Click to enlarge)

For the average HPC job, Furlani explains that the process starts with a researcher requesting resources, such as how many processors and how much memory they need. But in the cloud, a virtual machine may stop running and then start again. What’s more, a cloud-based supercomputer can increase and decrease cores and memory. This makes tracking performance more challenging. 

“Cloud computing has a beginning, but it doesn’t necessarily have a specific end,” Furlani says. “We have to restructure XDMoD’s entire backend data warehouse to accommodate that.”

Regardless of where XDMoD goes next, tools like this will continue to shape and redefine what supercomputers can accomplish. 

 

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2018 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.