Working with researchers to reduce load on scheduler #500

wwarriner · 2023-03-11T01:15:14Z

What would you like to see added?

Case study draft about collaborating with a research group to improve system stability. We used sdiag to manage scheduler stability, and snuck in some education as well. Written with a lay audience in mind.

Ensuring a consistent and low friction researcher experience is one of our goals. Recently we learned about features of our cluster management software that allowed us to prevent a potential outage.

One of the features of Cheaha is a scheduler which allocates resources for researchers' data analysis requests in an orderly and fair manner. The scheduler is a type of server software called a daemon, which runs in the background on our login node, silently processing requests from researchers and staff. If too many requests arrive at one time, the scheduler can become overwhelmed and stop working. This causes an outage, impacting the ability of researchers to perform data analysis.

The scheduler has a diagnostic tool called, appropriately, sdiag. This tool reports information about how many requests are being made, what type of request, and which researchers are making them. The first time we used sdiag, we identified that a particular researcher had made about 27.5 million requests in the span of 18 hours, or close to 425 requests per second. The second highest researcher had made about 200,000 requests in the same time span. Looking further, we found that the requests were for information on the status of the nodes in our cluster.

We reached out to the researcher and their graduate student to investigate the cause. Together we determined that the cause was part of a loop in a code script. The loop used another scheduler diagnostic tool squeue, which reports the status of jobs in the scheduler. The script checked the scheduler to decide when to submit additional jobs out of a very large batch. However, there was no pause between loop iterations, so squeue was being called as fast as the code could be executed. Almost certainly the source of the 425 requests per second. The graduate student inserted a pause into the loop following the check.

Ten days later, when the existing batch of jobs had all run through, sdiag no longer showed any researchers with more than 200,000 requests over an 18 hour span.

Research Computing takes pride in educating our users on best practices, and we take every opportunity to leave things better than we found them. This case was no exception. After a brief discussion of the purpose of the script, we determined that their data analysis would be easier to manage, and more readily reproducible, by migrating to a workflow management software. That would enable the research team to focus on the higher-level structure of their data flow, rather than on lower-level details like checking the scheduler status.

Everyone benefited from the experience. Research Computing found utility in a new-to-us diagnostic tool to ensure system stability and uptime, allowing us to better server the UAB research community. We also brought the scheduler request frequency down, decreasing the risk of an outage. The research team we worked with learned more about interacting with our systems, and the benefits of workflow management software moving forward.

The text was updated successfully, but these errors were encountered:

wwarriner added feat: case study Case studies and success stories fabric: cheaha Docs related to Cheaha platform labels Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with researchers to reduce load on scheduler #500

Working with researchers to reduce load on scheduler #500

wwarriner commented Mar 11, 2023

Working with researchers to reduce load on scheduler #500

Working with researchers to reduce load on scheduler #500

Comments

wwarriner commented Mar 11, 2023

What would you like to see added?