Seven years in, the open-source tool set has proved to be a ‘game changer’ in attracting younger developers to a foundational platform
By Andrew Wig
By Dava Stewart
At one time, being in IT and running a mainframe meant being a jack of all trades, and that included understanding capacity planning and performance management, says Harry Batten, executive solution architect, IBM Expert Labs. In the modern environment, though, specialization has made such a breadth of knowledge less common. He describes the change like this:
“If I go back in the past, MIPS were incredibly important—when you were running a box that had 50 MIPS on it, as opposed to the z17s nowadays that basically have hundreds of thousands of MIPS on it.”
Having unlimited capacity naturally leads to paying less attention to usage and performance. And, as capacity has increased, the tools for managing performance have improved, so there’s less training and monitoring in general. Those two factors have all but eliminated the need for a dedicated capacity planning and performance management role in most companies, but that lack can be problematic. To understand why this matters, consider a familiar scenario.

A fair comparison can be made between managing a household’s electric bill and an organization’s capacity planning and performance management. Your electricity usage is essentially unlimited, but it comes at a cost, just as computing capacity does.
Your electric meter measures how much power your household consumes, and your bill is based (in part) on that usage. Similarly, mainframe users pay software costs based on a four-hour rolling average (4HRA). In both cases, you pay for what you consume rather than buying a set amount.
Another commonality is that both your software and your electrical usage have peak usage times. In some places, electricity is billed at a higher rate during overall peak times, and in business, peak times affect the 4HRA dramatically, even if the peak only lasts a few minutes.


A family that decides to work on lowering their electricity bill would first need to understand how much electricity they were currently using. Then, they could begin to narrow down when and why it’s being consumed. In a place where prices vary depending on peak usage, the family might be able to spread out electricity use by doing things like running the dishwasher overnight.
This kind of analysis is analogous to capacity planning and results in a baseline and a goal. At that point, performance management can begin. Organizations need this same process of establishing baselines, understanding usage patterns and planning for future capacity needs. The benefits of capacity planning include predictable costs, improved resiliency and better planning for future upgrades.
However, without someone dedicated to monitoring capacity and usage in an organization, Batten says that performance management often becomes a “problem determination type of exercise.” In other words, instead of paying attention to how the box is performing over time and planning for upgrades, teams are discovering problems only after they occur.




In addition to the nearly unlimited capacity of modern mainframes and the fact that better tools lead to less attention, Batten notes that modern programming languages come with a trade-off—capacity usage.
“Everyone raves about these modern languages,” says Batten, “but they use a lot more CPU cycles. The simpler a language becomes to write, the more CPU cycles you need to write it.” He says that the further away you get from machine code, the more instructions it takes to do something.
Understanding the workload is the key to managing performance. Although it’s possible to offload capacity from general purpose processes and lower software costs, without understanding the workload, it’s impossible to predict the impact of changes. If you don’t have a baseline measurement of usage, trying to understand how new programs will affect performance is an exercise in frustration.
Once you have that baseline, Batten says, “you can go back and do a deeper dive. All the tools are there to do really deep dives into capacity and come up with a solution.”
A bedrock feature of mainframe systems that can help is the z/OS Workload Manager (WLM), which has effectively functioned like AI without a large language model beneath it for a long time, according to Batten. “Workload Manager is the glue that holds mainframes together. It distributes the work based on priority, availability of capacity and so on, and it’s had machine learning in it from day one,” he says. The system learns as it goes along.
“It effectively says, ‘Hey, I know that when this happens at this particular time, I get a spike in my CPU and I handle it accordingly and I learn how to do that,” says Batten, noting that it’s very similar to the way generative AI works. He expects that the future will bring more real-time performance monitoring.
“Right now you have products that have dashboards, and you can set up parameters so that when performance goes above a certain level, the dashboard turns yellow, and if it goes higher, it turns red, then you can draw down [capacity usage],” says Batten. He thinks that AI will be able to automatically handle the problem, and already is in some organizations.
The more that systems can handle things with less human interaction, the smoother things will go, particularly during times of crisis. “I always look at capacity as a component of resilience,” says Batten, “You’ve got to have enough capacity for recovery if something goes wrong.”
As AI handles more of the systems around capacity planning and performance management, the less likely mistakes are to happen, and the better disaster recovery plans become, notes Battan. “In a crisis, the first thing you lose are your people,” he says. People are more interested in looking after their families than in keeping computing systems up and running. Plus, people with spreadsheets are more prone to mistakes than systems.
However, even with the improvements modern technology is bringing around capacity and performance management, Batten says there are still ways to better project CPU usage. When a programmer has a stellar new program that’s been tested and works well, they often can’t say how much capacity it will use. “We’ll just stick it into production and see how it goes,” says Batten, “and then usage jumps 20% when the program is running. All that work could have been done proactively.”
The reason the mainframe is reliable is because of the policies in place, and those policies are being eroded. Taking a proactive stance around capacity planning and performance management, adopting new tools to automate bringing on capacity or soft capping, and using Workload Manager and automation to analyze and predict usage can help maintain the reliability of critical systems.