When Performance is Everything

What was so unusual and special about Cray Research? We accomplished, repeatedly, what nobody else on the planet was able to match. We embodied a high-stakes environment verified by decades of operational success.

High-Performance Computing (HPC) is not merely a branch of computer science characterized by parallelism and large-scale systems. That is the public-facing decoy version. HPC characterizes the problem space where wrong or insufficient decisions mean large numbers of people might die.

My role is to transmit our tradecraft as practiced within Cray Research and to transmit it within the context of our high-stakes operational environment. As part of that transmission, we will demonstrate this tradecraft using modern AI models as HPC systems to show our methods remain important today.

Now that I have (perhaps shockingly) told you what HPC is not, I can show you what it is.

Three Questions

High-Performance Computing organized around three questions:

  1. What, or how much, improvement in computing system performance, is needed to enable new capabilities currently impossible to consider?
  2. Given a machine with dramatically improved performance, what can we get this machine to do that has been impossible to accomplish, even in part?
  3. What problem could save many lives if we could solve it, but solving requires a capability we do not have? What must that enabling capability look like? (We always treat this as a two-step question because the customer asks the first part and we as the vendor answer the second.)

At Cray Research we focused on the first question: create computing system capabilities that enabled solutions not currently possible, even in part. Our customers, both classified and non-classified, came to us because of question two. We knew the third question was our reason for existence, and knew both question and answer were outside our reach.

Handling that third question sounds contradictory. That is the nature of working alongside classified environments. Hypothetically, we might propose a specific performance capability and ask if that is sufficient. The customer can answer yes or no without disclosing the reason for needing that capability. Collaboration then becomes possible in terms of that stated computing system performance, with the need for that performance never entering the conversation.

We were thus in the business of building the world’s fastest supercomputers without telling our customers what the computing systems were good for. We demonstrated system capability and let the customer decide if they wanted one or not (they wanted one).

The purest example is CRAY-1 serial number 1, the first Cray Research mainframe computer. It had no software, just five tons of bare metal on a loading pallet. The customer knew how to write software; they needed a more powerful computer to run it.

All high-performance computing systems have a bounding constraint. It might be memory size or speed. It might be heat generated. It might be electricity consumption. Liebig’s Law of the Minimum is a good analogy: plant growth is limited by the relatively scarcest ingredient. If you overcome or alleviate the bounding constraint, you can achieve greater performance. But you will still hit the next constraint. There will always be a bounding constraint; otherwise, performance would be infinite.

With the second CRAY-1 system, serial number 3 (serial 2 was scrapped for a memory redesign), the constraint was lack of software. The National Center for Atmospheric Research (NCAR) could use a CRAY-1 for improved weather forecasting, but without an operating system and FORTRAN compiler, it was useless to them. The bounding constraint is not always physical or obvious.

Once we have a capability, what do we do with it? The NSA (National Security Agency) explains.

Improved Efficiency

Appendix B is a declassified NSA document describing the HPC process. It is the same situation as modern AI, but 70 years earlier and in a Top Secret setting. The project name was BOOTSTRAPS. This was a cryptanalytic (code breaking) problem.

The NSA initial analysis determined that this problem needed several hundred man hours, with the result far from certain. They estimated the cost as nearly $1,000 (in 1953 dollars). That was impractical. They prioritized other projects and set this one aside. This one was not the best use of available expert analysts.

The NSA had a specific method, which they called a “pass,” for solving the BOOTSTRAPS problem. But the manpower requirement was too exorbitant with current resources.

Then someone found a way to automate the pass using card equipment. This is the value of a high-speed computing system. Now the cost for a pass was approximately $32.50. That 30X improvement crossed the feasibility barrier, and the NSA made several hundred passes. But that caused a problem downstream.

The result of a sequence of passes was not the final answer. It was material from which the cryptanalysts could proceed to a solution. The new labor-saving method led therefore to more work for cryptanalysts. Each solution opened new jobs to do as well. The new technique (automate the pass) using computing capability (card equipment) created an opportunity not previously feasible. But the result was so valuable that it created more work, rather than less, for the expert analysts.

The 1950s direct predecessor to the 1970s CRAY-1 supercomputer was a computer named ATLAS. When ATLAS became operational, someone programmed ATLAS to process the BOOTSTRAPS pass. The cost of a pass became $1.25. This cost was now such a bargain that all available data was run through the new process, making a tremendous job for the NSA cryptanalysts, and ultimately a tremendous amount of actionable plain text.

With BOOTSTRAPS, the NSA concluded that when analytic machinery enables capabilities not previously feasible, it makes more work for the analyst rather than less. (They undoubtedly rated this a good outcome, but the declassified document does not say.)

New Capability

Consider a second scenario. Suppose the NSA had the necessary dozens of analysts to solve the BOOTSTRAPS problem by hand. The card equipment then allowed that job to be performed about 30 times faster or more efficiently. The (hypothetical) 30 people assigned to that part of the project could be reduced to a single person, freeing up the other 29 for some other task, or enabling a significant workforce reduction.

When ATLAS became operational and available for this project, that represented another 25X reduction in manpower requirement.

This is exactly the approach being promoted by the Big Tech giants: junior developers can be replaced with AI. Greater efficiency means tremendous manpower savings.

But this is not what High-Performance Computing is for. The BOOTSTRAPS analysis shows the difference.

BOOTSTRAPS was a known problem to solve, but impractical to attempt. Automation made it practical, even if just barely. Note that “just barely” meant a 30X speedup, not an incremental improvement such as 10% or 25%.

This is the distinction that we stopped teaching around 1995 (as explained in “Constraint-Based Design” below). HPC tradecraft provides those timeless skills that AI cannot touch:

  1. The Cray Research (and its predecessors) engineering tradition that produces unmatched potential.
  2. The creativity, more an attitude and orientation than a skill, of devising uses never before considered.
  3. Observing, characterizing, and using existing HPC systems (particularly including AI) in ways assumed to be impossible.

The NSA (and predecessor agency AFSA) made a sharp distinction between using computing systems as:

  • Labor savers, reducing manpower cost of existing projects and tasks
  • Revolutionizers, devising new capabilities previously impossible to consider, even in part

Appendix A is a declassified AFSA document explaining this distinction, placing ATLAS in the “revolutionizer” category based on its performance capability.

Now I can explain what “HPC tradecraft” means in practice.

Constraint-Based Design

High-Performance Computing (HPC) tradecraft consists of constraint-based design. Constraints shape solutions. Similar constraints tend to shape similar solutions.

HPC design is never building a computer and then seeking a product/market fit. Design always addresses a known problem or need. Design always aims to solve a problem that cannot currently be solved.

Abstractions Enable Advancement

These tradecraft facts present a difficulty: since around 1995, we began to hide physical constraints behind abstractions, libraries, and infrastructure. That freed normal commercial and scientific computing to rapidly advance. But it shunted HPC tradecraft to the shadows, consisting of tacit knowledge passed from person to person and never written down in the open literature.

Yet down on the bare metal, HPC tradecraft remains constraint-based design. AI Large Language Models (LLMs) are in fact designed as HPC systems.

Constraints Became Hidden

Do you see the barrier? We do not normally reason about constraint-based design because we hid the constraints. The barrier is that simple. And, therefore, easily surmounted, but only if you allow me to show you the way.

HPC tradecraft always begins with the constraints. After finding the true bounding constraint, we then measured our actual capabilities, or named what capability must come into existence. Capabilities bounded our solution space.

The next step is simple: accomplish what has never been done before. At Cray Research and our predecessors, we accomplished what nobody else on the planet could, repeatedly.

How did we do that? We spotted or used patterns and made connections that others missed. Once you see it, it seems obvious and you cannot un-see it. But seeing is difficult until you know where to look.

Here is a somewhat humorous hypothesis for you to consider: the primary difference between myself and actual for-real AI experts is that I did it on bare metal in octal for 20+ years.

But if that hypothesis happens to be true, it has an implication: 20+ years of close observation, preceded by 40+ prior years of accumulated industry tradecraft, shows how to use AI in revolutionary ways to solve problems not considered solvable today. At Cray Research, “it cannot be done” was never a barrier. It was the necessary starting point.

Tradecraft Transmission

Tacit knowledge has traditionally been transmitted from person to person through demonstration, collaboration, and mentorship. But the people with the knowledge, at least in non-classified environments, are retiring out of the workforce.

Our knowledge and engineering tradition can be passed in written form, with the mentor/author not present. The next chapter shows how.