AI Trades Space for Time

Recently, I was on a walk with my friend and frequent collaborator Hayden Helm. He asked me if I could sum up the fundamental innovation behind the current wave of AI in one sentence. After thinking for a while, I replied:

AI makes it easy for programmers to trade space for time.

In this post, I am going to justify this answer by answering two sub-questions:

  1. In what way does AI make it easy to trade space for time?
  2. Why should this innovation be considered fundamental?

In What Way Does AI Make it Easy to Trade Space for Time?

Trading Space for Time in Traditional Algorithms

In traditional algorithms, the cost of a computation is measured using its time complexity and its space complexity. Here, time complexity refers to how long the program runs, and space complexity refers to the amount of data storage the program requires. Programmers often make use of the space-time tradeoff to make their programs faster by relying more heavily on data storage. A simple example of this is memoization, where a program stores the result of an expensive operation that is repeatedly called.

Why Time and Space Complexity?

The practice of measuring a program using its run time and memory consumption is a result of the emphasis placed on the RAM model of computing. CLRS, the most popular textbook for college level algorithms, makes this emphasis explicit:

"For most of this book, we shall assume a generic one-processor, random-access machine (RAM) model of computation... In the RAM model, instructions are executed one after another, with no concurrent operations."

However, it seems reasonable that a program could be made faster by distributing it over multiple processors on a parallel computer. This is desirable, since parallel computing would provide an alternative mechanism to the space-time tradeoff for speeding up programs. CLRS does eventually mention parallel computing, but only after 772 pages:

"Although the computing community settled on the random-access machine model for serial computing early on in the history of computer science, no single model for parallel computing has gained a wide acceptance... Unfortunately, programming a shared-memory parallel computer directly using static threads is difficult and error prone. One reason is that dynamically partitioning the work among the threads so that each thread receives approximately the same load turns out to be a complicated undertaking. For any but the simplest of applications, the programmer must use complex communication protocols to implement a scheduler to load-balance the work."

Because of the difficulty involved in managing and coordinating a large number of threads, CLRS largely assumes that a program will only ever be distributed over a small number of threads. As such, an asymptotic measure of "thread complexity," akin to the traditional measures of time and space complexity, is never mentioned.

What Would Thread Complexity Look Like?

Let's assume for a moment that we live in a world where we can effectively coordinate a massive number of threads. Instead of the RAM model, this world uses the GPU model, where we assume we have access to a computer with a large number of processors that can execute instructions concurrently. Under the GPU model, it makes sense to start measuring programs using three different kinds of complexity:

  1. Time Complexity: the amount of time it takes to run a program
  2. Memory Complexity: the amount of memory the program requires
  3. Thread Complexity: The number of processors the program requires

To get a feel for working with thread complexity, let's consider the task of multiplying N numbers together, which we will call MULTIPLY-N. Under the RAM model, MULTIPLY-N takes O(N) memory, O(1) threads, and achieves O(N) time.

Under the GPU model, we can use a parallel divide and conquer algorithm to compute the products of several pairs of numbers at once. For example, to compute abcda*b*c*d using the GPU model, we would first compute (ab)=y(a*b)=y and (cd)=z(c*d)=z in parallel, and then finish by computing yzy*z. The GPU implementation of MULTIPLY-N takes O(N) memory, O(N) threads, and achieves O(log(N)) time. In other words, there seems to be a thread-memory-time tradeoff in the GPU model, just like the space-time tradeoff in the RAM model. Moreover, the GPU implementation of MULTIPLY-N achieves an exponential speedup when compared to its RAM conterpart. This leaves us with the following observation:

If we have a computer with sufficiently many threads and an algorithm that can coordinate them, we may be able to leverage a thread-memory-time tradeoff to achieve an exponential program speedup.

AI is a Method for Coordinating Massively Parallel Computers

AlexNet

Before the advent of modern AI, a programmer trying to solve a task by coordinating a large number of threads would effectively need to invent a one-off algorithm for doing so. This changed with the introduction of AlexNet in 2012. AlexNet is a Convolutional Neural Network (CNN) that achieved an unprecedented improvement on Imagenet, the world's foremost image classification dataset (at the time). The key insight of AlexNet was that GPUs could be used to train neural networks in a highly parallel fashion:

"Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs and recent datasets such as ImageNet contain enough labeled examples to train such models"

The contribution of this paper goes beyond an image recognition model. AlexNet validated a recipe for effectively coordinating a large number of threads towards a particular task. His single implementation of a neural network provided a highly parallel solver for any task that could be defined with a dataset. In other words, you no longer needed to know parallel programming to recruit a lot of compute towards your task - you only needed to collect some data, which is comparatively much easier. This enabled a very broad class of programmers utilize the thread-memory-time tradeoff, and it was only limited by the hardware on the GPUs themselves:

All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

Though Krizhevsky was correct that faster GPUs could improve his results that observation misses the forest for the trees (which is easy for me to say in the fullness of retrospect). AlexNet worked because it was able to parallelize 2D convolution. What if the model could be improved, not by making the GPUs faster, but by parallelizing over even more of them?

Transformers

The transformer architecture, which powers mainstream AI applications like ChatGPT and GPT-4, is what happens when you push GPU parallelism to its logical extreme. Aidan Gomez, one of the authors on the seminal transformer paper, retrospectively remarks on this scalability in a ML Street Talk interview.

"What I didn't appreciate was that such a simple method could achieve such insane performance. At that time, we were training on 8 accelerators... but the architecture was so stripped down, so refined, it was so easy to scale... it was hard to see what would happen, which was this massive scaling project."

While there are no public figures regarding the scale of GPT-4, the current most capable language model, some intelligent estimates for its predecessor exist. In a blog post, Lambda Labs estimates the size and extent of parallelism present in GPT3:

The GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years... The 175 Billion parameters needs 175×4=700GB memory to store in FP32 (each parameter needs 4 Bytes). This is one order of magnitude larger than the maximum memory in a single GPU (48 GB of Quadro RTX 8000). To train the larger models without running out of memory, the OpenAI team uses a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network.

According to Li's calculations, OpenAI effectively coordinated 355 GPU-years of compute towards a single task. By parallelizing the model over many GPUs, this massive computation could be accomplished in several months of wall time. In other words, GPT-3 utilized a thread-memory-time tradeoff to reduce the time complexity of a massive computation by utilizing more threads and more memory.

Memory and Thread Complexity are Both Physical Space Complexity

GPT-3 makes it clear that modern AI leverages a tradeoff that pays increased thread and memory costs to decrease the time cost of massive computations. However, I want to go one step further, and claim that thread and memory consumption are both instances of increased physical space consumption. The argument here is straightforward: if a programmer wants to use more threads and memory for their program, they need to use more physical space for the physical machines that will run the computation. In this sense, modern AI is utilizing a physical space-time tradeoff to enable massive computations to be run in a short amount of wall time.

Why Should This Innovation Be Considered Fundamental?

What Does it Mean to be Fundamental?

A question about whether something is fundamental is, at its core, a question about the meaning of "fundamental." In the tradition of Wittgenstein, I consider the meaning of a word to be roughly it's use in discourse. As such, a situation that refers to the trading of physical space for time as fundamental will give us a great deal of information about the sense in which trading space for time is fundamental. In particular, we will look at a situation where trading physical space for time is considered fundamental in the sense of fundamental physics. To fully appreciate this situation, though, we will first need to build some intuition about space and time in physics.

Relativity & the Physical Space-Time Tradeoff

The fact that physical space and time can be traded off with each other should not be surprising. Since Einstein, we have known that space and time together make up spacetime, the singular entity required to describe physical phenomena:

"Space is a three-dimensional continuum. By this, we mean that it is possible to describe the position of a point by means of three numbers (coordinates): x, y, and z... Similarly, the world of physical phenomena is four-dimensional in the space-time sense. It is composed of individual events, each of which is described by four numbers: three space coordinates x, y, z, and a time value t."

Moreover, Einstein's spacetime comes equipped with an intrinsic tradeoff between space and time. Concretely, the spacetime volume between two points is defined as:

v=g(δx)(δy)(δz)(δt) v = \sqrt{-g}(\delta x)(\delta y)(\delta z)(\delta t)

Where t is time, x, y, and z are physical coordinates, and g is the determinant of the spacetime metric tensor. The critical thing to notice about this formula is that for a fixed spacetime volume, we can decrease the time component of the volume by increasing the space components of the volume. But how exactly does this physical space-time tradeoff relate to our computational thread-memory-time tradeoff?

Computation as Physical State Transformation

To precisely understand the relationship between computation and spacetime volumes, we need a slightly more precise definition of physical computation. Luckily, the Stanford Encyclopedia of Philosophy readily provides one. In short, we can think of a physical computation as a function that transforms one physical state into another. From Einstein, we know that the physical start and end states of the computation are points in spacetime, and thus, span a spacetime volume. In this sense any computation can be said to have a physical spacetime cost, which is exactly the measure of the volume required to bridge its start and end states. Before AI, there was not a general purpose way to shift the extent of a computation's spacetime cost away from time and into space. Richard Hamming writes about this challenge in his 1997 book The Art and Science of Doing Engineering:

"The evidence from the two relativity theories, special and general, gives a maximum speed of useful signaling. There are definite limits to what can be done on a single processor. The trend to highly parallel processors is the indication we are feeling the upper saturation limit of the S curve for single-processor computers."

AI provides programmers a general purpose way of maximizing the space component of the physical spacetime cost of any computation, radically increasing the number of functions that are physically realizable in reasonable amounts of time.

Constructor Theory & Fundamental Physics

This style of thinking is reminiscent of a burgeoning new subfield of physics called constructor theory. In this interview, Chiara Marletto, a theoretical physicist at Oxford and one of the founders of constructor theory, outlines its distinction from existing theories of physics:

"Constructor theory, instead of using dynamical laws and trajectories and initial conditions, takes statements about what tasks or transformations are possible and what are impossible and why as fundamental ones."

Using constructor theoretic language, we can describe the contribution of modern AI as enabling a new class of transformations between physical states. Namely, AI enables programmers to trade physical space for time, and perform computations whose time cost was previously untenable. In this sense, the statement "AI trades space for time" is a statement about what transformations are physically possible, and thus, should be taken as fundamental.