Transformers are a neural network architecture. They are behind some of the most successful large language models (LLMs) we see today, like GPT-3, PaLM, BARD, and GPT-4. I have seen several papers claiming that transformers are Turing-complete, meaning that they can be used to simulate any computer program.

But transformer architectures are not Turing-complete. They cannot simulate computer programs. The papers that claim otherwise are making a conceptual error. Transformers have been impressive and extraordinary as tools but **we need to be honest about what they can do and about the challenges that lie ahead of us on the path toward true artificial general intelligence**.

Here are a list of the papers I have seen so far.

- Universal Transformers (2018)
- On the Turing-Completeness of Modern Neural Network Architectures (2019)
- On the Computational Power of Transformers and its Implications in Sequence Modeling (2020)
- Attention is Turing-Complete (2021)
- Looped Transformers as Programmable Computers (2023)
- GPT is becoming a Turing machine: Here are some ways to program it (2023).

These papers all make the incorrect claim that transformers are [or can become] Turing-complete. Each one is a little different. Let us go through them in detail.

Let us first look at paper [5] Looped Transformers as Programmable Computers. In this model the input/output is split into three segments that they call scratchpad, memory, and instructions. They then show how to determine the set of connections and weights that essentially runs the instructions while keeping track of where the program is in the sequence of instructions. It can load data from the memory segment into the scratchpad, execute operations on the scratchpad, and then save data back to memory. They showed that you can simulate a computer program using this architecture. And they did this without assuming unbounded precision for any of the variables.

So how big is their transformer network? It depends! Their network had a fixed depth but an arbitrary width. The width is an adjustable parameter and they describe how to build transformers of a variable width. **So, really, what they show is that if you know how much memory your program needs for execution, you can build a transformer network that runs that program to it conclusion.** But that doesn’t mean transformers are Turing-complete!** By that same logic one can argue that finite state machines (FSMs) are Turing-complete** because for any program, given the amount of memory it needs to run, I can construct an FSM that runs it. (FSMs are not Turing-complete. That is a well accepted statement, still undisputed I hope).

Papers [2], [3], and [4] commit a similar but deeper error. They assume an arbitrary width that corresponds to, not the memory, but rather the number of steps needed in the computation. They store ** the entire history** of the program as tokens that are always fed to the model. When they want to retrieve what was written in the memory tape at a specific address, the network “attends” to the the part of the input history that represents the last time the Turing machine tape head was at that address. (See page 6 of On the Turing-Completeness of Modern Neural Network Architectures where it says “the symbol at index c

^{(t+1)}in time t + 1 coincides with the last symbol written by M at index c

^{(t+1)}“. See pages 11 & 12 of Attention is Turing-Complete, where y

_{i}vector stores the history – similar variable naming to their 2019 paper. See Theorems 4.1 and 4.2 of On the Computational Power of Transformers and its Implications in Sequence Modeling, e.g. where it says “Now, using the last but one coordinate in y

_{t}representing the time t+1, the attention mechanism Att(p

_{t}, K

^{e}, V

^{e}) can retrieve the embedding of the t-th input symbol x

_{t}).

So in papers [2], [3], and [4] (two of them from the same authors describing the same model) what they are saying is that if you know how many time steps your program needs for execution you can build a transformer wide enough to contain the entire history that runs that program to its conclusion. This argument is worse than the other one above because with this logic you can argue that lookup tables (a system even weaker than FSMs) are Turing-complete! Given any Turing machine I can simulate up to n steps by creating a finite list of all possible program sequences given different possible inputs.

In [1] (Universal Transformers) they attempt to provide an informal proof that their model can simulate Turing machines. The authors say that “*Given sufficient memory the Universal Transformer is computationally universal – i.e. it belongs to the class of models that can be used to simulate any Turing machine,… To show this, we can reduce a Neural GPU to a Universal Transformer…. If we now set the total number of recurrent steps T to be equal to the input length, we obtain exactly a Neural GPU*“. So they are committing the same conceptual error of the other papers. Their transformer can only simulate a Turing machine if they know the number of steps. Also to support their claim that neural GPUs are Turing complete, they refer to Neural GPUs Learn Algorithms (2018). This paper also makes the same error. On page 4, it informally draws an analogy between neural GPUs and 2d cellular, with the implicit assumption that their network starts off with enough memory units for the particular computation it is supposed to run.

Finally, in [6], they do not present a proof for Turing-completeness. But the paper implicitly assumes that a transformer can operate like a Turing-machine, i.e. simulate a program.

To show that transformers are as powerful as Turing-machines, **one must show that given any Turing machine (or computer program), a single transformer exists that simulates it**. What people are doing instead is showing that given a Turing machine and given the resource constraints on that machine there is a transformer that simulates it. This does not demonstrate Turing-completeness. **Otherwise the notion of computation power becomes completely meaningless**. Because by that standard, finite state machines are Turing-complete. You can simulate any Turing machine with a specified memory constraint using a corresponding finite state machine. Does anyone claim finite state machines are Turing-complete?

**It is actually pretty easy to prove that transformers are not Turing-complete.** Given any transformer, I can simulate that transformer by constructing a single finite state machine. The number of possible states of a transformer is limited and easily calculable. The only way for a transformer to have memory states is to have information sent through the input/output channel as discrete tokens (usually words in LLMs). And the tokens must be from a predefined vocabulary of finite size. So if transformers can simulate Turing machines, then finite state machines can also simulate Turing-machines and that leads to a contradiction. (RNNs on the other hand are Turing-complete because they can store unbounded memory as digits after the decimal point of a variable. That result is mathematically solid but it has its own practical problems. Those kinds of networks are structurally unstable and as far as I understand they cannot be approximated through learning. See part 3 of this post for more.).

Someone might object by saying that physical computers work with constraints too and that this is an unfair critique of transformers. A physical computing device has a fixed amount of memory and we try not to run programs that require more than the available space. My response to that objection is that **we shouldn’t confuse computing resources with computer programs**. In AI we seek computer

*programs*that are capable of general problem-solving. If the computing device that executes our AI program runs out of memory or time or energy, we can add more resources to that device or even take that program and continue running it on another device with more resources. A transformer network cannot be such a program, because its memory is fixed and determined by its description and there is no sense in which a transformer network runs out of memory.

Whether Turing-completeness matters for reaching general intelligence can be debated. (See this post on transformers and general intelligence). But there is no debate on whether transformers are Turing-complete or not. They aren’t, not with how they are defined today.

The feature image of this post was created using DreamStudio.

The title of this post is a pun on the paper that first introduced the transformer architecture: Attention Is All You Need (2017)