DisCO is structured around Karp’s 21 NP-hard combinatorial optimization problems and their variants, presented as visual and textual puzzles that obfuscate the underlying mathematical structure. The corpus uses small instances of these problems, where the optimal solution value is known and the corresponding puzzle can be mathematically modeled and solved by a human.

# The Novelty-Utility Dilemma

On the one hand, we need novel problems to ensure that AI systems are not simply recalling memorized solutions from their training data. On the other hand, we want to test useful skills that reflect intelligence (i.e. problem-solving skills that help with survival and/or are economically valuable).

While creating entirely new, arbitrary puzzles would ensure novelty, it might not guarantee practical utility, and they can have ambiguous solution, as is often the case with IQ tests or the ARC-AGI problems. On the other hand, if a problem helps with survival and/or is economically valuable, someone has most likely tried to solve it and left a trace of that on the internet.

DisCO solves this critical challenge by designing visually-novel puzzles that formulate fundamental computational problems with objectively optimal solutions.

As we will see, the visual novelty of the puzzles is enough to render state of the art LLMs unable to recognize the underlying problems and suggest tools to solve them (our preliminary tests used Claude 3.5 Sonnet, ChatGPT 4o, Llama 3 70B). This suggests that the difficulty current models have with arbitrary puzzles such as ARC-AGI are not due to the novelty of their problem structure and the type of reasoning required to solve them (i.e. the algorithm used to generate the output from the input), but largely, to their visual novelty.

Lastly, because DiSCO puzzles are unequivocally generated from well-defined combinatorial optimization problems, it is easy to generate an unlimited amount of similar training data, and even if multiple solutions to a problem exist, they can be verified without ambiguity.

# A Shortest Hamiltonian Path Example

The following (easy) example encodes an instance of the shortest Hamiltonian path problem with four literals (each encoded as a color) and ten clauses (encoded as rows).

We have 4 machines: A, B, C, and D.

We have 5 different tools: Wrench, Screwdriver, Pliers, Hammer, and Multimeter.

Each machine requires a specific set of tools for repair.

Machine requirements:

Machine A: Wrench, Screwdriver

Machine B: Screwdriver, Pliers, Hammer

Machine C: Pliers, Multimeter

Machine D: Wrench, Hammer, Multimeter

The repairman starts with an empty toolbox and can carry any number of tools, but wants to minimize the number of times he needs to change his toolset.

# A 3-SAT Visual Example

The following example encodes an instance of 3-SAT with four literals (each encoded as a color) and ten clauses (encoded as rows). The direction of an arrow indicates a positive or negative literal.

A human can eventually figure out that the output are the shapes that appear at least once in each row of the input, and an undergrad Computer Science student can mathematically model the problem and solve it.

On the other hand, state of the art multimodal LLMs get thrown off by the visuals. Importantly, this shows that the difficulty LLMs have with this kind of puzzles is not due to the inherent novelty of the problem to be solved (e.g. Claude Sonnet 3.5 knows about 3-SAT and is able to recognize other disguised instances of it), but only to their novel visual presentation.

Another important point is that besides Reasoning, this corpus can also be used to test Tool Use skills. Even when given the hint that this puzzle encodes an NP-hard problem, Claude fails to identify it and suggest an appropriate tool.

# The Problems with Arbitrary Puzzles

We are exploring an alternative path to the ARC-AGI dataset, for the following reasons:

## No Objective Concept of Better

Because they are not rooted in any objective concept of “better”, ARC-AGI problems often admit several solutions and it is unclear what makes one more correct than another. Consider the example below:

Both: “Change the color of the larger component according to shape of smaller component”, and “Change the color of the larger component to Light orange if it touches a lateral edge, dark orange if one cell off, green if more than one cell off” are valid solution.

The first solution is of course the “correct” one, but if we were to explain what makes it more correct, it would be a vague notion of “elegance”. For some other ARC-AGI problems, it is “conciseness” or “aesthetics”. These fleeting notions do not necessarily relate to intelligence. In nature, ugly, redundant, messy solutions are sometimes favored.

This kind of ambiguity is often found in IQ problems. According to Nassim Nicholas Taleb, the late Benoit Mandelbrot’s main difficulty with IQ test was trying to figure out which, of all the possible solutions to a problem, the test creator wanted.

Because they are not rooted in any objective concept of “better”, arbitrary puzzles risk throwing a bone for AI to chew on without substantial improvement in capabilities towards AGI.

In contrast, NP-Complete problems represent fundamental computational challenges, have multiple survival/economic applications and numerous other problems can be efficiently reduced to them.

## Humans Don’t Learn New Tasks from Scractch when Solving the ARC-AGI Challenge

When solving ARC problems, we relate it to our previous knowledge of solving puzzles, videogames, and what we know about the context (e.g. the test creator is an elegant person who values elegant solutions, as illustrated above). This is assumed to be base knowledge for people of the socio-economic class that would read or write a blog post like this one, but it’s far from being universal. Ask a person who has never used a computer or solved puzzles (they are hard to find but they exist in this vast world, where people live long). They won’t be able to learn the new task of solving an ARC problem, unless you first teach them about puzzles, videogames and the context. Therefore, we shouldn’t expect LLMs to learn a new task from scratch but instead to relate a problem with a novel configuration/visual-presentation to problems it knows about.

As our preliminary tests show, this is enough novelty to challenge current state of the art models, while still ensuring that the problems we give them are real.