At the end of Chapter 9, we built the ultimate Kasteleyn engine. By moving our math into a Finite Field modulo a 64-bit prime ($p$), we compressed the innermost loop of our algorithm down to a few ultra-fast integer operations.

It calculated the $4096 \times 4096$ board in milliseconds. But the output wasn’t the 2.1-million-digit answer we wanted. It was just the remainder of that massive answer divided by $p$. We only captured about 60 bits of information. We are staring at a massive universe through a tiny keyhole.

But what if we drill more keyholes?

The Multiverse Approach

If we run our lightning-fast Kasteleyn engine again, but this time we choose a different 64-bit prime ($p_2$), we get a completely different remainder. We get another 60 bits of information.

If we run it with $p_3$, we get another 60 bits.

The exact mathematical answer for a $4096 \times 4096$ board is over 2.1 million decimal digits long (which translates to roughly 7 million bits). If each prime gives us roughly 60 bits of data, we just need to run our algorithm inside exactly 112,733 different prime “universes.”

Is running an algorithm 112,733 times slow? Not if it only takes a fraction of a millisecond per run. And more importantly, not if we run them all at the exact same time.

Embarrassingly Parallel

In computer science, when you have a massive task that can be broken down into completely independent sub-tasks that don’t need to communicate with each other, it is called Embarrassingly Parallel.

Calculating the board modulo $p_1$ has absolutely nothing to do with calculating it modulo $p_2$. They are entirely isolated universes.

This means we can unleash the full physical power of modern hardware. Using OpenMP in C or C++, we simply drop a #pragma omp parallel for directive above our loop.

Instantly, the operating system wakes up every single core on your CPU. If you are running an M-series Mac or a massive AMD Threadripper, all 16, 32, or 64 cores will pin to 100% utilization. Every single core will grab a different prime number, calculate the Kasteleyn remainder in a millisecond, save the 64-bit result to an array, and instantly grab the next prime.

Within a matter of seconds, your CPU will generate all 112,733 perfectly accurate remainders.

The Chinese Remainder Theorem (CRT)

Now we have an array of 112,733 different 64-bit remainders. How do we turn this pile of puzzle pieces back into a single 2.1-million-digit integer?

We use a mathematical miracle discovered in the 3rd century AD by the Chinese mathematician Sunzi: the Chinese Remainder Theorem (CRT).

The theorem states that if you know the remainders of an unknown number $X$ when divided by several coprime numbers (like our list of distinct primes), there is exactly one unique solution for $X$ that satisfies all of those remainders simultaneously (up to the product of all the primes).

The math to stitch two remainders together looks like this:

$$X = (R_1 \cdot p_2 \cdot M_1 + R_2 \cdot p_1 \cdot M_2) \pmod{p_1 p_2}$$

(Where $M_1$ and $M_2$ are modular inverses).

When you stitch $R_1$ and $R_2$ together, you get a new remainder modulo $(p_1 \cdot p_2)$. You just combined two 64-bit puzzle pieces into a 128-bit puzzle piece!

The Final Bottleneck: Divide and Conquer

If we just write a simple for loop to stitch all 112,733 remainders together one by one, we will hit one final, fatal wall.

As the accumulated number $X$ grows into thousands and then millions of digits, multiplying it by the next little 64-bit prime becomes computationally heavy. If you stitch them sequentially, the time complexity is $O(K^2)$ where $K$ is the number of primes. The math will drag to a crawl.

To bypass this, we use a Divide and Conquer binary tree.

Instead of adding to a single massive pile, we pair the 112,733 remainders up and stitch them together in parallel. Now we have roughly 56,000 remainders of 128-bit size. We pair them up again. Now we have 28,000 remainders of 256-bit size.

We keep folding the array in half. Because we are using the GNU Multiple Precision (GMP) library to handle the massive integer arithmetic, and GMP uses ultra-fast FFT (Fast Fourier Transform) multiplication for huge numbers, this binary tree approach reduces the stitching time from $O(K^2)$ down to $O(K \log^2 K)$.

The Climax

We are finally ready. We have combined 1961 statistical physics, abstract cyclotomic algebra, prime number theory, and raw multi-threaded CPU architecture into a single program.

Let’s start with a “smaller” test: the $2048 \times 2048$ board. You hit run. OpenMP spins up your CPU cores. The engine burns through 28,177 prime universes, generating the array. The GMP binary tree folds the remainders together. Under 23 seconds later, the terminal prints the exact answer: a flawless integer roughly 530,000 digits long.

We scaled a mountain. But let’s go for the summit. Let’s solve the $4096 \times 4096$ board. You hit run. The CPU fans scream. The MacBook Pro’s aluminum chassis gets hot. The engine calculates 112,733 distinct 64-bit prime universes. The binary tree takes over, stitching and multiplying numbers that are hundreds of thousands of digits long using FFT.

6 minutes later, the tree reaches the root node.

The terminal prints the result. It is a mathematically exact, perfect integer over 2.1 million digits long.

We didn’t just solve the puzzle. We conquered the grid.


The Code: Generation 4

Here is the multi-core supercomputing engine. This C code evaluates the 2D Kasteleyn formula inside native 64-bit Finite Fields, parallelizes the universes using OpenMP, and stitches the final massive integer together using a GMP Divide-and-Conquer Chinese Remainder Theorem tree. It will crush a $4096 \times 4096$ board.

💻 View the code on GitHub: domino-2d-ff.c