Using 12GB memory to fold the full Ryujin in 44 minutes certainly was an achievement, but I had this strong feeling that I could still do better. On one hand, both time and space consumption can still be improved. On the other hand, the CP file I used for folding has this weird mountain fold across the upper half of the body for flat foldability, and the legs are positioned in a way that the scales on them cannot be seen. I can certainly modify the file to fix those, but once I changed either of them, the new CP file completely defeated my optimizations and the runtime is again forever. As a result, I was on the second stage of my Ryujin quest for another 2 weeks, and finally, I made the program capable of folding the version depicted above in 16 minutes, using only 6GB of memory. I shall again document the journey here.
One thing that troubled me was that the memory consumption was a lot more than what I estimated. So I started to look for what software engineers call "memory leaks", which means the program fails to release the memory allocated for data that are no longer in use. I discovered that, in the stages where the Orihime algorithm prepares the data for the solution searching stage, there was this matrix recording the adjacency relationship of all faces, which uses a huge amount of memory spaces (in the case for Ryujin it's about 5GB) but it is not needed anymore after the preparation stages. So just by releasing those memory spaces, I immediately got the memory requirement for Ryujin down to 7GB. 1
However, 7GB still didn't sound good enough, and I was more hoping for 6; not for any particular reason, it's just that an even number sounds better (plus the memory consumption is pretty close to 6). That motivated me to come up with another idea called PseudoListMatrix (probably not a new invention). In one part of the algorithm, there's a pretty large matrix
M where each position
M[i][j] is a list of numbers, which consumes a lot of memory if we store it as a 3D array or as a 2D array of lists. The idea of PseudoListMatrix is to store, for each
i, the list of numbers that appear in
M[i][x] for some
x, and similarly for each
j, the list of numbers that appear in
M[x][j] for some
x. Then when the list of
M[i][j] is required, we give the intersection list of the
M[x][j]. In general, this will result in a larger list than the actual
M[i][j] list, but in that particular use case, the result is exactly
M[i][j] due to the nature of the lists. With this data structure, I finally get the memory consumption down to 6GB, and almost without sacrificing the speed at all.
The power of quad tree
Up to this point, all my speed optimization efforts were focused on the searching stage only. By the time I successfully folded Ryujin for the first time, the searching stage had become so fast that most of the 44 minutes are spent on the preparation stages. As I look into those preparation steps, I realized that many steps could all be speeded up dramatically by a classical data structure known as the quad tree.
Basically, computer programs can't really "see" the objects on the 2D plane. All it knows is a list of objects represented by numbers. If we need to ask the question, "is there any two of them colliding", the naive way of answering this question is to examine all possible pairs of objects and calculate if there's a collision for each pair. So for
n objects, this will take
n(n-1)/2 comparisons, which will be a huge number if
n becomes larger.
The idea of the quad tree is to divide the 2D plane into four quadrants and store the objects into the proper quadrant. If a quadrant contains too many objects, we divide it again into four quadrants within itself and repeat the process. Then when we ask the same question again, for each object, we only need to compare it against those objects that are contained in the same quadrant. This greatly reduces the number of comparisons to about
n\log n (which is much smaller especially for larger
n) and allows answering the question much faster.
Quad trees are so versatile that there're many places in the preparations stage where they can be applied, and each application saved about 5 minutes in total runtime. Combined with other optimization techniques (especially the indexing mentioned in Part 1), I eventually got the total runtime of the same Ryujin CP from 44 minutes down to just about 5 minutes.
Folding the unfoldable
However, even as the old Ryujin CP now folded so fast, a few CPs (such as the version above) still wouldn't fold; the runtime is still like forever. Two major obstacles happen in the searching process for those CPs: Too many potential permutations, and swapping looping.
First, in Part 1, I compared the searching process of the Orihime algorithm to solving Sudoku. For Sudoku, each square has only 9 possible numbers to fill, but here for each subface, the number of possible permutations is measured in factorials, which are enormous numbers beyond imagination. Although I already applied the idea of fixing the order of the longest chain in the given constraints, in some cases that still leave billions of billions of possible permutations that need to be checked, and that's only for one subface, while in Ryujin there are thousands of them!
Secondly, I mentioned the swapping algorithm in Part 1 as well. Although it's one of the most powerful speed boosts so far, it does have a small chance of falling into "looping", in which two or more subfaces repeatedly get swapped to the front of the others, and even reset the progress of them. In that case, the searching will never end in theory, which is even worse than the first problem I just mentioned, since that will only cause the search to run for a very very long time, but will still finish eventually in theory.
For two weeks I've been struggling with these two obstacles, trying various ideas without success. Finally, I realized that the solution to both of these problems lies in introducing new data into the search process. The permutation generator I designed has the ability to filter out invalid permutations based on the given constraints, so if there are still too many possible permutations left, it can only mean that I haven't offered it enough constraints. Similarly, if the swapping falls into a loop, one way to immediately break the loop is to swap a subface that we haven't seen before in front of the loop, as it creates a new sequence of subfaces that we haven't considered yet.
However, merely introducing new subfaces into the sequence doesn't necessarily create enough new constraints to narrow down the permutation generator. In order to quickly create more constraints, I used the Italiano algorithm mentioned in Part 1 to infer more stacking relations after each permutation decision of subfaces. This turned out to be a giant boost to the searching performance, comparable to the swapping algorithm. With these ideas implemented, I succeeded in folding the version above, as well as all the sample CPs that were previously unfoldable.
Ryujin final version
To me, the second version above is satisfying enough. But at the same moment, Avinash Karnik and Bodo Haag made another version of the Ryujin CP which folds the entire body in half, folds down the legs, and adds the head twist structure that is used in the real-world folding process. The resulting CP is even more complicated in layer stacking than all previous versions, and this CP again defeated my optimized algorithm. Fortunately, it wasn't anything major, but just a bug that was previously unnoticed as it only occurs to highly complex CPs. After fixing the bug, their CP folds just fine in 35 minutes. A lot longer than other versions of Ryujin CP, but at least it is foldable.
Of course, I won't easily stop there. Even this is no doubt a more complicated CP, I was hoping to get its runtime to less than 20 minutes. By observing the searching process up-close, I noticed that this CP in particular has several subface that is in fact critical but hidden at the very end of the initial subface ordering. This will cause the algorithm to hit dead-ends at a very deep depth and have to return several times, wasting dozens of minutes. To cope with this, I implement a new strategy saying that "if it hits a dead-end at a deep depth, also bring a few nearby subfaces with it during the swapping". This is quite effective in solving that problem, and eventually, brings the runtime to just 16 minutes. The result is the first picture above.
These optimizations are now part of the v0.0.10 release of the Oriedita.
The latest record for the first figure is 65 seconds using 3GB memory.
Later, with more understanding of the algorithm, I found out that there's no need to create the face adjacency matrix at all by using the quad tree. ↩