Recently I've been spending quite some effort improving the algorithm of Orihime, a program developed by Toshiyuki Meguro. I was hoping to make it capable of folding the complete CP of Ryujin 3.5 (with shaped scales 1) by Satoshi Kamiya, arguably one of the most complicated models ever created. It has been a quest of persuing both speed and space with endless challenges. After two weeks of work, I finally manage to fold the CP in its entirety on my laptop PC in 44 minutes using 12G of memory. In this article, I will share the story behind this quest.
For those of you who would like to try this optimized version of Orihime, check out the latest release of Oriedita. (Folding Ryujin will require v0.0.7 or above, and don't forget to launch it with large memory settings.)
How it began
The quest began when I was about to finalize the CP of Angelfish, my latest original design. The model being only 32-grid BP, I was shocked by the fact that Orihime took 75 minutes to complete the folding of it. I deeply wondered if there's anything about Orihime's algorithm that can be optimized to speed things up. The problem is, the source code of the original Orihime used mostly Japanese names for variables and functions, etc., making the source code pretty difficult to read and understand.
Luckily, only about two months ago, Mr. Gerben Oolbekkink (a.k.a "qurben") has made a fork of Orihime called "Oriedita" and converted most of the source code to use English nomenclature instead. By looking at the source code he translated, it took me only a few minutes to spot the first place that can be made a lot more efficiently:
public int FaceId2PermutationDigit(int im) {
for (int i = 1; i <= faceIdCount; i++) {
if (faceIdList[getPermutation(i)] == im) {
return i;
}
}
return 0;
}
This was a function that is called repeatedly in the process of searching for a valid layer stacking order. Each time it is called, it performs what computer scientists call a linear search to locate the digit location of a given Id. This is way too slow. A much more efficient way is to store the correspondence of Id and location in an array, and then just look up that array directly.
int[] FaceId2PermutationDigitMap; // Initialized to be large enough
public void reset_map() {
for (int i = 1; i <= faceIdCount; i++) {
FaceId2PermutationDigitMap[faceIdList[getPermutation(i)]] = i;
}
}
public int FaceId2PermutationDigit(int im) {
return FaceId2PermutationDigitMap[im];
}
By making this change, the runtime of my Angelfish immediately dropped from 75 minutes to just 500 seconds. That's 9 times faster by just changing one function.
More optimizations
After I announced my result, several people soon sent me more large CP samples that previously took a very long time (or practically impossible) for Orihime to fold. Indeed, even with the optimizations I made, some of them still won't finish overnight. In a few days that follow, I found a few places in the code that can be optimized, but they all only led to slightly faster runtime, and none of them would give a significant improvement.
A few days later, I discovered the most significant boost in the entire quest: the swapping algorithm. Basically, what Orihime does is kind of like solving Sudoku, where the "subfaces" in the model are like the squares in Sudoku, and the permutations within the subfaces are like the available numbers to fill in the squares. Orihime performs an exhaustive search on the subfaces in a certain order and tries to find a permutation for each of them that is consistent. Once the algorithm finds a subface without a consistent permutation with the previous subfaces, it backtracks to the previous subface and chooses a different permutation, and continues.
But just like Sudoku, sometimes it is a lot faster to solve the puzzle by first fill in some of the "critical" squares, as they will quickly help to narrow down the possible numbers for the other squares. The problem is, it is not very obvious which subface is more critical from the beginning. We can, however, conclude that a subface is more critical than those few before it by observing that the search repeatedly hits a dead-end at that very subface. In that case, we dynamically swap the search order and work on that subface first.
This technique is so powerful that it suddenly brought the runtime of my Angelfish to just 3 or 4 seconds, and it also makes all other samples people sent to me folded in a feasible amount of time, with only one exception, Ryujin 3.5 by Satoshi Kamiya.
Taming the divine dragon
It was hard for me to say "OK we're done here; forget about that one model that won't fold," since I was quite far from running out of optimization ideas yet. I used to say "I'll never fold Ryujin in my lifetime" (the reason is, if I'm spending that much effort folding one model, it better be my original design), but at that moment I ironically found myself seriously considering folding one, only in the computer instead.
The biggest challenge with fully shaped Ryujin is not about speed, but about space. It has way too many creases and faces, and it takes a crazy amount of memory to fold by the original data structure (estimated to require at least 30GB of memory). By using bitmap arrays instead of 32-bit integer arrays, I managed to control the total memory usage down to less than 12GB (maybe even less), which is feasible on my laptop.
Of course, that doesn't mean that speed is not an issue at all. Even with the swapping algorithm, it still took about a whole day to fold only half of the Ryujin CP. To this end, I improve the permutation generator by fixing the order of the longest known chain in the constraints provided, and this led to a factorial speed up for iterating over possible permutations, making the half Ryujin fold in less than half an hour.
I also replaced the Warshall algorithm used by Meguro for finding transitive closure with a much faster Italiano algorithm 2, but again that algorithm requires quite some memory to run. The last piece of the puzzle I put in to make it work, is by optimizing the memory used by the Italiano algorithm. The idea is pretty simple but took me a while to figure out: all I had to do is to flush out the changes more frequently, and not to let them piled up in the memory.
So finally, I succeed in taming the divine dragon. For more details about the individual optimizations I made, check out each pull requests I made on qurben's repository.
Update:
Check out also Part 2 of my Ryujin journey!
-
The CP file used is a fan-made file that is based on the CP published by Kamiya in his book, with additional creases to make it flat-foldable. Since the CP is a copyright material, I will not be sharing the file here. ↩
-
It is an online (i.e. dynamic) transitive closure algorithm described by G. F. Italiano. See http://dx.doi.org/10.1016/0304-3975%2886%2990098-8. ↩
Recent Comments