From 041983d168322f2ff0127fbaad18f86f8e69c2fe Mon Sep 17 00:00:00 2001
From: inga-lovinde <52715130+inga-lovinde@users.noreply.github.com>
Date: Thu, 6 Apr 2017 13:40:02 +0300
Subject: [PATCH] Updated README

---
 README.md | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 68 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 1bc9c82..6e95d33 100644
--- a/README.md
+++ b/README.md
@@ -41,20 +41,36 @@ That's why the given hashes are solved much sooner than it takes to check all an
 
 Anagrams generation is not parallelized, as even single-threaded performance for 4-word anagrams is high enough; and 5-word (or larger) anagrams are frequent enough for most of the time being spent on computing hashes, with full CPU load.
 
-Multi-threaded performance with RyuJIT (.NET 4.6, 64-bit system) on quad-core Sandy Bridge @2.8GHz is as follows (excluding initialization time of 0.2 seconds):
-
-* If only phrases of at most 4 words are allowed, then it takes **0.9 seconds** to find and check all 7,433,016 anagrams; **all hashes are solved in first 0.15 seconds**.
-
-* If phrases of 5 words are allowed as well, then it takes around 100 seconds to find and check all 1,348,876,896 anagrams; all hashes are solved in first 2.5 seconds.
-
-* If phrases of 6 words are allowed as well, then it takes around 75 minutes to find and check all 58,837,302,096 anagrams; "more difficult" hash is solved in 2.5 seconds, "easiest" in 14 seconds, and "hard" in 35 seconds.
-
-* If phrases of 7 words are allowed as well, then it takes 75 seconds to count all 1,108,328,708,976 anagrams, and around 40 hours (speculatively) to find and check all these anagrams; "more difficult" hash is solved in 13 seconds, "easiest" in 1.5 minutes, and "hard" in 4.5 minutes.
+Multi-threaded performance with RyuJIT (.NET 4.6, 64-bit system) on quad-core Sandy Bridge @2.8GHz (without AVX2 support) is as follows (excluding initialization time of 0.2 seconds), for different maximum allowed words in an anagram:
+
+Number of words|Time to check all anagrams no longer than that|Time to solve "easy" hash|Time to solve "more difficult" hash|Time to solve "hard" hash|Number of anagrams no longer than that (see note below)
+---------------|----------------------------------------------|-------------------------|-----------------------------------|-------------------------|-------------------------------------------------------
+3|Fractions of a second||||4560
+4|0.6s|||0.1s|7,433,016
+5|60s|||1.5s|1,348,876,896
+6|45 minutes|||21s|58,837,302,096
+7|10 hours (?)|1.5 minutes|8s|4.5 minutes|1,108,328,708,976
+8|||||12,089,249,231,856
+9|||||88,977,349,731,696
+10|||||482,627,715,786,096
+11|||||2,030,917,440,675,696
+12|||||6,813,402,098,518,896
+13|||||18,437,325,782,691,696
+14|||||40,367,286,468,925,296
+15|||||71,561,858,517,565,296
+16|||||103,280,807,987,773,296
+17|||||123,910,678,817,341,296
+18|||||130,313,052,523,069,296
 
 Note that all measurements were done on a Release build; Debug build is significantly slower.
 
 For comparison, certain other solutions available on GitHub seem to require 3 hours to find all 3-word anagrams. This solution is faster by 6-7 orders of magnitude (it finds and checks all 4-word anagrams in 1/10000th fraction of time required for other solution just to find all 3-word anagrams, with no MD5 calculations).
 
+Also, note that anagram counts are inflated for the sake of code simplicity.
+E.g. for phrase "aabbc" and dictionary [ab, ba, c] there are four possible set of words adding up to the source phrase: [ab, ab, c], [ab, ba, c], [ba, ab, c], [ba, ba, c].
+My implementation regards these sets as sets of different words, and applies all possible permutations to the every set, even if it will result in the same set.
+For the example above, my application would produce 24 anagrams (with six permutations for every of the four sets), although actually there are only 12 different anagrams.
+
 Conditional compilation symbols
 ===============================
 
@@ -111,4 +127,46 @@ There is no need in processing all the words that are too large to be useful at
 11. Filtering the original dictionary (e.g. throwing away all single-letter words) does not really improve the performance, thanks to the optimizations mentioned in notes 7-9.
 This solution finds all anagrams, including those with single-letter words.
 
-12. MD5 computation could be further optimized by leveraging CPU extensions (which would reduce runtime by 5x to 10x); however, it could not be done with current .NET (see readme for https://github.com/penartur/TrustPilotChallenge/tree/simd-md5)
+12. Computing the entire MD5, and then comparing it to the target MD5s, makes little sense. Each of MD5 components is `uint`, which means that the chances of first component match for different hashes are one in 4 billions.
+It's more efficient to compute only the first component (which is 5% faster since we don't need to perform rounds 62-64 of MD5), and use only the first component for a lookup (which makes the lookup 4x faster).
+To prevent false positives, we could compute the entire MD5 again if there is a match.
+As that will only happen once in 4 billion hashes, the efficiency of this computation does not matter at all.
+Right now, this additional checking is not implemented, which means that once in a minute (if there are 3 target hashes) the program will produce a false positive, which allows one to monitor progress.
+
+13. MD5 computation is further optimized by leveraging CPU extensions.
+For example, one could compute MD5 more effectively by using `rotl` instruction to rotate numbers (which is currently done with two bitshifts and one `or` / `xor`).
+What's more important, one could compute 4 hashes at once (on a single core) using SSE, 8 hashes at once using AVX2, or 16 hashes at once using AVX512 (AVX lacks enough instructions to make computing hashes feasible).
+.NET/RyuJit does not support some of the required intrinsics (`rotl` for plain MD5 implementation, `psrld` and `pslld` for SSE, and similar intrinsics for AVX2).
+Although `rotl` support is expected in next release of RyuJIT (see https://github.com/dotnet/coreclr/pull/1830), no support for bitshift SIMD/AVX2 instructions is currently expected (see https://github.com/dotnet/coreclr/issues/3226).
+However, one can move MD5 computations to the unmanaged C++ code, where all the intrinsics are available.
+To make this work efficiently, I had to store anagrams in chunks of 8 anagrams (so that unmanaged code will receive the chunk and produce 8 hashes).
+And to make this efficient, I had to make all permutation counts to divide by 8 by filling in some additional permutation copies.
+It slows down processing anagrams of 1, 2, and 3 words (as for every set of word, number of anagrams is increased to 8 from 1, 2 and 6, respectively); however, these are relatively rare for a given phrase and dictionary.
+
+Implementation details
+======================
+
+Given all the above, the implementation is as follows:
+
+1. Words from the dictionary are converted into arrays of bytes with a trailing space.
+
+2. The dictionary is filtered from words that could not be a part of anagram (e.g. "b" or "aa"), and from duplicates.
+
+3. Words are converted into vectors, and grouped by vector.
+
+4. Vectors are ordered by their norm, in a descending order.
+
+5. All sequences of non-decreasing vector indices adding up to a target vector are found.
+
+6. For every sequence, a sequence of word arrays corresponging to these vectors is generated.
+
+7. For every sequence of word arrays, all sequences of word combinations are generated (e.g. for [[ab, ba], [cd, dc]], we generate [ab, cd], [ab, dc], [ba, cd], [ba, dc]).
+
+8. For every sequence of words, all permutations are generated (in chunks of 8).
+
+9. For every 8 permuted sequences of words, `uint[64]` message is generated (8 uints = 28 bytes with a trailing `128` byte, plus a length in bits for every sequence).
+
+10. For every `uint[64]` message, 8 `uint`s corresponding to the first components of MD5 hashes for `uint[8]` messages are generated.
+
+11. Every resulting `uint` is checked against the targets; if match is found, both sequence of word and full MD5 hash are printed to the output.
+