Inga 🏳‍🌈 57f3877378 Refactoring; all anagrams are printed		4 years ago
data	First draft	4 years ago
src	Refactoring; all anagrams are printed	4 years ago
.gitignore	First draft	4 years ago
Cargo.toml	First draft	4 years ago
LICENSE	Initial commit	4 years ago
README.md	Refactoring; all anagrams are printed	4 years ago

README.md

TrustPilotChallengeRust

TrustPilot had this challenge several years ago (http://followthewhiterabbit.trustpilot.com/) where you had to, given the dictionary, and given three MD5 hashes, find three-word anagrams of a phrase "poultry outwits ants" which result in these hashes.

My original solution was in mixture of C# and plain C (with a bit of Visual C++ as a bridge), and heavily used AVX2 intrinsics for optimization.

Rust now has a decent API frontend for AVX2 intrinsics (https://rust-lang.github.io/packed_simd/packed_simd_2/, and soon-to-be std::simd), so it makes perfect sense to try and reimplement the same ideas with Rust.

The problem will sound a bit different: given a dictionary and given a string, find all anagrams no longer than N words and no longer than 27 bytes which produce given MD5 hashes.

(The limit on the number of words is neccessary, because there are single-letter words in the dictionary; and it makes the total number of anagrams astronomically large)

This is a working draft, so far the code is extremely dirty (this is my first Rust project), and it only lists all anagrams (not including words reordering) and does not yet do actual MD5 calculation.

Algorithm description

Notably this solution does not involve string concatenation; strings are only concatenated for debugging purposes.

We could split the problem into three parts: finding all anagrams (up to words reordering and replacing some of the words with their single-word anagrams), finding all anagrams taking into account words order, and checking their hashes against the supplied list.

Finding all anagrams, pt. 1

For every string (ignoring spaces) we could define a vector in Z^N space, with its i-th coordinate matching the number of occurrences of character i in the string.

Two strings are anagrams of each other if and only if their vectors are the same.

Vector for a concatenation of two strings is the sum of vectors for these two strings.

This means that the task of finding anagrams for a phrase reduces to the task of finding subsets of vectors (out of sets of vectors for all dictionary words) which add up to the vector for original phrase. Since all coordinates are positive, only vectors which are contained in a hyperrectangle defined by the target vector (that is, which have all coordinates not larger than the target vector; that is, corresponding words for which are subsets of the target phrase) could belong to such subsets.

Additionally, if the source phrase contains no more than 32 different characters, and each of these no more than 255 times, we could limit ourselves to u8x32 vectors instead of vectors in Z^N. That way we can "concatenate" strings or "compare" them for anagrams in a single CPU cycle.

The naive solution of finding fixed-length subsets of vectors which add up to a given vector could be further optimized, resulting in the following algorithm:

Sort all vectors by their norm (length of the original word), largest first;
Find all target subsets such that the order of items in subset is compatible with their order in sorted vectors list
For number of words N, the requested phrase P, and the offset K (originally 0) check:
- If N is 0 and P is non-zero, there are no solutions;
- If N is 0 and P is zero, there is a trivial solution (empty subset);
- If N is larger than 0, let us find the first vector of a target subset:
  - For every vector W starting with offset K (while its norm times N is less than the norm of P)
    - If the norm of W is not larger than the norm of P and all coordinates of W are not larger than of P:
      - W might be one element of a target subset, and the remaining elements could be found by solving the task 2 for N-1, P-W and position of W in the list of vectors.

How to run

How to run to solve the original task for three-word anagrams:

cargo run data\words.txt data\hashes.txt 3 "poultry outwits ants"

(Note that CPU with AVX2 support is required; that is, Intel Haswell (2013) or newer, or AMD Excavator (2015) or newer)