Improved performance (dictionary => array)

feature-optimized-md5
Inga 🏳‍🌈 8 years ago
parent 1a45eece0f
commit a3a426f023
  1. 17
      README.md
  2. 19
      WhiteRabbit/StringsProcessor.cs
  3. 25
      WhiteRabbit/VectorsProcessor.cs

@ -13,7 +13,7 @@ WhiteRabbit.exe < wordlist
Performance Performance
=========== ===========
Memory usage is minimal (for that kind of task), around 10-20MB. Memory usage is minimal (for that kind of task), less than 10MB.
It is also somewhat optimized for likely intended phrases, as anagrams consisting of longer words are generated first. It is also somewhat optimized for likely intended phrases, as anagrams consisting of longer words are generated first.
That's why the given hashes are solved much sooner than it takes to check all anagrams. That's why the given hashes are solved much sooner than it takes to check all anagrams.
@ -22,13 +22,13 @@ Anagrams generation is not parallelized, as even single-threaded performance for
Multi-threaded performance with RyuJIT (.NET 4.6, 64-bit system) on quad-core Sandy Bridge @2.8GHz is as follows: Multi-threaded performance with RyuJIT (.NET 4.6, 64-bit system) on quad-core Sandy Bridge @2.8GHz is as follows:
* If only phrases of at most 4 words are allowed, then it takes less than 4.5 seconds to find and check all 7433016 anagrams; all hashes are solved in first 0.6 seconds. * If only phrases of at most 4 words are allowed, then it takes around 4 seconds to find and check all 7433016 anagrams; all hashes are solved in first 0.5 seconds.
* If phrases of 5 words are allowed as well, then it takes around 13 minutes to find and check all 1348876896 anagrams; all hashes are solved in first 20 seconds. Most of time is spent on MD5 computations for correct anagrams, so there is not a lot to optimize further. * If phrases of 5 words are allowed as well, then it takes around 12 minutes to find and check all 1348876896 anagrams; all hashes are solved in first 18 seconds. Most of time is spent on MD5 computations for correct anagrams, so there is not a lot to optimize further.
* If phrases of 6 words are allowed as well, then "more difficult" hash is solved in 20 seconds, "easiest" in 2.5 minutes, and "hard" in 6 minutes. * If phrases of 6 words are allowed as well, then "more difficult" hash is solved in 19 seconds, "easiest" in 2 minutes, and "hard" in less than 5 minutes.
* If phrases of 7 words are allowed as well, then "more difficult" hash is solved in 2.5 minutes. * If phrases of 7 words are allowed as well, then "more difficult" hash is solved in ~2 minutes.
Note that all measurements were done on a Release build; Debug build is significantly slower. Note that all measurements were done on a Release build; Debug build is significantly slower.
@ -38,6 +38,7 @@ Implementation notes
==================== ====================
1. We need to limit the number of words in an anagram by some reasonable number, as there are single-letter words in dictionary, and computing MD5 hashes for all anagrams consisting of single-letter words is computationally infeasible and could not have been intended by the challenge authors. 1. We need to limit the number of words in an anagram by some reasonable number, as there are single-letter words in dictionary, and computing MD5 hashes for all anagrams consisting of single-letter words is computationally infeasible and could not have been intended by the challenge authors.
In particular, as there are single-letter words for every letter in the original phrase, there are obvious anagrams consisting exclusively of the single-letter words; and the number of such anagrams equals to the number of all letter permutations of the original phrase, which is too high.
2. Every word or phrase could be thought of as a vector in 26-dimensional space, with every component equal to the number of corresponding letters in the original word. 2. Every word or phrase could be thought of as a vector in 26-dimensional space, with every component equal to the number of corresponding letters in the original word.
That way, vector corresponding to some phrase equals to the sum of vectors of its words. That way, vector corresponding to some phrase equals to the sum of vectors of its words.
@ -75,4 +76,8 @@ As we have ordered the words by weight, when we're looping over the dictionary,
9. Another possible optimization with such an ordering is employing binary search. 9. Another possible optimization with such an ordering is employing binary search.
There is no need in processing all the words that are too large to be useful at this moment; we can start with a first word with a weight not exceeding distance between current partial sum and the target. There is no need in processing all the words that are too large to be useful at this moment; we can start with a first word with a weight not exceeding distance between current partial sum and the target.
10. And then, all that remains are implementation optimizations: precomputing weights, optimizing memory usage and loops, etc. 10. And then, all that remains are implementation optimizations: precomputing weights, optimizing memory usage and loops, using byte arrays instead of strings, etc.
11. Filtering the original dictionary (e.g. throwing away all single-letter words) does not really improve the performance, thanks to the optimizations mentioned in notes 7-9.
This solution finds all anagrams, including those with single-letter words.

@ -4,7 +4,6 @@
using System.Collections.Generic; using System.Collections.Generic;
using System.Collections.Immutable; using System.Collections.Immutable;
using System.Linq; using System.Linq;
using System.Numerics;
internal sealed class StringsProcessor internal sealed class StringsProcessor
{ {
@ -15,22 +14,28 @@
this.VectorsConverter = new VectorsConverter(filteredSource); this.VectorsConverter = new VectorsConverter(filteredSource);
// Dictionary of vectors to array of words represented by this vector // Dictionary of vectors to array of words represented by this vector
this.VectorsToWords = words var vectorsToWords = words
.Select(word => new { word, vector = this.VectorsConverter.GetVector(word) }) .Select(word => new { word, vector = this.VectorsConverter.GetVector(word) })
.Where(tuple => tuple.vector != null) .Where(tuple => tuple.vector != null)
.Select(tuple => new { tuple.word, vector = tuple.vector.Value }) .Select(tuple => new { tuple.word, vector = tuple.vector.Value })
.GroupBy(tuple => tuple.vector) .GroupBy(tuple => tuple.vector)
.ToDictionary(group => group.Key, group => group.Select(tuple => tuple.word).Distinct(new ByteArrayEqualityComparer()).ToArray()); .Select(group => new { vector = group.Key, words = group.Select(tuple => tuple.word).Distinct(new ByteArrayEqualityComparer()).ToArray() })
.ToList();
this.WordsDictionary = vectorsToWords.Select(tuple => tuple.words).ToArray();
this.VectorsProcessor = new VectorsProcessor( this.VectorsProcessor = new VectorsProcessor(
this.VectorsConverter.GetVector(filteredSource).Value, this.VectorsConverter.GetVector(filteredSource).Value,
maxWordsCount, maxWordsCount,
this.VectorsToWords.Keys); vectorsToWords.Select(tuple => tuple.vector).ToArray());
} }
private VectorsConverter VectorsConverter { get; } private VectorsConverter VectorsConverter { get; }
private Dictionary<Vector<byte>, byte[][]> VectorsToWords { get; } /// <summary>
/// WordsDictionary[vectorIndex] = [word1, word2, ...]
/// </summary>
private byte[][][] WordsDictionary { get; }
private VectorsProcessor VectorsProcessor { get; } private VectorsProcessor VectorsProcessor { get; }
@ -67,13 +72,13 @@
return Flatten(wordVariants.Item2).Select(words => Tuple.Create(item1, words)); return Flatten(wordVariants.Item2).Select(words => Tuple.Create(item1, words));
} }
private Tuple<int, ImmutableStack<byte[][]>> ConvertVectorsToWords(Vector<byte>[] vectors) private Tuple<int, ImmutableStack<byte[][]>> ConvertVectorsToWords(int[] vectors)
{ {
var length = vectors.Length; var length = vectors.Length;
var words = new byte[length][][]; var words = new byte[length][][];
for (var i = 0; i < length; i++) for (var i = 0; i < length; i++)
{ {
words[i] = this.VectorsToWords[vectors[i]]; words[i] = this.WordsDictionary[vectors[i]];
} }
return Tuple.Create(length, ImmutableStack.Create(words)); return Tuple.Create(length, ImmutableStack.Create(words));

@ -17,7 +17,7 @@
PrecomputedPermutationsGenerator.HamiltonianPermutations(0); PrecomputedPermutationsGenerator.HamiltonianPermutations(0);
} }
public VectorsProcessor(Vector<byte> target, int maxVectorsCount, IEnumerable<Vector<byte>> dictionary) public VectorsProcessor(Vector<byte> target, int maxVectorsCount, Vector<byte>[] dictionary)
{ {
if (Enumerable.Range(0, Vector<byte>.Count).Any(i => target[i] > MaxComponentValue)) if (Enumerable.Range(0, Vector<byte>.Count).Any(i => target[i] > MaxComponentValue))
{ {
@ -37,7 +37,7 @@
private ImmutableArray<VectorInfo> Dictionary { get; } private ImmutableArray<VectorInfo> Dictionary { get; }
// Produces all sequences of vectors with the target sum // Produces all sequences of vectors with the target sum
public ParallelQuery<Vector<byte>[]> GenerateSequences() public ParallelQuery<int[]> GenerateSequences()
{ {
return GenerateUnorderedSequences(this.Target, GetVectorNorm(this.Target, this.Target), this.MaxVectorsCount, this.Dictionary, 0) return GenerateUnorderedSequences(this.Target, GetVectorNorm(this.Target, this.Target), this.MaxVectorsCount, this.Dictionary, 0)
.AsParallel() .AsParallel()
@ -62,11 +62,11 @@
return norm; return norm;
} }
private static VectorInfo[] FilterVectors(IEnumerable<Vector<byte>> vectors, Vector<byte> target) private static VectorInfo[] FilterVectors(Vector<byte>[] vectors, Vector<byte> target)
{ {
return vectors return Enumerable.Range(0, vectors.Length)
.Where(vector => Vector.GreaterThanOrEqualAll(target, vector)) .Where(i => Vector.GreaterThanOrEqualAll(target, vectors[i]))
.Select(vector => new VectorInfo(vector, GetVectorNorm(vector, target))) .Select(i => new VectorInfo(vectors[i], GetVectorNorm(vectors[i], target), i))
.OrderByDescending(vectorInfo => vectorInfo.Norm) .OrderByDescending(vectorInfo => vectorInfo.Norm)
.ToArray(); .ToArray();
} }
@ -75,7 +75,7 @@
// In every sequence, next vector always goes after the previous one from dictionary. // In every sequence, next vector always goes after the previous one from dictionary.
// E.g. if dictionary is [x, y, z], then only [x, y] sequence could be generated, and [y, x] will never be generated. // E.g. if dictionary is [x, y, z], then only [x, y] sequence could be generated, and [y, x] will never be generated.
// That way, the complexity of search goes down by a factor of MaxVectorsCount! (as if [x, y] does not add up to a required target, there is no point in checking [y, x]) // That way, the complexity of search goes down by a factor of MaxVectorsCount! (as if [x, y] does not add up to a required target, there is no point in checking [y, x])
private static IEnumerable<ImmutableStack<Vector<byte>>> GenerateUnorderedSequences(Vector<byte> remainder, int remainderNorm, int allowedRemainingWords, ImmutableArray<VectorInfo> dictionary, int currentDictionaryPosition) private static IEnumerable<ImmutableStack<int>> GenerateUnorderedSequences(Vector<byte> remainder, int remainderNorm, int allowedRemainingWords, ImmutableArray<VectorInfo> dictionary, int currentDictionaryPosition)
{ {
if (allowedRemainingWords > 1) if (allowedRemainingWords > 1)
{ {
@ -90,7 +90,7 @@
var currentVectorInfo = dictionary[i]; var currentVectorInfo = dictionary[i];
if (currentVectorInfo.Vector == remainder) if (currentVectorInfo.Vector == remainder)
{ {
yield return ImmutableStack.Create(currentVectorInfo.Vector); yield return ImmutableStack.Create(currentVectorInfo.Index);
} }
else if (currentVectorInfo.Norm < requiredRemainderPerWord) else if (currentVectorInfo.Norm < requiredRemainderPerWord)
{ {
@ -102,7 +102,7 @@
var newRemainderNorm = remainderNorm - currentVectorInfo.Norm; var newRemainderNorm = remainderNorm - currentVectorInfo.Norm;
foreach (var result in GenerateUnorderedSequences(newRemainder, newRemainderNorm, newAllowedRemainingWords, dictionary, i)) foreach (var result in GenerateUnorderedSequences(newRemainder, newRemainderNorm, newAllowedRemainingWords, dictionary, i))
{ {
yield return result.Push(currentVectorInfo.Vector); yield return result.Push(currentVectorInfo.Index);
} }
} }
} }
@ -114,7 +114,7 @@
var currentVectorInfo = dictionary[i]; var currentVectorInfo = dictionary[i];
if (currentVectorInfo.Vector == remainder) if (currentVectorInfo.Vector == remainder)
{ {
yield return ImmutableStack.Create(currentVectorInfo.Vector); yield return ImmutableStack.Create(currentVectorInfo.Index);
} }
else if (currentVectorInfo.Norm < remainderNorm) else if (currentVectorInfo.Norm < remainderNorm)
{ {
@ -176,15 +176,18 @@
private struct VectorInfo private struct VectorInfo
{ {
public VectorInfo(Vector<byte> vector, int norm) public VectorInfo(Vector<byte> vector, int norm, int index)
{ {
this.Vector = vector; this.Vector = vector;
this.Norm = norm; this.Norm = norm;
this.Index = index;
} }
public Vector<byte> Vector { get; } public Vector<byte> Vector { get; }
public int Norm { get; } public int Norm { get; }
public int Index { get; }
} }
} }
} }

Loading…
Cancel
Save