AdventOfCode-2023/day08-hard/README.md

The problem's phase space consists of pairs ("current step (modulo number of directions)", "current node"),
with the total of ~200 thousands states for the puzzle input.

For every state, there is a defined transition to the single next state.

There are six starting states (all pairs with "current step" being zero and "current node" ending with 'A').
And 263*6 ending states (all pairs with any "current step", and "current node" ending with Z).

Transitions are periodic; since successor of every state is clearly defined, and there are finite number of states,
this means that no matter at what state we start, we will eventually find ourselves in a loop with the length lower than 200k.
There might be several non-intersecting loops.

One way to solve the problem would be to use some complicated math in order to compute the result.
Another, to brute force the result naively, by doing what the puzzle describes:
running several "ghosts", one from each starting state, and on every step checking if all the current states are "ending".

In order for brute force to work as fast as possible,
we need to reduce the number of conditions, dereferences and computations within the loop.

There is only so much that we can do regarding the storage
(200k states means at least 18 bits per state to store the next state, times 200k that's 450KB,
way larger than any L1 cache).

For simplicity, here I store states in array of 270*1024 u32 (i.e. one megabyte),
still just a bit more than a modern L2 cache per core;
and the array layout is optimized for access: index is "current step" * 270 + "current node",
so on every step we stay more or less in the same region of the array
(we traverse 1k entries, or 4KB of memory, on average for every step).

For simplicity, in order to check that the state is "final", I slightly renumber the list of nodes;
nodes that end with Z get the high three bits of their 10-bit index set to 1
(since the total number of nodes in the sample input is 770).
Unfortunately, the puzzle input contains collisions
(there are "final" nodes on lines 320 and 694, with the same last seven bits),
so I had to manually reorder the puzzle input;
it was easier to move all nodes ending with Z to the end of the file,
to make sure that there will be no collisions.
This way, the state is final iff it has its eight, ninth and tenth bits set.
It's also easy enough to check all six current states at once
(just bitwise-and them all, bitwise-and the result with a `0b1110000000` mask, and check that the result matches the mask).

So ultimately, every step is just six bitwise-ands, one comparison
(which is only true once we found the result, meaning that there is no performance penalty for branch misprediction),
and six dereferences and assignments.

The resulting performance is over 100 million steps per second (single-threaded),
meaning that we get to ~250 billion steps in just half an hour.

Unfortunately, the result it produces (around ~250 billion) is apparently incorrect;
it is not accepted by AoC website.
Must be some bug somewhere, even though it works correctly on the (modified) sample input.

Another option, with math, would be to iterate over all possible direction numbers,
and for every direction number (out of 270), and for each permutation of final nodes (6^6~=47k) compute:
For each one out of the six starting states, how many steps does it take to get to this node? And to get to it again?
(Answering that question with brute-forcing would require on the order of 200k operations for every starting state and final state,
and another 200k for every final state, so that's about 200k*(270*6 + 270*6*6) ~= 2 billion operations
to precompute all ~10k values,
but it can be optimized if we would identify the shape of transitions,
and untangle the transition matrix into a set of loops, and of paths leading to these loops).

The answer to such a question would have a form of a_i+b_i*k, for some a and b, for every integer k>=0.
Knowing a and b, for each of the six questions, we could use arithmetic to find A and B such that
for every k>=0, A+Bk steps from the starting states produce exactly this configuration.
With A being the first time when we reach this configuration.

And then we would just need to find the smallest A for all ~10 million configurations.

But I can't be bothered to do this now.
day 8, part 2 (much faster, but incorrect) 11 months ago			`The problem's phase space consists of pairs ("current step (modulo number of directions)", "current node"),`
			`with the total of ~200 thousands states for the puzzle input.`

			`For every state, there is a defined transition to the single next state.`

			`There are six starting states (all pairs with "current step" being zero and "current node" ending with 'A').`
			`And 263*6 ending states (all pairs with any "current step", and "current node" ending with Z).`

			`Transitions are periodic; since successor of every state is clearly defined, and there are finite number of states,`
			`this means that no matter at what state we start, we will eventually find ourselves in a loop with the length lower than 200k.`
			`There might be several non-intersecting loops.`

			`One way to solve the problem would be to use some complicated math in order to compute the result.`
			`Another, to brute force the result naively, by doing what the puzzle describes:`
			`running several "ghosts", one from each starting state, and on every step checking if all the current states are "ending".`

			`In order for brute force to work as fast as possible,`
			`we need to reduce the number of conditions, dereferences and computations within the loop.`

			`There is only so much that we can do regarding the storage`
			`(200k states means at least 18 bits per state to store the next state, times 200k that's 450KB,`
			`way larger than any L1 cache).`

			`For simplicity, here I store states in array of 270*1024 u32 (i.e. one megabyte),`
			`still just a bit more than a modern L2 cache per core;`
			`and the array layout is optimized for access: index is "current step" * 270 + "current node",`
			`so on every step we stay more or less in the same region of the array`
			`(we traverse 1k entries, or 4KB of memory, on average for every step).`

			`For simplicity, in order to check that the state is "final", I slightly renumber the list of nodes;`
			`nodes that end with Z get the high three bits of their 10-bit index set to 1`
			`(since the total number of nodes in the sample input is 770).`
			`Unfortunately, the puzzle input contains collisions`
			`(there are "final" nodes on lines 320 and 694, with the same last seven bits),`
			`so I had to manually reorder the puzzle input;`
			`it was easier to move all nodes ending with Z to the end of the file,`
			`to make sure that there will be no collisions.`
			`This way, the state is final iff it has its eight, ninth and tenth bits set.`
			`It's also easy enough to check all six current states at once`
			(just bitwise-and them all, bitwise-and the result with a `0b1110000000` mask, and check that the result matches the mask).

			`So ultimately, every step is just six bitwise-ands, one comparison`
			`(which is only true once we found the result, meaning that there is no performance penalty for branch misprediction),`
			`and six dereferences and assignments.`

			`The resulting performance is over 100 million steps per second (single-threaded),`
			`meaning that we get to ~250 billion steps in just half an hour.`

			`Unfortunately, the result it produces (around ~250 billion) is apparently incorrect;`
			`it is not accepted by AoC website.`
			`Must be some bug somewhere, even though it works correctly on the (modified) sample input.`

			`Another option, with math, would be to iterate over all possible direction numbers,`
			`and for every direction number (out of 270), and for each permutation of final nodes (6^6~=47k) compute:`
			`For each one out of the six starting states, how many steps does it take to get to this node? And to get to it again?`
			`(Answering that question with brute-forcing would require on the order of 200k operations for every starting state and final state,`
			`and another 200k for every final state, so that's about 200k(2706 + 27066) ~= 2 billion operations`
			`to precompute all ~10k values,`
			`but it can be optimized if we would identify the shape of transitions,`
			`and untangle the transition matrix into a set of loops, and of paths leading to these loops).`

			`The answer to such a question would have a form of a_i+b_i*k, for some a and b, for every integer k>=0.`
			`Knowing a and b, for each of the six questions, we could use arithmetic to find A and B such that`
			`for every k>=0, A+Bk steps from the starting states produce exactly this configuration.`
			`With A being the first time when we reach this configuration.`

			`And then we would just need to find the smallest A for all ~10 million configurations.`

			`But I can't be bothered to do this now.`