added solution description to README

1 year ago · b9d1e67e23
parent 26b23d879b
commit b9d1e67e23
1 changed files with 60 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -27,4 +27,63 @@ The basic word list:
 https://simple.wikipedia.org/wiki/Wikipedia:BASIC_English_alphabetical_wordlist

 Combined word list (includes some compound words):
-https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
+https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
+
+## Solution
+
+Preview is available on https://inga-lovinde.github.io/static/overleaf-demo/
+
+Unfortunately, 1 hour is not nearly enough time to work on this task.
+
+Wikipedia wordlist link only contains base words, and does not categorize words into parts of speech.
+However, according to the Wikipedia link on BASIC English, words can also be constructed from base words and some suffixes, depending on the part of speech.
+For example, Wikipedia link contains the words "ant" and "any" but not "ants".
+In order to determine that "ants" is correct but "anys" is not, we would need to know that "ant" is a noun but "any" is not,
+and that would require tagging all ~850 words in this list with their parts of speech, which is not something to be done in an hour.
+
+So instead of relying on worlists from Wikipedia, I found and decided to use spellchecking data from BASIC English website
+(it's no longer online, but it's archived at https://web.archive.org/web/20230326131210/http://basic-english.org/down/download.html ).
+
+Spellchecker files for OpenOffice.org available there are not suitable to be used in their raw forms,
+as they contain stems and suffixes separately, and I didn't have enough time to figure out what are the rules for joining them together.
+But spellchecker installer for OOo 3.0 also contains thesaurus, and thesaurus seems to contain (almost?) all of the words in BASIC English.
+
+Extracting wordlist from spellchecker and fixing bugs took enough time bringing me to the edge of the time budget.
+
+Since I didn't have any time to think of the optimal way to find spelling errors,
+I decided to rely on state machines that browsers have out-of-the-box: that is, on regexes.
+It is not very difficult to create a regex that matches all words not present in the wordlist, but it still takes some time.
+
+Since this task is for a fullstack Node/React role, I decided to make this a react-compatible component in a preact-based app.
+Creating an empty preact app took some more time.
+The next question was how to bring the word list to the frontend.
+I made a mistake here: instead of simply adding another build step to npm `build` command, I decided to try to integrate into `vite` build process
+(despite not having any prior experience with `vite` worth speaking of), and that mistake cost me too much time.
+But ultimately I somehow (probably very incorrectly) got it to output a regex string into a separate file in `dist/assets`,
+given raw OOo thesaurus in `src`.
+
+By that point I was significantly overtime, so I just hacked together a very primitive textarea + validation function,
+simply displaying in a separate area the entered text with incorrect words highlighted.
+
+Very user-unfriendly (much friendlier way would be to use `contenteditable` element instead of `textarea`, and highlight incorrect words in-place in real time),
+but on the other hand it only took me half an hour to create a component that would fetch regex from the server, compile it,
+accept input from the user, validate it, and display errors.
+
+Time spent (approximately):
+* Creating empty preact-based project with my favorite tsconfig and eslint configs, by copying another test assignment and removing everything unneeded: **10 minutes**;
+* Finding BASIC English thesaurus, parsing it, and then fixing bugs (related to the repeated usage of `exec` on the same regex): **40 minutes**;
+* Writing a regex generator that would, for a word list, return a regex matching all words missing from that word list: **10 minutes**;
+* Trying to create a `vite` plugin that would compile source thesaurus to the regex string, and do it in a socially acceptable way
+  (and still doing it in a very bad way, but at least it works): **1.5 hours**;
+* Writing a react-compatible spellchecking frontend component: **3o minutes**;
+* Writing this text, focusing on completeness only: **20 minutes**.
+
+Besides design, I'm unhappy with the performance of the final solution.
+While parsing regex only takes around 1ms on my laptop,
+validating XKCD text takes 200ms (and it seems to scale linearly with the text length), which is too long.
+Probably using regexes for everything was a bad idea after all,
+and performance would be better if I would only use regexes to extract all words from the input (`O(input_length)`),
+and then searched all these words in a dictionary `Map` (`O(input_word_count * log(dictionary_word_count))`, presumably).
+
+And another issue is that, probably because of the regex file extension, it's transferred as `application/octet-stream` and not gzipped,
+which means that validation only becomes available after all 211KB are transferred, even though it would probably be much smaller gzipped.