diff --git a/README.md b/README.md index f423a60..809f6bf 100644 --- a/README.md +++ b/README.md @@ -27,4 +27,63 @@ The basic word list: https://simple.wikipedia.org/wiki/Wikipedia:BASIC_English_alphabetical_wordlist Combined word list (includes some compound words): -https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist \ No newline at end of file +https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist + +## Solution + +Preview is available on https://inga-lovinde.github.io/static/overleaf-demo/ + +Unfortunately, 1 hour is not nearly enough time to work on this task. + +Wikipedia wordlist link only contains base words, and does not categorize words into parts of speech. +However, according to the Wikipedia link on BASIC English, words can also be constructed from base words and some suffixes, depending on the part of speech. +For example, Wikipedia link contains the words "ant" and "any" but not "ants". +In order to determine that "ants" is correct but "anys" is not, we would need to know that "ant" is a noun but "any" is not, +and that would require tagging all ~850 words in this list with their parts of speech, which is not something to be done in an hour. + +So instead of relying on worlists from Wikipedia, I found and decided to use spellchecking data from BASIC English website +(it's no longer online, but it's archived at https://web.archive.org/web/20230326131210/http://basic-english.org/down/download.html ). + +Spellchecker files for OpenOffice.org available there are not suitable to be used in their raw forms, +as they contain stems and suffixes separately, and I didn't have enough time to figure out what are the rules for joining them together. +But spellchecker installer for OOo 3.0 also contains thesaurus, and thesaurus seems to contain (almost?) all of the words in BASIC English. + +Extracting wordlist from spellchecker and fixing bugs took enough time bringing me to the edge of the time budget. + +Since I didn't have any time to think of the optimal way to find spelling errors, +I decided to rely on state machines that browsers have out-of-the-box: that is, on regexes. +It is not very difficult to create a regex that matches all words not present in the wordlist, but it still takes some time. + +Since this task is for a fullstack Node/React role, I decided to make this a react-compatible component in a preact-based app. +Creating an empty preact app took some more time. +The next question was how to bring the word list to the frontend. +I made a mistake here: instead of simply adding another build step to npm `build` command, I decided to try to integrate into `vite` build process +(despite not having any prior experience with `vite` worth speaking of), and that mistake cost me too much time. +But ultimately I somehow (probably very incorrectly) got it to output a regex string into a separate file in `dist/assets`, +given raw OOo thesaurus in `src`. + +By that point I was significantly overtime, so I just hacked together a very primitive textarea + validation function, +simply displaying in a separate area the entered text with incorrect words highlighted. + +Very user-unfriendly (much friendlier way would be to use `contenteditable` element instead of `textarea`, and highlight incorrect words in-place in real time), +but on the other hand it only took me half an hour to create a component that would fetch regex from the server, compile it, +accept input from the user, validate it, and display errors. + +Time spent (approximately): +* Creating empty preact-based project with my favorite tsconfig and eslint configs, by copying another test assignment and removing everything unneeded: **10 minutes**; +* Finding BASIC English thesaurus, parsing it, and then fixing bugs (related to the repeated usage of `exec` on the same regex): **40 minutes**; +* Writing a regex generator that would, for a word list, return a regex matching all words missing from that word list: **10 minutes**; +* Trying to create a `vite` plugin that would compile source thesaurus to the regex string, and do it in a socially acceptable way + (and still doing it in a very bad way, but at least it works): **1.5 hours**; +* Writing a react-compatible spellchecking frontend component: **3o minutes**; +* Writing this text, focusing on completeness only: **20 minutes**. + +Besides design, I'm unhappy with the performance of the final solution. +While parsing regex only takes around 1ms on my laptop, +validating XKCD text takes 200ms (and it seems to scale linearly with the text length), which is too long. +Probably using regexes for everything was a bad idea after all, +and performance would be better if I would only use regexes to extract all words from the input (`O(input_length)`), +and then searched all these words in a dictionary `Map` (`O(input_word_count * log(dictionary_word_count))`, presumably). + +And another issue is that, probably because of the regex file extension, it's transferred as `application/octet-stream` and not gzipped, +which means that validation only becomes available after all 211KB are transferred, even though it would probably be much smaller gzipped. \ No newline at end of file