added solution description to README

main
Inga 🏳‍🌈 4 months ago
parent 26b23d879b
commit b9d1e67e23
  1. 61
      README.md

@ -27,4 +27,63 @@ The basic word list:
https://simple.wikipedia.org/wiki/Wikipedia:BASIC_English_alphabetical_wordlist
Combined word list (includes some compound words):
https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
## Solution
Preview is available on https://inga-lovinde.github.io/static/overleaf-demo/
Unfortunately, 1 hour is not nearly enough time to work on this task.
Wikipedia wordlist link only contains base words, and does not categorize words into parts of speech.
However, according to the Wikipedia link on BASIC English, words can also be constructed from base words and some suffixes, depending on the part of speech.
For example, Wikipedia link contains the words "ant" and "any" but not "ants".
In order to determine that "ants" is correct but "anys" is not, we would need to know that "ant" is a noun but "any" is not,
and that would require tagging all ~850 words in this list with their parts of speech, which is not something to be done in an hour.
So instead of relying on worlists from Wikipedia, I found and decided to use spellchecking data from BASIC English website
(it's no longer online, but it's archived at https://web.archive.org/web/20230326131210/http://basic-english.org/down/download.html ).
Spellchecker files for OpenOffice.org available there are not suitable to be used in their raw forms,
as they contain stems and suffixes separately, and I didn't have enough time to figure out what are the rules for joining them together.
But spellchecker installer for OOo 3.0 also contains thesaurus, and thesaurus seems to contain (almost?) all of the words in BASIC English.
Extracting wordlist from spellchecker and fixing bugs took enough time bringing me to the edge of the time budget.
Since I didn't have any time to think of the optimal way to find spelling errors,
I decided to rely on state machines that browsers have out-of-the-box: that is, on regexes.
It is not very difficult to create a regex that matches all words not present in the wordlist, but it still takes some time.
Since this task is for a fullstack Node/React role, I decided to make this a react-compatible component in a preact-based app.
Creating an empty preact app took some more time.
The next question was how to bring the word list to the frontend.
I made a mistake here: instead of simply adding another build step to npm `build` command, I decided to try to integrate into `vite` build process
(despite not having any prior experience with `vite` worth speaking of), and that mistake cost me too much time.
But ultimately I somehow (probably very incorrectly) got it to output a regex string into a separate file in `dist/assets`,
given raw OOo thesaurus in `src`.
By that point I was significantly overtime, so I just hacked together a very primitive textarea + validation function,
simply displaying in a separate area the entered text with incorrect words highlighted.
Very user-unfriendly (much friendlier way would be to use `contenteditable` element instead of `textarea`, and highlight incorrect words in-place in real time),
but on the other hand it only took me half an hour to create a component that would fetch regex from the server, compile it,
accept input from the user, validate it, and display errors.
Time spent (approximately):
* Creating empty preact-based project with my favorite tsconfig and eslint configs, by copying another test assignment and removing everything unneeded: **10 minutes**;
* Finding BASIC English thesaurus, parsing it, and then fixing bugs (related to the repeated usage of `exec` on the same regex): **40 minutes**;
* Writing a regex generator that would, for a word list, return a regex matching all words missing from that word list: **10 minutes**;
* Trying to create a `vite` plugin that would compile source thesaurus to the regex string, and do it in a socially acceptable way
(and still doing it in a very bad way, but at least it works): **1.5 hours**;
* Writing a react-compatible spellchecking frontend component: **3o minutes**;
* Writing this text, focusing on completeness only: **20 minutes**.
Besides design, I'm unhappy with the performance of the final solution.
While parsing regex only takes around 1ms on my laptop,
validating XKCD text takes 200ms (and it seems to scale linearly with the text length), which is too long.
Probably using regexes for everything was a bad idea after all,
and performance would be better if I would only use regexes to extract all words from the input (`O(input_length)`),
and then searched all these words in a dictionary `Map` (`O(input_word_count * log(dictionary_word_count))`, presumably).
And another issue is that, probably because of the regex file extension, it's transferred as `application/octet-stream` and not gzipped,
which means that validation only becomes available after all 211KB are transferred, even though it would probably be much smaller gzipped.
Loading…
Cancel
Save