Inga 🏳🌈
b9d1e67e23
|
11 months ago | |
---|---|---|
public | 11 months ago | |
src | 11 months ago | |
.eslint-config-preact-nojest.cjs | 11 months ago | |
.eslintrc.cjs | 11 months ago | |
.gitignore | 11 months ago | |
.prettierrc | 11 months ago | |
LICENSE | 11 months ago | |
README.md | 11 months ago | |
index.html | 11 months ago | |
package-lock.json | 11 months ago | |
package.json | 11 months ago | |
tsconfig.json | 11 months ago | |
vite.config.ts | 11 months ago |
README.md
Table of Contents
Assignment
Brief: Make a spell checker for BASIC English
Time allowed: 1 hour (outside of the interview time)
Background: British American Scientific International Commercial (BASIC) English is a constructed language. It has 850 English words, carefully chosen to make the language easy to learn but still powerful enough to communicate everyday ideas and actions. The rules of usage are identical to full English, so the speaker communicates in perfectly good, but simple, English. It is an interesting exercise to try to write out a complicated idea in simple English (e.g. https://xkcd.com/1133), and a spell checker that tells you when you are using BASIC English would be a useful tool.
Objectives: The main objective is to make something that works in the time you have (~1h pre-interview). In the interview, we’ll then discuss your approach, in terms of both UX and implementation, and ask you to extend your prototype in some way. There are no right or wrong answers. Here are some examples of discussion questions:
-
What should the user experience be like? What are people expecting or used to? How did you scope your solution given the time constraint?
-
You don’t have to worry too much about efficiency when there are only a few thousand words, but efficiency is more important for real languages. How would you make it faster and / or require less memory?
Resources:
More about BE: https://simple.wikipedia.org/wiki/Basic_English
The basic word list: https://simple.wikipedia.org/wiki/Wikipedia:BASIC_English_alphabetical_wordlist
Combined word list (includes some compound words): https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
Solution
Preview is available on https://inga-lovinde.github.io/static/overleaf-demo/
Unfortunately, 1 hour is not nearly enough time to work on this task.
Wikipedia wordlist link only contains base words, and does not categorize words into parts of speech. However, according to the Wikipedia link on BASIC English, words can also be constructed from base words and some suffixes, depending on the part of speech. For example, Wikipedia link contains the words "ant" and "any" but not "ants". In order to determine that "ants" is correct but "anys" is not, we would need to know that "ant" is a noun but "any" is not, and that would require tagging all ~850 words in this list with their parts of speech, which is not something to be done in an hour.
So instead of relying on worlists from Wikipedia, I found and decided to use spellchecking data from BASIC English website (it's no longer online, but it's archived at https://web.archive.org/web/20230326131210/http://basic-english.org/down/download.html ).
Spellchecker files for OpenOffice.org available there are not suitable to be used in their raw forms, as they contain stems and suffixes separately, and I didn't have enough time to figure out what are the rules for joining them together. But spellchecker installer for OOo 3.0 also contains thesaurus, and thesaurus seems to contain (almost?) all of the words in BASIC English.
Extracting wordlist from spellchecker and fixing bugs took enough time bringing me to the edge of the time budget.
Since I didn't have any time to think of the optimal way to find spelling errors, I decided to rely on state machines that browsers have out-of-the-box: that is, on regexes. It is not very difficult to create a regex that matches all words not present in the wordlist, but it still takes some time.
Since this task is for a fullstack Node/React role, I decided to make this a react-compatible component in a preact-based app.
Creating an empty preact app took some more time.
The next question was how to bring the word list to the frontend.
I made a mistake here: instead of simply adding another build step to npm build
command, I decided to try to integrate into vite
build process
(despite not having any prior experience with vite
worth speaking of), and that mistake cost me too much time.
But ultimately I somehow (probably very incorrectly) got it to output a regex string into a separate file in dist/assets
,
given raw OOo thesaurus in src
.
By that point I was significantly overtime, so I just hacked together a very primitive textarea + validation function, simply displaying in a separate area the entered text with incorrect words highlighted.
Very user-unfriendly (much friendlier way would be to use contenteditable
element instead of textarea
, and highlight incorrect words in-place in real time),
but on the other hand it only took me half an hour to create a component that would fetch regex from the server, compile it,
accept input from the user, validate it, and display errors.
Time spent (approximately):
- Creating empty preact-based project with my favorite tsconfig and eslint configs, by copying another test assignment and removing everything unneeded: 10 minutes;
- Finding BASIC English thesaurus, parsing it, and then fixing bugs (related to the repeated usage of
exec
on the same regex): 40 minutes; - Writing a regex generator that would, for a word list, return a regex matching all words missing from that word list: 10 minutes;
- Trying to create a
vite
plugin that would compile source thesaurus to the regex string, and do it in a socially acceptable way (and still doing it in a very bad way, but at least it works): 1.5 hours; - Writing a react-compatible spellchecking frontend component: 3o minutes;
- Writing this text, focusing on completeness only: 20 minutes.
Besides design, I'm unhappy with the performance of the final solution.
While parsing regex only takes around 1ms on my laptop,
validating XKCD text takes 200ms (and it seems to scale linearly with the text length), which is too long.
Probably using regexes for everything was a bad idea after all,
and performance would be better if I would only use regexes to extract all words from the input (O(input_length)
),
and then searched all these words in a dictionary Map
(O(input_word_count * log(dictionary_word_count))
, presumably).
And another issue is that, probably because of the regex file extension, it's transferred as application/octet-stream
and not gzipped,
which means that validation only becomes available after all 211KB are transferred, even though it would probably be much smaller gzipped.