solution.md (probably still a draft)

12 months ago · 97d4d55381
parent e078010afe
commit 97d4d55381
1 changed files with 202 additions and 0 deletions
--- a/solution.md
+++ b/solution.md
@ -0,0 +1,202 @@
 # Search feature
 To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
 For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792),
 even though our "database" does not have any information about Frank Miller having directed this movie,
 only OMDB has this data.
 And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year,
 and it only returns a single entry at a time).
 So the only way to figure out which of the movies in the "database" are directed by Frank Miller is
 to fetch data for _every_ movie present in our "database" from OMDB, and then
 either do a full scan or build some kind of search index.
 **Were this a real-world task, I would say that, depending on what problem are we trying to solve,
 the solution proposed in README is most likely a wrong solution.
 And that enriching local data with OMDB data should be a responsibility of the service that populates the DB and not reads from it;
 that the right approach would be for DB to contain all the data used in search,
 not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one
 (whether that happens on startup or on every search).**
 But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge.
 So let's get to the topic of implementing that potentially incorrect solution.
 Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity;
 an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred.
 (If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database",
 we could just hardcode enriched data inside the application and use it in searches, without making any external requests,
 but I guess that's not what this test assignment is about.)
 So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow,
 not as an optimization strategy but as a requirement.
 Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory.
 Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment,
 so we'll have to do a full scan on every search request, and at this point we might as well
 only do full scans in "database" as well.
 It is enough for DB interface to only have three methods:
 returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
 At this point we already made so many assumptions that it won't hurt to make another one, for simplicity of implementation:
 that data in local DB doesn't change while the service is running
 (it physically cannot, with "local DB" actually being JSON files inside the project directory),
 and that data in OMDB for the movies in the local DB doesn't change while the service is running.
 Otherwise that would open an entirely new can of worms regarding "cache" invalidation:
 e.g. what if IMDB ID of an existing movie changes between requests?
 So the final implementation idea is:
 1. Preload all entries from the local "database" on startup;
 2. For each of them, fetch the relevant data from OMDB;
 3. Create "merged movie objects", keep them in memory;
 4. For every search request, go through all objects, find matching ones, return them
    (with zero calls to OMDB API or to the local "database").
 # Technical decisions made
 ## "Additional tasks"
 ### AWS Lambda Function
 Considering everything that was said above regarding the search, I don't think this service
 (or any other reasonable implementation of the assignment, as reasonable as it might be),
 is suitable for deployment to AWS as Lambda.
 In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one.
 Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota,
 even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
 Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
 However, we should not rely on the way Lambda instances are cold started and shut down.
 And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances,
 and each one will have to fetch all data from OMDB, which is not really a good idea.
 On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request,
 meaning that we would have to refetch everything from OMDB again.
 Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as
 the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
 ### Caching
 Preloading everything into local memory on application startup can reasonably be called "caching".
 In this implementation TTL is infinite, because of the assumptions described in "Search feature" section.
 ## Efficiency, application scaling
 In this implementation, search and get request handlers just do a full scan of preloaded data.
 This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies
 (e.g. even in hundreds).
 And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to
 fetch all the needed data from OMDB before hitting OMDB API usage limits.
 I didn't do performance measurements because I didn't have enough time for that.
 But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second) linearly,
 at the expense of extra requests to OMDB API from each instance.
 ## framework
 I decided to implement this API with Nest.js, because I already used it on test assignments before,
 and it's good and convenient enough for API prototyping.
 ## OMDB format
 Unfortunately, I couldn't find any clear documentation on OMDB response format.
 Instead, I just picked a few actual OMDB responses (mostly the ones described in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)),
 and adapted my type definitions and converters to them.
 As a result, I assume that all responses from OMDB satisfy
 `OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields;
 and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts)
 in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`).
 Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`,
 so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
 ## Merging and mapping
 I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)),
 and to convert data from both OMDB and DB format to this structure.
 Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files),
 because for this small one-off assignment it was easier to do it by hand than integrate another library,
 and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
 and with runtime checks that there were no extra fields in the input I should have handled.
 Thi has a drawback that adding any new field to the source data will result in error;
 on the other hand this means that adding any new field to the source data will require one to explicitly describe
 how it should be mapped and merged.
 The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`.
 But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
 and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
 (with `title` and `description` in merged result containing the English data from OMDB,
 and `localTitle` and `localDescription` containing the German data from our "database").
 Besides title, description and runtime, there is another field present both in OMDB response and in our "database": language.
 There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
 sample JSONs only have one language per movie, as a two-letter code,
 while IMDB data contains a list of languages, not all of them even having two-letter codes
 (because there are more languages in the world than two letter-codes):
 for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response;
 a blockbuster tt11866324 (shot mostly in Komanche) has `Language: 'English, North American Indian, French'`.
 So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
 After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`,
 with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays).
 One more thing to note is that OMDB is somewhat inconsistent in its naming;
 it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural),
 even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object.
 I decided to use plural names for all six.
 ## API
 The original assignment didn't mention anything about how should one handle requests for movies
 that are present in our "database" but not OMDB, or vice versa.
 On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
 For simplicity (and in order to guarantee that `/api/movied/:id` endpoint always returns full merged `MovieData` object),
 I decided to return 404 when at least one of the sources does not have any data.
 Changing the implementation so that API returns partial data if only one of the sources knows about the requested movie would be trivial.
 The assignment seems to be a bit contradictory regarding the way how search should work;
 it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller`
 (as I mentioned above, I decided to go with plural forms in the merged object),
 and also the original OMDB object has `Actors` and our "database" has `studios`.
 It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular),
 and it also feels weird to have `?actors=Mark Hamill` in the search query,
 so ideally API should probably have plural forms in the merged object but singular forms in search query.
 But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
 Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
 (for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?)
 So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts)
 and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts).
 ## Testing
 I did not have any time to come up with a good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
 ## Git
 The assignment has two contradicting messages:
 that I should use git to document my work ("We will look at the git history of your pull request to determine the way you approached this"),
 without squashing or bundling;
 but also that you will evaluate the proper use of git.
 I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PR in a professional environment.
 # How to use
 To start: `npm run start`; it is available on `localhost:3000` by default.
 To run tests: `npm test`.
 To run tests that connect to actual OMDB (these are not included in `npm test`): `npm run test:e2e`.
 To lint: `npm run lint`.