merged solution.md into readme

1 year ago · 52e25916de
parent 6e38a9dec9
commit 52e25916de
2 changed files with 228 additions and 220 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,11 @@
-(This is the original assignment test; information about solution is in [solution.md](solution.md))
+---
+gitea: none
+include_toc: true
+---

-# Movie Metadata Search
+# Assignment
+
+## Movie Metadata Search

 ## Preface

@ -96,3 +101,224 @@ You will be evaluated by a number of criteria, among others:
 If you have any further questions, please open an issue in this GitHub repository and we'll try to give you an answer as quickly as possible.

 ### Good luck
+
+# Solution
+
+## Search feature
+
+To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
+
+For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792),
+even though our "database" does not have any information about Frank Miller having directed this movie,
+only OMDB has this data.
+
+And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year,
+and it only returns a single entry at a time).
+
+So the only way to figure out which of the movies in the "database" are directed by Frank Miller is
+to fetch data for _every_ movie present in our "database" from OMDB, and then
+either do a full scan or build some kind of search index.
+
+**Were this a real-world task, I would say that, depending on what problem are we trying to solve,
+the solution proposed in README is most likely a wrong solution.
+And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it;
+that the right approach would be for DB to contain all the data used in search,
+not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one
+(regardless of whether that happens on startup or on every search).**
+
+But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge.
+So let's get to the topic of implementing that potentially incorrect solution.
+
+Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity;
+an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred.
+(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database",
+we could just hardcode enriched data in the application and use it in searches, without making any external requests,
+but I guess that's not what this test assignment is about.)
+
+So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow,
+not as an optimization strategy but as a requirement.
+
+Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory.
+Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment,
+so we'll have to do a full scan on every search request, and at this point we might as well
+only do full scans in "database" as well.
+This means that it is enough for DB interface to only have three methods:
+returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
+
+At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation:
+
+1. That data in local DB doesn't change while the service is running
+(it physically cannot, with "local DB" actually being JSON files inside the project directory).
+Otherwise that would open an entirely new can of worms regarding "cache" invalidation:
+e.g. what if IMDB ID of an existing movie changes between requests?
+
+2. That data in OMDB for the movies in the local DB doesn't change while the service is running.
+Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields,
+so that search provides the correct and relevant results.
+
+So the final implementation idea is:
+
+1. Preload all entries from the local "database" on startup;
+2. For each of them, fetch the relevant data from OMDB on startup;
+3. Create "merged movie objects", keep them in memory;
+4. For every search request: go through all objects, find matching ones, return them
+    (with zero calls to OMDB API or to the local "database" in the process).
+
+## Technical decisions made
+
+### "Additional tasks"
+
+#### AWS Lambda Function
+
+Considering everything that was said above regarding the search, I don't think this service
+(or any other reasonable implementation of the assignment, as reasonable as it might be)
+is suitable for serverless deployment.
+
+In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one.
+Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota,
+even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
+
+Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
+
+However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means.
+
+And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances,
+and each one will have to fetch all data from OMDB, which is not really a good idea.
+On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request,
+meaning that we would have to refetch everything from OMDB again.
+
+Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as
+the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
+
+#### Caching
+
+Preloading everything into local memory on application startup can reasonably be called "caching".
+
+In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section.
+
+### Efficiency, application scaling
+
+In this implementation, search and get request handlers just do a full scan of preloaded data.
+
+This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies
+(definitely if it is below a thousand); latency on small enough local database should be low.
+And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to
+fetch all the needed data from OMDB before hitting OMDB API usage limits.
+
+I didn't do performance measurements because I didn't have enough time for that.
+But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second)
+linearly with the number of instances, at the expense of extra requests to OMDB API from each instance.
+
+### Framework
+
+I decided to implement this API with Nest.js, because I already used it on test assignments before,
+and it's good and convenient enough for API prototyping.
+
+I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config.
+
+### OMDB format
+
+Unfortunately, I couldn't find any clear documentation on OMDB response format.
+Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)),
+and adapted my type definitions and converters to them.
+
+As a result, I assume that all responses from OMDB satisfy
+`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields;
+and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts)
+in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`).
+
+Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`,
+so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
+
+### Merging and mapping
+
+I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)),
+and to convert data from both OMDB and DB format to this structure.
+
+Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files),
+because for this small one-off assignment it was easier to do it by hand than integrate another library,
+and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
+and with runtime checks that there were no extra fields in the input I should have handled.
+Thi has a drawback that adding any new field to the source data will result in error;
+on the other hand this means that adding any new field to the source data will require one to explicitly describe
+how it should be mapped and merged.
+
+The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`.
+But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
+and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
+(with `title` and `description` in merged result containing the English data from OMDB,
+and `localTitle` and `localDescription` containing the German data from our "database").
+
+Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language.
+There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
+sample JSONs only have one language per movie, as a two-letter code,
+while IMDB data contains a list of languages, not all of them even having two-letter codes
+(because there are more languages in the world than two letter-codes):
+for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response;
+a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`.
+So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
+
+After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`,
+with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays).
+
+One more thing to note is that OMDB is somewhat inconsistent in its naming;
+it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural),
+even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object.
+
+I decided to use plural names for all six.
+
+### API
+
+The original assignment didn't mention anything about how should one handle requests for movies
+that are present in our "database" but not OMDB, or vice versa.
+On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
+For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object),
+I decided to return 404 when at least one of the sources does not have any data.
+Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial.
+
+The assignment seems to be a bit contradictory regarding the way how search should work;
+it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller`
+(as I mentioned above, I decided to go with plural forms in the merged object),
+and also the original OMDB object has `Actors` and our "database" has `studios`.
+It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular),
+and it also feels weird to have `?actors=Mark Hamill` in the search query,
+so ideally API should probably have plural forms in the merged object but singular forms in search query.
+But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
+
+Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
+(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?)
+So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts)
+and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts).
+
+And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody
+who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query.
+
+### Testing
+
+The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests.
+
+I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
+
+Regular tests (ran with `npm test`) do not have any external dependencies.
+Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`.
+
+### Git
+
+The assignment has two contradicting messages:
+that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"),
+without squashing or bundling;
+but also that you will evaluate the proper use of git.
+
+I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment.
+
+### Configuration
+
+OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there).
+
+## How to use
+
+To start: `npm run start`; it is available on `localhost:3000` by default.
+
+To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details).
+
+To lint: `npm run lint` (see "Framework" section for more details).
--- a/solution.md
+++ b/solution.md
@ -1,218 +0,0 @@
-# Search feature
-
-To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
-
-For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792),
-even though our "database" does not have any information about Frank Miller having directed this movie,
-only OMDB has this data.
-
-And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year,
-and it only returns a single entry at a time).
-
-So the only way to figure out which of the movies in the "database" are directed by Frank Miller is
-to fetch data for _every_ movie present in our "database" from OMDB, and then
-either do a full scan or build some kind of search index.
-
-**Were this a real-world task, I would say that, depending on what problem are we trying to solve,
-the solution proposed in README is most likely a wrong solution.
-And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it;
-that the right approach would be for DB to contain all the data used in search,
-not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one
-(regardless of whether that happens on startup or on every search).**
-
-But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge.
-So let's get to the topic of implementing that potentially incorrect solution.
-
-Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity;
-an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred.
-(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database",
-we could just hardcode enriched data in the application and use it in searches, without making any external requests,
-but I guess that's not what this test assignment is about.)
-
-So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow,
-not as an optimization strategy but as a requirement.
-
-Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory.
-Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment,
-so we'll have to do a full scan on every search request, and at this point we might as well
-only do full scans in "database" as well.
-This means that it is enough for DB interface to only have three methods:
-returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
-
-At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation:
-
-1. That data in local DB doesn't change while the service is running
-(it physically cannot, with "local DB" actually being JSON files inside the project directory).
-Otherwise that would open an entirely new can of worms regarding "cache" invalidation:
-e.g. what if IMDB ID of an existing movie changes between requests?
-
-2. That data in OMDB for the movies in the local DB doesn't change while the service is running.
-Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields,
-so that search provides the correct and relevant results.
-
-So the final implementation idea is:
-
-1. Preload all entries from the local "database" on startup;
-2. For each of them, fetch the relevant data from OMDB on startup;
-3. Create "merged movie objects", keep them in memory;
-4. For every search request: go through all objects, find matching ones, return them
-    (with zero calls to OMDB API or to the local "database" in the process).
-
-# Technical decisions made
-
-## "Additional tasks"
-
-### AWS Lambda Function
-
-Considering everything that was said above regarding the search, I don't think this service
-(or any other reasonable implementation of the assignment, as reasonable as it might be)
-is suitable for serverless deployment.
-
-In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one.
-Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota,
-even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
-
-Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
-
-However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means.
-
-And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances,
-and each one will have to fetch all data from OMDB, which is not really a good idea.
-On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request,
-meaning that we would have to refetch everything from OMDB again.
-
-Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as
-the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
-
-### Caching
-
-Preloading everything into local memory on application startup can reasonably be called "caching".
-
-In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section.
-
-## Efficiency, application scaling
-
-In this implementation, search and get request handlers just do a full scan of preloaded data.
-
-This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies
-(definitely if it is below a thousand); latency on small enough local database should be low.
-And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to
-fetch all the needed data from OMDB before hitting OMDB API usage limits.
-
-I didn't do performance measurements because I didn't have enough time for that.
-But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second)
-linearly with the number of instances, at the expense of extra requests to OMDB API from each instance.
-
-## Framework
-
-I decided to implement this API with Nest.js, because I already used it on test assignments before,
-and it's good and convenient enough for API prototyping.
-
-I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config.
-
-## OMDB format
-
-Unfortunately, I couldn't find any clear documentation on OMDB response format.
-Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)),
-and adapted my type definitions and converters to them.
-
-As a result, I assume that all responses from OMDB satisfy
-`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields;
-and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts)
-in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`).
-
-Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`,
-so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
-
-## Merging and mapping
-
-I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)),
-and to convert data from both OMDB and DB format to this structure.
-
-Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files),
-because for this small one-off assignment it was easier to do it by hand than integrate another library,
-and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
-and with runtime checks that there were no extra fields in the input I should have handled.
-Thi has a drawback that adding any new field to the source data will result in error;
-on the other hand this means that adding any new field to the source data will require one to explicitly describe
-how it should be mapped and merged.
-
-The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`.
-But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
-and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
-(with `title` and `description` in merged result containing the English data from OMDB,
-and `localTitle` and `localDescription` containing the German data from our "database").
-
-Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language.
-There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
-sample JSONs only have one language per movie, as a two-letter code,
-while IMDB data contains a list of languages, not all of them even having two-letter codes
-(because there are more languages in the world than two letter-codes):
-for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response;
-a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`.
-So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
-
-After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`,
-with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays).
-
-One more thing to note is that OMDB is somewhat inconsistent in its naming;
-it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural),
-even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object.
-
-I decided to use plural names for all six.
-
-## API
-
-The original assignment didn't mention anything about how should one handle requests for movies
-that are present in our "database" but not OMDB, or vice versa.
-On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
-For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object),
-I decided to return 404 when at least one of the sources does not have any data.
-Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial.
-
-The assignment seems to be a bit contradictory regarding the way how search should work;
-it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller`
-(as I mentioned above, I decided to go with plural forms in the merged object),
-and also the original OMDB object has `Actors` and our "database" has `studios`.
-It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular),
-and it also feels weird to have `?actors=Mark Hamill` in the search query,
-so ideally API should probably have plural forms in the merged object but singular forms in search query.
-But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
-
-Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
-(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?)
-So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts)
-and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts).
-
-And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody
-who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query.
-
-## Testing
-
-The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests.
-
-I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
-
-Regular tests (ran with `npm test`) do not have any external dependencies.
-Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`.
-
-## Git
-
-The assignment has two contradicting messages:
-that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"),
-without squashing or bundling;
-but also that you will evaluate the proper use of git.
-
-I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment.
-
-## Configuration
-
-OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there).
-
-# How to use
-
-To start: `npm run start`; it is available on `localhost:3000` by default.
-
-To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details).
-
-To lint: `npm run lint` (see "Framework" section for more details).