From 52e25916deb75fd9b4f9d97a08002a67b21e7393 Mon Sep 17 00:00:00 2001 From: Inga <52715130+inga-lovinde@users.noreply.github.com> Date: Mon, 8 Jan 2024 10:46:54 +0000 Subject: [PATCH] merged solution.md into readme --- README.md | 230 +++++++++++++++++++++++++++++++++++++++++++++++++++- solution.md | 218 ------------------------------------------------- 2 files changed, 228 insertions(+), 220 deletions(-) delete mode 100644 solution.md diff --git a/README.md b/README.md index 9068fc3..a7c5f93 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,11 @@ -(This is the original assignment test; information about solution is in [solution.md](solution.md)) +--- +gitea: none +include_toc: true +--- -# Movie Metadata Search +# Assignment + +## Movie Metadata Search ## Preface @@ -96,3 +101,224 @@ You will be evaluated by a number of criteria, among others: If you have any further questions, please open an issue in this GitHub repository and we'll try to give you an answer as quickly as possible. ### Good luck + +# Solution + +## Search feature + +To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB. + +For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792), +even though our "database" does not have any information about Frank Miller having directed this movie, +only OMDB has this data. + +And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year, +and it only returns a single entry at a time). + +So the only way to figure out which of the movies in the "database" are directed by Frank Miller is +to fetch data for _every_ movie present in our "database" from OMDB, and then +either do a full scan or build some kind of search index. + +**Were this a real-world task, I would say that, depending on what problem are we trying to solve, +the solution proposed in README is most likely a wrong solution. +And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it; +that the right approach would be for DB to contain all the data used in search, +not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one +(regardless of whether that happens on startup or on every search).** + +But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge. +So let's get to the topic of implementing that potentially incorrect solution. + +Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity; +an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred. +(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database", +we could just hardcode enriched data in the application and use it in searches, without making any external requests, +but I guess that's not what this test assignment is about.) + +So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow, +not as an optimization strategy but as a requirement. + +Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory. +Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment, +so we'll have to do a full scan on every search request, and at this point we might as well +only do full scans in "database" as well. +This means that it is enough for DB interface to only have three methods: +returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies. + +At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation: + +1. That data in local DB doesn't change while the service is running +(it physically cannot, with "local DB" actually being JSON files inside the project directory). +Otherwise that would open an entirely new can of worms regarding "cache" invalidation: +e.g. what if IMDB ID of an existing movie changes between requests? + +2. That data in OMDB for the movies in the local DB doesn't change while the service is running. +Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields, +so that search provides the correct and relevant results. + +So the final implementation idea is: + +1. Preload all entries from the local "database" on startup; +2. For each of them, fetch the relevant data from OMDB on startup; +3. Create "merged movie objects", keep them in memory; +4. For every search request: go through all objects, find matching ones, return them + (with zero calls to OMDB API or to the local "database" in the process). + +## Technical decisions made + +### "Additional tasks" + +#### AWS Lambda Function + +Considering everything that was said above regarding the search, I don't think this service +(or any other reasonable implementation of the assignment, as reasonable as it might be) +is suitable for serverless deployment. + +In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one. +Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota, +even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state. + +Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory. + +However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means. + +And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances, +and each one will have to fetch all data from OMDB, which is not really a good idea. +On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request, +meaning that we would have to refetch everything from OMDB again. + +Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as +the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem. + +#### Caching + +Preloading everything into local memory on application startup can reasonably be called "caching". + +In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section. + +### Efficiency, application scaling + +In this implementation, search and get request handlers just do a full scan of preloaded data. + +This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies +(definitely if it is below a thousand); latency on small enough local database should be low. +And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to +fetch all the needed data from OMDB before hitting OMDB API usage limits. + +I didn't do performance measurements because I didn't have enough time for that. +But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second) +linearly with the number of instances, at the expense of extra requests to OMDB API from each instance. + +### Framework + +I decided to implement this API with Nest.js, because I already used it on test assignments before, +and it's good and convenient enough for API prototyping. + +I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config. + +### OMDB format + +Unfortunately, I couldn't find any clear documentation on OMDB response format. +Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)), +and adapted my type definitions and converters to them. + +As a result, I assume that all responses from OMDB satisfy +`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields; +and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts) +in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`). + +Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`, +so I have no idea what does this field mean, and I decided to (explicitly) ignore it. + +### Merging and mapping + +I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)), +and to convert data from both OMDB and DB format to this structure. + +Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files), +because for this small one-off assignment it was easier to do it by hand than integrate another library, +and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything, +and with runtime checks that there were no extra fields in the input I should have handled. +Thi has a drawback that adding any new field to the source data will result in error; +on the other hand this means that adding any new field to the source data will require one to explicitly describe +how it should be mapped and merged. + +The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`. +But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German, +and that these fields are always present and non-empty, so I thought that it might make sense to preserve both +(with `title` and `description` in merged result containing the English data from OMDB, +and `localTitle` and `localDescription` containing the German data from our "database"). + +Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language. +There was no clear way to reconcile the data from OMDB with the data from sample JSONs: +sample JSONs only have one language per movie, as a two-letter code, +while IMDB data contains a list of languages, not all of them even having two-letter codes +(because there are more languages in the world than two letter-codes): +for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response; +a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`. +So I decided to ignore the language field from OMDB, and only use the language field from the internal database. + +After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`, +with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays). + +One more thing to note is that OMDB is somewhat inconsistent in its naming; +it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural), +even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object. + +I decided to use plural names for all six. + +### API + +The original assignment didn't mention anything about how should one handle requests for movies +that are present in our "database" but not OMDB, or vice versa. +On practice, all four entries present in sample JSONs have corresponding entries in OMDB. +For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object), +I decided to return 404 when at least one of the sources does not have any data. +Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial. + +The assignment seems to be a bit contradictory regarding the way how search should work; +it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller` +(as I mentioned above, I decided to go with plural forms in the merged object), +and also the original OMDB object has `Actors` and our "database" has `studios`. +It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular), +and it also feels weird to have `?actors=Mark Hamill` in the search query, +so ideally API should probably have plural forms in the merged object but singular forms in search query. +But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search. + +Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them +(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?) +So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts) +and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts). + +And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody +who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query. + +### Testing + +The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests. + +I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs. + +Regular tests (ran with `npm test`) do not have any external dependencies. +Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`. + +### Git + +The assignment has two contradicting messages: +that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"), +without squashing or bundling; +but also that you will evaluate the proper use of git. + +I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment. + +### Configuration + +OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there). + +## How to use + +To start: `npm run start`; it is available on `localhost:3000` by default. + +To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details). + +To lint: `npm run lint` (see "Framework" section for more details). diff --git a/solution.md b/solution.md deleted file mode 100644 index 6badf4b..0000000 --- a/solution.md +++ /dev/null @@ -1,218 +0,0 @@ -# Search feature - -To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB. - -For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792), -even though our "database" does not have any information about Frank Miller having directed this movie, -only OMDB has this data. - -And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year, -and it only returns a single entry at a time). - -So the only way to figure out which of the movies in the "database" are directed by Frank Miller is -to fetch data for _every_ movie present in our "database" from OMDB, and then -either do a full scan or build some kind of search index. - -**Were this a real-world task, I would say that, depending on what problem are we trying to solve, -the solution proposed in README is most likely a wrong solution. -And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it; -that the right approach would be for DB to contain all the data used in search, -not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one -(regardless of whether that happens on startup or on every search).** - -But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge. -So let's get to the topic of implementing that potentially incorrect solution. - -Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity; -an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred. -(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database", -we could just hardcode enriched data in the application and use it in searches, without making any external requests, -but I guess that's not what this test assignment is about.) - -So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow, -not as an optimization strategy but as a requirement. - -Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory. -Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment, -so we'll have to do a full scan on every search request, and at this point we might as well -only do full scans in "database" as well. -This means that it is enough for DB interface to only have three methods: -returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies. - -At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation: - -1. That data in local DB doesn't change while the service is running -(it physically cannot, with "local DB" actually being JSON files inside the project directory). -Otherwise that would open an entirely new can of worms regarding "cache" invalidation: -e.g. what if IMDB ID of an existing movie changes between requests? - -2. That data in OMDB for the movies in the local DB doesn't change while the service is running. -Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields, -so that search provides the correct and relevant results. - -So the final implementation idea is: - -1. Preload all entries from the local "database" on startup; -2. For each of them, fetch the relevant data from OMDB on startup; -3. Create "merged movie objects", keep them in memory; -4. For every search request: go through all objects, find matching ones, return them - (with zero calls to OMDB API or to the local "database" in the process). - -# Technical decisions made - -## "Additional tasks" - -### AWS Lambda Function - -Considering everything that was said above regarding the search, I don't think this service -(or any other reasonable implementation of the assignment, as reasonable as it might be) -is suitable for serverless deployment. - -In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one. -Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota, -even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state. - -Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory. - -However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means. - -And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances, -and each one will have to fetch all data from OMDB, which is not really a good idea. -On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request, -meaning that we would have to refetch everything from OMDB again. - -Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as -the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem. - -### Caching - -Preloading everything into local memory on application startup can reasonably be called "caching". - -In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section. - -## Efficiency, application scaling - -In this implementation, search and get request handlers just do a full scan of preloaded data. - -This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies -(definitely if it is below a thousand); latency on small enough local database should be low. -And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to -fetch all the needed data from OMDB before hitting OMDB API usage limits. - -I didn't do performance measurements because I didn't have enough time for that. -But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second) -linearly with the number of instances, at the expense of extra requests to OMDB API from each instance. - -## Framework - -I decided to implement this API with Nest.js, because I already used it on test assignments before, -and it's good and convenient enough for API prototyping. - -I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config. - -## OMDB format - -Unfortunately, I couldn't find any clear documentation on OMDB response format. -Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)), -and adapted my type definitions and converters to them. - -As a result, I assume that all responses from OMDB satisfy -`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields; -and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts) -in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`). - -Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`, -so I have no idea what does this field mean, and I decided to (explicitly) ignore it. - -## Merging and mapping - -I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)), -and to convert data from both OMDB and DB format to this structure. - -Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files), -because for this small one-off assignment it was easier to do it by hand than integrate another library, -and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything, -and with runtime checks that there were no extra fields in the input I should have handled. -Thi has a drawback that adding any new field to the source data will result in error; -on the other hand this means that adding any new field to the source data will require one to explicitly describe -how it should be mapped and merged. - -The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`. -But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German, -and that these fields are always present and non-empty, so I thought that it might make sense to preserve both -(with `title` and `description` in merged result containing the English data from OMDB, -and `localTitle` and `localDescription` containing the German data from our "database"). - -Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language. -There was no clear way to reconcile the data from OMDB with the data from sample JSONs: -sample JSONs only have one language per movie, as a two-letter code, -while IMDB data contains a list of languages, not all of them even having two-letter codes -(because there are more languages in the world than two letter-codes): -for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response; -a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`. -So I decided to ignore the language field from OMDB, and only use the language field from the internal database. - -After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`, -with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays). - -One more thing to note is that OMDB is somewhat inconsistent in its naming; -it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural), -even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object. - -I decided to use plural names for all six. - -## API - -The original assignment didn't mention anything about how should one handle requests for movies -that are present in our "database" but not OMDB, or vice versa. -On practice, all four entries present in sample JSONs have corresponding entries in OMDB. -For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object), -I decided to return 404 when at least one of the sources does not have any data. -Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial. - -The assignment seems to be a bit contradictory regarding the way how search should work; -it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller` -(as I mentioned above, I decided to go with plural forms in the merged object), -and also the original OMDB object has `Actors` and our "database" has `studios`. -It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular), -and it also feels weird to have `?actors=Mark Hamill` in the search query, -so ideally API should probably have plural forms in the merged object but singular forms in search query. -But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search. - -Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them -(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?) -So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts) -and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts). - -And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody -who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query. - -## Testing - -The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests. - -I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs. - -Regular tests (ran with `npm test`) do not have any external dependencies. -Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`. - -## Git - -The assignment has two contradicting messages: -that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"), -without squashing or bundling; -but also that you will evaluate the proper use of git. - -I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment. - -## Configuration - -OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there). - -# How to use - -To start: `npm run start`; it is available on `localhost:3000` by default. - -To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details). - -To lint: `npm run lint` (see "Framework" section for more details).