merged solution.md into readme

main
Inga 🏳‍🌈 1 year ago
parent 6e38a9dec9
commit 52e25916de
  1. 230
      README.md
  2. 218
      solution.md

@ -1,6 +1,11 @@
(This is the original assignment test; information about solution is in [solution.md](solution.md))
---
gitea: none
include_toc: true
---
# Movie Metadata Search
# Assignment
## Movie Metadata Search
## Preface
@ -96,3 +101,224 @@ You will be evaluated by a number of criteria, among others:
If you have any further questions, please open an issue in this GitHub repository and we'll try to give you an answer as quickly as possible.
### Good luck
# Solution
## Search feature
To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792),
even though our "database" does not have any information about Frank Miller having directed this movie,
only OMDB has this data.
And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year,
and it only returns a single entry at a time).
So the only way to figure out which of the movies in the "database" are directed by Frank Miller is
to fetch data for _every_ movie present in our "database" from OMDB, and then
either do a full scan or build some kind of search index.
**Were this a real-world task, I would say that, depending on what problem are we trying to solve,
the solution proposed in README is most likely a wrong solution.
And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it;
that the right approach would be for DB to contain all the data used in search,
not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one
(regardless of whether that happens on startup or on every search).**
But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge.
So let's get to the topic of implementing that potentially incorrect solution.
Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity;
an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred.
(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database",
we could just hardcode enriched data in the application and use it in searches, without making any external requests,
but I guess that's not what this test assignment is about.)
So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow,
not as an optimization strategy but as a requirement.
Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory.
Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment,
so we'll have to do a full scan on every search request, and at this point we might as well
only do full scans in "database" as well.
This means that it is enough for DB interface to only have three methods:
returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation:
1. That data in local DB doesn't change while the service is running
(it physically cannot, with "local DB" actually being JSON files inside the project directory).
Otherwise that would open an entirely new can of worms regarding "cache" invalidation:
e.g. what if IMDB ID of an existing movie changes between requests?
2. That data in OMDB for the movies in the local DB doesn't change while the service is running.
Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields,
so that search provides the correct and relevant results.
So the final implementation idea is:
1. Preload all entries from the local "database" on startup;
2. For each of them, fetch the relevant data from OMDB on startup;
3. Create "merged movie objects", keep them in memory;
4. For every search request: go through all objects, find matching ones, return them
(with zero calls to OMDB API or to the local "database" in the process).
## Technical decisions made
### "Additional tasks"
#### AWS Lambda Function
Considering everything that was said above regarding the search, I don't think this service
(or any other reasonable implementation of the assignment, as reasonable as it might be)
is suitable for serverless deployment.
In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one.
Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota,
even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means.
And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances,
and each one will have to fetch all data from OMDB, which is not really a good idea.
On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request,
meaning that we would have to refetch everything from OMDB again.
Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as
the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
#### Caching
Preloading everything into local memory on application startup can reasonably be called "caching".
In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section.
### Efficiency, application scaling
In this implementation, search and get request handlers just do a full scan of preloaded data.
This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies
(definitely if it is below a thousand); latency on small enough local database should be low.
And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to
fetch all the needed data from OMDB before hitting OMDB API usage limits.
I didn't do performance measurements because I didn't have enough time for that.
But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second)
linearly with the number of instances, at the expense of extra requests to OMDB API from each instance.
### Framework
I decided to implement this API with Nest.js, because I already used it on test assignments before,
and it's good and convenient enough for API prototyping.
I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config.
### OMDB format
Unfortunately, I couldn't find any clear documentation on OMDB response format.
Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)),
and adapted my type definitions and converters to them.
As a result, I assume that all responses from OMDB satisfy
`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields;
and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts)
in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`).
Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`,
so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
### Merging and mapping
I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)),
and to convert data from both OMDB and DB format to this structure.
Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files),
because for this small one-off assignment it was easier to do it by hand than integrate another library,
and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
and with runtime checks that there were no extra fields in the input I should have handled.
Thi has a drawback that adding any new field to the source data will result in error;
on the other hand this means that adding any new field to the source data will require one to explicitly describe
how it should be mapped and merged.
The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`.
But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
(with `title` and `description` in merged result containing the English data from OMDB,
and `localTitle` and `localDescription` containing the German data from our "database").
Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language.
There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
sample JSONs only have one language per movie, as a two-letter code,
while IMDB data contains a list of languages, not all of them even having two-letter codes
(because there are more languages in the world than two letter-codes):
for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response;
a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`.
So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`,
with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays).
One more thing to note is that OMDB is somewhat inconsistent in its naming;
it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural),
even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object.
I decided to use plural names for all six.
### API
The original assignment didn't mention anything about how should one handle requests for movies
that are present in our "database" but not OMDB, or vice versa.
On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object),
I decided to return 404 when at least one of the sources does not have any data.
Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial.
The assignment seems to be a bit contradictory regarding the way how search should work;
it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller`
(as I mentioned above, I decided to go with plural forms in the merged object),
and also the original OMDB object has `Actors` and our "database" has `studios`.
It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular),
and it also feels weird to have `?actors=Mark Hamill` in the search query,
so ideally API should probably have plural forms in the merged object but singular forms in search query.
But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?)
So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts)
and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts).
And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody
who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query.
### Testing
The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests.
I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
Regular tests (ran with `npm test`) do not have any external dependencies.
Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`.
### Git
The assignment has two contradicting messages:
that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"),
without squashing or bundling;
but also that you will evaluate the proper use of git.
I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment.
### Configuration
OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there).
## How to use
To start: `npm run start`; it is available on `localhost:3000` by default.
To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details).
To lint: `npm run lint` (see "Framework" section for more details).

@ -1,218 +0,0 @@
# Search feature
To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792),
even though our "database" does not have any information about Frank Miller having directed this movie,
only OMDB has this data.
And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year,
and it only returns a single entry at a time).
So the only way to figure out which of the movies in the "database" are directed by Frank Miller is
to fetch data for _every_ movie present in our "database" from OMDB, and then
either do a full scan or build some kind of search index.
**Were this a real-world task, I would say that, depending on what problem are we trying to solve,
the solution proposed in README is most likely a wrong solution.
And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it;
that the right approach would be for DB to contain all the data used in search,
not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one
(regardless of whether that happens on startup or on every search).**
But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge.
So let's get to the topic of implementing that potentially incorrect solution.
Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity;
an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred.
(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database",
we could just hardcode enriched data in the application and use it in searches, without making any external requests,
but I guess that's not what this test assignment is about.)
So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow,
not as an optimization strategy but as a requirement.
Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory.
Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment,
so we'll have to do a full scan on every search request, and at this point we might as well
only do full scans in "database" as well.
This means that it is enough for DB interface to only have three methods:
returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation:
1. That data in local DB doesn't change while the service is running
(it physically cannot, with "local DB" actually being JSON files inside the project directory).
Otherwise that would open an entirely new can of worms regarding "cache" invalidation:
e.g. what if IMDB ID of an existing movie changes between requests?
2. That data in OMDB for the movies in the local DB doesn't change while the service is running.
Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields,
so that search provides the correct and relevant results.
So the final implementation idea is:
1. Preload all entries from the local "database" on startup;
2. For each of them, fetch the relevant data from OMDB on startup;
3. Create "merged movie objects", keep them in memory;
4. For every search request: go through all objects, find matching ones, return them
(with zero calls to OMDB API or to the local "database" in the process).
# Technical decisions made
## "Additional tasks"
### AWS Lambda Function
Considering everything that was said above regarding the search, I don't think this service
(or any other reasonable implementation of the assignment, as reasonable as it might be)
is suitable for serverless deployment.
In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one.
Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota,
even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means.
And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances,
and each one will have to fetch all data from OMDB, which is not really a good idea.
On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request,
meaning that we would have to refetch everything from OMDB again.
Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as
the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
### Caching
Preloading everything into local memory on application startup can reasonably be called "caching".
In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section.
## Efficiency, application scaling
In this implementation, search and get request handlers just do a full scan of preloaded data.
This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies
(definitely if it is below a thousand); latency on small enough local database should be low.
And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to
fetch all the needed data from OMDB before hitting OMDB API usage limits.
I didn't do performance measurements because I didn't have enough time for that.
But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second)
linearly with the number of instances, at the expense of extra requests to OMDB API from each instance.
## Framework
I decided to implement this API with Nest.js, because I already used it on test assignments before,
and it's good and convenient enough for API prototyping.
I'm also using TypeScript with "strictest" config, along with `@typescript-eslint` with "strict-type-checked" config.
## OMDB format
Unfortunately, I couldn't find any clear documentation on OMDB response format.
Instead, I just picked a few actual OMDB responses (mostly the ones listed in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)),
and adapted my type definitions and converters to them.
As a result, I assume that all responses from OMDB satisfy
`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields;
and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts)
in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`).
Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`,
so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
## Merging and mapping
I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)),
and to convert data from both OMDB and DB format to this structure.
Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files),
because for this small one-off assignment it was easier to do it by hand than integrate another library,
and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
and with runtime checks that there were no extra fields in the input I should have handled.
Thi has a drawback that adding any new field to the source data will result in error;
on the other hand this means that adding any new field to the source data will require one to explicitly describe
how it should be mapped and merged.
The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`.
But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
(with `title` and `description` in merged result containing the English data from OMDB,
and `localTitle` and `localDescription` containing the German data from our "database").
Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language.
There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
sample JSONs only have one language per movie, as a two-letter code,
while IMDB data contains a list of languages, not all of them even having two-letter codes
(because there are more languages in the world than two letter-codes):
for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response;
a blockbuster tt11866324 (shot mostly in Comanche) has `Language: 'English, North American Indian, French'`.
So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`,
with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays).
One more thing to note is that OMDB is somewhat inconsistent in its naming;
it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural),
even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object.
I decided to use plural names for all six.
## API
The original assignment didn't mention anything about how should one handle requests for movies
that are present in our "database" but not OMDB, or vice versa.
On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
For simplicity (and in order to guarantee that `/api/movies/:id` endpoint always returns full merged `MovieData` object),
I decided to return 404 when at least one of the sources does not have any data.
Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial.
The assignment seems to be a bit contradictory regarding the way how search should work;
it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller`
(as I mentioned above, I decided to go with plural forms in the merged object),
and also the original OMDB object has `Actors` and our "database" has `studios`.
It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular),
and it also feels weird to have `?actors=Mark Hamill` in the search query,
so ideally API should probably have plural forms in the merged object but singular forms in search query.
But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?)
So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts)
and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts).
And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody
who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query.
## Testing
The code is mostly (according to `npm run test:cov`, plus e2e tests) covered by tests.
I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
Regular tests (ran with `npm test`) do not have any external dependencies.
Additionally there are tests that connect to remote OMDB API; these are ran with `npm run test:e2e`.
## Git
The assignment has two contradicting messages:
that I should use git to document my work ("_We will look at the git history of your pull request to determine the way you approached this_"),
without squashing or bundling;
but also that you will evaluate the proper use of git.
I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment.
## Configuration
OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there).
# How to use
To start: `npm run start`; it is available on `localhost:3000` by default.
To run tests: `npm test`, `npm run test:e2e` (see "Testing" section for more details).
To lint: `npm run lint` (see "Framework" section for more details).
Loading…
Cancel
Save