parent
e078010afe
commit
97d4d55381
@ -0,0 +1,202 @@ |
|||||||
|
# Search feature |
||||||
|
|
||||||
|
To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB. |
||||||
|
|
||||||
|
For example, `?director=Frank Miller` should return movie with internal ID 3532674 (imdb ID tt0401792), |
||||||
|
even though our "database" does not have any information about Frank Miller having directed this movie, |
||||||
|
only OMDB has this data. |
||||||
|
|
||||||
|
And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year, |
||||||
|
and it only returns a single entry at a time). |
||||||
|
|
||||||
|
So the only way to figure out which of the movies in the "database" are directed by Frank Miller is |
||||||
|
to fetch data for _every_ movie present in our "database" from OMDB, and then |
||||||
|
either do a full scan or build some kind of search index. |
||||||
|
|
||||||
|
**Were this a real-world task, I would say that, depending on what problem are we trying to solve, |
||||||
|
the solution proposed in README is most likely a wrong solution. |
||||||
|
And that enriching local data with OMDB data should be a responsibility of the service that populates the DB and not reads from it; |
||||||
|
that the right approach would be for DB to contain all the data used in search, |
||||||
|
not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one |
||||||
|
(whether that happens on startup or on every search).** |
||||||
|
|
||||||
|
But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge. |
||||||
|
So let's get to the topic of implementing that potentially incorrect solution. |
||||||
|
|
||||||
|
Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity; |
||||||
|
an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred. |
||||||
|
(If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database", |
||||||
|
we could just hardcode enriched data inside the application and use it in searches, without making any external requests, |
||||||
|
but I guess that's not what this test assignment is about.) |
||||||
|
|
||||||
|
So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow, |
||||||
|
not as an optimization strategy but as a requirement. |
||||||
|
|
||||||
|
Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory. |
||||||
|
Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment, |
||||||
|
so we'll have to do a full scan on every search request, and at this point we might as well |
||||||
|
only do full scans in "database" as well. |
||||||
|
It is enough for DB interface to only have three methods: |
||||||
|
returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies. |
||||||
|
|
||||||
|
At this point we already made so many assumptions that it won't hurt to make another one, for simplicity of implementation: |
||||||
|
that data in local DB doesn't change while the service is running |
||||||
|
(it physically cannot, with "local DB" actually being JSON files inside the project directory), |
||||||
|
and that data in OMDB for the movies in the local DB doesn't change while the service is running. |
||||||
|
Otherwise that would open an entirely new can of worms regarding "cache" invalidation: |
||||||
|
e.g. what if IMDB ID of an existing movie changes between requests? |
||||||
|
|
||||||
|
So the final implementation idea is: |
||||||
|
|
||||||
|
1. Preload all entries from the local "database" on startup; |
||||||
|
2. For each of them, fetch the relevant data from OMDB; |
||||||
|
3. Create "merged movie objects", keep them in memory; |
||||||
|
4. For every search request, go through all objects, find matching ones, return them |
||||||
|
(with zero calls to OMDB API or to the local "database"). |
||||||
|
|
||||||
|
# Technical decisions made |
||||||
|
|
||||||
|
## "Additional tasks" |
||||||
|
|
||||||
|
### AWS Lambda Function |
||||||
|
|
||||||
|
Considering everything that was said above regarding the search, I don't think this service |
||||||
|
(or any other reasonable implementation of the assignment, as reasonable as it might be), |
||||||
|
is suitable for deployment to AWS as Lambda. |
||||||
|
|
||||||
|
In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one. |
||||||
|
Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota, |
||||||
|
even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state. |
||||||
|
|
||||||
|
Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory. |
||||||
|
|
||||||
|
However, we should not rely on the way Lambda instances are cold started and shut down. |
||||||
|
|
||||||
|
And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances, |
||||||
|
and each one will have to fetch all data from OMDB, which is not really a good idea. |
||||||
|
On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request, |
||||||
|
meaning that we would have to refetch everything from OMDB again. |
||||||
|
|
||||||
|
Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as |
||||||
|
the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem. |
||||||
|
|
||||||
|
### Caching |
||||||
|
|
||||||
|
Preloading everything into local memory on application startup can reasonably be called "caching". |
||||||
|
|
||||||
|
In this implementation TTL is infinite, because of the assumptions described in "Search feature" section. |
||||||
|
|
||||||
|
## Efficiency, application scaling |
||||||
|
|
||||||
|
In this implementation, search and get request handlers just do a full scan of preloaded data. |
||||||
|
|
||||||
|
This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies |
||||||
|
(e.g. even in hundreds). |
||||||
|
And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to |
||||||
|
fetch all the needed data from OMDB before hitting OMDB API usage limits. |
||||||
|
|
||||||
|
I didn't do performance measurements because I didn't have enough time for that. |
||||||
|
But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second) linearly, |
||||||
|
at the expense of extra requests to OMDB API from each instance. |
||||||
|
|
||||||
|
## framework |
||||||
|
|
||||||
|
I decided to implement this API with Nest.js, because I already used it on test assignments before, |
||||||
|
and it's good and convenient enough for API prototyping. |
||||||
|
|
||||||
|
## OMDB format |
||||||
|
|
||||||
|
Unfortunately, I couldn't find any clear documentation on OMDB response format. |
||||||
|
Instead, I just picked a few actual OMDB responses (mostly the ones described in [`omdb/converters.spec.ts`](src/integration/movies/omdb/converters.spec.ts)), |
||||||
|
and adapted my type definitions and converters to them. |
||||||
|
|
||||||
|
As a result, I assume that all responses from OMDB satisfy |
||||||
|
`OmdbResponse` type defined in [`omdb/types.ts`](src/integration/movies/omdb/types.ts) and do not contain any extra fields; |
||||||
|
and additionally that `'N/A'` values are only allowed in string fields mentioned in [`omdb/converters.ts`](src/integration/movies/omdb/converters.ts) |
||||||
|
in `createOptionalField` context (i.e. `Awards`, `BoxOffice`, `Rated`, `DVD`, `Episode`, `Season`, `seriesID`, `totalSeasons`, and `Website`). |
||||||
|
|
||||||
|
Additionally, I couldn't find any example of OMDB entry with `Production` value different from `'N/A'`, |
||||||
|
so I have no idea what does this field mean, and I decided to (explicitly) ignore it. |
||||||
|
|
||||||
|
## Merging and mapping |
||||||
|
|
||||||
|
I decided to have my own `MovieData` structure (defined in [`types.ts`](src/types.ts)), |
||||||
|
and to convert data from both OMDB and DB format to this structure. |
||||||
|
|
||||||
|
Instead of using any pre-existing libraries, I decided to do it manually (in two `converters.ts` files), |
||||||
|
because for this small one-off assignment it was easier to do it by hand than integrate another library, |
||||||
|
and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything, |
||||||
|
and with runtime checks that there were no extra fields in the input I should have handled. |
||||||
|
Thi has a drawback that adding any new field to the source data will result in error; |
||||||
|
on the other hand this means that adding any new field to the source data will require one to explicitly describe |
||||||
|
how it should be mapped and merged. |
||||||
|
|
||||||
|
The assignment says that OMDB `Title` overrides our own `title`, and that OMDB `Plot` overrides our own `description`. |
||||||
|
But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German, |
||||||
|
and that these fields are always present and non-empty, so I thought that it might make sense to preserve both |
||||||
|
(with `title` and `description` in merged result containing the English data from OMDB, |
||||||
|
and `localTitle` and `localDescription` containing the German data from our "database"). |
||||||
|
|
||||||
|
Besides title, description and runtime, there is another field present both in OMDB response and in our "database": language. |
||||||
|
There was no clear way to reconcile the data from OMDB with the data from sample JSONs: |
||||||
|
sample JSONs only have one language per movie, as a two-letter code, |
||||||
|
while IMDB data contains a list of languages, not all of them even having two-letter codes |
||||||
|
(because there are more languages in the world than two letter-codes): |
||||||
|
for example, tt7394674 has `Language: 'English, Micmac'` in OMDB response; |
||||||
|
a blockbuster tt11866324 (shot mostly in Komanche) has `Language: 'English, North American Indian, French'`. |
||||||
|
So I decided to ignore the language field from OMDB, and only use the language field from the internal database. |
||||||
|
|
||||||
|
After all these adjustments, the merging algorithm was as simple as doing `const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }`, |
||||||
|
with a single exception of having to manually merge `ratings` fields (just by concatenating the corresponding arrays). |
||||||
|
|
||||||
|
One more thing to note is that OMDB is somewhat inconsistent in its naming; |
||||||
|
it has `Director`, `Writer`, `Genre` and `Country` fields (singular) but `Actors` (plural); while our "database" has `studios` (plural), |
||||||
|
even though all six fields can (and typically do) contain several values, and are ultimately represented as `string[]` in the merged object. |
||||||
|
|
||||||
|
I decided to use plural names for all six. |
||||||
|
|
||||||
|
## API |
||||||
|
|
||||||
|
The original assignment didn't mention anything about how should one handle requests for movies |
||||||
|
that are present in our "database" but not OMDB, or vice versa. |
||||||
|
On practice, all four entries present in sample JSONs have corresponding entries in OMDB. |
||||||
|
For simplicity (and in order to guarantee that `/api/movied/:id` endpoint always returns full merged `MovieData` object), |
||||||
|
I decided to return 404 when at least one of the sources does not have any data. |
||||||
|
Changing the implementation so that API returns partial data if only one of the sources knows about the requested movie would be trivial. |
||||||
|
|
||||||
|
The assignment seems to be a bit contradictory regarding the way how search should work; |
||||||
|
it implies that search field name is the same as merged object field name, but then it provides an example as `?director=Frank Miller` |
||||||
|
(as I mentioned above, I decided to go with plural forms in the merged object), |
||||||
|
and also the original OMDB object has `Actors` and our "database" has `studios`. |
||||||
|
It feels weird to have `actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']` (in singular), |
||||||
|
and it also feels weird to have `?actors=Mark Hamill` in the search query, |
||||||
|
so ideally API should probably have plural forms in the merged object but singular forms in search query. |
||||||
|
But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search. |
||||||
|
|
||||||
|
Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them |
||||||
|
(for example, who would search (and why) by `?awards=Won 6 Oscars. 65 wins & 31 nominations total` or by `?boxOffice=$460,998,507` or by full plot description?) |
||||||
|
So I limited the search feature to a subset of fields, as described in `SearchFilters` type in [`types.ts`](src/types.ts) |
||||||
|
and in `SearchDto` in [`movies.controller.ts`](src/movies.controller.ts). |
||||||
|
|
||||||
|
## Testing |
||||||
|
|
||||||
|
I did not have any time to come up with a good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs. |
||||||
|
|
||||||
|
## Git |
||||||
|
|
||||||
|
The assignment has two contradicting messages: |
||||||
|
that I should use git to document my work ("We will look at the git history of your pull request to determine the way you approached this"), |
||||||
|
without squashing or bundling; |
||||||
|
but also that you will evaluate the proper use of git. |
||||||
|
|
||||||
|
I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PR in a professional environment. |
||||||
|
|
||||||
|
# How to use |
||||||
|
|
||||||
|
To start: `npm run start`; it is available on `localhost:3000` by default. |
||||||
|
|
||||||
|
To run tests: `npm test`. |
||||||
|
|
||||||
|
To run tests that connect to actual OMDB (these are not included in `npm test`): `npm run test:e2e`. |
||||||
|
|
||||||
|
To lint: `npm run lint`. |
Loading…
Reference in new issue