Inga 🏳🌈
52e25916de
|
1 year ago | |
---|---|---|
src | 1 year ago | |
test | 1 year ago | |
.eslintrc.js | 1 year ago | |
.gitignore | 1 year ago | |
.prettierrc | 1 year ago | |
README.md | 1 year ago | |
nest-cli.json | 1 year ago | |
package-lock.json | 1 year ago | |
package.json | 1 year ago | |
tsconfig.build.json | 1 year ago | |
tsconfig.json | 1 year ago |
README.md
Table of Contents
Assignment
Movie Metadata Search
Preface
Hello,
this is part of your interview with Joyn that aims to test your ability to write services that efficiently solve problems we have in our business.
Please read this document carefully before starting, as it outlines your task, the constraints and by what criteria you will be evaluated.
Use git to document your work. When finished, please create a pull request on the master
branch, according to the following naming convention: <your name>-solution
We will look at the git history of your pull request to determine the way you approached this. Please do not squash commits or bundle many unrelated changes into one large commit.
The service
You are tasked to implement a RESTful API that provides the following two basic functionalities to retrieve movie metadata from a content catalogue.
The data used for this comes from two sources:
- Typically, our own movie data would come from a database, but to simplify this, we use the static json files in
./movies
as our content catalogue. - OMDb movie metadata can be retrieved as follows:
https://www.omdbapi.com/?i=<imdb movie id>&apikey=<apikey>&plot=full
- You can use the following API key:
68fd98ab
(Limited to 1000 requests per day) - Please see https://www.omdbapi.com for details
Getting enriched movie metadata (title, description, ..)
The first task is to merge movie metadata from our systems with movie metadata from the Open Movie Database (OMDb).
- Calling
GET /api/movies/:id
should return a JSON object representing the merged movie object. :id
is an alphanumeric value that can either refer to OMDb movie ids or our internal ids.- When merging the two objects with the same fields (i.e. both JSON objects have a
title
/Title
), it depends on the name of the field, which metadata should be used. - The following rules apply, with capitalized field names (i.e.
Title
vstitle
) always referring to OMDb dataTitle
overwritestitle
Plot
overwritesdescription
duration
overwritesRuntime
userrating
will become part ofRatings
, applying a similar logic thanRatings
currently hasDirector
,Writer
andActors
should be transformed fromString
to anString[]
- Fields not covered by any of these rules should be merged into the resulting JSON in a normalized way
- If fields are unclear, make reasonable assumptions and choose your implementation accordingly
Search movies in our catalogue
We want to be able to search movies in our catalogue. To that end, we implement a simple search that returns a movie object if all search terms are true. A search term is a query param in your REST call in the form of <search_field>=<search value>
- If no search term is provided, return all movies
- Search terms are case-insensitive
- Search is performed on the merged json objects of movies
- If
<search_field>
is of typeNumber
orString
in the movie metadata, the search matches if the values are equal, i.e.?title=Sin City
matches3532674.json
- If
<search_field>
is of typeArray
in the movie metadata, the search matches if the<search value>
is contained in the array, i.e.?director=Frank Miller
matches3532674.json
/ the corresponding OMBd object - Calling
GET /api/movies?<search_field>=<search value>
should return a JSON array representing all movies that match the search criteria
Additional Tasks
The following tasks are optional but it would be nice to have at least one, in case you have additional time.
AWS Lambda Function
Use the code you have written and create an AWS Lambda function. For setting up the Lambda function and the deployment please use IaC. You are free to use any framework or tool for this task.
Caching
To improve response times it might be beneficial to add caching to the service. Go through your implementation and check where adding caching would make sense. Then pick an appropriate caching mechanism and implement it in your code. Please also take TTLs and application scaling into account.
Constraints
- Use TypeScript
- Use node.js in version 14+
- Do not introduce any system dependencies (databases, caches, search engines, docker, ..) to solve this task. This task is about your problem solving skills and not about creating a production ready system. It should not require more than
npm install
andnpm start
to have a running service. - 72h after being added to the project, your pushing rights will be revoked and the latest commit of your implementation will be the basis for your evaluation.
- We respect your time and encourage you to keep it simple: You are not expected to spend days on this - just proof that you know how to write great software in node.js
Evaluation criteria
In general you can think of the evaluation being a thorough peer review of your code. You will be evaluated by a number of criteria, among others:
- How well did you apply engineering best practices (general & node.js specific)?
- Is the service working as intended?
- How readable is your code?
- Does the service solve the problem
- correctly?
- efficiently?
- Is your code consistent in itself (styling, language constructs, ..)?
- Appropriate use of 3rd party modules
- We do not expect you to have a high test coverage, BUT it is important that you demonstrate that you know how to write testable code and provide a few tests that showcase this.
- Proper use of git
- Making good assumptions and documenting them
Questions
If you have any further questions, please open an issue in this GitHub repository and we'll try to give you an answer as quickly as possible.
Good luck
Solution
Search feature
To address the elephant in the room: search is supposed to be "performed on the merged json object of movies", i.e. including the data from OMDB.
For example, ?director=Frank Miller
should return movie with internal ID 3532674 (imdb ID tt0401792),
even though our "database" does not have any information about Frank Miller having directed this movie,
only OMDB has this data.
And OMDB does not support searching by director name (the only kind of search it seems to support is by title, optionally with year, and it only returns a single entry at a time).
So the only way to figure out which of the movies in the "database" are directed by Frank Miller is to fetch data for every movie present in our "database" from OMDB, and then either do a full scan or build some kind of search index.
Were this a real-world task, I would say that, depending on what problem are we trying to solve, the solution proposed in README is most likely a wrong solution. And that enriching local data with OMDB data should be a responsibility of the service that populates the DB, not of the one that reads from it; that the right approach would be for DB to contain all the data used in search, not for API returning data from that DB to fetch all the entries from DB and then fetch all the corresponding entries from OMDB one-by-one (regardless of whether that happens on startup or on every search).
But as I understand, this is supposed to be a coding challenge, not a requirements analysis challenge. So let's get to the topic of implementing that potentially incorrect solution.
Our "database" here only contains four movies, but README implies that this is just for (extreme) simplicity; an actual database would presumably contain thousands of titles, and even simple test database would probably contain a hundred. (If we could rely on the "database" only containing these four movies, then we wouldn't even need to call it a "database", we could just hardcode enriched data in the application and use it in searches, without making any external requests, but I guess that's not what this test assignment is about.)
So fetching data from OMDB on every search request is out of question; we must cache OMDB responses somehow, not as an optimization strategy but as a requirement.
Additionally, since "Constraints" section mentions to not introduce any system dependencies, we need to cache it in-memory. Building an index on the cached data (and maintaining it) is clearly outside of the scope of this assignment, so we'll have to do a full scan on every search request, and at this point we might as well only do full scans in "database" as well. This means that it is enough for DB interface to only have three methods: returning "our own data" for a specific movie by "our internal ID", by IMDB ID, and returning data for all movies.
At this point we already made so many assumptions that it won't hurt to make another couple, for simplicity of implementation:
-
That data in local DB doesn't change while the service is running (it physically cannot, with "local DB" actually being JSON files inside the project directory). Otherwise that would open an entirely new can of worms regarding "cache" invalidation: e.g. what if IMDB ID of an existing movie changes between requests?
-
That data in OMDB for the movies in the local DB doesn't change while the service is running. Otherwise we would have to constantly refetch all data for all movies just to update the searchable fields, so that search provides the correct and relevant results.
So the final implementation idea is:
- Preload all entries from the local "database" on startup;
- For each of them, fetch the relevant data from OMDB on startup;
- Create "merged movie objects", keep them in memory;
- For every search request: go through all objects, find matching ones, return them (with zero calls to OMDB API or to the local "database" in the process).
Technical decisions made
"Additional tasks"
AWS Lambda Function
Considering everything that was said above regarding the search, I don't think this service (or any other reasonable implementation of the assignment, as reasonable as it might be) is suitable for serverless deployment.
In order to have the search working, we should fetch data from OMDB for every movie in the local "database", one by one. Either we do it on every search request (which is unreasonable, and can easily get us above 1000 requests/day quota, even for the sample database with only four movies in it), or we have to store the fetched data in some kind of somewhat persistent state.
Since the assignment said to not introduce any system dependencies, the only somewhat persistent state we can use in Lambda is in its memory.
However, we should not rely on the way Lambda instances are cold started and shut down; this is what "serverless" basically means.
And even if we would rely on that: if our Lambda will get too many requests at the same time, AWS will spin up several instances, and each one will have to fetch all data from OMDB, which is not really a good idea. On the other hand, if our Lambda will only get infrequent requests, AWS will shut it down and cold-start it again on the next request, meaning that we would have to refetch everything from OMDB again.
Reasons why this service cannot reasonably / should not be packaged as a serverless function are the same as the reasons why the assignment as described in README seems to be a really wrong solution to an (unknown) problem.
Caching
Preloading everything into local memory on application startup can reasonably be called "caching".
In this implementation TTL is infinite, because of the assumptions described near the end of "Search feature" section.
Efficiency, application scaling
In this implementation, search and get request handlers just do a full scan of preloaded data.
This is not very efficient (the time scales as O(number of movies)), but good enough for low enough numbers of movies (definitely if it is below a thousand); latency on small enough local database should be low. And if local database contains over a thousand movies, we have a bigger problem anyway because we won't be able to fetch all the needed data from OMDB before hitting OMDB API usage limits.
I didn't do performance measurements because I didn't have enough time for that. But at least this implementation can be scaled horizontally, improving performance (in terms of numbers of requests handled per second) linearly with the number of instances, at the expense of extra requests to OMDB API from each instance.
Framework
I decided to implement this API with Nest.js, because I already used it on test assignments before, and it's good and convenient enough for API prototyping.
I'm also using TypeScript with "strictest" config, along with @typescript-eslint
with "strict-type-checked" config.
OMDB format
Unfortunately, I couldn't find any clear documentation on OMDB response format.
Instead, I just picked a few actual OMDB responses (mostly the ones listed in omdb/converters.spec.ts
),
and adapted my type definitions and converters to them.
As a result, I assume that all responses from OMDB satisfy
OmdbResponse
type defined in omdb/types.ts
and do not contain any extra fields;
and additionally that 'N/A'
values are only allowed in string fields mentioned in omdb/converters.ts
in createOptionalField
context (i.e. Awards
, BoxOffice
, Rated
, DVD
, Episode
, Season
, seriesID
, totalSeasons
, and Website
).
Additionally, I couldn't find any example of OMDB entry with Production
value different from 'N/A'
,
so I have no idea what does this field mean, and I decided to (explicitly) ignore it.
Merging and mapping
I decided to have my own MovieData
structure (defined in types.ts
),
and to convert data from both OMDB and DB format to this structure.
Instead of using any pre-existing libraries, I decided to do it manually (in two converters.ts
files),
because for this small one-off assignment it was easier to do it by hand than integrate another library,
and because I implemented the conversion in a type-safe way with compile-time checks that I didn't forget about anything,
and with runtime checks that there were no extra fields in the input I should have handled.
Thi has a drawback that adding any new field to the source data will result in error;
on the other hand this means that adding any new field to the source data will require one to explicitly describe
how it should be mapped and merged.
The assignment says that OMDB Title
overrides our own title
, and that OMDB Plot
overrides our own description
.
But from the sample data, it seems that OMDB always has both in English, and our "database" always has both in German,
and that these fields are always present and non-empty, so I thought that it might make sense to preserve both
(with title
and description
in merged result containing the English data from OMDB,
and localTitle
and localDescription
containing the German data from our "database").
Besides title, description, runtime and ratings, there is only one other field present both in OMDB response and in our "database": language.
There was no clear way to reconcile the data from OMDB with the data from sample JSONs:
sample JSONs only have one language per movie, as a two-letter code,
while IMDB data contains a list of languages, not all of them even having two-letter codes
(because there are more languages in the world than two letter-codes):
for example, tt7394674 has Language: 'English, Micmac'
in OMDB response;
a blockbuster tt11866324 (shot mostly in Comanche) has Language: 'English, North American Indian, French'
.
So I decided to ignore the language field from OMDB, and only use the language field from the internal database.
After all these adjustments, the merging algorithm was as simple as doing const mergedData = { ...normalizedOmdbData, ...normalizedInternalData }
,
with a single exception of having to manually merge ratings
fields (just by concatenating the corresponding arrays).
One more thing to note is that OMDB is somewhat inconsistent in its naming;
it has Director
, Writer
, Genre
and Country
fields (singular) but Actors
(plural); while our "database" has studios
(plural),
even though all six fields can (and typically do) contain several values, and are ultimately represented as string[]
in the merged object.
I decided to use plural names for all six.
API
The original assignment didn't mention anything about how should one handle requests for movies
that are present in our "database" but not OMDB, or vice versa.
On practice, all four entries present in sample JSONs have corresponding entries in OMDB.
For simplicity (and in order to guarantee that /api/movies/:id
endpoint always returns full merged MovieData
object),
I decided to return 404 when at least one of the sources does not have any data.
Changing the implementation so that API returns partial data when only one of the sources knows about the requested movie would be trivial.
The assignment seems to be a bit contradictory regarding the way how search should work;
it implies that search field name is the same as merged object field name, but then it provides an example as ?director=Frank Miller
(as I mentioned above, I decided to go with plural forms in the merged object),
and also the original OMDB object has Actors
and our "database" has studios
.
It feels weird to have actor: ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher']
(in singular),
and it also feels weird to have ?actors=Mark Hamill
in the search query,
so ideally API should probably have plural forms in the merged object but singular forms in search query.
But also this is very much outside of the scope / timebox for this assignment, so I decided to go with plural forms both in merged object and in search.
Another thing regarding search is that it seems that for most fields, it does not make any sense to search by them
(for example, who would search (and why) by ?awards=Won 6 Oscars. 65 wins & 31 nominations total
or by ?boxOffice=$460,998,507
or by full plot description?)
So I limited the search feature to a subset of fields, as described in SearchFilters
type in types.ts
and in SearchDto
in movies.controller.ts
.
And finally, it does not make a lot of sense to search by an empty query (we probably don't want to just list our entire database to anybody who did not enter any search filters), so I return a 400 Bad Request errors for request to search endpoint with empty query.
Testing
The code is mostly (according to npm run test:cov
, plus e2e tests) covered by tests.
I did not have any time to come up with good test cases, so most tests work on samples of data from OMDB and on provided sample JSONs.
Regular tests (ran with npm test
) do not have any external dependencies.
Additionally there are tests that connect to remote OMDB API; these are ran with npm run test:e2e
.
Git
The assignment has two contradicting messages: that I should use git to document my work ("We will look at the git history of your pull request to determine the way you approached this"), without squashing or bundling; but also that you will evaluate the proper use of git.
I decided to go with the first message, and use git to document work history, not to showcase git use that would be proper for PRs in a professional environment.
Configuration
OMDB API token is passed to the app and e2e tests as an environment variable (hardcoded in package.json, but can easily be extracted from there).
How to use
To start: npm run start
; it is available on localhost:3000
by default.
To run tests: npm test
, npm run test:e2e
(see "Testing" section for more details).
To lint: npm run lint
(see "Framework" section for more details).