test-assignment-parcellab/README.md

# Assignment

## 👉 Our Expectations

We greatly appreciate your commitment to this process.

The main goal of this task is to:

- Get to know how you work
- Assess your general understanding and expertise in relevant aspects of backend development
- Ultimately, have a conversation with you to discuss the actions you performed, results, and any issues encountered.

We do not want to create a stressful situation for you or interfere with your daily activities. Therefore, you should not invest more than a few hours in executing this task.

Please feel free to contact us at any time if you have any questions.

We are confident that you will do your best, and we are looking forward to reviewing the output with you. 😊

## 💡 **Track and Trace API**

The Track and Trace page enables end users (receiver of a shipment) to monitor the status of their shipment.

Your task is to create a backend API that serves shipment data and provides users with current weather conditions at their location.

### **Guideline**

You have been provided with a CSV file (below) containing sample shipment and article data to seed your data structures. Your task is to create an API application that performs the following:

- Provides a RESTful or GraphQL Endpoint that exposes shipment and article information along with corresponding weather information.
- Allows consumers to lookup shipments via tracking number and carrier.

*Hints*:

- Integrate a suitable weather API: Choose a weather API (e.g., OpenWeatherMap, Weatherbit, or any other free API) and fetch the weather information for the respective locations.
- Limit weather data retrieval: Ensure that weather information for the same location (zip code) is fetched at most every 2 hours to minimize API calls.
- You can use any framework or library that you feel comfortable with.

*Nice to have:*

- Provide unit tests and/or integration tests for the application.
- OpenAPI docs
- Implement the API using an well-known backend framework.

### **Solution**

**Deliverables**:

- Application code, tests, and documentation needed to run the code.

Evaluation Criteria:

- Code quality and organization
- Functional correctness
- Efficiency, robustness, and adequate design choices

Discussion Points:

- What were the important design choices and trade-offs you made?
- What would be required to deploy this application to production?
- What would be required to scale this application to handle 1000 requests per second?

### Seed Data

NB: you don’t need to write an import, if that is too time consuming, you can just setup one or two sample items in the DB via any mechanism you like.

```html
tracking_number,carrier,sender_address,receiver_address,article_name,article_quantity,article_price,SKU,status
TN12345678,DHL,"Street 1, 10115 Berlin, Germany","Street 10, 75001 Paris, France",Laptop,1,800,LP123,in-transit
TN12345678,DHL,"Street 1, 10115 Berlin, Germany","Street 10, 75001 Paris, France",Mouse,1,25,MO456,in-transit
TN12345679,UPS,"Street 2, 20144 Hamburg, Germany","Street 20, 1000 Brussels, Belgium",Monitor,2,200,MT789,inbound-scan
TN12345680,DPD,"Street 3, 80331 Munich, Germany","Street 5, 28013 Madrid, Spain",Keyboard,1,50,KB012,delivery
TN12345680,DPD,"Street 3, 80331 Munich, Germany","Street 5, 28013 Madrid, Spain",Mouse,1,25,MO456,delivery
TN12345681,FedEx,"Street 4, 50667 Cologne, Germany","Street 9, 1016 Amsterdam, Netherlands",Laptop,1,900,LP345,transit
TN12345681,FedEx,"Street 4, 50667 Cologne, Germany","Street 9, 1016 Amsterdam, Netherlands",Headphones,1,100,HP678,transit
TN12345682,GLS,"Street 5, 70173 Stuttgart, Germany","Street 15, 1050 Copenhagen, Denmark",Smartphone,1,500,SP901,scanned
TN12345682,GLS,"Street 5, 70173 Stuttgart, Germany","Street 15, 1050 Copenhagen, Denmark",Charger,1,20,CH234,scanned
```

## That is it - everything else is up to you! Happy coding!

# Solution

I spent around 6 hours implementing this.

## Domain model

The original assignment leaves some questions unanswered, I tried to answer them myself:

* **Are tracking numbers unique across the entire system, or only within a single carrier?**

    The sample data makes it seem as if these were not carriers' tracking numbers but parcellab's own tracking numbers
    (serving as an abstraction over actual carriers).

    On the other hand, the assignment says that customers should be able to look shipments up by tracking number and carrier.

    In the end, I decided to use carrier + tracking number as an identifier (meaning that carrier is required in the request,
    and that requests with the wrong carrier will result in 404).

* **For what point exactly should I retrieve the weather?**

    The assignment mentions ZIP code in passing in an unrelated section.
    But ZIP codes are not a great way to determine weather at the receiver area, because they can be very large
    (e.g. 0872 in Australia is over 1500 kilometers across, with the area larger than the entire Germany and France combined).

    On the other hand, addresses in the seed data are all fake ("Street 1" etc), and one cannot get weather for them.

    It is not clear why would anybody even need this data, or how are they going to use it.
    (Using location of a pickup point would probably make more sense than using the address of a recipient, but we don't have this data.)

    In the end, I decided to use addresses, and replace some of the fake addresses in the seed data with real ones.

    Which brings me to the next question...

* **What is the supposed scenario in which caching weather data will be beneficial?**

    Do we assume that the application should return data for a lot of different packages to different addresses,
    with only infrequent requests (e.g. once a day) for the same package?
    Then caching weather data by location will not be of any benefit, instead only increazing our cache size
    (because the requests will never hit the cache, because every request is for the different location).
    Or, alternatively, we could use coarser coordinates for caching, e.g. rounding them to 0.1 degree (roughly 11km or less),
    so that we could still reuse the weather data even for different (but close enough) addresses.
    This can be done in `src/packages.service.ts`.

    Or do we assume that the application is going to receive a lot of requests for the small number of packages?

    For simplicity, in this implementation I went with the latter assumption.

* **How to get weather, knowing only the recipient address (including the zip code)?**

    Most, if not all, weather APIs accept location as a parameter, not an address.
    So in addition to integrating with a weather API, I also had to integrate with a map (geocoder) API
    in order to resolve addresses to locations (latitude + longitude).

* **What do tracking numbers even mean? Why are they not unique?**

    I've only noticed this in the last moment: in the seed data, there are multiple rows
    with the same carrier + tracking number, same addresses, same statuses, but different products.
    I guess that this is because it is supposed to come from some kind of join SQL query, returning one row per product.

    I don't think this is a good was to get the data for such an API endpoint, but that's what the source data is.

    If only I noticed it earlier, I would write my code accordingly.
    But I only noticed it in the very end.
    So I just did a quick workaround to demonstrate that everything works, and changed tracking numbers to be unique.

* **What weather data should we use?**

    It is not clear why would anybody even need this data, or how are they going to use it.

    I decided to use the current weather data instead of forecasts.
    Since we're refetching the weather data if it's more than 2 hours old,
    the "current weather" data will never be more than 2 hours out of date, and will still stay somewhat relevant.

* **What data to return? What if location or weather APIs are unavailable or return an error?**

    I decided to return package data always, and weather data only if both location and weather APIs resolved successfully.

## Decisions made, implementation details

* **Which framework to use**

    Nest.js provides a good starting point, and I already created services with Nest.js in the past,
    so I decided to use it here as well, to save time on learning the new boilerplate.

    Additionally I switched TS config to use `@tsconfig/strictest` options,
    and eslint config to use `@typescript-eslint/strict` ruleset.

* **Dependency injection**

    All integrations and storages etc have proper generalized interfaces;
    it is easy enough to switch to another weather / location provider or to another database,
    simply by implementing another integration and switching to it as a drop-in replacement
    in `src/app.module.ts`.

    Additionally, since Nest.js unfortunately does not support typechecking for dependencies yet,
    I implemented additional types as a drop-in replacement for some of the Nest.js ones;
    they're quite limited but they have all the features I used from Nest.js,
    and they do support typechecking for dependencies.
    (`src/dependencies.ts`).

* **Which APIs to use**

    I decided to use Openmeteo for weather, and OSM Nominatim for location resolving,
    because they are free / open-source, don't require any sign ups / API tokens,
    and are easy enough to use.

    However, they have strict request limits for the free tier, so in my integration
    I limited interaction with them to one request per second.

* **Resolving addresses to locations**

    One problem with OSM Nominatim is that it returns all objects that match the query.
    And very often, there are several different objects matching the same address
    (e.g. the building itself, plus all the organizations in it, or other buildings with address supplements).

    So I take the mean of all the locations returned by OSM Nominatim, and check if all returned locations lie
    within 0.01 degree (1.1km or less) from the mean.

    If they do, this means that all the results are close enough together to probably actually refer
    to more or less the same location, so I return the mean.
    If they don't, this means that results probably refer to different locations, and we cannot determine
    which of these locations is the one we're supposed to return (for example, imagine the address:
    "Hauptbahnhof, Germany"), so I throw an error.

* **Caching**

    The assignment says that the weather data should not be fetched more frequently than every two hours for the same location.
    (It also says "zip code", but then again, 0872 Australia.)
    So I'm caching the weather data for two hours (TTL in `src/clients/weather.ts`).

    But I'm also caching the location data for a day, because it's unlikely to change very often
    (it certainly doesn't change as often as the weather),
    and because I don't want to send too many requests to OSM.

* **Database implementation**

    For simplicity, and to make this solution self-contained, I decided to use in-memory database
    for the packages, with simulated 50ms latency (to make it feel more like a remote database).

    If needed, it can be replaced by any other database, by creating a new implementation of `PackagesRepository`.

* **Cache implementation**
    The same in-memory database (just a JS `Map`) is used, with simulated 20ms latency.

    If needed, it can be replaced by any other caching solution,
    by creating a new implementation of `ClearableKeyValueStorage`.

    (Also I should have created an intermediate caching layer handling the TTLs / expirations,
    but I didn't have time for that, so this logic _and_ the logic of loading stuff and storing it in cache
    both currently live in `storage/cache.ts`.)

* **Throttling**

    One problem with complicated things like caching is that the naive implementations are not really concurrent-safe,
    with the code assuming that the related cache state will not change while we're working with that cache.

    Of course it doesn't really work like that, but as long as there is only one copy of our application running,
    and no other applications work with the same data, we can simulate this easily enough by making sure
    that concurrency-unsafe code is only executed once at a time for every set of data it operates on.

    Since all that code in this application is supposed to be more or less idempotent, there is no need to wait
    for it to resolve before calling it again; if some data is requested again for the same key, we can simply
    return the previous (not-yet-resolved) promise.

    This is implemented in `src/utils/throttle.ts`.

    Basically this means that e.g. if the location for the same address is requested twice at the same time,
    the underlying function (checking the cache, querying the API and storing data to the cache if there is a cache miss)
    will only be called once, and both callers will get the same promise.

## How to use the app

To lint: `npm run lint`.

To test: `npm run test` and `npm run test:e2e`.

To start: `npm run start`.

This is a RESTful API. To get package info, send GET request to e.g. `/packages/UPS/TN12345679`.

## Discussion points

> What were the important design choices and trade-offs you made?

See above ("Domain model", "Decisions made").

> What would be required to deploy this application to production?

First of all, identifying what problem are we solving would be required.
Because this application does not seem like it's actually solving some problem.
Why would anybody need such an endpoint? How are they going to use it?
Without answering these questions, there is no point in deploying this application anywhere,
and there cannot be any clear understanding of non-functional requirements.

But also we'd need to use some production-ready reliable (and probably paid) APIs for geocoding and weather,
and some actual database (or API) for retrieving package data (instead of in-memory key-value record),
and some actual caching solution (e.g. redis), which will also mean rethinking how `throttle` is used here.

> What would be required to scale this application to handle 1000 requests per second?

This application already handles over 3000 requests per second with ease,
even with an artificial 20ms cache access latency (and artificial 50ms DB access latency).

```
❯ ab -c 500 -n 100000 http://127.0.0.1:3000/packages/UPS/TN12345679
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:
Server Hostname:        127.0.0.1
Server Port:            3000

Document Path:          /packages/UPS/TN12345679
Document Length:        394 bytes

Concurrency Level:      500
Time taken for tests:   29.422 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      60300000 bytes
HTML transferred:       39400000 bytes
Requests per second:    3398.84 [#/sec] (mean)
Time per request:       147.109 [ms] (mean)
Time per request:       0.294 [ms] (mean, across all concurrent requests)
Transfer rate:          2001.47 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    7   3.5      7      19
Processing:    98  139  11.0    137     251
Waiting:       89  115  12.7    114     222
Total:        103  146  10.7    144     259

Percentage of the requests served within a certain time (ms)
  50%    144
  66%    146
  75%    149
  80%    151
  90%    159
  95%    166
  98%    170
  99%    172
 100%    259 (longest request)
 ```

Granted this is all for the same URL, and it can be slower if we will be requesting different URLs.
But also the only reasons why it can be slower (in terms of rps) for different URLs is:

* Sending more requests to remote APIs
    (will only affect local resources because more bandwidth is used,
    and more requests are created and responses parsed, and more sockets are open at the same time, etc);
* Using larger caches
    (performance of JS `Map` naturally depends on how many entries are there within a given map,
    and also there are memory constraints).

But, depending on the usage profile (how many different packages are requested how often?),
1000 requests per second should be doable.