Design a system where an admin can trigger a machine learning model training job from a desktop application and check on its status later. Because training can take several minutes, this does not need to be real-time.
To give you some architectural context, the system is split into two parts:
-
A Core Service: This is our main backend API. It handles user requests, and receives the initial 'Train' request from the desktop app.
-
An ML Service: This is a separate service responsible for training the ML model.
-
Explain how these two services communicate asynchronously so the Core Service doesn't get blocked.
-
Design the database models required to support this workflow. What tables do we need, what columns should they have, and how do the services interact with this data?
-
Explain how the desktop app safely checks the status (
pending → running → completed / failed).