Ilab UI API Server

1. GET /models

Purpose:
Returns a list of available models. The server first checks an in‑memory cache (refreshed every 20 minutes) and returns the cached models if still valid; otherwise, it refreshes the cache by executing the ilab model list command.

Request:

Method: GET
URL: /models

Response:

Status 200:

[
  {
    "name": "model-name-1",
    "last_modified": "2025-02-18T12:34:56Z",
    "size": "123 MB"
  },
  {
    "name": "model-name-2",
    "last_modified": "2025-02-17T09:21:30Z",
    "size": "456 MB"
  }
  // ... more models
]

Error Responses:

500 Internal Server Error: If encoding or cache refresh fails.

2. GET /data

Purpose:
Retrieves a list of data records by running the ilab data list command and parsing its output.

Request:

Method: GET
URL: /data

Response:

Status 200:

[
  {
    "dataset": "dataset-1",
    "created_at": "2025-02-18T11:00:00Z",
    "file_size": "10 MB"
  },
  {
    "dataset": "dataset-2",
    "created_at": "2025-02-17T15:30:00Z",
    "file_size": "20 MB"
  }
  // ... more datasets
]

Error Responses:

500 Internal Server Error: If there’s an error running the command or parsing its output.

3. POST /data/generate

Purpose:
Starts a background job to generate new data (by running the ilab data generate command). In mock mode, it simulates a job.

Request:

Method: POST
URL: /data/generate
Body: No request body is required.

Response:

Status 200:
```
{
  "job_id": "g-<timestamp-nano>"
}
```
The returned job_id uniquely identifies the generation job.

Error Responses:

500 Internal Server Error: If starting the job or creating a log file fails.

4. POST /model/train

Purpose:
Starts a model training job. The endpoint performs a Git checkout for the provided branch and then initiates a training job via the ilab model train command. It supports both real and mock modes.

Request:

Method: POST
URL: /model/train
Headers: Content-Type: application/json

Body:

{
  "modelName": "models/instructlab/my-model",
  "branchName": "feature/train-improvements",
  "epochs": 10  // optional; must be a positive integer if provided
}

Response:

Status 200:
```
{
  "job_id": "t-<timestamp-nano>"
}
```
The job_id identifies the training job.

Error Responses:

400 Bad Request: If required fields (modelName or branchName) are missing or if epochs is invalid.
500 Internal Server Error: If Git checkout or job creation fails.

5. GET /jobs/{job_id}/status

Purpose:
Retrieves the current status of a job (data generation, training, pipeline, serving, etc.) given its job ID.

Request:

Method: GET
URL: /jobs/{job_id}/status
Example: /jobs/g-123456789/status

Response:

Status 200:

{
  "job_id": "g-123456789",
  "status": "running",   // possible values: "running", "finished", "failed"
  "branch": "feature/train-improvements",
  "command": "/path/to/ilab"
}

Error Responses:

404 Not Found: If no job exists with the provided job_id.
500 Internal Server Error: On database query errors.

6. GET /jobs/{job_id}/logs

Purpose:
Retrieves the log file contents for the specified job ID.

Request:

Method: GET
URL: /jobs/{job_id}/logs
Example: /jobs/t-123456789/logs

Response:

Status 200: Plain text log content.

Error Responses:

404 Not Found: If either the job or its log file is not found.
500 Internal Server Error: If reading the log file fails.

7. GET /jobs

Purpose:
Lists all jobs stored in the database.

Request:

Method: GET
URL: /jobs

Response:

Status 200:

[
  {
    "job_id": "g-123456789",
    "cmd": "path/to/ilab",
    "args": ["data", "generate", "--pipeline", "full"],
    "status": "finished",
    "pid": 12345,
    "log_file": "logs/g-123456789.log",
    "start_time": "2025-02-18T12:00:00Z",
    "end_time": "2025-02-18T12:05:00Z",
    "branch": "",
    "served_model_name": ""
  },
  // ... other jobs
]

Error Responses:

500 Internal Server Error: If there is a database error.

8. POST /pipeline/generate-train

Purpose:
Orchestrates a full pipeline that first generates data and then starts a model training job. This endpoint creates a pipeline job that monitors both steps sequentially.

Request:

Method: POST
URL: /pipeline/generate-train
Headers: Content-Type: application/json

Body:

{
  "modelName": "models/instructlab/my-model",
  "branchName": "feature/train-improvements",
  "epochs": 10  // optional
}

Response:

Status 200:

{
  "pipeline_job_id": "p-<timestamp-nano>"
}

Error Responses:

400 Bad Request: If required fields are missing.
500 Internal Server Error: If any of the steps (data generation or training) fails to start.

9. POST /model/serve-latest

Purpose:
Serves the latest checkpoint of a model on port 8001. The endpoint checks for an optional checkpoint parameter in the request body. If not provided, it selects the latest checkpoint directory (matching a prefix such as "samples_").

Request:

Method: POST
URL: /model/serve-latest
Headers: Content-Type: application/json

Body (optional):

{
  "checkpoint": "samples_12345"  // Optional; if omitted, the latest checkpoint is used.
}

Response:

Status 200:
```
{
  "status": "model process started",
  "job_id": "ml-<timestamp-nano>"
}
```
If using vLLM mode, a container is launched and the response includes the corresponding job ID.

Error Responses:

404 Not Found: If the checkpoints directory or specified checkpoint does not exist.
500 Internal Server Error: If starting the model serving process fails.

10. POST /model/serve-base

Purpose:
Serves the base model on port 8000. Depending on configuration (vLLM enabled or not), it either spawns a container or launches a local serving process.

Request:

Method: POST
URL: /model/serve-base
Body: No JSON payload required.

Response:

Status 200:

{
  "status": "model process started",
  "job_id": "ml-<timestamp-nano>"
}

Error Responses:

500 Internal Server Error: If there is an error launching the serving process or container.

11. POST /qna-eval

Purpose:
Runs a QnA evaluation by launching a Podman container with a specified model and YAML configuration. It validates the existence of both the model path and YAML file before execution.

Request:

Method: POST
URL: /qna-eval
Headers: Content-Type: application/json

Body:

{
  "model_path": "/path/to/model",
  "yaml_file": "/path/to/config.yaml"
}

Response:

Status 200:

{
  "result": "Evaluation output from QnA-eval container..."
}

Error Responses:

400 Bad Request: If the specified model_path or yaml_file does not exist or if the request body is malformed.
500 Internal Server Error: If the Podman command fails.

12. GET /checkpoints

Purpose:
Lists all checkpoint directories available in the expected checkpoints folder (typically under the user's home directory).

Request:

Method: GET
URL: /checkpoints

Response:

Status 200:

[
  "checkpoint_1",
  "checkpoint_2",
  "samples_12345"
]

Error Responses:

404 Not Found: If the checkpoints directory does not exist.
500 Internal Server Error: If there is an error reading the directory.

13. GET /vllm-containers

Purpose:
Lists all running vLLM containers. The endpoint calls Podman to list containers that match the vLLM image and then inspects each container to extract arguments such as the served model name and model path.

Request:

Method: GET
URL: /vllm-containers

Response:

Status 200:

{
  "containers": [
    {
      "container_id": "abcdef123456",
      "image": "vllm/vllm-openai:latest",
      "command": "[...command...]",
      "created_at": "2025-02-18T12:00:00Z",
      "status": "Up 5 minutes",
      "ports": "8001/tcp",
      "names": "vllm_container_1",
      "served_model_name": "post-train",
      "model_path": "/path/to/model"
    }
    // ... additional containers
  ]
}

Error Responses:

500 Internal Server Error: If listing or inspecting containers fails.

14. POST /vllm-unload

Purpose:
Stops (unloads) a running vLLM container based on the served model name. Only the names "pre-train" or "post-train" are valid.

Request:

Method: POST
URL: /vllm-unload
Headers: Content-Type: application/json
Body:
```
{
  "model_name": "pre-train"
}
```

Response:

Status 200:

{
  "status": "success",
  "message": "Model 'pre-train' unloaded successfully",
  "modelName": "pre-train"
}

Error Responses:

400 Bad Request: If the provided model_name is not "pre-train" or "post-train".
500 Internal Server Error: If stopping the container fails.

15. GET /vllm-status

Purpose:
Retrieves the status of a vLLM container for a specified served model. The request must include a query parameter specifying the model name (either "pre-train" or "post-train"). The endpoint checks container status and reads log files to determine if the container is still loading or has finished.

Request:

Method: GET
URL: /vllm-status?model_name=post-train

Response:

Status 200:

{
  "status": "running"  // possible values: "running", "loading", or "stopped"
}

Error Responses:

400 Bad Request: If model_name is invalid.
500 Internal Server Error: If querying the container status fails.

16. GET /gpu-free

Purpose:
Checks the availability of GPUs by running the nvidia-smi command and returns the number of free GPUs along with the total GPU count.

Request:

Method: GET
URL: /gpu-free

Response:

Status 200:
```
{
  "free_gpus": 1,
  "total_gpus": 4
}
```

Error Responses:

500 Internal Server Error: If executing nvidia-smi fails.

17. GET /served-model-jobids

Purpose:
A debug endpoint that returns the current mapping of served model names (e.g., "pre-train" or "post-train") to their associated running job IDs (for vLLM container processes).

Request:

Method: GET
URL: /served-model-jobids

Response:

Status 200:

{
  "pre-train": "v-123456789",
  "post-train": "v-987654321"
}

18. POST /model/convert

Purpose:
(Implementation details not fully provided in the snippet.)
Typically, this endpoint is expected to convert a model from one format to another (e.g., from one file format to GGUF). The request body likely contains parameters such as source model path, target format, and additional options.

Request:

Method: POST
URL: /model/convert
Headers: Content-Type: application/json

Body:

{
  "source_model_path": "/path/to/source/model",
  "target_format": "gguf"
  // ... additional conversion options
}

Response:

Status 200:

{
  "status": "conversion completed",
  "converted_model_path": "/path/to/converted/model"
}

Error Responses:

400 Bad Request: For invalid or missing parameters.
500 Internal Server Error: If the conversion process fails.

nerdalert/Ilab-api-ui-rest.md

Ilab UI API Server

1. GET /models

2. GET /data

3. POST /data/generate

4. POST /model/train

5. GET /jobs/{job_id}/status

6. GET /jobs/{job_id}/logs

7. GET /jobs

8. POST /pipeline/generate-train

9. POST /model/serve-latest

10. POST /model/serve-base

11. POST /qna-eval

12. GET /checkpoints

13. GET /vllm-containers

14. POST /vllm-unload

15. GET /vllm-status

16. GET /gpu-free

17. GET /served-model-jobids

18. POST /model/convert