HF inference API howto

To run the TheBloke/Mistral-7B-OpenOrca-AWQ model on Hugging Face infrastructure using a Clojure clj-http client, you’ll need to leverage Hugging Face's Inference API, which allows you to make HTTP requests to run models hosted on their platform. Since you have a Hugging Face Pro account, you can use this API with your authentication token for enhanced access (e.g., higher rate limits or access to private models, though this specific model is public).

Here’s a step-by-step guide to achieve this:

Step 1: Prerequisites

Hugging Face API Token:
- Log in to your Hugging Face account.
- Go to your profile settings (click your avatar > "Settings" > "Access Tokens").
- Create a new API token with "Read" access (or "Write" if you plan to fine-tune or upload models later). Copy this token.
Clojure Environment:
- Ensure you have a Clojure project set up with Leiningen or deps.edn.
- Add the clj-http dependency to your project. For a project.clj file, include:
```
:dependencies [[clj-http "3.12.3"]]
```
  Or for deps.edn:
```
{:deps {clj-http {:mvn/version "3.12.3"}}}
```
Model Availability:
- The TheBloke/Mistral-7B-OpenOrca-AWQ model is a quantized version optimized for GPU inference.
- Hugging Face’s Inference API supports many models, but not all quantized formats (like AWQ) are natively supported
- out of the box on their hosted inference endpoints. We’ll proceed assuming it’s available via the API.
- If it’s not, you may need to deploy it yourself (Step 5).

Step 2: Understand the Inference API

The Hugging Face Inference API allows you to send a POST request to a model endpoint with input data and receive generated output. The endpoint for a model is typically:

https://api-inference.huggingface.co/models/<model-id>

For your model, this would be:

https://api-inference.huggingface.co/models/TheBloke/Mistral-7B-OpenOrca-AWQ

The API expects a JSON payload with the input prompt and optional parameters (e.g., temperature, max tokens). You’ll need to include your API token in the Authorization header.

Step 3: Write Clojure Code with clj-http

Here’s an example Clojure script to call the Inference API using clj-http:

(ns huggingface-inference
  (:require [clj-http.client :as http]
            [clojure.data.json :as json]))

(def api-token "your-hf-api-token-here") ;; Replace with your Hugging Face API token

(def model-endpoint "https://api-inference.huggingface.co/models/TheBloke/Mistral-7B-OpenOrca-AWQ")

(defn query-hf-model [prompt]
  (let [payload {:inputs prompt
                 :parameters {:max_length 100
                              :temperature 0.7
                              :top_p 0.95}}
        response (http/post model-endpoint
                            {:headers {"Authorization" (str "Bearer " api-token)
                                       "Content-Type" "application/json"}
                             :body (json/write-str payload)
                             :as :json})]
    (-> response :body :generated_text)))

;; Example usage
(defn -main []
  (let [prompt "Tell me a short story about a robot."
        result (query-hf-model prompt)]
    (println "Generated text:" result)))

;; Run the main function
(-main)

Explanation:

API Token: Replace "your-hf-api-token-here" with your actual token.
Payload: The :inputs field is the prompt you send to the model. The :parameters map customizes generation (e.g., max_length, temperature). Adjust these as needed (see Hugging Face docs for supported options).
HTTP Request: clj-http.client/post sends a POST request with the JSON payload and your token in the header.
Response: The response is parsed as JSON, and we extract the generated text. The exact response structure depends on the model; for text generation, it’s typically under :generated_text.

Step 4: Test the Code

Run your Clojure script (e.g., lein run if using Leiningen).
Check the output. If successful, you’ll see the generated text.

Possible Issues:

Model Not Supported: If the API returns an error like "Model not found" or "Unsupported format," it means the AWQ-quantized version isn’t natively supported on the free Inference API. Proceed to Step 5.
Rate Limits: With a Pro account, you have higher limits, but still monitor for 429 Too Many Requests errors.
JSON Response: If the response structure differs (e.g., no :generated_text), inspect the raw response with (println response) and adjust the parsing.

Step 5: Deploy the Model (If Needed)

If the AWQ model isn’t available via the Inference API, you can deploy it yourself on Hugging Face infrastructure using an Inference Endpoint (a Pro feature):

Create an Inference Endpoint:
- Go to Hugging Face > Inference Endpoints > "Create new endpoint."
- Select the model: TheBloke/Mistral-7B-OpenOrca-AWQ.
- Choose a GPU instance (e.g., NVIDIA A100) and configure settings (e.g., quantization: AWQ).
- Deploy the endpoint. Once ready, you’ll get a custom URL (e.g., https://<endpoint-id>.endpoints.huggingface.cloud).
Update the Code: Replace model-endpoint in the Clojure code with your custom endpoint URL. The rest of the code remains the same.
Test Again: Run the script with the new endpoint. Since you’re using a Pro account, this should work seamlessly with the AWQ model.

Step 6: Verify and Optimize

Prompt Template: The Mistral-7B-OpenOrca-AWQ model uses the ChatML format (<|im_start|>user\n{prompt}<|im_end|>). Include this in your prompt if required:

(defn format-prompt [text]
  (str "<|im_start|>user\n" text "<|im_end|>"))

(defn query-hf-model [prompt]
  (let [formatted-prompt (format-prompt prompt)
        payload {:inputs formatted-prompt
                 :parameters {:max_length 100
                              :temperature 0.7
                              :top_p 0.95}}]
    ;; Rest of the function...
  ))

Error Handling: Add try-catch to handle network or API errors:

(try
  (query-hf-model prompt)
  (catch Exception e
    (println "Error:" (.getMessage e))))

Final Notes

AWQ Support: The AWQ quantization is optimized for GPU inference with tools like vLLM. If the standard Inference API doesn’t support it, the custom Inference Endpoint is your best bet.
Documentation: Check the Hugging Face Inference API docs and the model’s page (TheBloke/Mistral-7B-OpenOrca-AWQ) for specific requirements.
Pro Benefits: Your Pro account ensures priority access and the ability to deploy custom endpoints, so use that if the default API fails.

Let me know if you encounter issues or need help debugging!

usametov/HF-indeference-api.md