To run the TheBloke/Mistral-7B-OpenOrca-AWQ
model on Hugging Face infrastructure using a Clojure clj-http
client,
you’ll need to leverage Hugging Face's Inference API, which allows you to make HTTP requests to run models hosted on
their platform. Since you have a Hugging Face Pro account, you can use this API with your authentication token for
enhanced access (e.g., higher rate limits or access to private models, though this specific model is public).
Here’s a step-by-step guide to achieve this:
-
Hugging Face API Token:
- Log in to your Hugging Face account.
- Go to your profile settings (click your avatar > "Settings" > "Access Tokens").
- Create a new API token with "Read" access (or "Write" if you plan to fine-tune or upload models later). Copy this token.
-
Clojure Environment:
- Ensure you have a Clojure project set up with Leiningen or deps.edn.
- Add the
clj-http
dependency to your project. For aproject.clj
file, include:Or for:dependencies [[clj-http "3.12.3"]]
deps.edn
:{:deps {clj-http {:mvn/version "3.12.3"}}}
-
Model Availability:
- The
TheBloke/Mistral-7B-OpenOrca-AWQ
model is a quantized version optimized for GPU inference. - Hugging Face’s Inference API supports many models, but not all quantized formats (like AWQ) are natively supported
- out of the box on their hosted inference endpoints. We’ll proceed assuming it’s available via the API.
- If it’s not, you may need to deploy it yourself (Step 5).
- The
The Hugging Face Inference API allows you to send a POST request to a model endpoint with input data and receive generated output. The endpoint for a model is typically:
https://api-inference.huggingface.co/models/<model-id>
For your model, this would be:
https://api-inference.huggingface.co/models/TheBloke/Mistral-7B-OpenOrca-AWQ
The API expects a JSON payload with the input prompt and optional parameters (e.g., temperature, max tokens). You’ll need to include your API token in the Authorization
header.
Here’s an example Clojure script to call the Inference API using clj-http
:
(ns huggingface-inference
(:require [clj-http.client :as http]
[clojure.data.json :as json]))
(def api-token "your-hf-api-token-here") ;; Replace with your Hugging Face API token
(def model-endpoint "https://api-inference.huggingface.co/models/TheBloke/Mistral-7B-OpenOrca-AWQ")
(defn query-hf-model [prompt]
(let [payload {:inputs prompt
:parameters {:max_length 100
:temperature 0.7
:top_p 0.95}}
response (http/post model-endpoint
{:headers {"Authorization" (str "Bearer " api-token)
"Content-Type" "application/json"}
:body (json/write-str payload)
:as :json})]
(-> response :body :generated_text)))
;; Example usage
(defn -main []
(let [prompt "Tell me a short story about a robot."
result (query-hf-model prompt)]
(println "Generated text:" result)))
;; Run the main function
(-main)
- API Token: Replace
"your-hf-api-token-here"
with your actual token. - Payload: The
:inputs
field is the prompt you send to the model. The:parameters
map customizes generation (e.g.,max_length
,temperature
). Adjust these as needed (see Hugging Face docs for supported options). - HTTP Request:
clj-http.client/post
sends a POST request with the JSON payload and your token in the header. - Response: The response is parsed as JSON, and we extract the generated text. The exact response structure depends on the model; for text generation, it’s typically under
:generated_text
.
- Run your Clojure script (e.g.,
lein run
if using Leiningen). - Check the output. If successful, you’ll see the generated text.
- Model Not Supported: If the API returns an error like "Model not found" or "Unsupported format," it means the AWQ-quantized version isn’t natively supported on the free Inference API. Proceed to Step 5.
- Rate Limits: With a Pro account, you have higher limits, but still monitor for
429 Too Many Requests
errors. - JSON Response: If the response structure differs (e.g., no
:generated_text
), inspect the raw response with(println response)
and adjust the parsing.
If the AWQ model isn’t available via the Inference API, you can deploy it yourself on Hugging Face infrastructure using an Inference Endpoint (a Pro feature):
-
Create an Inference Endpoint:
- Go to Hugging Face > Inference Endpoints > "Create new endpoint."
- Select the model:
TheBloke/Mistral-7B-OpenOrca-AWQ
. - Choose a GPU instance (e.g., NVIDIA A100) and configure settings (e.g., quantization: AWQ).
- Deploy the endpoint. Once ready, you’ll get a custom URL (e.g.,
https://<endpoint-id>.endpoints.huggingface.cloud
).
-
Update the Code: Replace
model-endpoint
in the Clojure code with your custom endpoint URL. The rest of the code remains the same. -
Test Again: Run the script with the new endpoint. Since you’re using a Pro account, this should work seamlessly with the AWQ model.
- Prompt Template: The
Mistral-7B-OpenOrca-AWQ
model uses the ChatML format (<|im_start|>user\n{prompt}<|im_end|>
). Include this in yourprompt
if required:(defn format-prompt [text] (str "<|im_start|>user\n" text "<|im_end|>")) (defn query-hf-model [prompt] (let [formatted-prompt (format-prompt prompt) payload {:inputs formatted-prompt :parameters {:max_length 100 :temperature 0.7 :top_p 0.95}}] ;; Rest of the function... ))
- Error Handling: Add try-catch to handle network or API errors:
(try (query-hf-model prompt) (catch Exception e (println "Error:" (.getMessage e))))
- AWQ Support: The AWQ quantization is optimized for GPU inference with tools like
vLLM
. If the standard Inference API doesn’t support it, the custom Inference Endpoint is your best bet. - Documentation: Check the Hugging Face Inference API docs and the model’s page (
TheBloke/Mistral-7B-OpenOrca-AWQ
) for specific requirements. - Pro Benefits: Your Pro account ensures priority access and the ability to deploy custom endpoints, so use that if the default API fails.
Let me know if you encounter issues or need help debugging!