Created
December 12, 2024 20:52
-
-
Save Nick-Harvey/1404b3a6f39332483a03702f76abe429 to your computer and use it in GitHub Desktop.
Introducing the Lambda Inference API
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere | |
Nick Harvey | |
December 10, 2024 • 19 min read | |
Think Big, Spend Small - Introducing the Lambda Inference API | |
Today, we’re excited to announce the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API. | |
Generate your own API key and see it for yourself: | |
curl https://api.lambdalabs.com/v1/completions \ | |
-H "Authorization: Bearer $LAMBDA_API_KEY" \ | |
-H "Content-Type: application/json" \ | |
-d "$(curl -sL https://tinyurl.com/5n8pbpkm | jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \ | |
'{ | |
model: "llama3.3-70b-instruct-fp8", | |
prompt: ($prompt + "\n\n" + .), | |
temperature: 0 | |
}')" | jq . | |
Our new Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s half the cost of Deepinfra and the lowest-priced serverless AI inference available anywhere! | |
Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy. | |
Lambda Inference API Pricing | |
Model Context Price per 1M input/output tokens | |
Core | |
Llama-3.1-8B-Instruct (BF16) | |
131K $0.03 | |
Llama-3.1-70B-Instruct (FP8) | |
131K $0.20 | |
Llama-3.1-405B-Instruct (FP8) | |
131K $0.90 | |
Sandbox | |
Llama-3.2-3B-Instruct (FP8) | |
131K $0.02 | |
Hermes-3-Llama-3.1-8B (BF16) | |
131K $0.03 | |
Hermes-3-Llama-3.1-70B (FP8) | |
131K $0.20 | |
Hermes-3-Llama-3.1-405B (FP8) | |
131K $0.90 | |
LFM-40b (BF16) | |
66K $0.15 | |
Llama3.1-nemotron-70b-instruct (FP8) 131K $0.20 | |
Qwen2.5-Coder-32B (BF16) | |
33K $0.09 | |
* plus applicable sales tax | |
AI without the complexity | |
Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time. | |
From powering conversational agents to generating images, inference is at the heart of every AI-driven application. | |
But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know. | |
That’s why we built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency. | |
We’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI. | |
Cut costs. Not corners. | |
Lambda Inference API provides Meta’s Llama 3.1 405B at just $0.90 per million tokens. | |
Pay-per-token | |
You’re only charged for the tokens you use, ensuring zero waste and complete transparency. No hidden fees or long-term commitments. | |
Scalability? Handled. | |
Designed to dynamically meet the demands of workloads of any size, so you can scale without worrying about infrastructure bottlenecks. | |
No rate limits | |
Run inference unconstrained by any limits on your API calls. | |
Whether you're supporting a handful of users or millions, our API dynamically scales to meet demand. | |
Getting Started with Lambda Inference API | |
It’s easy to get started with the Lambda Inference API, and for those VScode lovers out there, we have a quickstart guide on integrating the Lambda Inference API into VS Code. Here’s how: | |
Generate an API key | |
Choose your model | |
Pick your endpoint: | |
/completions - single text string (a prompt) as input, then outputs a response | |
/chat/completions - takes a list of messages that make up a conversation | |
Then, using your language of choice start leveraging the latest and greatest models. | |
# Lambda_Infernce_API_test.py | |
from openai import OpenAI | |
openai_api_key = "<API-KEY>" | |
openai_api_base = "https://api.lambdalabs.com/v1" | |
client = OpenAI( | |
api_key=openai_api_key, | |
base_url=openai_api_base, | |
) | |
model = "<MODEL>" | |
response = client.completions.create( | |
prompt="Computers are", | |
temperature=0, | |
model=model, | |
) | |
print(response) | |
Check out our Documentation for more information on how integrate the API and our /completions and /chat/completions endpoints in your application. | |
It’s a great time to be builder | |
With so many new models coming, it can be tough to keep up and find out which ones are worth integrating, especially as their VRAM requirements continue to grow. Until now, developers had to worry about managing the infrastructure to run these models. But with the Lambda Inference API, you can simply make an API call to the latest models for fractions of a cent. | |
Conversational Agents: | |
Power chatbots and virtual assistants for a fraction of the cost of anywhere else. | |
curl https://api.lambdalabs.com/v1/completions \ | |
-H "Authorization: Bearer "$LAMBDA_API_KEY"" \ | |
-H "Content-Type: application/json" \ | |
-d '{ | |
"model": "hermes3-70b", | |
"prompt": "Create a concise relevant reply to the following message from a customer: I cant log into the app, Im getting a 500 error", | |
"temperature": 0 | |
}' | jq . | |
Output: | |
Sorry to hear that you're experiencing issues logging in to the app. A 500 error usually indicates | |
a server-side issue on our end. Can you please try closing and reopening the app, | |
or clearing your cache and trying again? If the issue persists, our support team will be happy to | |
assist you further. | |
Total Tokens: 132 | |
Total Cost: $0.00003 (rounded) | |
Content Generation and Summarization: | |
Use the Lambda Inference API to summarize large amounts of text, with most models supporting context windows up to 131k. For example, this command summarizes 80% of The Odyssey by Homer for just a few cents. | |
curl https://api.lambdalabs.com/v1/completions \ | |
-H "Authorization: Bearer "$LAMBDA_API_KEY"" \ | |
-H "Content-Type: application/json" \ | |
-d "$(curl -sL https://tinyurl.com/hthdw2nn | jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \ | |
'{ | |
model: "llama3.3-70b-instruct-fp8", | |
prompt: ($prompt + "\n\n" + .), | |
temperature: 0 | |
}')" | jq . | |
Output | |
Penelope's prayer to Diana shows her desperation and longing for her husband, as well as her | |
cleverness in drawing parallels to mythology. The story of the daughters of Pandareus highlights | |
her fear of being forced to marry one of the suitors against her will. | |
The fact that Penelope is still haunted by her misery even in her dreams emphasizes the depth | |
of her love for Ulysses and the toll his long absence has taken on her. Her dream of Ulysses by | |
her side also foreshadows his imminent return, building anticipation for the reunion to come. | |
This passage underscores Penelope's fidelity, intelligence, and emotional resilience in the face of | |
overwhelming challenges, reinforcing her status as one of the most significant and admirable | |
characters in the epic. | |
Total tokens: 120724 | |
Cost: $ 0.02 (rounded) | |
A Glimpse Into the Future | |
We’re just getting started. Here’s what’s next for the Lambda Inference API: | |
More models: | |
Expect support for additional state-of-the-art models, enabling even more use cases across industries. | |
More formats: | |
Support for multimodal models, reasoning models, image generation, video generation, and more | |
Batch inference: | |
Lower-cost, batched inference for non-realtime, offline, or overnight processing tasks. | |
Start Building Today | |
With the Lambda Inference API, you can leverage cutting-edge AI models without the high costs, infrastructure headaches, or operational complexity. | |
Sign up today and see how the Lambda Inference API can transform your next project, one API call at a time. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment