Nick-Harvey · December 12, 2024 20:52
diff --git a/gistfile1.txt b/gistfile1.txt
 Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere

 Nick Harvey
 December 10, 2024 • 19 min read
 Think Big, Spend Small - Introducing the Lambda Inference API
 Today, we’re excited to announce the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API.

 Generate your own API key and see it for yourself:

  curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer $LAMBDA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(curl -sL https://tinyurl.com/5n8pbpkm | jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \
    '{
      model: "llama3.3-70b-instruct-fp8",
      prompt: ($prompt + "\n\n" + .),
      temperature: 0
    }')" | jq .
 Our new Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s half the cost of Deepinfra and the lowest-priced serverless AI inference available anywhere! 

 Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy.

 Lambda Inference API Pricing
 Model	Context	Price per 1M input/output tokens
 Core
 Llama-3.1-8B-Instruct (BF16)
 131K	$0.03
 Llama-3.1-70B-Instruct (FP8)
 131K	$0.20
 Llama-3.1-405B-Instruct (FP8)
 131K	$0.90
 Sandbox
 Llama-3.2-3B-Instruct (FP8)
 131K	$0.02
 Hermes-3-Llama-3.1-8B (BF16)
 131K	$0.03
 Hermes-3-Llama-3.1-70B (FP8)
 131K	$0.20
 Hermes-3-Llama-3.1-405B (FP8)
 131K	$0.90
 LFM-40b (BF16)
 66K	$0.15
 Llama3.1-nemotron-70b-instruct (FP8)	131K	$0.20
 Qwen2.5-Coder-32B (BF16)
 33K	$0.09
 * plus applicable sales tax

 AI without the complexity
 Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time.

 From powering conversational agents to generating images, inference is at the heart of every AI-driven application.

 But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know.

 That’s why we built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency.

 We’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI.

 Cut costs. Not corners.
 Lambda Inference API provides Meta’s Llama 3.1 405B at just $0.90 per million tokens.
 Pay-per-token 
 You’re only charged for the tokens you use, ensuring zero waste and complete transparency. No hidden fees or long-term commitments.
 Scalability? Handled.
 Designed to dynamically meet the demands of workloads of any size, so you can scale without worrying about infrastructure bottlenecks.
 No rate limits
 Run inference unconstrained by any limits on your API calls.
 Whether you're supporting a handful of users or millions, our API dynamically scales to meet demand.

 Getting Started with Lambda Inference API
 It’s easy to get started with the Lambda Inference API, and for those VScode lovers out there, we have a quickstart guide on integrating the Lambda Inference API into VS Code. Here’s how:

 Generate an API key
 Choose your model
 Pick your endpoint:
 /completions - single text string (a prompt) as input, then outputs a response
 /chat/completions - takes a list of messages that make up a conversation
 Then, using your language of choice start leveraging the latest and greatest models.

 # Lambda_Infernce_API_test.py
 from openai import OpenAI

 openai_api_key = "<API-KEY>"
 openai_api_base = "https://api.lambdalabs.com/v1"

 client = OpenAI(
   api_key=openai_api_key,
   base_url=openai_api_base,
 )

 model = "<MODEL>"

 response = client.completions.create(
 prompt="Computers are",
 temperature=0,
 model=model,
 )

 print(response)   
 Check out our Documentation for more information on how integrate the API and our /completions and /chat/completions endpoints in your application.  

 It’s a great time to be builder
 With so many new models coming, it can be tough to keep up and find out which ones are worth integrating, especially as their VRAM requirements continue to grow. Until now, developers had to worry about managing the infrastructure to run these models. But with the Lambda Inference API, you can simply make an API call to the latest models for fractions of a cent.

 Conversational Agents:
 Power chatbots and virtual assistants for a fraction of the cost of anywhere else.

 curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer "$LAMBDA_API_KEY"" \
  -H "Content-Type: application/json" \
  -d '{
     "model": "hermes3-70b",
     "prompt": "Create a concise relevant reply to the following message from a customer: I cant log into the app, Im getting a 500 error",
     "temperature": 0
     }' | jq .
 Output:

 Sorry to hear that you're experiencing issues logging in to the app. A 500 error usually indicates 
 a server-side issue on our end. Can you please try closing and reopening the app, 
 or clearing your cache and trying again? If the issue persists, our support team will be happy to 
 assist you further.
 Total Tokens: 132

 Total Cost: $0.00003 (rounded)

 

 Content Generation and Summarization:
 Use the Lambda Inference API to summarize large amounts of text, with most models supporting context windows up to 131k. For example, this command summarizes 80% of The Odyssey by Homer for just a few cents.

 curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer "$LAMBDA_API_KEY"" \
  -H "Content-Type: application/json" \
  -d "$(curl -sL https://tinyurl.com/hthdw2nn | jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \
     '{
      model: "llama3.3-70b-instruct-fp8",
      prompt: ($prompt + "\n\n" + .),
      temperature: 0
     }')" | jq .
 Output

 Penelope's prayer to Diana shows her desperation and longing for her husband, as well as her 
 cleverness in drawing parallels to mythology. The story of the daughters of Pandareus highlights 
 her fear of being forced to marry one of the suitors against her will.

 The fact that Penelope is still haunted by her misery even in her dreams emphasizes the depth 
 of her love for Ulysses and the toll his long absence has taken on her. Her dream of Ulysses by 
 her side also foreshadows his imminent return, building anticipation for the reunion to come.

 This passage underscores Penelope's fidelity, intelligence, and emotional resilience in the face of 
 overwhelming challenges, reinforcing her status as one of the most significant and admirable 
 characters in the epic.
 Total tokens: 120724

 Cost: $ 0.02 (rounded)

 A Glimpse Into the Future
 We’re just getting started. Here’s what’s next for the Lambda Inference API:

 More models:
 Expect support for additional state-of-the-art models, enabling even more use cases across industries.
 More formats:
 Support for multimodal models, reasoning models, image generation, video generation, and more
 Batch inference:
 Lower-cost, batched inference for non-realtime, offline, or overnight processing tasks.
 Start Building Today
 With the Lambda Inference API, you can leverage cutting-edge AI models without the high costs, infrastructure headaches, or operational complexity. 

 Sign up today and see how the Lambda Inference API can transform your next project, one API call at a time.
	Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere

	Nick Harvey
	December 10, 2024 • 19 min read
	Think Big, Spend Small - Introducing the Lambda Inference API
	Today, we’re excited to announce the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API.

	Generate your own API key and see it for yourself:

	curl https://api.lambdalabs.com/v1/completions \
	-H "Authorization: Bearer $LAMBDA_API_KEY" \
	-H "Content-Type: application/json" \
	-d "$(curl -sL https://tinyurl.com/5n8pbpkm \| jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \
	'{
	model: "llama3.3-70b-instruct-fp8",
	prompt: ($prompt + "\n\n" + .),
	temperature: 0
	}')" \| jq .
	Our new Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s half the cost of Deepinfra and the lowest-priced serverless AI inference available anywhere!

	Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy.

	Lambda Inference API Pricing
	Model Context Price per 1M input/output tokens
	Core
	Llama-3.1-8B-Instruct (BF16)
	131K $0.03
	Llama-3.1-70B-Instruct (FP8)
	131K $0.20
	Llama-3.1-405B-Instruct (FP8)
	131K $0.90
	Sandbox
	Llama-3.2-3B-Instruct (FP8)
	131K $0.02
	Hermes-3-Llama-3.1-8B (BF16)
	131K $0.03
	Hermes-3-Llama-3.1-70B (FP8)
	131K $0.20
	Hermes-3-Llama-3.1-405B (FP8)
	131K $0.90
	LFM-40b (BF16)
	66K $0.15
	Llama3.1-nemotron-70b-instruct (FP8) 131K $0.20
	Qwen2.5-Coder-32B (BF16)
	33K $0.09
	* plus applicable sales tax

	AI without the complexity
	Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time.

	From powering conversational agents to generating images, inference is at the heart of every AI-driven application.

	But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know.

	That’s why we built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency.

	We’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI.

	Cut costs. Not corners.
	Lambda Inference API provides Meta’s Llama 3.1 405B at just $0.90 per million tokens.
	Pay-per-token
	You’re only charged for the tokens you use, ensuring zero waste and complete transparency. No hidden fees or long-term commitments.
	Scalability? Handled.
	Designed to dynamically meet the demands of workloads of any size, so you can scale without worrying about infrastructure bottlenecks.
	No rate limits
	Run inference unconstrained by any limits on your API calls.
	Whether you're supporting a handful of users or millions, our API dynamically scales to meet demand.

	Getting Started with Lambda Inference API
	It’s easy to get started with the Lambda Inference API, and for those VScode lovers out there, we have a quickstart guide on integrating the Lambda Inference API into VS Code. Here’s how:

	Generate an API key
	Choose your model
	Pick your endpoint:
	/completions - single text string (a prompt) as input, then outputs a response
	/chat/completions - takes a list of messages that make up a conversation
	Then, using your language of choice start leveraging the latest and greatest models.

	# Lambda_Infernce_API_test.py
	from openai import OpenAI

	openai_api_key = "<API-KEY>"
	openai_api_base = "https://api.lambdalabs.com/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	model = "<MODEL>"

	response = client.completions.create(
	prompt="Computers are",
	temperature=0,
	model=model,
	)

	print(response)
	Check out our Documentation for more information on how integrate the API and our /completions and /chat/completions endpoints in your application.

	It’s a great time to be builder
	With so many new models coming, it can be tough to keep up and find out which ones are worth integrating, especially as their VRAM requirements continue to grow. Until now, developers had to worry about managing the infrastructure to run these models. But with the Lambda Inference API, you can simply make an API call to the latest models for fractions of a cent.

	Conversational Agents:
	Power chatbots and virtual assistants for a fraction of the cost of anywhere else.

	curl https://api.lambdalabs.com/v1/completions \
	-H "Authorization: Bearer "$LAMBDA_API_KEY"" \
	-H "Content-Type: application/json" \
	-d '{
	"model": "hermes3-70b",
	"prompt": "Create a concise relevant reply to the following message from a customer: I cant log into the app, Im getting a 500 error",
	"temperature": 0
	}' \| jq .
	Output:

	Sorry to hear that you're experiencing issues logging in to the app. A 500 error usually indicates
	a server-side issue on our end. Can you please try closing and reopening the app,
	or clearing your cache and trying again? If the issue persists, our support team will be happy to
	assist you further.
	Total Tokens: 132

	Total Cost: $0.00003 (rounded)



	Content Generation and Summarization:
	Use the Lambda Inference API to summarize large amounts of text, with most models supporting context windows up to 131k. For example, this command summarizes 80% of The Odyssey by Homer for just a few cents.

	curl https://api.lambdalabs.com/v1/completions \
	-H "Authorization: Bearer "$LAMBDA_API_KEY"" \
	-H "Content-Type: application/json" \
	-d "$(curl -sL https://tinyurl.com/hthdw2nn \| jq -Rs --arg prompt "Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:" \
	'{
	model: "llama3.3-70b-instruct-fp8",
	prompt: ($prompt + "\n\n" + .),
	temperature: 0
	}')" \| jq .
	Output

	Penelope's prayer to Diana shows her desperation and longing for her husband, as well as her
	cleverness in drawing parallels to mythology. The story of the daughters of Pandareus highlights
	her fear of being forced to marry one of the suitors against her will.

	The fact that Penelope is still haunted by her misery even in her dreams emphasizes the depth
	of her love for Ulysses and the toll his long absence has taken on her. Her dream of Ulysses by
	her side also foreshadows his imminent return, building anticipation for the reunion to come.

	This passage underscores Penelope's fidelity, intelligence, and emotional resilience in the face of
	overwhelming challenges, reinforcing her status as one of the most significant and admirable
	characters in the epic.
	Total tokens: 120724

	Cost: $ 0.02 (rounded)

	A Glimpse Into the Future
	We’re just getting started. Here’s what’s next for the Lambda Inference API:

	More models:
	Expect support for additional state-of-the-art models, enabling even more use cases across industries.
	More formats:
	Support for multimodal models, reasoning models, image generation, video generation, and more
	Batch inference:
	Lower-cost, batched inference for non-realtime, offline, or overnight processing tasks.
	Start Building Today
	With the Lambda Inference API, you can leverage cutting-edge AI models without the high costs, infrastructure headaches, or operational complexity.

	Sign up today and see how the Lambda Inference API can transform your next project, one API call at a time.
No results found