Inexpensive and Fast LLM Inference Using the Groq Service

Dear reader, are you excited about integrating LLMs into your applications but you want to minimize costs?

Groq is rapidly making a name for itself in the AI community as a cloud-based large language model (LLM) inference provider, distinguished by its revolutionary hardware and remarkably low-cost, high-speed performance. At the heart of Groq’s impressive capabilities lies its custom-designed Language Processing Unit (LPU), a departure from the conventional GPUs that have long dominated the AI hardware landscape. Unlike GPUs, which are general-purpose processors, the LPU is an application-specific integrated circuit (ASIC) meticulously engineered for the singular task of executing LLM inference. This specialization allows for a deterministic and streamlined computational process, eliminating many of the bottlenecks inherent in more versatile hardware. The LPU’s architecture prioritizes memory bandwidth and minimizes latency, enabling it to process and generate text at a blistering pace, often an order of magnitude faster than its GPU counterparts. This focus on inference, the process of using a trained model to make predictions, positions Groq as a compelling solution for real-time applications where speed is paramount.

The practical implications of Groq’s technological innovation are multifaceted, offering a potent combination of affordability, speed, and a diverse selection of open-source models. The efficiency of the LPU translates directly into a more cost-effective pricing structure for users, with a pay-as-you-go model based on the number of tokens processed. This transparent and often significantly cheaper pricing democratizes access to powerful AI, enabling developers and businesses of all sizes to leverage state-of-the-art models without prohibitive upfront costs. The platform’s raw speed is a game-changer, facilitating near-instantaneous responses that are crucial for interactive applications like chatbots, content generation tools, and real-time data analysis. Furthermore, Groq’s commitment to the open-source community is evident in its extensive library of available models, including popular choices like Meta’s Llama series, Mistral’s Mixtral, and Google’s Gemma. This wide array of options provides users with the flexibility to select the model that best suits their specific needs, all while benefiting from the unparalleled inference speeds and economic advantages offered by Groq’s unique hardware.

Here you will learn how to send prompts to the Groq LMS inference service. For information on creating effective prompts please read my blog article Notes on effectively using AI.

Structure of Project and Build Instructions

This project is stored in the directory gerbil_scheme_book/source_code/groq_llm_inference. There is one common utility file groq_inference.ss and currently two very short example scripts that use this utility:

  • kimi2.ss - Uses Moonshot AI’s Kimi2 model (MOE 1 trillion paramters, with 32B active).
  • gpt-oss-120b.ss - Uses OpenAI’s open source model gpt-oss-120b.

Both of these models are practical models that are excellent for data manipulation, coding, and general purpose use.

It’s important to note that both models leverage a Mixture of Experts (MoE) architecture. This is a significant departure from traditional “dense” transformer models where every parameter is activated for every input token. In an MoE model, a “router” network selectively activates a small subset of “expert” sub-networks for each token, allowing for a massive total parameter count while keeping the computational cost for inference relatively low. The comparison, therefore, is between two different implementations and philosophies of the MoE approach.

Here is the project Makefile:

1 compile: groq_inference.ss
2     gxc groq_inference.ss 
3 
4 kimi2: compile
5     gxi -l kimi2.ss -
6 
7 gpt-oss-120b: compile
8     gxi -l gpt-oss-120b.ss -

Kimi2 (Moonshot AI)

Features:

  • Architecture: A very large-scale Mixture of Experts (MoE) model.
  • Parameters: It has a staggering 1 trillion total parameters. For any given token during inference, it activates approximately 32 billion of these parameters. This represents a very sparse activation (around 3.2%).
  • Specialization: Kimi2 is highly optimized for agentic capabilities, meaning it excels at using tools, reasoning through multi-step problems, and advanced code synthesis.
  • Training Innovation: It was trained using a novel optimizer called MuonClip, designed to ensure stability during large-scale MoE training runs, which have historically been prone to instability.
  • Context Window: It supports a large context window of up to 128,000 tokens, making it suitable for tasks involving long documents or extensive codebases.
  • Licensing: While the model weights are publicly available (“open-weight”), its specific licensing and training data details are proprietary to Moonshot AI.

gpt-oss-120b (OpenAI)

Features:

  • Architecture: Also a Mixture of Experts (MoE) model, but at a smaller scale than Kimi2.
  • Parameters: It has a total of 117 billion parameters, with a much smaller active set of around 5.1 billion parameters per token. This results in a similarly sparse activation (around 4.4%).
  • Efficiency and Accessibility: A primary feature is its optimization for efficient deployment. It’s designed to run on a single 80 GB GPU (like an H100), making it significantly more accessible for researchers and smaller organizations.
  • Focus: Like Kimi2, it is designed for high-reasoning, agentic tasks, and general-purpose use.
  • Licensing: It is a true open-source model, released under the permissive Apache 2.0 license. This allows for broad use, modification, and redistribution.
  • Training: It was trained using a combination of reinforcement learning and distillation techniques from OpenAI’s more advanced proprietary models.

Comparison and Use Cases

Feature Kimi2 (Moonshot AI) gpt-oss-120b (OpenAI)
Architecture Massive-scale Mixture of Experts (MoE) Efficient Mixture of Experts (MoE)
Total Parameters ~1 Trillion ~117 Billion
Active Parameters ~32 Billion ~5.1 Billion
Primary Goal Pushing the upper limits of performance and scale. Balancing high performance with deployment efficiency.
Hardware Target Large-scale, high-end compute clusters. Single high-end GPU (e.g., H100).
Licensing Open-Weight (proprietary) Open-Source (Apache 2.0)
Key Differentiator Sheer scale; novel MuonClip optimizer. Accessibility, efficiency, and permissive open license.

groq_inference.ss Utility

Here we construct a practical, reusable Gerbil Scheme function for interacting with the Groq API, a service renowned for its high-speed large language model inference. The function, named groq_inference, encapsulates the entire process of making a call to Groq’s OpenAI-compatible chat completions endpoint. It demonstrates essential real-world programming patterns, such as making authenticated HTTP POST requests, dynamically building a complex JSON payload from Scheme data structures, and securely managing credentials using environment variables. This example not only provides a useful utility for integrating AI into your applications but also serves as an excellent case study in using Gerbil’s standard libraries for networking (:std/net/request) and data interchange (:std/text/json), complete with robust error handling for both network issues and malformed API responses.

 1 (import :std/net/request
 2         :std/text/json)
 3 
 4 (export groq_inference)
 5 
 6 ;; Generic Groq chat completion helper
 7 ;; Usage: (groq_inference model prompt [system-prompt: "..."])
 8 (def (groq_inference
 9       model prompt
10       system-prompt: (system-prompt "You are a helpful assistant."))
11   (let ((api-key (get-environment-variable "GROQ_API_KEY")))
12     (unless api-key
13       (error "GROQ_API_KEY environment variable not set."))
14 
15     (let* ((headers `(("Content-Type" . "application/json")
16                       ("Authorization" . ,(string-append "Bearer " api-key))))
17            (body-data
18             (list->hash-table
19              `(("model" . ,model)
20                ("messages"
21                 .
22                 ,(list
23                   (list->hash-table `(("role" . "system") ("content" . ,system-prompt)))
24                   (list->hash-table `(("role" . "user") ("content" . ,prompt))))))))
25            (body-string (json-object->string body-data))
26            (endpoint "https://api.groq.com/openai/v1/chat/completions"))
27       
28       (let ((response (http-post endpoint headers: headers data: body-string)))
29         (if (= (request-status response) 200)
30           (let* ((response-json (request-json response))
31                  (choices (hash-ref response-json 'choices))
32                  (first-choice (and (pair? choices) (car choices)))
33                  (message (and first-choice (hash-ref first-choice 'message)))
34                  (content (and message (hash-ref message 'content))))
35             (or content (error "Groq response missing content")))
36           (error "Groq API request failed"
37             status: (request-status response)
38             body: (request-text response)))))))

The implementation begins by defining the groq_inference function, which accepts a model and a prompt, along with an optional keyword argument for a system message. Its first action is a crucial security and configuration check: it attempts to fetch the GROQ_API_KEY from the environment variables, raising an immediate error if it’s not found. The core of the function then uses a let* block to sequentially build the components of the HTTP request. It constructs the authorization headers and then assembles the JSON body using a combination of quasiquotation and the list->hash-table procedure to create the nested structure required by the API. This body is then serialized into a JSON string, and finally, the http-post function is called with the endpoint, headers, and data to execute the network request.

Upon receiving a response, the function demonstrates robust result processing and error handling. It first checks if the HTTP status code is 200 (OK), indicating a successful request. If it is, a series of let* bindings are used to safely parse the JSON response and navigate the nested data structure to extract the final content string from response[‘choices’][0][‘message’][‘content’], with checks at each step to prevent errors on an unexpected response format. If the content is successfully extracted, it is returned as the result of the function. However, if the HTTP status is anything other than 200, the function enters its error-handling branch, raising a descriptive error that includes the failing status code and the raw text body of the response, providing valuable debugging information to the caller.

Example scripts: kimi2.ss and gpt-oss-120b.ss

These two scripts are simple enough to just list without comment:

kimi2.ss

 1 (import :groq/groq_inference)
 2 
 3 ;; Use Moonshot AI's best kimi2 model (MOE: 1 triliion parameters, 32B resident).
 4 
 5 ; Export the `kimi2` procedure from this module
 6 (export kimi2)
 7 
 8 (def (kimi2 prompt
 9             model: (model "moonshotai/kimi-k2-instruct")
10             system-prompt: (system-prompt "You are a helpful assistant."))
11   (groq_inference model prompt system-prompt: system-prompt))
12 
13 ;; (kimi2 "why is the sky blue? be very concise")

gpt-oss-120b.ss

 1 (import :groq/groq_inference)
 2 
 3 ;; Use OpenAI's open source model gpt-oss-120b
 4 
 5 ; Export the `gpt-oss-120b` procedure from this module
 6 (export gpt-oss-120b)
 7 
 8 (def (gpt-oss-120b
 9       prompt
10       model: (model "openai/gpt-oss-120b")
11       system-prompt: (system-prompt "You are a helpful assistant."))
12   (groq_inference model prompt system-prompt: system-prompt))
13 
14 ;; (gpt-oss-120b "why is the sky blue? be very concise")

Running the kimi2 example:

Note, the utility must be comiled one time: gxc groq_inference.ss. The compiled library by default will be in the directory ~/.gerbil/lib/groq/ because we set this project’s module name to groq in the file gerbil.pkg.

 1  $ gxi -l kimi2.ss                   
 2 > (displayln (kimi2 "explain concisely what evidence there is for 'dark matter' in the universe, and counter arguments. Be concise!"))
 3 Evidence for dark matter  
 4 • Galaxy rotation curves: outer stars orbit too fast for visible mass alone.  
 5 • Gravitational lensing: mass maps exceed baryonic matter.  
 6 • Cosmic Microwave Background: tiny temperature ripples fit models with ~5× more dark than baryonic matter.  
 7 • Structure formation: simulations need unseen matter to match today’s galaxy distribution.  
 8 • Bullet Cluster: collision separated hot gas (baryons) from dominant mass peak, consistent with collisionless dark matter.
 9 
10 Counter-arguments / alternatives  
11 • Modified Newtonian Dynamics (MOND): tweaks gravity law to explain rotation curves without extra mass.  
12 • Modified gravity theories (TeVeS, f(R), emergent gravity) reproduce lensing and CMB with no dark particles.  
13 • Claims of inconsistent lensing signals or tidal dwarf galaxies without dark matter challenge universality.
14 > 

Running the gpt-oss-120b example:

1 $ gxi -l gpt-oss-120b.ss
2 > (displayln (gpt-oss-120b "write a recursive Haskell function 'factorial'. Only show the code."))
3 ``haskell
4 factorial :: Integer -> Integer
5 factorial 0 = 1
6 factorial n = n * factorial (n - 1)
7 ``
8 >