Asynchronous LLM API calls in Python: A comprehensive guide

As developers and data scientists, we often find ourselves needing to interact with these powerful models through APIs. However, as applications grow in complexity and scale, the need for efficient and performant API interactions becomes critical. This is where asynchronous programming comes into play, allowing you to maximize throughput and minimize latency when working with LLM APIs.

In this comprehensive guide, we’ll explore the world of asynchronous LLM API calls in Python. We’ll cover everything from the basics of asynchronous programming to advanced techniques for handling complex workflows. By the end of this article, you’ll have a solid understanding of how to leverage asynchronous programming to enhance your LLM-powered applications.

Before we dive into the details of asynchronous LLM API calls, let’s have a solid foundation in asynchronous programming concepts.

Asynchronous programming allows multiple operations to run simultaneously without blocking the main thread of execution. In Python, this is mainly done Asynchronous The module provides a framework for writing concurrent code using coroutines, event loops, and futures.

Key Concepts:

Coroutines: defined function Asynchronous definition You can pause and resume it.
Event Loop: The central execution mechanism that manages and executes asynchronous tasks.
Long-awaited: Objects that can be used with the await keyword (coroutines, tasks, futures).

Below is a simple example to illustrate these concepts.

import asyncio
async def greet(name):
    await asyncio.sleep(1)  # Simulate an I/O operation
    print(f"Hello, {name}!")
async def main():
    await asyncio.gather(
        greet("Alice"),
        greet("Bob"),
        greet("Charlie")
    )
asyncio.run(main())

In this example we define an asynchronous function greet Simulating I/O Operations asyncio.sleep(). main Using Functions asyncio.gather() We run multiple greetings simultaneously; despite the sleep delay, all three greetings print about one second apart, demonstrating the power of asynchronous execution.

The need for asynchronous LLM API calls

When using the LLM API, you will often encounter scenarios where multiple API calls need to be executed sequentially or in parallel. Traditional synchronous code can cause significant performance bottlenecks, especially when dealing with high-latency operations such as network requests to the LLM service.

Consider a scenario where you need to generate summaries for 100 articles using the LLM API. With a synchronous approach, each API call would block until it receives a response, potentially taking minutes to complete all the requests. On the other hand, with an asynchronous approach, multiple API calls can be initiated simultaneously, significantly reducing the overall execution time.

Setting up the environment

To start making asynchronous LLM API calls, you need to configure your Python environment with the required libraries.

Python 3.7 or higher (for native asyncio support)
Aeo http: Asynchronous HTTP client library
Open Night: Official OpenAI Python client (if you are using OpenAI’s GPT model)
Run Chain: A framework for building applications using LLM (optional, but recommended for complex workflows)

You can install these dependencies using pip:

pip install aiohttp openai langchain
Basic Async LLM API Calls with asyncio and aiohttp
Let's start by making a simple asynchronous call to an LLM API using aiohttp. We'll use OpenAI's GPT-3.5 API as an example, but the concepts apply to other LLM APIs as well.
import asyncio
import aiohttp
from openai import AsyncOpenAI
async def generate_text(prompt, client):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=({"role": "user", "content": prompt})
    )
    return response.choices(0).message.content
async def main():
    prompts = (
        "Explain quantum computing in simple terms.",
        "Write a haiku about artificial intelligence.",
        "Describe the process of photosynthesis."
    )
    
    async with AsyncOpenAI() as client:
        tasks = (generate_text(prompt, client) for prompt in prompts)
        results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")
asyncio.run(main())


In this example we define an asynchronous function generate_text Call the OpenAI API using the AsyncOpenAI client. main The function creates multiple tasks for different prompts, asyncio.gather() Run simultaneously.
This approach allows multiple requests to be sent simultaneously to the LLM API, significantly reducing the total time required to process all prompts.
Advanced Techniques: Batching and Concurrency Control
The previous examples showed the basics of asynchronous LLM API calls, but real-world applications often require a more sophisticated approach. Let's look at two important techniques: request batching and concurrency control.
Batching requests: When processing a large number of prompts, it is more efficient to process them in a group rather than sending a separate request for each prompt. This reduces the overhead of multiple API calls and improves performance.

import asyncio
from openai import AsyncOpenAI
async def process_batch(batch, client):
    responses = await asyncio.gather(*(
        client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=({"role": "user", "content": prompt})
        ) for prompt in batch
    ))
    return (response.choices(0).message.content for response in responses)
async def main():
    prompts = (f"Tell me a fact about number {i}" for i in range(100))
    batch_size = 10
    
    async with AsyncOpenAI() as client:
        results = ()
        for i in range(0, len(prompts), batch_size):
            batch = prompts(i:i+batch_size)
            batch_results = await process_batch(batch, client)
            results.extend(batch_results)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")
asyncio.run(main())


Concurrency control: Asynchronous programming allows for concurrency, but it is important to control the level of concurrency to avoid overloading the API server or exceeding rate limits. You can use asyncio.Semaphore for this purpose.

import asyncio
from openai import AsyncOpenAI
async def generate_text(prompt, client, semaphore):
    async with semaphore:
        response = await client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=({"role": "user", "content": prompt})
        )
        return response.choices(0).message.content
async def main():
    prompts = (f"Tell me a fact about number {i}" for i in range(100))
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as client:
        tasks = (generate_text(prompt, client, semaphore) for prompt in prompts)
        results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")
asyncio.run(main())


In this example, we use a semaphore to limit the number of concurrent requests to 5 to avoid overloading the API server.
Error handling and retries for asynchronous LLM calls
When using external APIs, it is important to implement robust error handling and retry mechanisms. Extend your code to handle common errors and implement exponential backoff for retries.

import asyncio
import random
from openai import AsyncOpenAI
from tenacity import retry, stop_after_attempt, wait_exponential
class APIError(Exception):
    pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def generate_text_with_retry(prompt, client):
    try:
        response = await client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=({"role": "user", "content": prompt})
        )
        return response.choices(0).message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        raise APIError("Failed to generate text")
async def process_prompt(prompt, client, semaphore):
    async with semaphore:
        try:
            result = await generate_text_with_retry(prompt, client)
            return prompt, result
        except APIError:
            return prompt, "Failed to generate response after multiple attempts."
async def main():
    prompts = (f"Tell me a fact about number {i}" for i in range(20))
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as client:
        tasks = (process_prompt(prompt, client, semaphore) for prompt in prompts)
        results = await asyncio.gather(*tasks)
    
    for prompt, result in results:
        print(f"Prompt: {prompt}\nResponse: {result}\n")
asyncio.run(main())


This extended version includes:

Habits APIError Exceptions for API-related errors.
a generate_text_with_retry Decorative features @retry Implements exponential backoff from the tenacity library.
Error Handling process_prompt Ability to catch and report failures.

Performance optimization: Streaming responses
When generating long-form content, streaming responses can significantly improve the perceived performance of your application: instead of waiting for the entire response, chunks of text can be processed and displayed as they become available.

import asyncio
from openai import AsyncOpenAI
async def stream_text(prompt, client):
    stream = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=({"role": "user", "content": prompt}),
        stream=True
    )
    
    full_response = ""
    async for chunk in stream:
        if chunk.choices(0).delta.content is not None:
            content = chunk.choices(0).delta.content
            full_response += content
            print(content, end='', flush=True)
    
    print("\n")
    return full_response
async def main():
    prompt = "Write a short story about a time-traveling scientist."
    
    async with AsyncOpenAI() as client:
        result = await stream_text(prompt, client)
    
    print(f"Full response:\n{result}")
asyncio.run(main())


This example shows how to stream the response from the API, outputting each chunk as it comes in. This approach is particularly useful in chat applications or scenarios where you want to provide real-time feedback to the user.
Building Asynchronous Workflows with LangChain
For more complex LLM-powered applications, the LangChain framework provides a higher level abstraction that simplifies the process of chaining multiple LLM calls and integrating other tools. Let's look at an example of using LangChain with its asynchronous capabilities:
In this example, we will show how to use LangChain to create more complex workflows with streaming and asynchronous execution. AsyncCallbackManager and StreamingStdOutCallbackHandler Enables real-time streaming of generated content.

import asyncio
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import AsyncCallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
async def generate_story(topic):
    llm = OpenAI(temperature=0.7, streaming=True, callback_manager=AsyncCallbackManager((StreamingStdOutCallbackHandler())))
    prompt = PromptTemplate(
        input_variables=("topic"),
        template="Write a short story about {topic}."
    )
    chain = LLMChain(llm=llm, prompt=prompt)
    return await chain.arun(topic=topic)
async def main():
    topics = ("a magical forest", "a futuristic city", "an underwater civilization")
    tasks = (generate_story(topic) for topic in topics)
    stories = await asyncio.gather(*tasks)
    
    for topic, story in zip(topics, stories):
        print(f"\nTopic: {topic}\nStory: {story}\n{'='*50}\n")
asyncio.run(main())


Providing asynchronous LLM applications with FastAPI
Asynchronous To make your LLM application available as a web service, FastAPI is a great choice as it natively supports asynchronous operations. Here is an example of how to create a simple API endpoint for text generation:

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI()
class GenerationRequest(BaseModel):
    prompt: str
class GenerationResponse(BaseModel):
    generated_text: str
@app.post("/generate", response_model=GenerationResponse)
async def generate_text(request: GenerationRequest, background_tasks: BackgroundTasks):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=({"role": "user", "content": request.prompt})
    )
    generated_text = response.choices(0).message.content
    
    # Simulate some post-processing in the background
    background_tasks.add_task(log_generation, request.prompt, generated_text)
    
    return GenerationResponse(generated_text=generated_text)
async def log_generation(prompt: str, generated_text: str):
    # Simulate logging or additional processing
    await asyncio.sleep(2)
    print(f"Logged: Prompt '{prompt}' generated text of length {len(generated_text)}")
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)


This FastAPI application creates an endpoint /generate It accepts the prompt and returns the generated text, and also demonstrates how to use a background task to do additional processing without blocking the response.
Best practices and common pitfalls
When using the asynchronous LLM API, keep in mind the following best practices:

Use a connection pool: Reuse connections to reduce overhead when making multiple requests.
Implement proper error handling: Always consider network issues, API errors, and unexpected responses.
Respect rate limits: Use semaphores or other concurrency control mechanisms to avoid overloading the API.
Monitoring and logging: Implement comprehensive logging to track performance and identify problems.
Use streaming for long-form content: Improves user experience and allows for early processing of partial results.






 


I have been immersed in the fascinating world of Machine Learning and Deep Learning for the past 5 years. My passion and expertise have led me to contribute to 50+ diverse software engineering projects with a special focus on AI/ML. My continuing curiosity has also drawn me to Natural Language Processing and I would like to explore this field further.