Skip to main content

Async/Await Fundamentals for LLM APIs

Async/await is the foundational pattern for building scalable LLM applications. At its core, async/await lets a single thread handle thousands of I/O-bound tasks (like API calls to OpenAI or Claude) by yielding control when a task waits for a network response, instead of blocking. Without async/await, each LLM API call blocks the entire thread until completion—a sequential bottleneck that wastes 99% of CPU time waiting. With async/await, your application spawns concurrent tasks that suspend and resume, multiplexing thousands of pending requests across a single thread efficiently.

What Is Async/Await and Why It Matters for LLMs

Async/await is a language-level abstraction for non-blocking concurrency. An async function returns a coroutine (in Python) or a Future (in JavaScript/Rust) that can be paused and resumed. The await keyword yields control to an event loop when the coroutine hits an I/O operation (network call, file read), freeing the thread to run other work. When the I/O completes, the event loop resumes the paused coroutine.

For LLM applications, this matters enormously. An OpenAI API call takes 200–2000 ms of network latency. Synchronous code blocks the entire application during that time. Async code suspends, allowing the event loop to process hundreds of other pending requests. A single Python asyncio event loop can manage 10,000–100,000 concurrent LLM API calls on a modest machine, versus only 1–10 with naive threads (thread overhead is ~1–2 MB stack per thread; async tasks use ~50 KB memory each).

How the Event Loop Works Under the Hood

The event loop is the heart of async systems. It maintains a queue of runnable coroutines and a queue of pending I/O operations (sockets, timers). In each iteration, the event loop:

  1. Runs all currently-runnable coroutines until they hit an await (blocking operation).
  2. Calls the OS via select()/epoll()/kqueue() to ask, "Which I/O operations completed?"
  3. Resumes any coroutines whose I/O is ready.
  4. Repeats.

This means your CPU only "works" when code is actually running; during I/O waits, it yields gracefully, processing other coroutines. By contrast, synchronous code burns CPU cycles (or worse, blocks the OS thread) during network waits.

Core Concepts: Coroutines, Futures, and Tasks

Coroutines (Python) are functions defined with async def that can be paused at await points. They are not immediately executed; you must run them on an event loop. A coroutine is lightweight—just a state machine with no OS resources.

# Define a coroutine that fetches an LLM response
async def fetch_llm_response(prompt: str) -> str:
"""Coroutine: must be awaited to run."""
import asyncio
import aiohttp

async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
},
headers={"Authorization": f"Bearer {YOUR_API_KEY}"},
) as resp:
data = await resp.json()
return data["choices"][0]["message"]["content"]

# Calling the function returns a coroutine object; it doesn't run yet.
coro = fetch_llm_response("What is scaling?")
print(type(coro)) # <class 'coroutine'>

# To run it, use asyncio.run() (Python 3.7+)
import asyncio
result = asyncio.run(coro)
print(result)

Futures (in Rust and JavaScript) are similar: objects representing a value that will eventually be available. In Rust, Future is a trait; in JavaScript, Promise is the built-in Future. Both are paused at .await until the underlying operation completes.

Tasks are running instances of coroutines. When you asyncio.create_task(coro), you wrap the coroutine in a Task, which the event loop schedules immediately. Multiple tasks can run concurrently on the same event loop.

import asyncio

async def main():
"""Spawn 100 concurrent LLM requests."""
tasks = [
asyncio.create_task(fetch_llm_response(f"Question {i}"))
for i in range(100)
]
# All 100 tasks are now running concurrently on the event loop.
results = await asyncio.gather(*tasks)
return results

# Run 100 concurrent API calls.
results = asyncio.run(main())
print(f"Fetched {len(results)} responses.")

In this example, all 100 fetch_llm_response() tasks suspend at the await resp.json() line simultaneously. The event loop multiplexes them, resuming each as its response arrives. On a network with 200 ms latency, sequential calls would take 100 * 200 ms = 20 seconds. Concurrent calls complete in ~200 ms (the latency of the slowest request).

Event Loop Rules and Common Pitfalls

To write correct async code, follow these rules:

  1. Never block the event loop. Don't call time.sleep() or CPU-intensive operations inside async functions—they block all other coroutines. Use asyncio.sleep() instead.

  2. Await only in async contexts. You cannot await outside an async def function. If you need to run a coroutine, use asyncio.run() at the top level.

  3. One event loop per thread. Each OS thread has a single event loop. Don't pass tasks or futures between threads without proper locking (see later articles on worker pools).

import asyncio
import time

async def bad_example():
"""This blocks the entire event loop."""
time.sleep(1) # WRONG: blocks other coroutines.
print("This message waits, delaying all concurrent tasks.")

async def good_example():
"""This yields to other coroutines."""
await asyncio.sleep(1) # CORRECT: suspends, allows other work.
print("Other tasks ran while this slept.")

async def demo():
# Running bad_example() blocks good_example().
await bad_example()
await good_example()
  1. Create tasks early. If you await immediately after creating a task, you lose concurrency. Instead, create all tasks first, then gather them:
# SLOW: sequential
async def sequential():
await fetch_llm_response("Q1")
await fetch_llm_response("Q2")
await fetch_llm_response("Q3")
# Total time: ~3 * network_latency

# FAST: concurrent
async def concurrent():
tasks = [
asyncio.create_task(fetch_llm_response("Q1")),
asyncio.create_task(fetch_llm_response("Q2")),
asyncio.create_task(fetch_llm_response("Q3")),
]
results = await asyncio.gather(*tasks)
# Total time: ~1 * network_latency

Async Context Managers and Resource Cleanup

LLM applications often need to manage resources like HTTP sessions, database connections, and file handles. Async context managers (using async with) ensure proper cleanup even if errors occur.

import aiohttp
import asyncio

async def fetch_multiple_with_session():
"""Reuse a single session for multiple requests."""
async with aiohttp.ClientSession() as session:
# Session stays open for all requests in this block.
tasks = []
for i in range(10):
tasks.append(fetch_llm_response_with_session(session, f"Q{i}"))
results = await asyncio.gather(*tasks)
# Session is automatically closed here.
return results

async def fetch_llm_response_with_session(session: aiohttp.ClientSession, prompt: str) -> str:
"""Fetch using a shared session (more efficient than opening per request)."""
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
},
headers={"Authorization": f"Bearer {YOUR_API_KEY}"},
) as resp:
data = await resp.json()
return data["choices"][0]["message"]["content"]

Opening a new HTTP session per request is expensive (SSL handshake, DNS lookup). Reusing a session across many concurrent requests reduces latency by 40–60% and cuts CPU usage by half.

Key Takeaways

  • Async/await is non-blocking concurrency: Code suspends at await points, allowing the event loop to run other tasks, multiplexing thousands of API calls on a single thread.
  • Event loops are single-threaded: Each thread has one event loop. Concurrent execution is achieved through task switching, not true parallelism.
  • Tasks are the unit of concurrency: Use asyncio.create_task() and asyncio.gather() to spawn and manage concurrent work.
  • Never block the loop: Use asyncio.sleep(), not time.sleep(). Avoid CPU-heavy operations in async functions.
  • Reuse resources: Share HTTP sessions, database connections, and thread pools across concurrent tasks to reduce overhead by 40–60%.

Frequently Asked Questions

Can async/await handle CPU-bound work like local LLM inference?

Async/await is designed for I/O-bound work (network, disk, database). For CPU-bound work, you need a separate thread or process pool to avoid blocking the event loop. Frameworks like asyncio provide loop.run_in_executor() to offload CPU work safely.

What's the difference between asyncio.run() and loop.run_until_complete()?

asyncio.run() (Python 3.7+) creates a fresh event loop, runs the coroutine, and closes the loop—clean and simple. run_until_complete() lets you manage the loop manually, useful for long-lived applications that need multiple asyncio calls. Prefer asyncio.run() for scripts and simple applications.

How many concurrent requests can one event loop handle?

Depends on memory and system resources, but typically 10,000–100,000 tasks per loop. Each task uses ~50 KB. The limiting factor is usually OS file descriptor limits (use ulimit -n to check) or network bandwidth, not async overhead.

Should I use asyncio or an alternative library like Trio or AnyIO?

asyncio is the standard library and works everywhere. Trio and AnyIO offer better error handling and cancellation semantics. For LLM applications, asyncio + aiohttp is the most common and well-documented choice. Use Trio if you need advanced structured concurrency.

How do I test async functions in unit tests?

Use pytest with pytest-asyncio plugin, or the asyncio.run() approach in a test function. Avoid running the event loop multiple times in one test session unless you clear it between tests.

Further Reading