How many Uvicorn workers should I run in production?

The standard recommendation is 2 times the number of CPU cores plus 1 (2N+1) when using Gunicorn with Uvicorn workers. For a 4-core machine, that means 9 workers. However, this depends on your workload profile. I/O-bound APIs (database queries, external API calls) benefit from more workers, while CPU-bound APIs (data processing, serialization of large responses) may perform better with fewer workers to avoid memory pressure. Always load test with your actual traffic patterns. Start with the 2N+1 formula and adjust based on monitoring data — watch for memory usage, response latency percentiles, and worker restart frequency.

Python

FastAPI for Production: Building Scalable APIs Beyond the Tutorial

DSi Team

· May 16, 2024 · 12 min read

Every FastAPI tutorial ends the same way: a single-file application with a few routes, an in-memory list for data storage, and a triumphant "run it with uvicorn main:app --reload." It works. It is clean. And it has almost nothing in common with what a production FastAPI application actually looks like.

The gap between tutorial and production is where most Python API projects struggle. How do you organize 50+ endpoints across multiple domains? How do you handle database connections without leaking them under load? What does authentication look like when you have to support JWT tokens, API keys, and role-based access in the same application? How do you deploy without the single-threaded development server?

This guide covers the decisions and patterns that matter once you move past the tutorial. It is based on patterns our Python engineering teams use across production FastAPI applications serving real traffic for clients in the US and Europe.

Why FastAPI for Production Python APIs

FastAPI has earned its position as the default choice for new Python API projects, and the reasons go beyond developer experience. The framework is built on Starlette for the async HTTP layer and Pydantic for data validation, both of which are battle-tested in production environments. But the real production advantages are architectural.

Async-first design means your API does not block on I/O operations. When an endpoint waits for a database query, an external HTTP call, or a file read, the event loop handles other requests instead of sitting idle. For I/O-bound workloads, which describe the majority of API servers, this translates to 3x to 10x more concurrent requests per server instance compared to synchronous frameworks like Flask.

Automatic data validation through Pydantic eliminates an entire class of bugs. Request bodies, query parameters, path parameters, and headers are all validated before your business logic runs. In production, this means fewer 500 errors from malformed input and better error messages for API consumers. Pydantic v2, now the default, runs validation 5x to 50x faster than v1 thanks to its Rust core.

The built-in OpenAPI documentation is not just a convenience feature. It becomes the contract between your API and every team that consumes it. Frontend developers, mobile teams, and third-party integrators all work from the same auto-generated specification, which eliminates the documentation drift that plagues manually maintained API docs.

Project Structure for Large Applications

The single-file structure from tutorials breaks down quickly. Once you have more than 10 endpoints, you need a structure that lets multiple developers work simultaneously without constant merge conflicts, and that makes it possible to find code without searching the entire codebase.

Domain-driven package structure

The pattern that scales best for large FastAPI applications organizes code by business domain rather than by technical layer. Instead of putting all models in one directory and all routes in another, each domain gets its own self-contained package:

app/users/ -- router.py, models.py, schemas.py, service.py, dependencies.py
app/orders/ -- router.py, models.py, schemas.py, service.py, dependencies.py
app/products/ -- router.py, models.py, schemas.py, service.py, dependencies.py
app/core/ -- config.py, database.py, security.py, middleware.py
app/main.py -- application factory, router inclusion, lifespan events

Each domain package contains everything related to that feature: the SQLAlchemy models, the Pydantic request/response schemas, the service layer with business logic, the FastAPI router with endpoint definitions, and the domain-specific dependencies. The core package holds shared infrastructure -- database session management, configuration, authentication utilities, and middleware.

This structure has a practical benefit beyond organization. When a domain grows large enough to warrant extraction into a separate microservice, the boundary is already defined. You move the package, update the imports for shared dependencies, and the separation is clean.

The application factory pattern

Rather than creating the FastAPI instance at module level, use a factory function that builds and configures the application. This makes testing significantly easier because you can create fresh application instances with different configurations for different test scenarios:

Register all domain routers with appropriate prefixes and tags
Configure CORS, trusted hosts, and other middleware
Set up lifespan events for database connection pools and background task infrastructure
Apply rate limiting and request logging middleware
Mount static files or sub-applications if needed

The factory pattern also keeps your main.py clean. It becomes a composition root that wires together all the pieces rather than a dumping ground for configuration and route definitions.

Dependency Injection That Scales

FastAPI's dependency injection system is one of its strongest production features, but most tutorials only scratch the surface. In a production application, dependencies handle database sessions, authentication, authorization, feature flags, rate limiting, and request-scoped caching.

Layered dependencies

The key pattern is layering dependencies so that complex operations compose from simple building blocks:

Infrastructure dependencies provide database sessions, Redis connections, and external service clients. These are typically async generators that handle resource lifecycle -- acquiring a connection from the pool and releasing it after the request completes.
Authentication dependencies consume infrastructure dependencies (a database session to look up users) and produce the current user or raise an HTTP 401.
Authorization dependencies consume authentication dependencies (the current user) and verify permissions for the specific operation. A dependency like require_role("admin") returns a callable that checks the user's roles.
Business dependencies combine everything above. An endpoint that requires an authenticated admin with a database session declares all three, and FastAPI resolves the entire dependency graph automatically.

This approach eliminates duplicated checks across endpoints. Authentication logic lives in one place. Database session management lives in one place. When you need to change how sessions are scoped or how tokens are validated, you change one dependency and every endpoint that uses it gets the update.

Dependency overrides for testing

FastAPI's app.dependency_overrides dictionary lets you replace any dependency during testing. This is critical for production applications -- you can swap the real database session for a test database, replace external API clients with mocks, and override authentication to inject test users without touching endpoint code. The pattern makes integration testing practical even for complex dependency graphs.

Async Database Integration with SQLAlchemy

The database layer is where most FastAPI production issues originate. Connection leaks under load, blocking queries that stall the event loop, and improper session scoping that leads to stale data or transaction conflicts. Getting this right matters more than any other architectural decision.

Async engine and session setup

SQLAlchemy 2.0+ provides first-class async support through create_async_engine and AsyncSession. Use asyncpg as the PostgreSQL driver -- it is written in Cython and significantly outperforms psycopg2 for async workloads. The engine should be created once at application startup using FastAPI's lifespan context manager, and disposed on shutdown.

Connection pool sizing is critical. The default pool size of 5 connections is too low for any meaningful traffic. Configure the pool based on your expected concurrency:

pool_size: The number of persistent connections to maintain. Start with 20 for moderate traffic.
max_overflow: Additional connections allowed beyond pool_size during traffic spikes. Set this to 10-20 to handle bursts without rejecting requests.
pool_timeout: How long to wait for a connection from the pool before raising an error. 30 seconds is a reasonable default.
pool_recycle: Maximum age of a connection before it is recycled. Set to 1800 (30 minutes) to avoid issues with database server connection limits and stale connections.
pool_pre_ping: Enable this to test connections before using them. It adds minimal overhead and prevents errors from connections that the database server has closed.

Session-per-request pattern

Each request should get its own database session through a dependency that yields an AsyncSession and ensures it is closed after the request completes, regardless of whether the request succeeded or raised an exception. This is non-negotiable for production. Shared sessions across requests lead to subtle bugs under concurrent load -- stale reads, transaction conflicts, and connection exhaustion.

The most common production database issue in FastAPI applications is not slow queries -- it is connection leaks from sessions that are not properly closed when exceptions occur. Always use async generators for session dependencies and ensure the finally block releases the session back to the pool.

Authentication and Authorization

Production APIs rarely have a single authentication method. You need JWT bearer tokens for user-facing clients, API keys for server-to-server communication, and sometimes OAuth2 flows for third-party integrations. FastAPI's security utilities provide the building blocks, but you need to compose them into a coherent system.

JWT authentication

Use python-jose or PyJWT for token generation and validation. The critical decisions are token lifetime (short-lived access tokens of 15-30 minutes with longer-lived refresh tokens), the claims you include (minimize to user ID and roles -- do not embed data that changes frequently), and the signing algorithm (RS256 for systems where multiple services verify tokens, HS256 for simpler setups where only your API creates and verifies tokens).

Role-based access control

Implement RBAC as a dependency factory -- a function that takes a list of required roles and returns a dependency that checks the current user's roles against the requirement. This lets you protect endpoints declaratively:

Public endpoints: no authentication dependency
Authenticated endpoints: Depends(get_current_user)
Admin endpoints: Depends(require_role("admin"))
Resource-owner endpoints: custom dependency that checks if the current user owns the requested resource

Rate limiting

For production APIs, rate limiting is not optional. Use slowapi (built on limits) or implement custom rate limiting with Redis. The approach depends on your deployment topology -- if you run multiple API instances behind a load balancer, you need a shared rate limit store (Redis), not in-memory counters that reset per instance. Apply different limits to different endpoint groups: generous limits for read operations, stricter limits for writes, and very strict limits for authentication endpoints to prevent brute-force attacks.

Middleware and Background Tasks

Essential middleware stack

A production FastAPI application needs several middleware layers. Order matters -- middleware executes in the order it is added, and the request passes through each layer before reaching your endpoint:

CORS middleware: Required for browser-based API consumers. Be specific about allowed origins in production -- never use allow_origins=["*"] outside of development.
Trusted host middleware: Prevents host header attacks by validating the Host header against a whitelist.
Request ID middleware: Generates a unique ID for each request and attaches it to the response headers. Essential for tracing requests across services and correlating logs.
Logging middleware: Records request method, path, status code, and response time for every request. This is your primary observability layer for API traffic patterns.
Exception handling middleware: Catches unhandled exceptions, logs the full traceback, and returns a consistent error response to the client instead of exposing internal details.

Background tasks

FastAPI provides a BackgroundTasks class for work that should happen after the response is sent -- sending emails, updating analytics, processing webhooks. This works well for lightweight tasks, but it has a limitation: background tasks run in the same process as your API, so a CPU-intensive background task will degrade API response times.

For anything beyond trivial tasks, use a proper task queue. Celery with Redis or RabbitMQ remains the standard for Python background processing. The pattern is straightforward: your FastAPI endpoint publishes a task to the queue and returns immediately, and Celery workers process tasks in separate processes (or separate machines entirely). This separation is critical for production systems where background processing volume can spike independently of API traffic. For teams managing these deployment pipelines, a mature DevOps practice makes the difference between smooth operations and constant firefighting.

Deployment: Gunicorn, Uvicorn, and Docker

The development server (uvicorn main:app --reload) is single-process and single-threaded. Production deployment requires a process manager that runs multiple worker processes to utilize all available CPU cores and handle worker crashes gracefully.

Gunicorn with Uvicorn workers

The standard production deployment uses Gunicorn as the process manager with Uvicorn workers. Gunicorn handles process lifecycle -- spawning workers, restarting crashed workers, and graceful shutdown. Each Uvicorn worker runs its own async event loop and handles requests independently.

Configuration	Recommended Value	Notes
Workers	2 * CPU cores + 1	Start here, adjust based on load testing. More workers use more memory.
Worker class	uvicorn.workers.UvicornWorker	Required for async support. Do not use the default sync workers.
Timeout	120 seconds	Maximum time for a worker to handle a request before Gunicorn kills it.
Graceful timeout	30 seconds	Time allowed for in-flight requests to complete during shutdown.
Keep-alive	5 seconds	Increase to 65 seconds if behind an AWS ALB to avoid 502 errors.
Max requests	1000-5000	Recycles workers after N requests to prevent memory leaks. Add jitter with max-requests-jitter.
Bind	0.0.0.0:8000	Bind to all interfaces inside Docker. Use Unix sockets for same-host reverse proxy setups.

Docker deployment

A production Dockerfile for FastAPI should use multi-stage builds to keep the image size small, run as a non-root user for security, and handle signals correctly so Gunicorn can perform graceful shutdowns during deployments:

Base image: Use python:3.12-slim instead of the full image. The slim variant is 100MB+ smaller and excludes development tools you do not need in production.
Dependencies first: Copy requirements.txt and install dependencies before copying application code. Docker layer caching means dependencies are only rebuilt when requirements change, not on every code change.
Non-root user: Create a dedicated user and switch to it before the CMD instruction. Running as root inside a container is a security risk that most container orchestration platforms flag.
Health check: Add a HEALTHCHECK instruction that hits a lightweight /health endpoint. Container orchestrators use this to determine when the application is ready to receive traffic and when to restart unhealthy instances.
Signal handling: Use exec form for CMD (CMD ["gunicorn", ...] not CMD gunicorn ...) so that Gunicorn receives SIGTERM directly from Docker and can shut down gracefully.

Managing this deployment infrastructure well requires more than getting the Dockerfile right. The accumulated shortcuts in your deployment pipeline -- hardcoded configurations, missing health checks, absent monitoring -- are a form of technical debt that compounds over time and becomes expensive to fix under production pressure.

Monitoring and Observability

A production API without monitoring is a liability. You need to know when things break before your users tell you, and you need enough data to diagnose why.

Structured logging

Replace Python's default logging with structured JSON logs using structlog or python-json-logger. Structured logs are parseable by log aggregation systems (ELK stack, Datadog, CloudWatch) and let you filter and search by any field -- request ID, user ID, endpoint, status code, response time. Include the request ID from your middleware in every log entry so you can trace a single request across all log messages it generates.

Metrics

Expose Prometheus metrics from your FastAPI application using prometheus-fastapi-instrumentator or a custom middleware. The essential metrics are:

Request rate: Requests per second by endpoint and status code
Request duration: Histogram of response times by endpoint (track p50, p95, p99)
Active requests: Current number of in-flight requests
Error rate: 4xx and 5xx responses per endpoint
Database pool: Active connections, available connections, overflow connections, and pool timeouts
Background task queue: Queue depth, task processing time, failure rate

Health checks

Implement two health check endpoints: a lightweight /health that returns 200 if the process is running (used by load balancers for traffic routing), and a detailed /health/ready that verifies database connectivity, Redis availability, and external service reachability (used by deployment orchestrators to determine when a new instance is ready to receive traffic).

Separating liveness from readiness is important during deployments. A new instance might be running (liveness passes) but still establishing database connections (readiness fails). Without this separation, load balancers route traffic to instances that will return 500 errors on every database query.

Performance Optimization Patterns

Once your FastAPI application is deployed and monitored, the next step is optimizing for the traffic patterns you actually observe. Premature optimization is wasteful, but there are several patterns that almost every production FastAPI application benefits from.

Response caching

Use Redis-based caching for expensive queries and external API calls. The pattern is a dependency or decorator that checks Redis for a cached response before executing the endpoint logic. Cache keys should include all parameters that affect the response -- user role, query parameters, pagination. Set appropriate TTLs based on how stale the data can be. For truly static responses, use HTTP cache headers (Cache-Control, ETag) to let clients and CDNs cache at the edge.

Connection pooling for external services

If your API calls external HTTP services, use a shared httpx.AsyncClient with connection pooling instead of creating a new client per request. Creating a new TCP connection for every outgoing request adds 20-50ms of latency and wastes resources. A pooled client reuses connections and resolves DNS once. Create the client in the application lifespan and inject it as a dependency.

Pagination and query optimization

Never return unbounded result sets from list endpoints. Implement cursor-based pagination for large datasets (it performs better than offset-based pagination at scale) and always select only the columns your response schema needs. The .options(load_only(...)) method in SQLAlchemy prevents loading entire ORM objects when you only need three fields from a 30-column table.

Comparison: When to Choose FastAPI

FastAPI is not the right choice for every Python API project. Understanding where it excels and where alternatives make more sense helps you avoid fighting the framework.

Scenario	Best Choice	Why
New API with async I/O needs	FastAPI	Async-first design, Pydantic validation, automatic docs
Existing Django project adding APIs	Django REST Framework	Leverage existing models, admin, and ORM. Migrating to FastAPI rarely justifies the cost.
Simple internal microservice	FastAPI or Flask	Both work well. FastAPI if async matters, Flask if the team knows it already.
High-throughput WebSocket server	FastAPI	Native WebSocket support built on Starlette's async handling
Full-stack web application with server-rendered pages	Django	Django's template engine, admin, forms, and middleware ecosystem are purpose-built for this.
CPU-bound data processing API	FastAPI (with care)	Async does not help with CPU-bound work. Offload to background workers or use process pool executor.

Common Production Mistakes

After building and reviewing dozens of FastAPI applications in full-cycle development engagements, these are the mistakes that cause the most production incidents.

Blocking the event loop

The most damaging production mistake is calling synchronous, blocking code from async endpoints. A single time.sleep(5), a synchronous database query using psycopg2, or a blocking file read will freeze the entire event loop for that worker -- meaning every other request handled by that worker waits. Use asyncio.to_thread() or run_in_executor() to offload unavoidable blocking operations, and prefer async libraries (asyncpg, aiofiles, httpx) for all I/O.

Missing database session cleanup

If your database session dependency does not properly close sessions in a finally block, exceptions in endpoint logic will leak connections. Under sustained traffic, this exhausts the connection pool and your API starts returning 500 errors on every request that needs a database query. Test this explicitly by throwing exceptions in endpoints during load tests and monitoring the connection pool metrics.

Oversharing in error responses

FastAPI's default error responses in debug mode include stack traces and internal details. In production, unhandled exceptions should return a generic error message with the request ID for correlation, while the full details are logged server-side. Leaking database schema information, file paths, or internal service URLs in error responses is a security vulnerability.

Ignoring graceful shutdown

When deploying new versions, your container orchestrator sends SIGTERM to the running process. If Gunicorn does not handle this gracefully, in-flight requests are terminated mid-response. Configure the graceful timeout, ensure your health check endpoint starts returning unhealthy during shutdown so load balancers stop routing new traffic, and verify this works during deployment testing.

The difference between a FastAPI tutorial project and a production system is not the framework features you use -- it is the failure modes you handle. Connection exhaustion, event loop blocking, ungraceful shutdowns, and leaked error details are invisible in development and devastating in production.

Scaling FastAPI: From Single Instance to Distributed

A well-configured single FastAPI instance on a 4-core machine can handle 2,000 to 10,000 requests per second depending on response complexity. When you need to go beyond that, horizontal scaling is straightforward because FastAPI applications are stateless by design (assuming you have externalized sessions and caching to Redis).

Horizontal scaling: Run multiple instances behind a load balancer (NGINX, AWS ALB, or Kubernetes Ingress). Each instance runs its own Gunicorn + Uvicorn stack. No code changes required if your application is stateless.
Database scaling: Read replicas for read-heavy workloads, connection pooling with PgBouncer for connection-heavy workloads, and write sharding for write-heavy workloads. Your SQLAlchemy session dependency can route reads to replicas using a custom session factory.
Caching layer: Redis for application-level caching, CDN for static and semi-static API responses. Caching is the highest-leverage performance improvement for most APIs -- a cache hit is orders of magnitude faster than a database query.
Background processing: Scale Celery workers independently of API instances. During business hours you might need 10 API instances and 2 workers; during batch processing windows you might need 2 API instances and 20 workers.

The operational complexity of distributed systems is significant. If your team is evaluating whether to build and maintain this infrastructure or bring in specialized backend engineers, understanding the trade-offs between staff augmentation and outsourcing is a good starting point.

Conclusion

FastAPI gives you an excellent foundation for production Python APIs, but the framework is the easy part. The decisions that determine whether your API survives real-world traffic happen in the layers around it: how you structure the project so a team of developers can work on it simultaneously, how you manage database connections so they do not leak under load, how you deploy so that zero-downtime releases are routine rather than stressful, and how you monitor so that you find problems before your users do.

Start with the domain-driven project structure, get async database integration right from the beginning, layer your dependencies so that cross-cutting concerns like authentication and authorization are defined once and applied consistently, and deploy with Gunicorn managing Uvicorn workers behind a reverse proxy. Add structured logging and Prometheus metrics before you need them -- retroactively instrumenting a production application under pressure is no one's idea of a good time.

The patterns in this guide are not theoretical. They come from building and operating FastAPI applications that handle production traffic for clients across industries. Every recommendation addresses a specific failure mode that we have encountered, diagnosed, and resolved.

At DSi, our full-cycle development teams build production Python APIs with FastAPI, Django, and async architectures -- from initial design through deployment and ongoing operations. If you need backend engineers who have solved these problems before, talk to our engineering team.

FAQ

Frequently Asked
Questions

Yes. FastAPI is production-ready and used by companies like Microsoft, Uber, and Netflix for high-traffic services. It is built on top of Starlette and Pydantic, both of which are mature, well-tested libraries. The framework's async-first architecture, automatic OpenAPI documentation, and built-in data validation make it one of the strongest choices for production Python APIs. The key is how you structure the application around it -- dependency injection, proper error handling, connection pooling, and deployment configuration matter more than the framework itself.

FastAPI significantly outperforms Flask and Django REST Framework in throughput benchmarks, primarily because of its async support. In typical benchmarks, FastAPI handles 3x to 10x more concurrent requests than Flask under I/O-bound workloads. For CPU-bound tasks the gap narrows since Python's GIL limits true parallelism regardless of framework. The real performance advantage comes from async database drivers and non-blocking I/O -- if your API mostly waits on database queries, external HTTP calls, or file operations, FastAPI's async model delivers measurably better throughput per server instance.

Use SQLAlchemy async (2.0+) for new FastAPI projects. The async engine with asyncpg for PostgreSQL provides significantly better throughput under concurrent load because database queries do not block the event loop. Sync SQLAlchemy still works with FastAPI but negates the async performance benefits -- each blocking database call ties up an event loop slot. If you are migrating an existing sync codebase, you can use FastAPI's run_in_executor pattern as a bridge, but plan to migrate to async sessions for any endpoint that handles meaningful traffic.

The standard recommendation is 2 times the number of CPU cores plus 1 (2N+1) when using Gunicorn with Uvicorn workers. For a 4-core machine, that means 9 workers. However, this depends on your workload profile. I/O-bound APIs benefit from more workers, while CPU-bound APIs may perform better with fewer workers to avoid memory pressure. Always load test with your actual traffic patterns. Start with the 2N+1 formula and adjust based on monitoring data -- watch for memory usage, response latency percentiles, and worker restart frequency.

For large FastAPI applications, use a domain-driven structure where each feature or business domain gets its own package containing routes, models, schemas, services, and dependencies. Avoid a flat structure that groups all routes in one file and all models in another -- this becomes unmanageable past 10 to 15 endpoints. Each domain package should be self-contained, with a router that gets included in the main application. Shared concerns like database sessions, authentication, and configuration live in a core package. This structure lets multiple developers work on different domains without constant merge conflicts and makes it straightforward to extract a domain into a separate microservice later if needed.