FAQ: How much concurrency can the API handle?

Concurrency is controlled by request rate, token rate, model latency, upstream availability, and your client timeout. Do not treat one successful request as proof that production concurrency is safe.

Current public defaults

Limit	Default
Requests per minute	`120`
Estimated tokens per minute	`120000`

Deployment configuration can change these values. For production capacity planning, use your own load test and contact support with target RPM, TPM, model IDs, and latency requirements.

How to test safely

Start with one API key and one model.
Increase request rate gradually.
Watch rate_limit_exceeded, token_rate_limit_exceeded, timeouts, and cost.
Keep prompts representative of production token size.
Check Usage Logs for status and latency distribution.

Rate limits
Timeouts

FAQ: How much concurrency can the API handle?

Current public defaults

How to test safely

Related pages