
Unlocking Creativity with Azure OpenAI
By :

AOAI typically imposes constraints on the volume of calls permitted. With Azure OpenAI, these limitations manifest as token limits (TPM, Tokens Per Minute) and restrictions on the number of requests per minute (RPM). Nevertheless, these quotas are confined to individual subscriptions, regions, and specific models. As a result, numerous customers opt for multiple Azure OpenAI (AOAI) resources across various regions to achieve maximum throughput. Although this configuration in a “PAUG” setup does not address latency issues, the subsequent section will delve into resolving latency problems using PTU.
In PAUG when the capacity limits are reached, The AOAI returns a 429 or TooManyRequests HTTP status code, along with a Retry-After response header specifying the duration of second you should wait before attempting the next request. Handling these errors is typically managed on the client-side by SDKs, which is effective when dealing with a single API...