API Management in the AI Era - Azure Singapore.pdf

Nilesh Gule @nileshgule
API Management
in the
AI Era

$whoami
{
“name” : “Nilesh Gule”,
“website” : “https://www.HandsOnArchitect.com",
“github” : “https://GitHub.com/NileshGule"
“twitter” : “@nileshgule”,
“linkedin” : “https://www.linkedin.com/in/nileshgule”,
“YouTube” : “https://www.YouTube.com/@nilesh-gule”
“likes” : “Technical Evangelism, Cricket”,
}

API Management - GenAI Gateway
Azure-Samples/AI-Gateway: APIM

AI Gateway capabilities of Azure API Management
AI Gateway
Security & safety
• Keyless managed identities
• AI Apps & Agents Authorizations -New
• Content Safety -GA
• Credential Manager
Resiliency
• Weight load balancing
• Priority routing to provisioned capacity models
• Backend pools with circuit breaker
• Session aware load balancing -GA
Scalability
• Token rate limits and token quotas
• Semantic Caching -GA
• Model load balancing
• Multi-regional deployments
Traffic mediation & control
• Azure AI Foundry & Azure OpenAI
• OpenAI compatible models -GA
• Responses API -GA
• WebSocket’s for Realtime APIs
• MCP server pass-trough - Soon
• Expose APIs as built-in MCP server - Preview
Developer velocity
• Wizard policy configuration experience
• Self-service with the Developer Portal
• API Center Copilot Studio connector - Preview
• Policy Toolkit
Observability
• Token counting per consumer
• Prompts and completions logging -GA
• Built-in reporting dashboard -GA
Governance
• Policy engine with custom expressions
• API Center MCP server registry - Preview
• Federated API Management
GA
GA
GA Soon
GA
GA
GA
GA
Preview
Preview
GA
Preview
New

Challenges in managing the GenAI APIs
Track Token
usage
Ensure Tokens are
used properly
across multiple
applications.
Manage TPM
quota
Endure single app
doesn’t consume
the whole TPM
quota.
Secure API keys
Secure API keys
across multiple
applications.
Distribute load
across multiple
endpoints
Ensure committed
PTU is exhausted
before falling
back to the PAYG
instance.

Provisioned Throughput Units (PTU)
• Allows to specify the amount of throughput required in a model deployment.
• Granted to subscription as quota
• Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the
subscription and region
• PTU provides
• Predictable performance
• Allocated processing capacity
• Cost savings
Understanding costs associated with provisioned throughput units (PTU)

Token Metrics Emitting
• Sends Token Merics usage to Applications Insights
• Provides overview of utilization of Azure OpenAI models
across multiple applications or API consumers
GenAI Gateway Capabilities in Azure API Management

Token Rate Limiting
• Manage and enforce limits per API consumer based on the
usage of API Tokens

/demo
Load Balance Pool & Circuit Breaker

Load Balanced Pool and Circuit Breaker
• Helps to spread load across multiple Azure OpenAI endpoints
• Round-robin, weighted or priority based load distribution
strategy

Semantic Caching
• Optimize Token usage by leveraging semantic caching
• Stores completions for prompts with similar meanings

Summary
• Track Token usage across multiple applications
• Emit Token Metrics policy
• Ensure single app doesn’t consume whole TPM quota
• Token Limit Policy
• Secure API keys across multiple applications
• Subscription keys
• Distribute load across multiple endpoints
• Backend pool load balancing and circuit breaker

Resources
• Azure OpenAI Gateway topologies
• Azure OpenAI Token Limit Policy
• LLM Token Limit Policy
• Azure OpenAI Emit Token Metric Policy
• LLM Emit Token Metric Policy
• Houssem Dellai Youtube videos
• GenAI Labs
• Designing and implementing GenAI gateway solution

Nilesh Gule
ARCHITECT | MICROSOFT MVP
“Code with Passion and
Strive for Excellence”
nileshgule @nileshgule Nilesh Gule
NileshGule
www.handsonarchitect.com
https://www.youtube.com/@nilesh-gule

Source Code & slide deck
Nilesh Gule fork - GenAI Labs
https://github.com/NileshGule/AI-Gateway
GenAI Labs
https://aka.ms/apim/genai/labs
https://speakerdeck.com/nileshgule/
https://www.slideshare.net/nileshgule/

API Management in the AI Era - Azure Singapore.pdf

Contenu connexe

Similaire à API Management in the AI Era - Azure Singapore.pdf

Plus de Nilesh Gule

Dernier

API Management in the AI Era - Azure Singapore.pdf