A technical reference for sending images, audio, video, and documents to multimodal models through TheRouter. Covers all five delivery methods, provider support matrices, size limits, and practical code examples.
LLM providers support different ways to reference a resource (an image, audio clip, video, or document) within a chat message. Understanding which methods each provider supports is critical for building robust multimodal applications.
The resource is base64-encoded and embedded directly in the request body as a data URI or structured field. This is the most universally supported method — every provider that accepts multimodal input supports base64.
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}Pros: Works everywhere, no external dependency, no authentication needed for the resource itself.
Cons: Inflates request payload by ~33% (base64 encoding overhead), adds latency, hits body size limits faster.
The resource URL is passed directly; the provider fetches it server-side. The URL must be publicly accessible (no authentication, no private S3 signed URLs that expire before the provider fetches).
"image_url": {
"url": "https://cdn.example.com/photo.jpg"
}Pros: Minimal request size, no encoding overhead, easy for images already hosted publicly.
Cons: Not supported by Amazon Bedrock (hard requirement: base64 or S3 only). Provider fetches add latency. URL must remain accessible until the request is processed.
Upload a file once via a provider's Files API, receive a file_id, then reference that ID in subsequent requests. Avoids re-transmitting the same resource in multi-turn conversations.
// After uploading: POST /v1/files
"image_url": {
"url": "file-abc123"
}Pros: Efficient for repeated use of the same resource across multiple turns. Anthropic Files API files persist indefinitely (no expiry). OpenAI files persist until deleted.
Cons: Requires a separate upload step. Anthropic Files API is currently in beta (requires anthropic-beta: files-api-2025-04-14 header). Not available on Bedrock or Vertex AI channels. Gemini Files API files expire after 48 hours.
Reference a file in a cloud storage bucket by its native URI. Each provider uses their own storage ecosystem: Amazon S3 for Bedrock, Google Cloud Storage for Gemini.
// Amazon S3 (Bedrock)
"source": { "s3Location": { "uri": "s3://my-bucket/image.jpg" } }
// Google Cloud Storage (Gemini)
"fileData": { "mimeType": "image/jpeg", "fileUri": "gs://my-bucket/image.jpg" }Pros: Eliminates base64 encoding overhead. Ideal for large files (video, multi-page PDFs) where inline base64 would exceed payload limits. No expiry (unlike Gemini Files API).
Cons: Requires pre-provisioned cloud storage in the provider's ecosystem with appropriate IAM permissions. Not portable across providers. Adds operational complexity.
Only supported by Google Gemini for video input. Pass a public YouTube video URL directly; Gemini processes the video without requiring download or upload.
"fileData": {
"mimeType": "video/youtube",
"fileUri": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}Pros: Zero-cost delivery for publicly available videos. No storage required. Up to 10 videos per request (Gemini 2.5+).
Cons: Gemini-only. Public videos only — private or unlisted videos are rejected. Cannot be used in batch mode.
Which delivery methods each provider accepts, by resource type:
| Provider | Base64 | HTTP URL | File Upload (file_id) | Cloud Storage URI | Max Size | Formats |
|---|---|---|---|---|---|---|
| OpenAI | YES | YES | YES (Files API) | NO | 20 MB | JPEG, PNG, WebP, GIF |
| Anthropic | YES | YES | YES (beta) | NO | 5 MB | JPEG, PNG, WebP, GIF |
| Google Gemini | YES | YES | YES (Files API) | YES (GCS) | 20 MB inline / 2 GB Files API | JPEG, PNG, WebP, HEIC, HEIF |
| Amazon Bedrock | YES | NO | NO | YES (S3) | 3.75 MB | JPEG, PNG, WebP, GIF |
| xAI Grok | YES | YES | NO | NO | 20 MiB | JPEG, PNG |
| Mistral (Pixtral) | YES | YES | NO | NO | ~50 MB | JPEG, PNG, WebP, GIF |
| Cohere (Command A) | YES | YES | NO | NO | 20 MB total | JPEG, PNG, WebP, GIF |
| Provider | Base64 | HTTP URL | File Upload | Max Size | Formats |
|---|---|---|---|---|---|
| OpenAI (Chat Completions) | YES | NO | NO | ~20 MB | WAV, MP3 |
| OpenAI (Whisper API) | NO | NO | YES (multipart) | 25 MB | MP3, MP4, MPEG, MPGA, WAV, WebM, M4A |
| Google Gemini | YES | YES | YES (Files API) | 20 MB inline / 2 GB Files API | WAV, MP3, AIFF, AAC, OGG, FLAC |
| Anthropic | NO | NO | NO | — | None |
| Amazon Bedrock | Varies | NO | NO | — | Model-dependent |
| xAI, Mistral, Cohere | NO | NO | NO | — | None |
| Provider | Base64 | HTTP URL | Cloud Storage | YouTube URL | Max Size | Formats |
|---|---|---|---|---|---|---|
| Google Gemini | YES (<100 MB) | YES (Files API) | YES (GCS) | YES | 100 MB inline / 20 GB Files API | MP4, MOV, WebM, AVI, FLV, MPEG, WMV, 3GPP |
| Amazon Bedrock | YES | NO | YES (S3) | NO | S3 recommended for large files | MKV, MOV, MP4, WebM, FLV, MPEG, MPG, WMV, 3GP |
| OpenAI, Anthropic, xAI, Mistral, Cohere | NO | NO | NO | NO | — | None |
| Provider | Base64 | HTTP URL | File Upload (file_id) | Max Size | Formats |
|---|---|---|---|---|---|
| OpenAI | NO (file_id only) | NO | YES (Files API) | 512 MB per file | PDF, DOCX, PPTX, XLSX, TXT, CSV |
| Anthropic | YES | YES | YES (beta) | 32 MB, 100 pages | PDF, TXT |
| Google Gemini | YES | YES | YES (Files API) | 50 MB inline / 2 GB Files API | PDF, TXT, CSV |
| Amazon Bedrock | YES | NO | NO (S3 only) | 4.5 MB default | PDF, CSV, DOC, DOCX, XLS, XLSX, HTML, TXT, MD |
| Mistral (OCR API) | NO (separate endpoint) | NO | NO | 50 MB, 1000 pages | PDF (separate OCR API only) |
| xAI, Cohere | NO | NO | NO | — | None |
OpenAI supports the widest range of resource types and delivery methods. The same format works in both Chat Completions and the Responses API.
data:image/...;base64, URI, HTTP URL, or file_id from the Files API.low (85 tokens flat), high (tiled at 512 px, 170 tokens/tile + 85 base), auto (default). Use detail: "low" to cut image cost to a fixed 85 tokens regardless of resolution.input_audio content block. HTTP URL not supported.file_id. Supports PDF, DOCX, PPTX, XLSX, TXT, CSV (text extraction)./v1/uploads.Key quirk: GIF support is limited to the first frame only. The detail parameter defaults to auto, which selects low for images under 512×512 px and high for larger images.
Anthropic has strict validation rules but supports both base64 and HTTP URLs for images and PDFs natively, without requiring a separate upload step.
file_id (beta) using the document content block. Up to 32 MB and 100 pages. Claude 3.5+ only.anthropic-beta: files-api-2025-04-14. Supports PDF, plain text, and images up to 500 MB per file. Files persist until deleted — no expiry. Not available via Bedrock or Vertex AI channels.(width × height) / 750 — a 1000×1000 image costs ~1,334 tokens.Key quirk: The declared media_type must exactly match the actual file bytes. Declaring image/jpeg while sending PNG bytes causes a hard 400 error. Always detect MIME type from file content, not file extension.
Gemini has the most flexible resource delivery across all types. It is the only provider that natively supports video input and YouTube URLs.
Storage Object Viewer IAM role on service-{project}@gcp-sa-aiplatform.iam.gserviceaccount.com.media_resolution parameter (low, medium, high) to trade quality for cost. A 10-minute video at default resolution costs ~$0.63 at Gemini 1.5 Pro pricing.Key quirk: The 20 MB inline limit applies to the total request payload after base64 encoding — which adds ~33% overhead. A 15 MB raw file becomes ~20 MB base64-encoded. Use the Files API for files over ~14 MB raw.
Bedrock is the most restrictive provider: HTTP URLs are completely unsupported. All media must be base64-encoded inline or referenced via S3 URI.
source.bytes field, or S3 URI via source.s3Location.uri.text block is required in the same content array — omitting it causes a 400 error.Key quirk: HTTP URL delivery is not implemented and there is no announced timeline for support. TheRouter automatically proxies HTTP image URLs to base64 before forwarding requests to Bedrock — this is handled transparently.
Grok supports only images, and only via base64 or HTTP URL. No audio, video, document, or file upload support.
image_url content block as OpenAI. The detail parameter may be ignored.Key quirk: No Files API. Every request must include the full image data inline or via URL — no pre-upload caching.
Vision support is limited to Pixtral models (pixtral-12b, pixtral-large-2411). PDF handling is a separate OCR API product, not integrated into the chat completions endpoint.
image_url field. Pixtral Large handles up to 30 high-res images (128k context window).mistral-ocr-2503), not via chat completions. Priced at $2 per 1,000 pages. Not routed through TheRouter's chat completions endpoint.Key quirk: RGBA PNG files cause a decode error. Convert to RGB before encoding (remove the alpha channel). This is a common issue when screenshots or image-editing tools output PNG with transparency.
Vision support was added in command-a-03-2025. Earlier Command R/R+ models do not support image input.
image_url format. Up to 20 MB total per request, max 20 images.low detail = 256 tokens per image. high detail = 256 tokens per 512×512 tile + 256 base.documents parameter — not multimodal, no file upload.# cURL
curl https://api.therouter.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${THEROUTER_API_KEY}" \
-d '{
"model": "openai/gpt-4.1",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{
"type": "image_url",
"image_url": { "url": "https://example.com/photo.jpg" }
}
]
}
],
"max_tokens": 512
}'# Python (httpx)
import httpx
response = httpx.post(
"https://api.therouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "anthropic/claude-sonnet-4",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/photo.jpg"},
},
],
}
],
"max_tokens": 512,
},
)
print(response.json()["choices"][0]["message"]["content"])import base64
import httpx
with open("photo.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = httpx.post(
"https://api.therouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "google/gemini-2.5-flash",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"},
},
],
}
],
"max_tokens": 512,
},
)import base64
import httpx
with open("report.pdf", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = httpx.post(
"https://api.therouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "anthropic/claude-sonnet-4",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the key findings in this report.",
},
{
"type": "image_url",
"image_url": {
"url": f"data:application/pdf;base64,{b64}"
},
},
],
}
],
"max_tokens": 1024,
},
)Note: TheRouter normalizes the PDF into the native Anthropic document content block format transparently. You send the same image_url shape as with images; the gateway handles the conversion.
import base64
import httpx
with open("question.mp3", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = httpx.post(
"https://api.therouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "openai/gpt-audio-1.5",
"modalities": ["text", "audio"],
"audio": {"voice": "alloy", "format": "mp3"},
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {"data": b64, "format": "mp3"},
}
],
}
],
},
)
print(response.json()["choices"][0]["message"]["content"])import base64
import httpx
def to_b64(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
response = httpx.post(
"https://api.therouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "anthropic/claude-opus-4.6",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two product designs."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{to_b64('design_v1.png')}"
},
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{to_b64('design_v2.png')}"
},
},
],
}
],
"max_tokens": 1024,
},
)TheRouter normalizes resource delivery across providers transparently. You write your request once using the standard image_url content block format, and the gateway handles provider-specific conversions:
data:application/pdf;base64,... image_url block is converted to Anthropic's native document content block format with source.type: "base64".image_url block is translated to Anthropic's image content block with the appropriate source.type ("url" or "base64").jpg extension is normalized to jpeg for Bedrock's format enum. Non-allowed formats are rejected with a clear error.inlineData blocks for Vertex AI requests.modality.input capabilities. Mismatches return a 400 error with a suggestion of supported models.Check model capabilities at the /v1/models endpoint — each model includes an architecture.features array with values like vision, audio_input, and pdf.
detail: "low" for images that only need rough visual understanding (logos, charts, diagrams). This reduces token cost from hundreds to a flat 85 tokens regardless of image size.media_resolution: "low" to reduce token cost from ~300 to ~100 tokens per second.image/jpeg for JPEG files, image/png for PNG — never derive the MIME type from the file extension alone. Use a library like python-magic or the filetype package to detect MIME type from file bytes.file_id values for long-lived references with Gemini.content array must include at least one text block. Sending a document without an accompanying text prompt returns a 400 error.