LLM Resource Delivery Methods

A technical reference for sending images, audio, video, and documents to multimodal models through TheRouter. Covers all five delivery methods, provider support matrices, size limits, and practical code examples.

The Five Delivery Methods

LLM providers support different ways to reference a resource (an image, audio clip, video, or document) within a chat message. Understanding which methods each provider supports is critical for building robust multimodal applications.

1. Base64 Inline

The resource is base64-encoded and embedded directly in the request body as a data URI or structured field. This is the most universally supported method — every provider that accepts multimodal input supports base64.

"image_url": {
  "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}

Pros: Works everywhere, no external dependency, no authentication needed for the resource itself.
Cons: Inflates request payload by ~33% (base64 encoding overhead), adds latency, hits body size limits faster.

2. HTTP URL Reference

The resource URL is passed directly; the provider fetches it server-side. The URL must be publicly accessible (no authentication, no private S3 signed URLs that expire before the provider fetches).

"image_url": {
  "url": "https://cdn.example.com/photo.jpg"
}

Pros: Minimal request size, no encoding overhead, easy for images already hosted publicly.
Cons: Not supported by Amazon Bedrock (hard requirement: base64 or S3 only). Provider fetches add latency. URL must remain accessible until the request is processed.

3. File Upload (file_id)

Upload a file once via a provider's Files API, receive a file_id, then reference that ID in subsequent requests. Avoids re-transmitting the same resource in multi-turn conversations.

// After uploading: POST /v1/files
"image_url": {
  "url": "file-abc123"
}

Pros: Efficient for repeated use of the same resource across multiple turns. Anthropic Files API files persist indefinitely (no expiry). OpenAI files persist until deleted.
Cons: Requires a separate upload step. Anthropic Files API is currently in beta (requires anthropic-beta: files-api-2025-04-14 header). Not available on Bedrock or Vertex AI channels. Gemini Files API files expire after 48 hours.

4. Cloud Storage URI

Reference a file in a cloud storage bucket by its native URI. Each provider uses their own storage ecosystem: Amazon S3 for Bedrock, Google Cloud Storage for Gemini.

// Amazon S3 (Bedrock)
"source": { "s3Location": { "uri": "s3://my-bucket/image.jpg" } }

// Google Cloud Storage (Gemini)
"fileData": { "mimeType": "image/jpeg", "fileUri": "gs://my-bucket/image.jpg" }

Pros: Eliminates base64 encoding overhead. Ideal for large files (video, multi-page PDFs) where inline base64 would exceed payload limits. No expiry (unlike Gemini Files API).
Cons: Requires pre-provisioned cloud storage in the provider's ecosystem with appropriate IAM permissions. Not portable across providers. Adds operational complexity.

5. YouTube URL

Only supported by Google Gemini for video input. Pass a public YouTube video URL directly; Gemini processes the video without requiring download or upload.

"fileData": {
  "mimeType": "video/youtube",
  "fileUri": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}

Pros: Zero-cost delivery for publicly available videos. No storage required. Up to 10 videos per request (Gemini 2.5+).
Cons: Gemini-only. Public videos only — private or unlisted videos are rejected. Cannot be used in batch mode.

Provider Support Matrix

Which delivery methods each provider accepts, by resource type:

Images

Provider	Base64	HTTP URL	File Upload (file_id)	Cloud Storage URI	Max Size	Formats
OpenAI	YES	YES	YES (Files API)	NO	20 MB	JPEG, PNG, WebP, GIF
Anthropic	YES	YES	YES (beta)	NO	5 MB	JPEG, PNG, WebP, GIF
Google Gemini	YES	YES	YES (Files API)	YES (GCS)	20 MB inline / 2 GB Files API	JPEG, PNG, WebP, HEIC, HEIF
Amazon Bedrock	YES	NO	NO	YES (S3)	3.75 MB	JPEG, PNG, WebP, GIF
xAI Grok	YES	YES	NO	NO	20 MiB	JPEG, PNG
Mistral (Pixtral)	YES	YES	NO	NO	~50 MB	JPEG, PNG, WebP, GIF
Cohere (Command A)	YES	YES	NO	NO	20 MB total	JPEG, PNG, WebP, GIF

Audio

Provider	Base64	HTTP URL	File Upload	Max Size	Formats
OpenAI (Chat Completions)	YES	NO	NO	~20 MB	WAV, MP3
OpenAI (Whisper API)	NO	NO	YES (multipart)	25 MB	MP3, MP4, MPEG, MPGA, WAV, WebM, M4A
Google Gemini	YES	YES	YES (Files API)	20 MB inline / 2 GB Files API	WAV, MP3, AIFF, AAC, OGG, FLAC
Anthropic	NO	NO	NO	—	None
Amazon Bedrock	Varies	NO	NO	—	Model-dependent
xAI, Mistral, Cohere	NO	NO	NO	—	None

Video

Provider	Base64	HTTP URL	Cloud Storage	YouTube URL	Max Size	Formats
Google Gemini	YES (<100 MB)	YES (Files API)	YES (GCS)	YES	100 MB inline / 20 GB Files API	MP4, MOV, WebM, AVI, FLV, MPEG, WMV, 3GPP
Amazon Bedrock	YES	NO	YES (S3)	NO	S3 recommended for large files	MKV, MOV, MP4, WebM, FLV, MPEG, MPG, WMV, 3GP
OpenAI, Anthropic, xAI, Mistral, Cohere	NO	NO	NO	NO	—	None

Documents (PDF and Office formats)

Provider	Base64	HTTP URL	File Upload (file_id)	Max Size	Formats
OpenAI	NO (file_id only)	NO	YES (Files API)	512 MB per file	PDF, DOCX, PPTX, XLSX, TXT, CSV
Anthropic	YES	YES	YES (beta)	32 MB, 100 pages	PDF, TXT
Google Gemini	YES	YES	YES (Files API)	50 MB inline / 2 GB Files API	PDF, TXT, CSV
Amazon Bedrock	YES	NO	NO (S3 only)	4.5 MB default	PDF, CSV, DOC, DOCX, XLS, XLSX, HTML, TXT, MD
Mistral (OCR API)	NO (separate endpoint)	NO	NO	50 MB, 1000 pages	PDF (separate OCR API only)
xAI, Cohere	NO	NO	NO	—	None

Per-Provider Details

OpenAI

OpenAI supports the widest range of resource types and delivery methods. The same format works in both Chat Completions and the Responses API.

Images: JPEG, PNG, WebP, GIF (first frame only for animated GIFs). Up to 20 MB per image. Base64 via data:image/...;base64, URI, HTTP URL, or file_id from the Files API.
Image detail: low (85 tokens flat), high (tiled at 512 px, 170 tokens/tile + 85 base), auto (default). Use detail: "low" to cut image cost to a fixed 85 tokens regardless of resolution.
Audio (Chat Completions): WAV and MP3 only, base64 via input_audio content block. HTTP URL not supported.
Documents: Must upload via Files API first, then reference file_id. Supports PDF, DOCX, PPTX, XLSX, TXT, CSV (text extraction).
Files API limits: 50 MB for vision purpose, 512 MB for assistants/general. Chunked uploads (up to 8 GB) via /v1/uploads.

Key quirk: GIF support is limited to the first frame only. The detail parameter defaults to auto, which selects low for images under 512×512 px and high for larger images.

Anthropic

Anthropic has strict validation rules but supports both base64 and HTTP URLs for images and PDFs natively, without requiring a separate upload step.

Images: JPEG, PNG, WebP, GIF (first frame). Strict 5 MB per image limit (10 MB on claude.ai). Max 8000×8000 px; drops to 2000×2000 px when more than 20 images are in a single request.
PDFs: Base64, HTTP URL, or file_id (beta) using the document content block. Up to 32 MB and 100 pages. Claude 3.5+ only.
Files API (beta): Requires header anthropic-beta: files-api-2025-04-14. Supports PDF, plain text, and images up to 500 MB per file. Files persist until deleted — no expiry. Not available via Bedrock or Vertex AI channels.
Token cost: (width × height) / 750 — a 1000×1000 image costs ~1,334 tokens.

Key quirk: The declared media_type must exactly match the actual file bytes. Declaring image/jpeg while sending PNG bytes causes a hard 400 error. Always detect MIME type from file content, not file extension.

Google Gemini

Gemini has the most flexible resource delivery across all types. It is the only provider that natively supports video input and YouTube URLs.

Images: JPEG, PNG, WebP, HEIC, HEIF (iOS photos work natively). Up to 20 MB total inline payload. 2 GB via Files API or GCS URI. Up to 3,600 images per request.
Audio: WAV, MP3, AIFF, AAC, OGG, FLAC. Inline up to 20 MB. Files API up to 2 GB. Token cost: 32 tokens per second of audio. Max 9.5 hours combined.
Video: MP4, MOV, WebM, AVI, FLV, MPEG, WMV, 3GPP. Inline only for files under 100 MB. Files API up to 20 GB (paid tier). YouTube URLs for public videos. Token cost: ~300 tokens/sec at default resolution (258 tokens/frame at 1 FPS + 32 tokens/sec audio).
PDF: Inline up to 50 MB. Files API up to 2 GB. ~258 tokens per page.
Files API expiry: 48 hours after upload. GCS registrations: up to 30 days. Files must be re-uploaded for long-lived applications.
GCS URI: Requires Storage Object Viewer IAM role on service-{project}@gcp-sa-aiplatform.iam.gserviceaccount.com.
Cost control: Use the media_resolution parameter (low, medium, high) to trade quality for cost. A 10-minute video at default resolution costs ~$0.63 at Gemini 1.5 Pro pricing.

Key quirk: The 20 MB inline limit applies to the total request payload after base64 encoding — which adds ~33% overhead. A 15 MB raw file becomes ~20 MB base64-encoded. Use the Files API for files over ~14 MB raw.

Amazon Bedrock

Bedrock is the most restrictive provider: HTTP URLs are completely unsupported. All media must be base64-encoded inline or referenced via S3 URI.

Images: JPEG, PNG, WebP, GIF. Max 3.75 MB per image, 20 images per request. Base64 via source.bytes field, or S3 URI via source.s3Location.uri.
Documents: PDF, CSV, DOC, DOCX, XLS, XLSX, HTML, TXT, MD. Default 4.5 MB limit (no limit for PDF with Claude 4+ and Nova models). Max 5 documents per request. A text block is required in the same content array — omitting it causes a 400 error.
Video: MKV, MOV, MP4, WebM, FLV, MPEG, MPG, WMV, 3GP. Base64 or S3 URI. Nova Pro and Anthropic Claude 3/4 models only.
Model restrictions: Not all models support all resource types. Claude 3+ and Nova Pro/Premier/Lite support images + documents. Nova Micro supports neither.

Key quirk: HTTP URL delivery is not implemented and there is no announced timeline for support. TheRouter automatically proxies HTTP image URLs to base64 before forwarding requests to Bedrock — this is handled transparently.

xAI Grok

Grok supports only images, and only via base64 or HTTP URL. No audio, video, document, or file upload support.

Images: JPEG and PNG confirmed. WebP behavior is undocumented. Up to 20 MiB per image. Token calculation: tiled at 448×448 px, 256 tokens/tile, max 7 tiles per image (1,792 tokens maximum).
OpenAI-compatible format: Same image_url content block as OpenAI. The detail parameter may be ignored.

Key quirk: No Files API. Every request must include the full image data inline or via URL — no pre-upload caching.

Mistral (Pixtral)

Vision support is limited to Pixtral models (pixtral-12b, pixtral-large-2411). PDF handling is a separate OCR API product, not integrated into the chat completions endpoint.

Images: JPEG, PNG, WebP, GIF (first frame). Base64 or HTTP URL via the standard image_url field. Pixtral Large handles up to 30 high-res images (128k context window).
PDFs: Only via the separate Mistral OCR API (mistral-ocr-2503), not via chat completions. Priced at $2 per 1,000 pages. Not routed through TheRouter's chat completions endpoint.

Key quirk: RGBA PNG files cause a decode error. Convert to RGB before encoding (remove the alpha channel). This is a common issue when screenshots or image-editing tools output PNG with transparency.

Cohere (Command A)

Vision support was added in command-a-03-2025. Earlier Command R/R+ models do not support image input.

Images: JPEG, PNG, WebP, GIF (non-animated). Base64 or HTTP URL via OpenAI-compatible image_url format. Up to 20 MB total per request, max 20 images.
Token cost: low detail = 256 tokens per image. high detail = 256 tokens per 512×512 tile + 256 base.
Document mode (R/R+ models): Text-only RAG via documents parameter — not multimodal, no file upload.

Code Examples

Send an Image via URL (simplest)

# cURL
curl https://api.therouter.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${THEROUTER_API_KEY}" \
  -d '{
    "model": "openai/gpt-4.1",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What is in this image?" },
          {
            "type": "image_url",
            "image_url": { "url": "https://example.com/photo.jpg" }
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

# Python (httpx)
import httpx

response = httpx.post(
    "https://api.therouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "model": "anthropic/claude-sonnet-4",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this image?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": "https://example.com/photo.jpg"},
                    },
                ],
            }
        ],
        "max_tokens": 512,
    },
)
print(response.json()["choices"][0]["message"]["content"])

Send an Image via Base64

import base64
import httpx

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = httpx.post(
    "https://api.therouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "model": "google/gemini-2.5-flash",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail."},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                    },
                ],
            }
        ],
        "max_tokens": 512,
    },
)

Send a PDF to Claude

import base64
import httpx

with open("report.pdf", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = httpx.post(
    "https://api.therouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "model": "anthropic/claude-sonnet-4",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Summarize the key findings in this report.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:application/pdf;base64,{b64}"
                        },
                    },
                ],
            }
        ],
        "max_tokens": 1024,
    },
)

Note: TheRouter normalizes the PDF into the native Anthropic document content block format transparently. You send the same image_url shape as with images; the gateway handles the conversion.

Send Audio to GPT-4o

import base64
import httpx

with open("question.mp3", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = httpx.post(
    "https://api.therouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "model": "openai/gpt-audio-1.5",
        "modalities": ["text", "audio"],
        "audio": {"voice": "alloy", "format": "mp3"},
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {"data": b64, "format": "mp3"},
                    }
                ],
            }
        ],
    },
)
print(response.json()["choices"][0]["message"]["content"])

Multi-Image Request

import base64
import httpx

def to_b64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

response = httpx.post(
    "https://api.therouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "model": "anthropic/claude-opus-4.6",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Compare these two product designs."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{to_b64('design_v1.png')}"
                        },
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{to_b64('design_v2.png')}"
                        },
                    },
                ],
            }
        ],
        "max_tokens": 1024,
    },
)

TheRouter Resource Handling

TheRouter normalizes resource delivery across providers transparently. You write your request once using the standard image_url content block format, and the gateway handles provider-specific conversions:

Bedrock URL proxying: HTTP URLs in requests routed to Bedrock are automatically fetched and converted to base64 inline. Bedrock does not support HTTP URL references — this conversion is mandatory and handled by the gateway.
Anthropic PDF conversion: A data:application/pdf;base64,... image_url block is converted to Anthropic's native document content block format with source.type: "base64".
Anthropic image conversion: The standard OpenAI-format image_url block is translated to Anthropic's image content block with the appropriate source.type ("url" or "base64").
Bedrock format normalization: The jpg extension is normalized to jpeg for Bedrock's format enum. Non-allowed formats are rejected with a clear error.
Vertex AI URL fetching: HTTP image URLs are fetched and converted to inline base64 inlineData blocks for Vertex AI requests.
Multimodal validation: Requests containing image, audio, or video content blocks are validated against the target model's declared modality.input capabilities. Mismatches return a 400 error with a suggestion of supported models.

Check model capabilities at the /v1/models endpoint — each model includes an architecture.features array with values like vision, audio_input, and pdf.

Best Practices

Which delivery method to use

Use HTTP URLs when the resource is already publicly hosted and the target model supports URL references (OpenAI, Anthropic, Gemini, xAI, Mistral, Cohere). This is the most efficient option — no encoding overhead, minimal request size.
Use base64 inline for private or locally generated resources that are not publicly accessible. Always detect MIME type from file content, not file extension, to avoid strict-validation failures (especially on Anthropic).
Use file_id when the same resource appears in multiple turns of a conversation. Upload once, reference by ID for the rest of the session. Most effective with OpenAI and Anthropic.
Use cloud storage URIs for large files (video, multi-page PDFs) when routing to Bedrock or Gemini. Eliminates base64 overhead and avoids payload size limits. Requires pre-provisioned IAM access.

Size optimization

Resize images before sending. Anthropic's optimal resolution is ≤1.15 megapixels (≤1568 px on the long edge). Larger images are auto-resized, consuming unnecessary bandwidth.
For OpenAI, use detail: "low" for images that only need rough visual understanding (logos, charts, diagrams). This reduces token cost from hundreds to a flat 85 tokens regardless of image size.
For Gemini video, use media_resolution: "low" to reduce token cost from ~300 to ~100 tokens per second.
Prefer JPEG over PNG for photos. PNG lossless compression is typically 2–4× larger than equivalent JPEG quality, increasing both payload size and token cost.
Remember that base64 encoding adds ~33% size overhead. A 10 MB image becomes a ~13.3 MB base64 string in the request body.

Cost implications

Image tokens vary by provider and resolution. OpenAI high-detail: 170 tokens per 512×512 tile. Anthropic: (w × h) / 750. Cohere: 256 tokens per 512×512 tile. For cost-sensitive workloads, test multiple models and detail levels.
Audio is expensive. OpenAI audio input tokens are 8–10× the cost of text tokens. Gemini audio costs 32 tokens per second. Transcription via Whisper ($0.006/minute) is cheaper than audio chat for speech-to-text tasks where a text response is sufficient.
Gemini video is very expensive at default settings. 10 minutes of video at default resolution ≈ 180,000 tokens input. Extract key frames as images instead of sending raw video when full temporal understanding is not required.
Delivery method itself has no provider-side cost. Base64, URL reference, and file_id all produce the same token count for equivalent content. The difference is bandwidth and latency, not per-token pricing.

Common gotchas

Bedrock rejects HTTP URLs. If you switch a model alias to route through Bedrock (e.g., a Meta or Mistral model), any HTTP image URLs in your request will fail unless TheRouter's URL proxying layer handles them. TheRouter does this automatically for models routed through Bedrock.
Anthropic strict MIME type. Declare image/jpeg for JPEG files, image/png for PNG — never derive the MIME type from the file extension alone. Use a library like python-magic or the filetype package to detect MIME type from file bytes.
Mistral RGBA PNG. PNG images with an alpha channel (RGBA) cause decode errors on Pixtral models. Strip the alpha channel before encoding.
Gemini Files API 48-hour expiry. Files uploaded to the Gemini Files API are deleted after 48 hours. Do not store file_id values for long-lived references with Gemini.
Bedrock document text block requirement. When sending a document to Bedrock, the same content array must include at least one text block. Sending a document without an accompanying text prompt returns a 400 error.
Anthropic many-images dimension limit. When sending more than 20 images in a single Anthropic request, the maximum dimension per image drops from 8000×8000 px to 2000×2000 px. Resize accordingly when building image-batch applications.