Multimodal AI
Also Known As
TL;DR
Explanation
Traditional AI models are unimodal: an image classifier takes images as input and returns labels; a language model takes text and returns text. Multimodal models unify multiple modalities within a single model, allowing them to reason across modality boundaries. Vision-language models (VLMs) such as GPT-4o, Claude 3, and Gemini can accept a combination of text and images in a single prompt and respond in text. More advanced systems add audio input/output or video understanding. Technically, multimodal models embed each modality into a shared latent space so the model can attend across them — an image patch and a text token become comparable vectors. For software engineers, multimodal APIs introduce new concerns: larger prompt costs (images are billed by token count derived from resolution), latency (image encoding adds processing time), handling of non-text content in context windows, and appropriate use-case validation (vision models hallucinate image details just as LLMs hallucinate text). Common use cases: document understanding, visual question answering, screenshot-to-code, accessibility description generation, and product image analysis.
Diagram
flowchart LR
subgraph Inputs
TEXT[Text prompt]
IMG[Image]
AUD[Audio]
end
subgraph Encoder
TE[Text encoder]
IE[Image patch encoder]
AE[Audio encoder]
end
subgraph SharedLatentSpace
ATTN[Cross-modal attention<br/>transformer layers]
end
subgraph Outputs
OTEXT[Text response]
OIMG[Generated image]
end
TEXT --> TE --> ATTN
IMG --> IE --> ATTN
AUD --> AE --> ATTN
ATTN --> OTEXT & OIMG
style ATTN fill:#0d419d,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Sending high-resolution images without downscaling — a 4K image can cost thousands of tokens and add hundreds of milliseconds of latency.
- Assuming vision models count objects accurately — they frequently miscount items in dense or overlapping scenes.
- Not specifying the image context in the text prompt — the model cannot reliably infer what aspect of the image is relevant without guidance.
- Using multimodal models for tasks that require pixel-perfect accuracy (e.g. reading small text from low-res screenshots) without validating output against a structured parser.
Avoid When
- Tasks that require pixel-perfect accuracy (e.g. reading barcodes, extracting precise table data from low-quality scans) — use dedicated OCR or structured parsers instead.
- Sending full-resolution images without understanding the token cost and latency implications.
When To Use
- Use multimodal models to combine document text and layout understanding in a single call instead of a separate OCR-then-NLP pipeline.
- Generate accessibility alt text for product images automatically at upload time.
- Let users ask natural-language questions about charts, screenshots, or diagrams in your application.
- Always resize images and write specific, structured prompts to control cost, latency, and output quality.
Code Examples
// Sending a full-resolution image without constraints
$imageData = base64_encode(file_get_contents('/uploads/photo.jpg')); // may be 4MB+
$response = $llm->complete([
['type' => 'image', 'data' => $imageData],
['type' => 'text', 'text' => 'What is in this image?'] // vague prompt
]);
// Resize before sending, be specific in the prompt
$image = Image::load('/uploads/photo.jpg')
->resize(800, 600) // keep under ~1000px on longest side
->toBase64Jpeg(quality: 80);
$response = $llm->complete([
[
'type' => 'image',
'data' => $image,
'media_type' => 'image/jpeg',
],
[
'type' => 'text',
'text' => 'List the product names and prices visible in this invoice image. '
. 'Return as JSON: [{"name": string, "price": number}]. '
. 'If a value is unclear, return null for that field.',
],
]);
// Validate structured output
$items = json_decode($response->text, true);
if (!is_array($items)) {
throw new UnexpectedLlmOutputException($response->text);
}