Multimodal AI
debt(d9/e3/b5/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints field explicitly states automated detection is 'no', and the code patterns (unresized images, vague prompts) produce no compiler or linter errors. Misuse only surfaces when users encounter incorrect counts, missed spatial relationships, or unexpected latency/cost spikes in production — invisible until real workloads expose it.
Closest to 'simple parameterised fix (replace pattern with safer alternative)' (e3). The quick_fix summary specifies resize images to under 1000px, add structured prompts, and validate output — a small, repeatable pattern fix that touches the API call site and prompt template, not a cross-cutting refactor. Slightly worse than e1 because it involves three coordinated changes (resize, prompt rewrite, output validation), but all localised to the integration layer.
Closest to 'persistent productivity tax' (b5). Multimodal AI applies across web, CLI, and queue-worker contexts per applies_to, meaning any team feature touching images must understand the resizing, prompting, and validation constraints. The non-deterministic failure modes (miscounting, spatial misreading) impose an ongoing review and validation burden on every image-handling feature, slowing multiple work streams without fully defining the system's architecture.
Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The misconception field directly captures the trap: developers expect models to 'see' images the way humans do — a natural analogy that contradicts the actual statistical token-patch mechanism. This causes confident misuse (assuming accurate object counts, spatial precision) that is directly contradicted by how similar perception-focused tools (e.g. dedicated object-detection models) actually behave, making it a serious and recurring cognitive pitfall.
Also Known As
TL;DR
Explanation
Traditional AI models are unimodal: an image classifier takes images as input and returns labels; a language model takes text and returns text. Multimodal models unify multiple modalities within a single model, allowing them to reason across modality boundaries. Vision-language models (VLMs) such as GPT-4o, Claude 3, and Gemini can accept a combination of text and images in a single prompt and respond in text. More advanced systems add audio input/output or video understanding. Technically, multimodal models embed each modality into a shared latent space so the model can attend across them — an image patch and a text token become comparable vectors. For software engineers, multimodal APIs introduce new concerns: larger prompt costs (images are billed by token count derived from resolution), latency (image encoding adds processing time), handling of non-text content in context windows, and appropriate use-case validation (vision models hallucinate image details just as LLMs hallucinate text). Common use cases: document understanding, visual question answering, screenshot-to-code, accessibility description generation, and product image analysis.
Diagram
flowchart LR
subgraph Inputs
TEXT[Text prompt]
IMG[Image]
AUD[Audio]
end
subgraph Encoder
TE[Text encoder]
IE[Image patch encoder]
AE[Audio encoder]
end
subgraph SharedLatentSpace
ATTN[Cross-modal attention<br/>transformer layers]
end
subgraph Outputs
OTEXT[Text response]
OIMG[Generated image]
end
TEXT --> TE --> ATTN
IMG --> IE --> ATTN
AUD --> AE --> ATTN
ATTN --> OTEXT & OIMG
style ATTN fill:#0d419d,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Sending high-resolution images without downscaling — a 4K image can cost thousands of tokens and add hundreds of milliseconds of latency.
- Assuming vision models count objects accurately — they frequently miscount items in dense or overlapping scenes.
- Not specifying the image context in the text prompt — the model cannot reliably infer what aspect of the image is relevant without guidance.
- Using multimodal models for tasks that require pixel-perfect accuracy (e.g. reading small text from low-res screenshots) without validating output against a structured parser.
Avoid When
- Tasks that require pixel-perfect accuracy (e.g. reading barcodes, extracting precise table data from low-quality scans) — use dedicated OCR or structured parsers instead.
- Sending full-resolution images without understanding the token cost and latency implications.
When To Use
- Use multimodal models to combine document text and layout understanding in a single call instead of a separate OCR-then-NLP pipeline.
- Generate accessibility alt text for product images automatically at upload time.
- Let users ask natural-language questions about charts, screenshots, or diagrams in your application.
- Always resize images and write specific, structured prompts to control cost, latency, and output quality.
Code Examples
// Sending a full-resolution image without constraints
$imageData = base64_encode(file_get_contents('/uploads/photo.jpg')); // may be 4MB+
$response = $llm->complete([
['type' => 'image', 'data' => $imageData],
['type' => 'text', 'text' => 'What is in this image?'] // vague prompt
]);
// Resize before sending, be specific in the prompt
$image = Image::load('/uploads/photo.jpg')
->resize(800, 600) // keep under ~1000px on longest side
->toBase64Jpeg(quality: 80);
$response = $llm->complete([
[
'type' => 'image',
'data' => $image,
'media_type' => 'image/jpeg',
],
[
'type' => 'text',
'text' => 'List the product names and prices visible in this invoice image. '
. 'Return as JSON: [{"name": string, "price": number}]. '
. 'If a value is unclear, return null for that field.',
],
]);
// Validate structured output
$items = json_decode($response->text, true);
if (!is_array($items)) {
throw new UnexpectedLlmOutputException($response->text);
}