{
    "slug": "multimodal_ai",
    "term": "Multimodal AI",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "AI models that process and generate across multiple input or output modalities — text, images, audio, and video — within a single unified architecture.",
    "long": "Traditional AI models are unimodal: an image classifier takes images as input and returns labels; a language model takes text and returns text. Multimodal models unify multiple modalities within a single model, allowing them to reason across modality boundaries. Vision-language models (VLMs) such as GPT-4o, Claude 3, and Gemini can accept a combination of text and images in a single prompt and respond in text. More advanced systems add audio input/output or video understanding. Technically, multimodal models embed each modality into a shared latent space so the model can attend across them — an image patch and a text token become comparable vectors. For software engineers, multimodal APIs introduce new concerns: larger prompt costs (images are billed by token count derived from resolution), latency (image encoding adds processing time), handling of non-text content in context windows, and appropriate use-case validation (vision models hallucinate image details just as LLMs hallucinate text). Common use cases: document understanding, visual question answering, screenshot-to-code, accessibility description generation, and product image analysis.",
    "aliases": [
        "vision-language model",
        "VLM",
        "multimodal LLM",
        "multi-modal AI"
    ],
    "tags": [
        "ai",
        "llm",
        "vision",
        "multimodal"
    ],
    "misconception": "Multimodal models 'see' images the way humans do — they encode image patches into token-like vectors and attend over them statistically, meaning they can miss spatial relationships or miscount objects that are visually obvious.",
    "why_it_matters": "Applications that previously required separate pipelines — OCR, then NLP, then image classification — can be replaced with a single multimodal API call, dramatically reducing complexity, but the failure modes are different and less predictable.",
    "common_mistakes": [
        "Sending high-resolution images without downscaling — a 4K image can cost thousands of tokens and add hundreds of milliseconds of latency.",
        "Assuming vision models count objects accurately — they frequently miscount items in dense or overlapping scenes.",
        "Not specifying the image context in the text prompt — the model cannot reliably infer what aspect of the image is relevant without guidance.",
        "Using multimodal models for tasks that require pixel-perfect accuracy (e.g. reading small text from low-res screenshots) without validating output against a structured parser."
    ],
    "when_to_use": [
        "Use multimodal models to combine document text and layout understanding in a single call instead of a separate OCR-then-NLP pipeline.",
        "Generate accessibility alt text for product images automatically at upload time.",
        "Let users ask natural-language questions about charts, screenshots, or diagrams in your application.",
        "Always resize images and write specific, structured prompts to control cost, latency, and output quality."
    ],
    "avoid_when": [
        "Tasks that require pixel-perfect accuracy (e.g. reading barcodes, extracting precise table data from low-quality scans) — use dedicated OCR or structured parsers instead.",
        "Sending full-resolution images without understanding the token cost and latency implications."
    ],
    "related": [
        "large_language_models",
        "embeddings",
        "ai_hallucination",
        "ai_cost_management",
        "tokenization_llm"
    ],
    "prerequisites": [
        "large_language_models",
        "tokenization_llm",
        "embeddings"
    ],
    "refs": [
        "https://platform.openai.com/docs/guides/vision",
        "https://docs.anthropic.com/en/docs/build-with-claude/vision"
    ],
    "bad_code": "// Sending a full-resolution image without constraints\n$imageData = base64_encode(file_get_contents('/uploads/photo.jpg')); // may be 4MB+\n$response = $llm->complete([\n    ['type' => 'image', 'data' => $imageData],\n    ['type' => 'text',  'text' => 'What is in this image?'] // vague prompt\n]);",
    "good_code": "// Resize before sending, be specific in the prompt\n$image = Image::load('/uploads/photo.jpg')\n    ->resize(800, 600)         // keep under ~1000px on longest side\n    ->toBase64Jpeg(quality: 80);\n\n$response = $llm->complete([\n    [\n        'type' => 'image',\n        'data' => $image,\n        'media_type' => 'image/jpeg',\n    ],\n    [\n        'type' => 'text',\n        'text' => 'List the product names and prices visible in this invoice image. '\n               . 'Return as JSON: [{\"name\": string, \"price\": number}]. '\n               . 'If a value is unclear, return null for that field.',\n    ],\n]);\n\n// Validate structured output\n$items = json_decode($response->text, true);\nif (!is_array($items)) {\n    throw new UnexpectedLlmOutputException($response->text);\n}",
    "example_note": "Resize images to control token cost and latency, write specific prompts that direct the model's attention, and validate structured output — vision models can hallucinate fields just as text models do.",
    "quick_fix": "Resize images to under 1000px on the longest side before sending to a vision API, write prompts that direct attention to the relevant content, and validate structured output before trusting it",
    "severity": "medium",
    "effort": "medium",
    "created": "2026-03-29",
    "updated": "2026-03-29",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/multimodal_ai",
        "html_url": "https://codeclaritylab.com/glossary/multimodal_ai",
        "json_url": "https://codeclaritylab.com/glossary/multimodal_ai.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Multimodal AI](https://codeclaritylab.com/glossary/multimodal_ai) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/multimodal_ai"
            }
        }
    }
}