Llama 4 Architecture: Vergleiche Sequential Pipeline (Late Fusion) mit Joint Processing (Early Fusion) für besseres Cross-Modal Reasoning.
| Aspekt | Late Fusion (Sequenziell) | Early Fusion (Llama 4) |
|---|---|---|
| Pipeline | Vision → Dense Vector → Text → LLM | Vision + Text → Interleaved → Unified Transformer |
| Cross-Modal Reasoning | Begrenzt (nur am Ende) | Durchgehend in allen Layers |
| Encoder | Separate Vision/Text Encoder | MetaCLIP-based Vision → Token Space |
| Kontext | 2K Vision Tokens + Text | Million+ Token Context (joint) |
| Information Loss | Hoch (Bottleneck beim Merge) | Minimal (direkte Token-Representation) |
| Reasoning Quality | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Compute Effizienz | Höher (separater Processing) | Unified Framework (optimiert) |