AI Inference Engineering Handbook

From Models to Silicon: A Complete Guide

This handbook explains how AI inference actually works — step by step — starting from what a model is, all the way down to how silicon executes math.

Written specifically for IT, Platform, and Infrastructure Engineers who need to deploy and optimize AI models in production.

đŸŽ¯ What You'll Learn

  • Model Fundamentals: Weights, activations, computation graphs
  • Inference Runtimes: PyTorch, OpenVINO, TensorRT, ONNX, llama.cpp
  • Hardware Architecture: CPU vs GPU, ISAs, CUDA, specialized accelerators
  • Optimization Techniques: Quantization, batching, memory management
  • Production Deployment: Real-world ASR and LLM deployment strategies

đŸ‘Ĩ Who This Is For

  • Infrastructure Engineers managing AI deployments
  • Platform Engineers building ML infrastructure
  • DevOps/MLOps professionals
  • IT Engineers supporting AI applications

No machine learning background required — just systems/IT experience.

📚 Learning Approach

Each topic follows the same pattern:

  1. Plain Explanation — concept in simple terms
  2. Mental Model — how to remember it
  3. Visual Diagrams — see how it works
  4. Real Examples — ASR (Whisper) and LLM deployments
  5. Operational Impact — why it matters for your job

đŸ—ēī¸ Handbook Structure

31 topics across 7 parts:

  • Part I: Fundamentals (Models, Weights, Activations)
  • Part II: Computation Graphs (Static vs Dynamic, Optimization)
  • Part III: Runtimes (PyTorch, OpenVINO, TensorRT, ONNX, llama.cpp)
  • Part IV: Hardware (CPU/GPU, ISA, CUDA)
  • Part V: Optimization (Quantization, Batching, Memory)
  • Part VI: Deployment (ASR, LLM, Benchmarking)
  • Part VII: Advanced (Hyperparameters, Troubleshooting)

1. What Is an AI Model?

Plain Explanation

An AI model is a large mathematical function. It takes input data (like audio or text) and produces output data (like transcribed text or probability scores).

During training, the model learns values. During inference, it only performs calculations. No learning happens at inference time.

💡 Mental Model

A model is a compiled mathematical machine

You don't change it while it runs — you feed data in and read results out.

📊 Model Flow Diagram

Input Data
→
Model
(Math Function)
→
Output Data

⚡ Key Takeaway

For inference engineers, a model is static. It is executed, not trained. Your job is to run it efficiently.

2. Weights and Parameters

Plain Explanation

Weights are the learned numerical parameters of the model. They represent what the model has learned from data during training.

  • Stored in RAM (CPU) or VRAM (GPU)
  • Read constantly during inference
  • Never change during inference
  • A large model may have billions of weights

💡 Mental Model

Weights = Model's knowledge stored as numbers

📏 Weight Size Examples

Model Parameters Size (FP32)
Whisper Large-v3 ~1.5B ~6 GB
LLaMA-7B 7B 28 GB
GPT-4 (estimated) ~1.7T ~6,800 GB

🧮 Memory Calculation

If weights are stored in FP32 (Float32) format:

Each parameter = 4 bytes

7B × 4 bytes ≈ 28 GB

âš ī¸ Operational Impact

  • Weight size determines if a model fits in memory
  • Larger models require more VRAM (GPU) or RAM (CPU)
  • Quantization (reducing precision) reduces weight size
  • This is why INT8 and INT4 models are popular for deployment

3. Activations and Memory

Plain Explanation

Activations are temporary values created as the model processes input data. Unlike weights, activations only exist during inference.

  • Exist only during inference
  • Not saved after execution
  • Scale with input length and batch size
  • Often use MORE memory than weights

💡 Mental Model

Weights

Knowledge

Static, learned values

Activations

Thinking

Dynamic, temporary values

âš ī¸ Why Activations Matter

  • Memory bottleneck: Activations often require MORE memory than weights
  • Batch size impact: 2× batch size = 2× activation memory
  • Sequence length: Longer inputs = more activations
  • OOM errors are usually caused by activations, not weights

4. Training vs Inference

Plain Explanation

Training and Inference are two completely different phases in the AI lifecycle. Understanding this distinction is crucial for infrastructure engineering.

📊 Comparison

Aspect Training Inference
Purpose Learn from data Make predictions
Changes Weights are updated Weights never change
Duration Hours to weeks Milliseconds to seconds
Hardware Multiple GPUs CPU or single GPU
Team Data scientists, ML engineers Platform/infra engineers (you!)
Focus Accuracy, convergence Latency, throughput, cost

✅ Part I Complete!

You now understand the fundamentals: models, weights, activations, and the difference between training and inference. Ready to learn about computation graphs!

5. Understanding Computation Graphs

Plain Explanation

A computation graph represents a model as a graph of mathematical operations. It's like a blueprint that shows exactly what calculations need to happen and in what order.

💡 Mental Model

A computation graph = Wiring diagram for math

🔧 Graph Components

Nodes

Represent operations:

  • â€ĸ Matrix multiplication
  • â€ĸ Addition
  • â€ĸ Activation functions (ReLU, Softmax)
  • â€ĸ Convolution
  • â€ĸ Attention

Edges

Represent data flow:

  • â€ĸ Tensors (multi-dimensional arrays)
  • â€ĸ Flow between operations
  • â€ĸ Define dependencies
  • â€ĸ Show execution order

📊 Simple Graph Example

Input X
Input Y
↓
Multiply (X × Y)
↓
Add Bias (+b)
↓
ReLU Activation
↓
Output

🎤 ASR Example (Whisper)

Audio
→
Encoder
Encoder
→
Attention
Attention
→
Decoder
Decoder
→
Text

đŸ’Ŧ LLM Example (GPT)

Tokens
→
Embedding
Embedding
→
Transformer
Transformer
→
Softmax
Softmax
→
Next Token

đŸŽ¯ Why Graphs Matter

  • â€ĸ Optimization: Graphs can be analyzed and optimized
  • â€ĸ Parallelization: Independent operations can run simultaneously
  • â€ĸ Memory planning: Know memory needs ahead of time
  • â€ĸ Debugging: Visualize what the model does

6. Static vs Dynamic Graphs

Plain Explanation

Computation graphs can be built in two ways: dynamically (as the program runs) or statically (built once ahead of time). This choice has major implications for inference performance.

🔄 Dynamic Graphs

How It Works:

Graph is constructed as operations execute

✓ Advantages:

  • â€ĸ Very flexible
  • â€ĸ Easy to debug
  • â€ĸ Supports control flow (if/else, loops)
  • â€ĸ Great for research

✗ Disadvantages:

  • â€ĸ Slower execution
  • â€ĸ Limited optimization
  • â€ĸ Higher overhead

Used by:

PyTorch Eager Mode

⚡ Static Graphs

How It Works:

Graph is built once, then optimized and executed repeatedly

✓ Advantages:

  • â€ĸ Much faster execution
  • â€ĸ Aggressive optimization
  • â€ĸ Memory planning
  • â€ĸ Perfect for production

✗ Disadvantages:

  • â€ĸ Less flexible
  • â€ĸ Harder to debug
  • â€ĸ Requires conversion step

Used by:

TorchScript, ONNX, OpenVINO, TensorRT

💡 Mental Model

Dynamic Graph

= Interpreted Code

Like running Python

Static Graph

= Compiled Code

Like running C++

📊 Performance Comparison

Metric Dynamic Static
Inference Speed 1x (baseline) 2-5x faster
Debugging Easy Harder
Flexibility High Limited
Memory Usage Higher Optimized
Production Use Not Recommended Strongly Recommended

đŸ’ģ Code Example: PyTorch

Dynamic (Eager Mode):

# Dynamic execution
import torch

model = WhisperModel()
output = model(audio)  # Graph built on-the-fly

Static (TorchScript):

# Static graph - compile once
import torch

model = WhisperModel()
scripted = torch.jit.script(model)  # Build graph
output = scripted(audio)  # Fast execution

đŸŽ¯ When to Use Each

Use Dynamic Graphs for:

  • â€ĸ Research and experimentation
  • â€ĸ Prototyping
  • â€ĸ Models with complex control flow

Use Static Graphs for:

  • â€ĸ Production deployment
  • â€ĸ Performance-critical applications
  • â€ĸ ASR and LLM inference
  • â€ĸ This is what you want 99% of the time!

7. Graph Optimization

Plain Explanation

Once you have a static computation graph, inference runtimes can optimize it automatically. These optimizations make inference faster and more memory-efficient without changing the model's accuracy.

💡 Mental Model

Graph optimization = Compiler optimization for AI

Like how C++ compilers optimize your code automatically

🔧 Common Optimization Techniques

1. Operator Fusion (Kernel Fusion)

Combine multiple operations into a single kernel

Before:

MatMul → Add Bias → ReLU

After:

Fused MatMul+Bias+ReLU

✓ Fewer memory reads/writes ✓ Faster execution

2. Constant Folding

Pre-compute operations that don't depend on input

Before:

output = input * (2 + 3)  # Computed every time

After:

output = input * 5  # Pre-computed

3. Dead Code Elimination

Remove operations that don't affect the output

Example: Unused outputs, redundant calculations

4. Layout Optimization

Reorganize data for better memory access patterns

Change tensor formats (NCHW ↔ NHWC) for hardware efficiency

5. Memory Reuse

Reuse memory buffers instead of allocating new ones

Reduces total memory footprint significantly

6. Quantization-Aware Optimization

Optimize for lower precision (INT8, INT4)

Replace FP32 operations with INT8 equivalents

📊 Impact of Graph Optimization

Unoptimized

100ms

Basic Optimization

50ms

2x faster

Aggressive Optimization

25ms

4x faster

Typical speedup from graph optimization on production models

đŸ› ī¸ Which Runtimes Do This?

TensorRT (NVIDIA)

Most aggressive optimization

Excellent

OpenVINO (Intel)

Strong CPU optimization

Excellent

ONNX Runtime

Good cross-platform optimization

Very Good

PyTorch JIT

Basic optimization

Good

⚡ Key Takeaway

Graph optimization is why specialized inference runtimes are so much faster than running models in PyTorch eager mode. The same model, same weights, but 2-5x faster execution through automatic optimization.

✅ Part II Complete!

You now understand computation graphs, why static graphs matter, and how they're optimized. Ready to learn about the runtimes that execute these graphs!

8. What Is an Inference Runtime?

Plain Explanation

An inference runtime is the software layer that executes your computation graph on hardware. It's the engine that takes your model and actually runs it on CPUs or GPUs.

💡 Mental Model

Runtime = Execution Engine for AI Models

Like JVM for Java or V8 for JavaScript

đŸŽ¯ What Runtimes Do

Core Responsibilities:

  • â€ĸ Load model weights into memory
  • â€ĸ Parse computation graph
  • â€ĸ Select optimal kernels
  • â€ĸ Execute operations in order
  • â€ĸ Manage memory allocation
  • â€ĸ Schedule threads/cores

Optimizations:

  • â€ĸ Graph optimization (fusion, etc.)
  • â€ĸ Hardware-specific kernels
  • â€ĸ Memory planning
  • â€ĸ Parallel execution
  • â€ĸ Quantization support
  • â€ĸ Batch processing

📊 The Inference Stack

Your Application
↓
Inference Runtime

(PyTorch, OpenVINO, TensorRT, etc.)

↓
Kernels / Operators

(Optimized math functions)

↓
Hardware (CPU / GPU)

🔧 Popular Inference Runtimes

PyTorch Runtime

General Purpose

Default runtime, good for prototyping

OpenVINO

CPU Optimized

Intel CPUs, excellent INT8 performance

TensorRT

GPU Optimized

NVIDIA GPUs, ultra-low latency

ONNX Runtime

Cross-Platform

Works everywhere, good portability

llama.cpp

LLM Specialized

Optimized for language models on CPU

đŸŽ¯ Choosing a Runtime

Your choice depends on:

  • â€ĸ Hardware: CPU vs GPU
  • â€ĸ Model type: ASR, LLM, vision
  • â€ĸ Latency requirements: Real-time vs batch
  • â€ĸ Vendor: Intel, NVIDIA, AMD, Apple
  • â€ĸ Ecosystem: Python, C++, mobile

9. PyTorch Runtime

Plain Explanation

PyTorch is the most popular deep learning framework. While primarily designed for training, it also has inference capabilities through Eager Mode and TorchScript.

🔄 PyTorch Execution Modes

Eager Mode

  • ✓ Dynamic graph execution
  • ✓ Easy debugging
  • ✓ Flexible and Pythonic
  • ✗ Slower inference
  • ✗ Limited optimization
model = WhisperModel()
output = model(audio)

TorchScript

  • ✓ Static graph (optimized)
  • ✓ 2-3x faster inference
  • ✓ Can run without Python
  • ✗ Requires conversion step
  • ✗ Some features unsupported
model = WhisperModel()
scripted = torch.jit.script(model)
output = scripted(audio)

⚡ Performance Comparison

Mode Speed Ease of Use Production Ready
Eager Mode 1x (baseline) ⭐⭐⭐⭐⭐ ❌
TorchScript 2-3x ⭐⭐⭐ âš ī¸ OK
OpenVINO 4-5x ⭐⭐⭐ ✅

đŸ’ģ Code Example: Converting to TorchScript

import torch
from transformers import WhisperForConditionalGeneration

# Load model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
model.eval()

# Convert to TorchScript
with torch.no_grad():
    scripted_model = torch.jit.trace(
        model,
        example_inputs=(dummy_input,)
    )

# Save for deployment
scripted_model.save("whisper_scripted.pt")

# Load and use
loaded = torch.jit.load("whisper_scripted.pt")
output = loaded(audio_input)

âš™ī¸ Backend Options (ATen)

PyTorch uses ATen (A Tensor Library) for operations:

CPU Backend

  • â€ĸ Uses MKL (Intel Math Kernel Library)
  • â€ĸ OpenBLAS or Eigen as fallback
  • â€ĸ Good threading support

GPU Backend

  • â€ĸ CUDA kernels (NVIDIA)
  • â€ĸ cuDNN for convolutions
  • â€ĸ cuBLAS for matrix operations

âš ī¸ When to Use PyTorch Runtime

✓ Good for:

  • â€ĸ Prototyping and experimentation
  • â€ĸ Research deployments
  • â€ĸ When you need maximum flexibility
  • â€ĸ Quick proof-of-concept

✗ Not ideal for:

  • â€ĸ High-throughput production (use OpenVINO/TensorRT)
  • â€ĸ Latency-critical applications
  • â€ĸ Resource-constrained environments

💡 Key Insight

PyTorch is excellent for development, but for production inference, you typically want to export to a specialized runtime like OpenVINO (CPU) or TensorRT (GPU) for 2-5x better performance.

10. OpenVINO (Open Visual Inference and Neural Network Optimization)

Plain Explanation

OpenVINO is Intel's inference optimization toolkit. It's specifically designed to run AI models blazingly fast on Intel CPUs, with excellent support for quantization and various model types.

💡 Mental Model

OpenVINO = Intel's turbocharger for CPU inference

✨ Key Features

Strengths:

  • ✓ Excellent CPU performance (Intel)
  • ✓ Outstanding INT8 quantization
  • ✓ Static graph optimization
  • ✓ Auto-tuning for your CPU
  • ✓ Cross-platform (Windows, Linux)
  • ✓ Supports many frameworks

Limitations:

  • â€ĸ Best on Intel hardware
  • â€ĸ Requires model conversion (IR format)
  • â€ĸ Learning curve for optimization
  • â€ĸ Limited GPU support (vs TensorRT)

🔧 OpenVINO Workflow

1. PyTorch Model
→
2. Convert to IR
(Intermediate Representation)
↓
3. Optimize
(Graph + Quantization)
→
4. Run Inference

đŸ’ģ Code Example: Converting and Running

# Step 1: Install

pip install openvino openvino-dev

# Step 2: Convert PyTorch to OpenVINO IR

from openvino.tools import mo

# Convert model
mo.convert_model(
    "whisper_model.pt",
    output_dir="openvino_model",
    compress_to_fp16=True  # Reduce size
)

# Step 3: Run Inference

from openvino.runtime import Core

# Initialize runtime
core = Core()
model = core.read_model("openvino_model/model.xml")
compiled = core.compile_model(model, "CPU")

# Run inference
output = compiled([audio_input])[0]

📊 Performance: OpenVINO vs PyTorch

PyTorch CPU (FP32)
100ms
OpenVINO FP32
50ms (2x faster)
OpenVINO INT8
20ms (5x faster)

Typical speedups for ASR models on Intel Xeon CPUs

đŸŽ¯ INT8 Quantization with OpenVINO

OpenVINO excels at INT8 quantization:

from openvino.tools import pot  # Post-training Optimization Tool

# Quantize to INT8
config = {
    "model": "whisper.xml",
    "engine": {"type": "accuracy_checker"},
    "compression": {
        "algorithms": [{
            "name": "DefaultQuantization",
            "preset": "performance",
            "stat_subset_size": 300
        }]
    }
}

pot.compress_model(config)

Result: 4x smaller model, 3-5x faster inference, minimal accuracy loss (<1%)

đŸŽ¯ When to Use OpenVINO

✓ Perfect for:

  • â€ĸ CPU-only production deployments
  • â€ĸ Intel hardware (Xeon, Core processors)
  • â€ĸ ASR models (Whisper, Conformer)
  • â€ĸ Real-time applications on CPU
  • â€ĸ Cost-sensitive deployments (no GPU needed)

âš ī¸ Consider alternatives if:

  • â€ĸ You have NVIDIA GPUs (use TensorRT)
  • â€ĸ You need maximum GPU performance
  • â€ĸ You're on non-Intel hardware

✅ Real-World Use Case

Call Center ASR: Many companies use OpenVINO to run Whisper models on CPU servers, achieving real-time transcription at 1/10th the cost of GPU deployments.

11. TensorRT (NVIDIA Tensor Runtime)

Plain Explanation

TensorRT is NVIDIA's high-performance inference optimizer and runtime. It's designed to squeeze maximum performance out of NVIDIA GPUs through aggressive graph optimization and kernel fusion.

💡 Mental Model

TensorRT = Ultimate GPU performance optimizer

⚡ What Makes TensorRT Fast

1. Aggressive Kernel Fusion

Combines dozens of operations into single GPU kernels

2. Precision Calibration

Automatic INT8 quantization with minimal accuracy loss

3. Layer and Tensor Fusion

Optimizes memory access patterns for GPU architecture

4. Dynamic Tensor Memory

Reuses memory buffers to reduce VRAM usage

5. Multi-Stream Execution

Parallel processing of multiple batches

📊 Performance Comparison

PyTorch GPU (FP32)
15ms
TensorRT FP16
6ms (2.5x faster)
TensorRT INT8
3ms (5x faster)

Typical speedups for LLMs on NVIDIA A100 GPUs

đŸ’ģ Code Example: PyTorch to TensorRT

# Step 1: Export to ONNX

import torch
model = WhisperModel()
dummy_input = torch.randn(1, 80, 3000).cuda()

torch.onnx.export(
    model, dummy_input, "whisper.onnx",
    opset_version=17,
    input_names=["audio"],
    output_names=["text"]
)

# Step 2: Convert to TensorRT

import tensorrt as trt

builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)

engine = builder.build_serialized_network(network, config)

đŸŽ¯ When to Use TensorRT

  • â€ĸ Ultra-low latency requirements (real-time ASR, LLM serving)
  • â€ĸ NVIDIA GPUs available (A100, H100, V100, T4)
  • â€ĸ Production LLM deployments
  • â€ĸ When GPU cost is a concern (better utilization)

12. ONNX Runtime (Open Neural Network Exchange)

Plain Explanation

ONNX Runtime is a cross-platform, hardware-agnostic inference engine. It's designed to run models from any framework (PyTorch, TensorFlow, etc.) on any hardware (CPU, GPU, mobile).

💡 Mental Model

ONNX = "Write once, run anywhere" for AI

Like Java's JVM, but for neural networks

🌐 ONNX Ecosystem

PyTorch
TensorFlow
JAX

Source Frameworks

→
ONNX
Format
→
CPU
GPU
Mobile

Target Hardware

✨ Key Features

Strengths:

  • ✓ True cross-platform portability
  • ✓ Hardware flexibility (CPU/GPU/NPU)
  • ✓ Framework agnostic
  • ✓ Good performance (2-3x vs PyTorch)
  • ✓ Active community support
  • ✓ Microsoft backing

Trade-offs:

  • â€ĸ Not as fast as TensorRT (GPU)
  • â€ĸ Not as fast as OpenVINO (Intel CPU)
  • â€ĸ Conversion can be tricky
  • â€ĸ Operator coverage gaps

đŸ’ģ Code Example: Converting and Running

# Step 1: Export PyTorch to ONNX

import torch

model = WhisperModel()
dummy_input = torch.randn(1, 80, 3000)

torch.onnx.export(
    model,
    dummy_input,
    "whisper.onnx",
    export_params=True,
    opset_version=14,
    input_names=["audio"],
    output_names=["text"],
    dynamic_axes={
        "audio": {0: "batch", 2: "time"},
        "text": {0: "batch"}
    }
)

# Step 2: Run with ONNX Runtime

import onnxruntime as ort

# Create session
session = ort.InferenceSession(
    "whisper.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Run inference
outputs = session.run(
    None,
    {"audio": audio_input}
)

text_output = outputs[0]

âš™ī¸ Execution Providers (Hardware Backends)

CPUExecutionProvider

Default, works everywhere

CUDAExecutionProvider

NVIDIA GPUs with CUDA

TensorrtExecutionProvider

Uses TensorRT under the hood

OpenVINOExecutionProvider

Uses OpenVINO for Intel hardware

CoreMLExecutionProvider

Apple Silicon (M1/M2/M3)

📊 Performance Positioning

Scenario Best Choice ONNX Position
Intel CPU OpenVINO 2nd (Good)
NVIDIA GPU TensorRT 2nd (Good)
Cross-platform ONNX Runtime 1st (Best)
Mobile / Edge ONNX Runtime 1st (Best)

đŸŽ¯ When to Use ONNX Runtime

  • â€ĸ Multi-hardware deployments (CPU + GPU + mobile)
  • â€ĸ Platform flexibility (Windows, Linux, macOS, mobile)
  • â€ĸ Framework agnostic (PyTorch, TensorFlow, etc.)
  • â€ĸ Good "middle ground" performance
  • â€ĸ When you need portability more than peak performance

13. llama.cpp and Variants

Plain Explanation

llama.cpp is a lightweight, CPU-optimized inference engine specifically designed for Large Language Models (LLMs). It's written in pure C/C++ with no dependencies, making it incredibly portable and efficient.

💡 Mental Model

llama.cpp = SQLite for LLMs

Minimal, fast, runs anywhere, zero dependencies

✨ Why llama.cpp Is Special

Unique Strengths:

  • ✓ Pure C/C++ (no Python overhead)
  • ✓ Extreme portability (Linux, Mac, Windows, mobile)
  • ✓ Tiny binary (~few MB)
  • ✓ CPU-first design (no GPU required)
  • ✓ Quantization mastery (4-bit, 3-bit, 2-bit)
  • ✓ Memory mapped files (efficient loading)

Perfect For:

  • â€ĸ Running LLMs on laptops
  • â€ĸ Edge deployments
  • â€ĸ CPU-only servers
  • â€ĸ Local AI applications
  • â€ĸ Raspberry Pi / embedded
  • â€ĸ Cost-sensitive deployments

đŸ—‚ī¸ GGUF Format (GPT-Generated Unified Format)

llama.cpp uses its own model format called GGUF (previously GGML). This format is optimized for:

Memory Mapping

Load instantly without copying

Quantization

Built-in 2/3/4/5/6/8-bit

Portability

Single file, works everywhere

đŸ’ģ Usage Example

# Step 1: Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Step 2: Convert model to GGUF

python convert.py /path/to/llama-model --outtype f16

# Step 3: Quantize (optional but recommended)

./quantize model-f16.gguf model-q4_0.gguf q4_0

# Step 4: Run inference

./main -m model-q4_0.gguf -p "Hello, my name is" -n 128 -t 8

# -m: model file
# -p: prompt
# -n: number of tokens to generate
# -t: number of CPU threads

📊 Quantization Levels

Format Bits Size (7B) Quality
F16 16 14 GB ⭐⭐⭐⭐⭐
Q8_0 8 7 GB ⭐⭐⭐⭐⭐
Q6_K 6 5.5 GB ⭐⭐⭐⭐⭐
Q5_K_M 5 4.8 GB ⭐⭐⭐⭐
Q4_K_M 4 4.1 GB ⭐⭐⭐⭐
Q3_K_M 3 3.3 GB ⭐⭐⭐
Q2_K 2 2.7 GB ⭐⭐

* Recommended: Q4_K_M or Q5_K_M for best quality/size trade-off

🚀 Popular Variants & Wrappers

llama-cpp-python

Python bindings for llama.cpp

Ollama

User-friendly wrapper with model library

LM Studio

GUI for running GGUF models (uses llama.cpp)

text-generation-webui

Web interface for LLMs (llama.cpp backend)

đŸŽ¯ When to Use llama.cpp

  • â€ĸ CPU-only deployments (no GPU budget)
  • â€ĸ Edge devices (Raspberry Pi, embedded systems)
  • â€ĸ Local AI applications (privacy-focused)
  • â€ĸ Development & prototyping LLM apps
  • â€ĸ Memory-constrained environments (with quantization)
  • â€ĸ Cross-platform deployment needs

✅ Part III Complete!

You now understand the major inference runtimes: PyTorch, OpenVINO (CPU), TensorRT (GPU), ONNX (cross-platform), and llama.cpp (LLM-specialized). Ready to dive into hardware and kernels!

14. Kernels and Operations

Plain Explanation

A kernel is a hardware-specific implementation of a mathematical operation. The same operation (like matrix multiplication) has different kernels for CPU, GPU, and other accelerators.

💡 Mental Model

Kernel = How math actually runs on silicon

🔧 Operation vs Kernel

Operation

Abstract mathematical function

Examples:

  • â€ĸ Matrix Multiplication
  • â€ĸ Convolution
  • â€ĸ Softmax
  • â€ĸ ReLU Activation

Platform-independent concept

Kernel

Hardware-specific implementation

For MatMul:

  • â€ĸ CPU kernel (uses AVX-512)
  • â€ĸ GPU kernel (uses Tensor Cores)
  • â€ĸ ARM kernel (uses NEON)
  • â€ĸ TPU kernel (custom silicon)

Hardware-specific code

đŸŽ¯ Example: Matrix Multiplication Kernels

CPU Kernel (Intel)

Uses MKL (Math Kernel Library) with AVX-512 instructions

// Optimized for Intel CPUs
void matmul_cpu(float* A, float* B, float* C) {
    cblas_sgemm(...);  // MKL function
    // Uses AVX-512 SIMD instructions
}

GPU Kernel (NVIDIA)

Uses CUDA with Tensor Cores

// CUDA kernel
__global__ void matmul_gpu(float* A, float* B, float* C) {
    // Parallel execution across thousands of cores
    // Uses Tensor Cores for FP16/INT8
}

ARM Kernel

Uses NEON SIMD instructions

// ARM NEON optimized
void matmul_arm(float* A, float* B, float* C) {
    // Uses NEON vector instructions
}

📚 Common Deep Learning Operations

Matrix Operations

  • â€ĸ GEMM (General Matrix Multiply)
  • â€ĸ BatchMatMul

Convolution

  • â€ĸ Conv2D (2D Convolution)
  • â€ĸ DepthwiseConv

Activation Functions

  • â€ĸ ReLU, GELU, Swish
  • â€ĸ Softmax, Sigmoid

Normalization

  • â€ĸ LayerNorm, BatchNorm
  • â€ĸ GroupNorm

Attention

  • â€ĸ Multi-Head Attention
  • â€ĸ Scaled Dot-Product

Pooling

  • â€ĸ MaxPool, AvgPool
  • â€ĸ AdaptivePool

⚡ Kernel Libraries

CPU: Intel MKL

Math Kernel Library - highly optimized for Intel CPUs

GPU: cuDNN

CUDA Deep Neural Network library - NVIDIA's DL primitives

GPU: cuBLAS

CUDA Basic Linear Algebra Subprograms - matrix operations

ARM: Compute Library

Optimized kernels for ARM CPUs (NEON) and Mali GPUs

đŸŽ¯ Why This Matters

Runtimes like OpenVINO and TensorRT are fast because they:

  • â€ĸ Select the best kernel for your hardware
  • â€ĸ Fuse multiple kernels into one
  • â€ĸ Use hardware-specific optimizations
  • â€ĸ Minimize kernel launch overhead

15. CPU vs GPU Architecture

Plain Explanation

CPUs and GPUs are designed for fundamentally different workloads. Understanding their architectures helps you choose the right hardware for your inference needs.

đŸ—ī¸ Architecture Comparison

CPU (Central Processing Unit)

Design Philosophy:

Few powerful cores optimized for sequential tasks

Cores:

4-64 powerful cores

Cache:

Large (32-256 MB)

Clock Speed:

High (2-5 GHz)

Memory:

RAM (DDR4/DDR5)

Bandwidth:

~50-100 GB/s

Best For:

Control flow, branching, general computing

GPU (Graphics Processing Unit)

Design Philosophy:

Thousands of simple cores optimized for parallel tasks

Cores:

1,000-10,000+ CUDA cores

Cache:

Small per core (~KB)

Clock Speed:

Lower (1-2 GHz)

Memory:

VRAM (HBM2/GDDR6)

Bandwidth:

~500-2,000 GB/s

Best For:

Parallel math, matrix operations

💡 Mental Models

CPU

= Ferrari

Few fast cores, sequential excellence

GPU

= Bus Fleet

Many slow cores, parallel powerhouse

📊 Performance Characteristics

Task CPU GPU
Matrix Multiply (Large) Slow Very Fast
Single-threaded Code Very Fast Slow
Branching / If-else Excellent Poor
Parallel Operations Limited Excellent
Memory Bandwidth 50-100 GB/s 500-2,000 GB/s
Power Efficiency Better (50-150W) Hungry (250-700W)

đŸŽ¯ When to Use Each for AI Inference

Choose CPU When:

  • ✓ Low latency, small batch (batch=1)
  • ✓ Cost-sensitive deployments
  • ✓ Already have CPU infrastructure
  • ✓ Models fit in RAM with quantization
  • ✓ ASR models (with OpenVINO)

Choose GPU When:

  • ✓ High throughput needed
  • ✓ Large batch sizes
  • ✓ Very large models (70B+ LLMs)
  • ✓ Ultra-low latency critical
  • ✓ Budget allows ($$)

💡 Real-World Example

ASR Call Center (Whisper Large-v3):

  • â€ĸ CPU (Xeon + OpenVINO INT8): ~200ms latency, $0.50/hour
  • â€ĸ GPU (T4 + TensorRT FP16): ~50ms latency, $2.50/hour

Decision: CPU wins for call centers (200ms is acceptable, 5x cost savings)

16. Instruction Set Architectures (ISA)

Plain Explanation

An Instruction Set Architecture (ISA) is the language that your CPU speaks. It defines what operations the processor can perform and how software communicates with hardware.

💡 Mental Model

ISA = CPU's native language

Like English vs Spanish vs Mandarin for humans

đŸ–Ĩī¸ Major CPU ISAs

x86-64 (AMD64)

Dominant

Used by: Intel (Core, Xeon), AMD (Ryzen, EPYC)

Market: Servers, desktops, laptops

SIMD Extensions:

  • â€ĸ SSE (Streaming SIMD Extensions)
  • â€ĸ AVX (Advanced Vector Extensions)
  • â€ĸ AVX-512 (512-bit vectors for AI/HPC)
  • â€ĸ AMX (Advanced Matrix Extensions - new!)

AI Performance: Excellent with AVX-512/AMX

ARM64 (AArch64)

Growing

Used by: Apple (M1/M2/M3), AWS Graviton, NVIDIA Grace

Market: Mobile, edge, emerging servers

SIMD Extensions:

  • â€ĸ NEON (Advanced SIMD)
  • â€ĸ SVE (Scalable Vector Extension)
  • â€ĸ SVE2 (Enhanced for AI/ML)

AI Performance: Good, improving rapidly

Advantage: Power efficiency

RISC-V

Emerging

Used by: SiFive, StarFive, various startups

Market: Edge devices, research, future servers

Key Feature: Open-source ISA (no licensing fees!)

Extensions:

  • â€ĸ V extension (Vector operations)
  • â€ĸ Zve (Embedded vector)

Status: Early but promising for AI

🎮 GPU ISAs

NVIDIA: PTX → SASS

PTX (Parallel Thread Execution): Virtual ISA (like Java bytecode)

SASS: Native GPU machine code (hardware-specific)

CUDA → PTX → SASS → Hardware

AMD: GCN / RDNA / CDNA

GCN: Graphics Core Next (older)

RDNA: Gaming GPUs

CDNA: Compute/AI GPUs (MI series)

Intel: Gen Graphics

Used in Intel Xe GPUs (Arc, Flex, Max series)

📊 SIMD: Single Instruction Multiple Data

SIMD extensions allow CPUs to perform the same operation on multiple data points simultaneously - crucial for AI inference.

Evolution of Intel SIMD:

SSE
128-bit (4 floats)
~1999
AVX
256-bit (8 floats)
~2011
AVX2
256-bit + FMA
~2013
AVX-512
512-bit (16 floats)
~2016
AMX
Matrix tiles (INT8/BF16)
~2021

Impact: AVX-512 + AMX make modern Intel CPUs competitive with GPUs for INT8 inference

🔄 Why ISA Matters for Inference

For Deployment:

  • â€ĸ Kernels are ISA-specific
  • â€ĸ x86 binaries won't run on ARM
  • â€ĸ Need right compiler/runtime

For Performance:

  • â€ĸ AVX-512 = 2-3x faster than AVX2
  • â€ĸ AMX = dedicated matrix ops
  • â€ĸ NEON optimizes ARM inference

âš ī¸ Practical Implications

  • â€ĸ Docker images must match architecture (amd64 vs arm64)
  • â€ĸ OpenVINO automatically detects and uses best ISA features
  • â€ĸ ARM Macs (M1/M2) need ARM-specific builds
  • â€ĸ Check CPU flags: lscpu | grep Flags

17. CUDA and Alternatives

Plain Explanation

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. It allows developers to use GPUs for general-purpose computing, not just graphics.

💡 Mental Model

CUDA = The "C++" of GPU programming

Dominant, powerful, but NVIDIA-only

đŸ—ī¸ CUDA Ecosystem

Your Application / Framework

(PyTorch, TensorFlow)

↓
High-Level Libraries

(cuDNN, cuBLAS, TensorRT)

↓
CUDA Runtime API

(Memory management, kernel launch)

↓
NVIDIA GPU Hardware

📚 Key CUDA Libraries for AI

cuDNN (CUDA Deep Neural Network library)

GPU-accelerated primitives for deep learning (convolution, pooling, normalization)

Used by: All major frameworks

cuBLAS (CUDA Basic Linear Algebra Subprograms)

GPU-accelerated matrix operations (GEMM, GEMV)

Critical for: Transformer models

TensorRT

High-performance inference optimizer (covered earlier)

Best for: Production inference

cuSPARSE

Sparse matrix operations

Useful for: Pruned models

NCCL (NVIDIA Collective Communications Library)

Multi-GPU communication

For: Multi-GPU training/inference

🔄 CUDA Alternatives

ROCm (AMD)

AMD GPUs

Radeon Open Compute platform - AMD's answer to CUDA

Advantages:

  • â€ĸ Open source
  • â€ĸ AMD MI series (CDNA)
  • â€ĸ HIP (CUDA compatibility layer)

Challenges:

  • â€ĸ Smaller ecosystem
  • â€ĸ Fewer optimized libraries
  • â€ĸ Limited framework support

oneAPI / SYCL (Intel)

Intel GPUs

Unified programming model for CPUs, GPUs, FPGAs

Advantages:

  • â€ĸ Cross-architecture
  • â€ĸ Standards-based (SYCL)
  • â€ĸ Intel Xe GPUs (Arc, Flex, Max)

Status:

  • â€ĸ Growing adoption
  • â€ĸ Good for Intel stack
  • â€ĸ Still maturing

OpenCL

Vendor Neutral

Open standard for parallel programming across platforms

Advantages:

  • â€ĸ True cross-vendor
  • â€ĸ CPU, GPU, FPGA support
  • â€ĸ Open standard

Reality:

  • â€ĸ Slower than CUDA on NVIDIA
  • â€ĸ Less AI library support
  • â€ĸ Declining adoption

Metal (Apple)

Apple Silicon

Apple's GPU programming framework for M1/M2/M3 chips

Advantages:

  • â€ĸ Excellent on Apple Silicon
  • â€ĸ Unified memory architecture
  • â€ĸ Growing ML support (MLX)

Limitation:

  • â€ĸ Apple devices only
  • â€ĸ Not for datacenter

📊 Market Reality Check

Platform AI Market Share Ecosystem Maturity
CUDA (NVIDIA) ~95% ⭐⭐⭐⭐⭐
ROCm (AMD) ~3% ⭐⭐⭐
oneAPI (Intel) ~1% ⭐⭐
Metal (Apple) ~1% ⭐⭐⭐
OpenCL <1% ⭐⭐

💡 Practical Advice

  • â€ĸ For production AI: CUDA (NVIDIA GPUs) is still the safest bet
  • â€ĸ For cost optimization: Consider AMD MI series with ROCm
  • â€ĸ For Apple Silicon: Use Metal/MLX for local development
  • â€ĸ For maximum portability: Use high-level frameworks (PyTorch, ONNX)

18. Specialized AI Hardware

Plain Explanation

Beyond CPUs and GPUs, there are specialized accelerators designed specifically for AI workloads. These chips sacrifice flexibility for extreme performance and efficiency.

🚀 Major AI Accelerators

TPU (Tensor Processing Unit) - Google

Custom ASIC for TensorFlow, excellent for training and inference

Apple Neural Engine

Built into M-series chips, optimized for CoreML

AWS Inferentia / Trainium

Amazon's custom chips for cloud inference

Intel Gaudi

Deep learning accelerator for training and inference

✅ Part IV Complete!

You now understand hardware from kernels to silicon, ISAs, CUDA, and specialized accelerators. Ready for memory optimization!

19. RAM vs VRAM

Plain Explanation

RAM (system memory) and VRAM (video memory) serve the same purpose—storing data—but they're optimized for different processors. Understanding the difference is critical for inference deployment decisions.

💡 Mental Model

RAM = CPU's storage
VRAM = GPU's storage

Data must live where it's processed

📊 Key Differences

Characteristic RAM VRAM
Type DDR4/DDR5 GDDR6/HBM2
Bandwidth 50-100 GB/s 500-2,000 GB/s
Capacity 64-512 GB 8-80 GB
Cost per GB $1-3 $20-100
Latency ~60ns ~200ns

🔄 Memory Transfer Bottleneck

Moving data between RAM and VRAM is expensive:

Model in RAM
→ PCIe (~16 GB/s) →
VRAM for GPU

Transfer time for 7B model (14GB): ~1 second

đŸ’ģ RAM: CPU Inference

✓ Advantages:

  • â€ĸ Much larger capacity (512GB possible)
  • â€ĸ Lower cost per GB
  • â€ĸ Easier to upgrade

✗ Limitations:

  • â€ĸ Lower bandwidth
  • â€ĸ CPU is slower for parallel ops

🎮 VRAM: GPU Inference

✓ Advantages:

  • â€ĸ Massive bandwidth (20x faster)
  • â€ĸ GPU optimized for parallel ops
  • â€ĸ Lower latency for inference

✗ Limitations:

  • â€ĸ Limited capacity (24-80GB typical)
  • â€ĸ Very expensive
  • â€ĸ Cannot upgrade

đŸŽ¯ Memory Requirements by Model

Whisper Large-v3 (1.5B params)

  • â€ĸ FP32: 6 GB
  • â€ĸ FP16: 3 GB
  • â€ĸ INT8: 1.5 GB

✓ Fits in most GPUs (even T4 with 16GB)

LLaMA-7B

  • â€ĸ FP32: 28 GB
  • â€ĸ FP16: 14 GB
  • â€ĸ INT8: 7 GB
  • â€ĸ INT4: 3.5 GB

✓ INT8 fits in 16GB GPU, INT4 fits in 8GB

LLaMA-70B

  • â€ĸ FP16: 140 GB
  • â€ĸ INT8: 70 GB
  • â€ĸ INT4: 35 GB

⚠ Requires multiple GPUs or extreme quantization

💡 Practical Decision Guide

  • â€ĸ Model fits in VRAM: Use GPU (much faster)
  • â€ĸ Model too large for VRAM: Quantize to INT8/INT4 or use CPU
  • â€ĸ Budget constrained: Use CPU with quantization
  • â€ĸ Very large models (70B+): Multi-GPU or use llama.cpp on CPU with INT4

20. Quantization Techniques

Plain Explanation

Quantization means reducing the precision of model weights and activations. Instead of 32-bit floats, use 16-bit, 8-bit, or even 4-bit integers. This makes models smaller and faster with minimal accuracy loss.

💡 Mental Model

Quantization = Compression with controlled quality loss

Like JPEG for images, but for AI models

📊 Precision Levels

FP32 (Float32)

4 bytes

Full precision, baseline accuracy

Range: Âą3.4 × 10Âŗâ¸

FP16 (Float16)

2 bytes

Half precision, minimal loss

2× speedup, 2× memory savings, <0.1% accuracy loss

INT8 (8-bit Integer)

1 byte

Most popular for inference

4× speedup, 4× memory savings, <1% accuracy loss

INT4 (4-bit Integer)

0.5 bytes

Aggressive compression

8× memory savings, 1-3% accuracy loss, great for LLMs

🔧 Quantization Approaches

Post-Training Quantization (PTQ)

Quantize after training is complete

✓ Advantages:

  • â€ĸ No retraining needed
  • â€ĸ Fast (minutes)
  • â€ĸ Easy to use

✗ Limitations:

  • â€ĸ Can lose 1-3% accuracy
  • â€ĸ Needs calibration data

Quantization-Aware Training (QAT)

Train with quantization in mind

✓ Advantages:

  • â€ĸ Better accuracy
  • â€ĸ Can handle INT4 better
  • â€ĸ Model adapts to low precision

✗ Limitations:

  • â€ĸ Requires retraining
  • â€ĸ Time-consuming (days/weeks)

📈 Impact on Model Size

LLaMA-7B Model Size by Precision:

FP32:
28 GB
FP16:
14 GB
INT8:
7 GB
INT4:
3.5 GB

🎤 ASR Quantization

Whisper models handle quantization well:

  • â€ĸ FP16: Recommended for GPU, no loss
  • â€ĸ INT8: Perfect for CPU (OpenVINO), <0.5% WER increase
  • â€ĸ INT4: Use with caution, test accuracy

đŸ’Ŧ LLM Quantization

LLMs are quantization-friendly:

  • â€ĸ INT8: Minimal perplexity increase (<1%)
  • â€ĸ INT4: Popular for llama.cpp, 1-3% loss
  • â€ĸ GPTQ/AWQ: Advanced INT4 methods

⚡ Quick Recommendations

  • â€ĸ GPU inference: Use FP16 (native support, no accuracy loss)
  • â€ĸ CPU inference: Use INT8 (4x faster, <1% loss)
  • â€ĸ Memory constrained: Use INT4 (llama.cpp, GPTQ)
  • â€ĸ Always calibrate: Test on real data before deploying

21. Calibration for Quantization

Plain Explanation

Calibration is the process of finding the right scale factors when converting from high precision (FP32) to low precision (INT8). Without calibration, quantization causes significant accuracy loss.

💡 Mental Model

Calibration = Measuring before compressing

Like setting the right exposure for a photo

🔍 Why Calibration Matters

When quantizing, we need to map floating-point ranges to integer ranges:

FP32 Range

-5.2 to +8.7

→

INT8 Range

-128 to +127

Calibration finds the optimal mapping to minimize accuracy loss

📊 Calibration Methods

1. Min-Max Calibration

Uses observed min/max values from calibration data

scale = (max - min) / 255

✓ Pros:

  • â€ĸ Simple
  • â€ĸ Fast

✗ Cons:

  • â€ĸ Sensitive to outliers
  • â€ĸ Less accurate

2. Entropy Calibration (KL Divergence)

Minimizes information loss between FP32 and INT8

Finds threshold that minimizes KL(P||Q)

✓ Pros:

  • â€ĸ More accurate
  • â€ĸ Robust to outliers

✗ Cons:

  • â€ĸ Slower
  • â€ĸ More complex

3. Percentile Calibration

Uses percentiles (e.g., 99.9%) to clip outliers

Ignores extreme values that hurt quantization

✓ Pros:

  • â€ĸ Good balance
  • â€ĸ Handles outliers

✗ Cons:

  • â€ĸ Requires tuning

đŸ“Ļ Calibration Dataset

You need representative data to calibrate:

Size:

100-1,000 samples typical (more is not always better)

Diversity:

Cover all types of inputs (different accents, languages, topics)

Source:

Use validation set or production samples

đŸ’ģ Calibration Example (TensorRT)

import tensorrt as trt

# 1. Create calibrator
class Calibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader):
        self.data_loader = data_loader
        self.batch_size = 32
    
    def get_batch(self, names):
        # Return next batch of calibration data
        return next(self.data_loader)
    
    def get_batch_size(self):
        return self.batch_size

# 2. Load calibration data
calibration_data = load_samples(count=500)

# 3. Build engine with calibration
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = Calibrator(calibration_data)

# 4. Build quantized engine
engine = builder.build_serialized_network(network, config)
                

TensorRT INT8 calibration using entropy method

📈 Calibration Best Practices

✓ Do:

  • â€ĸ Use 100-500 diverse samples
  • â€ĸ Match production data distribution
  • â€ĸ Test multiple calibration methods
  • â€ĸ Validate accuracy after calibration
  • â€ĸ Cache calibration results

✗ Don't:

  • â€ĸ Use training data for calibration
  • â€ĸ Calibrate with <50 samples
  • â€ĸ Skip accuracy validation
  • â€ĸ Use non-representative data
  • â€ĸ Forget to version calibration data

💡 Key Takeaways

  • â€ĸ Calibration is essential for INT8 quantization quality
  • â€ĸ Use entropy (KL divergence) for best accuracy
  • â€ĸ 500 diverse samples is a good target
  • â€ĸ Always validate accuracy on real data after calibration
  • â€ĸ Cache calibration results to avoid recomputation

22. KV-Cache Optimization

Plain Explanation

The KV-cache (Key-Value cache) stores attention intermediate results in transformer models. It avoids recomputing past tokens, making generation much faster but consuming significant memory.

💡 Mental Model

KV-cache = Memory for what the model has "seen"

Speed vs memory tradeoff

🔍 How KV-Cache Works

Without cache, generating each token requires recomputing attention for all previous tokens:

Token 1

Compute attention (1 token)

Token 2

Recompute attention (2 tokens) ❌

Token 3

Recompute attention (3 tokens) ❌

Total: O(n²) complexity - very slow!

With KV-cache, we store and reuse past attention:

Token 1

Compute & cache ✓

Token 2

Use cache + compute new ✓

Token 3

Use cache + compute new ✓

Total: O(n) complexity - much faster!

📊 KV-Cache Memory Requirements

Memory formula per token:

KV_memory = 2 × num_layers × hidden_size × precision

(2 = key + value, multiplied by sequence length)

LLaMA-7B Example

  • â€ĸ 32 layers × 4096 hidden × 2 bytes (FP16) × 2 (K+V)
  • â€ĸ = 1 MB per token
  • â€ĸ For 2048 context: 2 GB just for KV-cache!

GPT-3 (175B) Example

  • â€ĸ 96 layers × 12288 hidden × 2 bytes (FP16) × 2 (K+V)
  • â€ĸ = 4.5 MB per token
  • â€ĸ For 2048 context: 9 GB for KV-cache!

⚡ KV-Cache Optimizations

1. PagedAttention (vLLM)

Manage KV-cache like virtual memory pages

  • ✓ Reduces memory waste by ~40%
  • ✓ Enables dynamic memory allocation
  • ✓ Better batching efficiency

2. KV-Cache Quantization

Quantize KV-cache to INT8 or INT4

  • ✓ 2-4× memory savings
  • ✓ Minimal accuracy loss (<1%)
  • ✓ Allows longer contexts

3. Multi-Query Attention (MQA)

Share key/value across attention heads

  • ✓ 8× less KV-cache memory
  • ✓ Faster inference
  • ✗ Requires model architecture change

4. Grouped-Query Attention (GQA)

Middle ground: group heads to share KV

  • ✓ 2-4× less memory than MHA
  • ✓ Better accuracy than MQA
  • ✓ Used in LLaMA-2, Mistral

đŸ’ģ Memory Calculation Tool

def calculate_kv_cache_memory(
    num_layers: int,
    hidden_size: int,
    num_heads: int,
    sequence_length: int,
    batch_size: int = 1,
    precision_bytes: int = 2  # FP16
):
    """Calculate KV-cache memory in GB"""
    # Key + Value = 2
    # Per layer, per token
    bytes_per_token = 2 * num_layers * hidden_size * precision_bytes
    
    # Total for sequence
    total_bytes = bytes_per_token * sequence_length * batch_size
    
    return total_bytes / (1024**3)  # Convert to GB

# LLaMA-7B example
memory_gb = calculate_kv_cache_memory(
    num_layers=32,
    hidden_size=4096,
    num_heads=32,
    sequence_length=2048,
    batch_size=1,
    precision_bytes=2
)
print(f"KV-cache memory: {memory_gb:.2f} GB")
# Output: KV-cache memory: 2.00 GB
                

💡 Practical Recommendations

  • â€ĸ For single-user chatbots: Standard KV-cache is fine
  • â€ĸ For high-throughput serving: Use vLLM with PagedAttention
  • â€ĸ For very long contexts (8K+): Enable KV-cache quantization
  • â€ĸ For new models: Consider GQA architecture for better efficiency
  • â€ĸ Always monitor VRAM usage - KV-cache can exceed weights memory!

23. Batching Strategies

Plain Explanation

Batching means processing multiple inputs together instead of one at a time. It dramatically increases throughput but adds latency. The key is choosing the right batching strategy for your use case.

💡 Mental Model

Batching = Loading a bus vs sending taxis

Higher throughput, but people wait for the bus to fill

📊 Batch Size Impact

Batch = 1
10 req/sec
Batch = 8
60 req/sec (6x)
Batch = 32
100 req/sec (10x)

Typical throughput gains from batching (GPU inference)

âš–ī¸ The Latency-Throughput Tradeoff

Batch Size Latency Throughput Use Case
1 Lowest (10ms) Low Real-time apps
8-16 Medium (50ms) Good Interactive services
32-64 High (200ms) Excellent Batch processing
128+ Very High (500ms+) Maximum Offline workloads

🔧 Batching Strategies

1. Static Batching

Wait for N requests, then process together

✓ Pros:

  • â€ĸ Simple to implement
  • â€ĸ Predictable throughput
  • â€ĸ Easy to reason about

✗ Cons:

  • â€ĸ High latency (wait time)
  • â€ĸ Wasted capacity at low load
  • â€ĸ Fixed batch size inefficient

Example:

Wait for 32 requests → Process batch → Wait again

2. Dynamic Batching

Wait for timeout OR max batch size, whichever comes first

✓ Pros:

  • â€ĸ Better latency vs throughput balance
  • â€ĸ Adapts to load
  • â€ĸ More efficient than static

✗ Cons:

  • â€ĸ Still wastes capacity
  • â€ĸ Padding overhead
  • â€ĸ Timeout tuning needed

Example:

Batch size 32 OR 50ms timeout → Process batch

3. Continuous Batching (vLLM)

Add/remove requests from batch as they arrive/complete

✓ Pros:

  • â€ĸ Best throughput
  • â€ĸ Best latency
  • â€ĸ No wasted capacity
  • â€ĸ Adapts to variable lengths

✗ Cons:

  • â€ĸ Complex implementation
  • â€ĸ Requires scheduler
  • â€ĸ Framework-specific (vLLM, TGI)

How it works:

Continuously iterate batches, add new requests, remove finished ones

âš ī¸ Padding Overhead

When batching variable-length inputs, you must pad to the longest in the batch:

Seq 1: 50 tokens
padding
Seq 2: 200 tokens
Seq 3: 80 tokens
padding

Wasted compute: ~40% (padding for sequences 1 and 3)

Solution: Use continuous batching or group similar-length sequences together

🎤 ASR Batching

Real-time:

Batch size = 1 (low latency required)

Batch processing:

Batch size = 16-32 (group similar-length audio)

Tip:

Sort by audio length before batching to minimize padding

đŸ’Ŧ LLM Batching

Interactive chatbots:

Dynamic batching (16-32) or continuous

API serving:

Continuous batching (vLLM) for best efficiency

Recommendation:

Use vLLM for production LLM serving

đŸ’ģ Dynamic Batching Example

import asyncio
from collections import deque

class DynamicBatcher:
    def __init__(self, max_batch_size=32, timeout_ms=50):
        self.max_batch_size = max_batch_size
        self.timeout_ms = timeout_ms
        self.queue = deque()
    
    async def add_request(self, request):
        """Add request to batch queue"""
        self.queue.append(request)
        
        # Process when batch full or timeout
        if len(self.queue) >= self.max_batch_size:
            return await self.process_batch()
        
        # Wait for timeout
        await asyncio.sleep(self.timeout_ms / 1000)
        if self.queue:
            return await self.process_batch()
    
    async def process_batch(self):
        """Process accumulated batch"""
        batch = [self.queue.popleft() 
                 for _ in range(min(len(self.queue), 
                                   self.max_batch_size))]
        
        # Run inference on batch
        results = model.inference(batch)
        return results
                

💡 Best Practices

  • â€ĸ Real-time apps: Use batch size 1 or small dynamic batches
  • â€ĸ High-throughput serving: Use continuous batching (vLLM)
  • â€ĸ Batch processing: Use large static batches (32-64)
  • â€ĸ Monitor P95/P99 latency - batching impacts tail latency!
  • â€ĸ Group similar lengths together to minimize padding waste

24. ASR Deployment Guide

Plain Explanation

This is a practical, copy-paste guide for deploying Automatic Speech Recognition models in production. We'll cover Whisper deployment on both CPU and GPU.

đŸŽ¯ Deployment Decision Tree

Real-time streaming (<200ms latency)?

→ GPU (TensorRT) or CPU (OpenVINO INT8)

Batch processing (latency flexible)?

→ CPU (OpenVINO) for cost savings

High throughput (>100 req/sec)?

→ GPU (TensorRT) with batching

đŸ’ģ Option 1: CPU Deployment (OpenVINO)

# Step 1: Install dependencies

{pip install openvino openvino-dev
pip install transformers torch torchaudio}

# Step 2: Convert Whisper to OpenVINO

{from transformers import WhisperForConditionalGeneration
from openvino.tools import mo

model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-base"
)

# Convert to OpenVINO IR
mo.convert_model(
    model,
    output_dir="whisper_openvino",
    compress_to_fp16=True
)}

# Step 3: Run inference

{from openvino.runtime import Core
import numpy as np

core = Core()
model = core.read_model("whisper_openvino/model.xml")
compiled = core.compile_model(model, "CPU")

# Inference
output = compiled([audio_features])[0]}

🚀 Option 2: GPU Deployment (TensorRT)

# Step 1: Export to ONNX

{import torch
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-base"
)
model.eval()

dummy_input = torch.randn(1, 80, 3000).cuda()

torch.onnx.export(
    model,
    dummy_input,
    "whisper.onnx",
    opset_version=17
)}

# Step 2: Build TensorRT engine

{trtexec --onnx=whisper.onnx \
        --saveEngine=whisper.trt \
        --fp16 \
        --workspace=4096}

📊 Production Configuration

Parameter Real-time Batch
Batch Size 1-4 16-32
Beam Size 1-3 5
Precision INT8/FP16 INT8
Chunk Size 5-10s 20-30s

⚡ Quick Start: Fastest Path to Production

{# Use faster-whisper (CTranslate2 backend)
pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")}

25. LLM Deployment Guide

Plain Explanation

Large Language Models require specialized serving infrastructure. This guide covers the most popular deployment options from local CPU to production GPU serving.

đŸŽ¯ LLM Deployment Options

llama.cpp / Ollama

CPU/Local

Best for: Development, edge deployment, no GPU

vLLM

GPU/Production

Best for: High-throughput GPU serving, PagedAttention

Text Generation Inference (TGI)

GPU/HuggingFace

Best for: HuggingFace ecosystem, production serving

TensorRT-LLM

GPU/Ultra-Fast

Best for: Lowest latency on NVIDIA GPUs

đŸ’ģ Option 1: llama.cpp (CPU)

# Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download quantized model (GGUF format)
wget https://huggingface.co/.../model-q4_k_m.gguf

# Run inference
./main -m model-q4_k_m.gguf \
       -p "Write a Python function to" \
       -n 128 \
       -t 8

Or use Ollama (easier):

curl https://ollama.ai/install.sh | sh
ollama run llama2
ollama run mistral

🚀 Option 2: vLLM (GPU Production)

# Install vLLM

pip install vllm

# Python API

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)

prompts = ["Hello, my name is", "The future of AI is"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

# Or use OpenAI-compatible server

python -m vllm.entrypoints.openai.api_server \
       --model meta-llama/Llama-2-7b-hf \
       --tensor-parallel-size 1

# Then use with OpenAI client
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-2-7b-hf",
       "prompt": "San Francisco is",
       "max_tokens": 50}'

📊 Performance Comparison

Method Hardware Throughput Latency
llama.cpp (INT4) CPU (16 core) ~10 tok/sec Medium
vLLM (FP16) A100-40GB ~100 tok/sec Low
TensorRT-LLM A100-40GB ~150 tok/sec Very Low

âš™ī¸ Production Configuration Best Practices

Memory Management

  • â€ĸ Use INT8/INT4 quantization to fit larger models
  • â€ĸ Enable KV-cache quantization (vLLM supports this)
  • â€ĸ Monitor GPU memory usage continuously

Batching

  • â€ĸ Start with batch size 16-32
  • â€ĸ Use continuous batching (vLLM does this automatically)
  • â€ĸ Monitor P95/P99 latency, not just average

Sampling Parameters

  • â€ĸ temperature: 0.7-0.9 (lower = more deterministic)
  • â€ĸ top_p: 0.9 (nucleus sampling)
  • â€ĸ max_tokens: Set based on use case (limit costs)

💡 Cost Optimization Tips

  • â€ĸ CPU (llama.cpp): $0.05-0.20/hr, good for <100 req/day
  • â€ĸ GPU (T4): $0.50/hr, good for moderate traffic
  • â€ĸ GPU (A100): $3-5/hr, for high throughput
  • â€ĸ Consider spot instances for 60-80% cost savings

26. Benchmarking and Metrics

Plain Explanation

Benchmarking inference systems requires understanding multiple metrics beyond just latency. This guide teaches you how to measure and interpret performance correctly.

📊 Key Metrics Explained

Latency

Time to process a single request

Metrics to track:

  • â€ĸ P50 (median): Typical user experience
  • â€ĸ P95: 95% of requests faster than this
  • â€ĸ P99: Tail latency (important!)
  • â€ĸ Max: Worst-case scenario

Throughput

Requests processed per second

Measured as:

  • â€ĸ QPS (Queries Per Second)
  • â€ĸ Tokens/second (for LLMs)
  • â€ĸ Audio hours/hour (for ASR)

Accuracy

Quality of model predictions

ASR Metrics:

  • â€ĸ WER (Word Error Rate): Lower is better
  • â€ĸ CER (Character Error Rate)

LLM Metrics:

  • â€ĸ Perplexity, BLEU, ROUGE
  • â€ĸ Human evaluation

Resource Utilization

Hardware efficiency

  • â€ĸ CPU/GPU utilization (%)
  • â€ĸ Memory usage (RAM/VRAM)
  • â€ĸ Power consumption (Watts)

âš ī¸ Common Benchmarking Mistakes

❌ Only measuring P50 latency

Problem: Ignores tail latency. Some users get 10x worse experience.

✓ Fix: Always report P95 and P99

❌ Not warming up the model

Problem: First inference is slow (loading weights, JIT compilation)

✓ Fix: Run 10-100 warmup iterations before measuring

❌ Testing with unrealistic data

Problem: Production has noise, accents, variable length

✓ Fix: Use production-representative test set

❌ Single-threaded benchmarks

Problem: Doesn't test concurrent load

✓ Fix: Load test with multiple concurrent requests

đŸ’ģ Benchmarking Code Example

{`import time
import numpy as np

def benchmark_inference(model, test_data, warmup=10):
    # Warmup
    for i in range(warmup):
        model(test_data[0])
    
    # Measure
    latencies = []
    for data in test_data:
        start = time.perf_counter()
        output = model(data)
        latency = (time.perf_counter() - start) * 1000  # ms
        latencies.append(latency)
    
    # Report
    print(f"P50: {np.percentile(latencies, 50):.2f}ms")
    print(f"P95: {np.percentile(latencies, 95):.2f}ms")
    print(f"P99: {np.percentile(latencies, 99):.2f}ms")
    print(f"Max: {np.max(latencies):.2f}ms")
    print(f"Throughput: {len(test_data) / np.sum(latencies) * 1000:.2f}      case 'isa':
          

16. Instruction Set Architectures (ISA)

Plain Explanation

An Instruction Set Architecture (ISA) is the language that your CPU speaks. It defines what operations the processor can perform and how software communicates with hardware.

💡 Mental Model

ISA = CPU's native language

Like English vs Spanish vs Mandarin for humans

đŸ–Ĩī¸ Major CPU ISAs

x86-64 (AMD64)

Dominant

Used by: Intel (Core, Xeon), AMD (Ryzen, EPYC)

Market: Servers, desktops, laptops

SIMD Extensions:

  • â€ĸ SSE (Streaming SIMD Extensions)
  • â€ĸ AVX (Advanced Vector Extensions)
  • â€ĸ AVX-512 (512-bit vectors for AI/HPC)
  • â€ĸ AMX (Advanced Matrix Extensions - new!)

AI Performance: Excellent with AVX-512/AMX

ARM64 (AArch64)

Growing

Used by: Apple (M1/M2/M3), AWS Graviton, NVIDIA Grace

Market: Mobile, edge, emerging servers

SIMD Extensions:

  • â€ĸ NEON (Advanced SIMD)
  • â€ĸ SVE (Scalable Vector Extension)
  • â€ĸ SVE2 (Enhanced for AI/ML)

AI Performance: Good, improving rapidly

Advantage: Power efficiency

RISC-V

Emerging

Used by: SiFive, StarFive, various startups

Market: Edge devices, research, future servers

Key Feature: Open-source ISA (no licensing fees!)

Extensions:

  • â€ĸ V extension (Vector operations)
  • â€ĸ Zve (Embedded vector)

Status: Early but promising for AI

🎮 GPU ISAs

NVIDIA: PTX → SASS

PTX (Parallel Thread Execution): Virtual ISA (like Java bytecode)

SASS: Native GPU machine code (hardware-specific)

CUDA → PTX → SASS → Hardware

AMD: GCN / RDNA / CDNA

GCN: Graphics Core Next (older)

RDNA: Gaming GPUs

CDNA: Compute/AI GPUs (MI series)

Intel: Gen Graphics

Used in Intel Xe GPUs (Arc, Flex, Max series)

📊 SIMD: Single Instruction Multiple Data

SIMD extensions allow CPUs to perform the same operation on multiple data points simultaneously - crucial for AI inference.

Evolution of Intel SIMD:

SSE
128-bit (4 floats)
~1999
AVX
256-bit (8 floats)
~2011
AVX2
256-bit + FMA
~2013
AVX-512
512-bit (16 floats)
~2016
AMX
Matrix tiles (INT8/BF16)
~2021

Impact: AVX-512 + AMX make modern Intel CPUs competitive with GPUs for INT8 inference

🔄 Why ISA Matters for Inference

For Deployment:

  • â€ĸ Kernels are ISA-specific
  • â€ĸ x86 binaries won't run on ARM
  • â€ĸ Need right compiler/runtime

For Performance:

  • â€ĸ AVX-512 = 2-3x faster than AVX2
  • â€ĸ AMX = dedicated matrix ops
  • â€ĸ NEON optimizes ARM inference

âš ī¸ Practical Implications

  • â€ĸ Docker images must match architecture (amd64 vs arm64)
  • â€ĸ OpenVINO automatically detects and uses best ISA features
  • â€ĸ ARM Macs (M1/M2) need ARM-specific builds
  • â€ĸ Check CPU flags: lscpu | grep Flags

27. Production Monitoring

Plain Explanation

Production monitoring means tracking key metrics to ensure your AI inference system is healthy, fast, and accurate. Without monitoring, you fly blind.

💡 Mental Model

Monitoring = Health dashboard for your inference system

See problems before users complain

📊 Critical Metrics to Track

1. Latency Metrics

P50 (Median):

Typical user experience. Target: <100ms for interactive

P95:

95% of requests faster. Good SLA metric. Target: <200ms

P99:

Tail latency - critical! Target: <500ms

Max:

Worst case. Should not exceed 2× P99

2. Throughput Metrics

Requests Per Second (RPS/QPS):

Total load on system

Tokens Per Second (LLMs):

Generation speed

Batch Utilization:

% of max batch size used

3. Resource Metrics

GPU Utilization:

Should be 70-90% for good efficiency

GPU Memory Usage:

Watch for OOM! Alert at 90%

CPU Utilization:

For CPU inference or preprocessing

GPU Temperature:

Alert if >85°C (thermal throttling risk)

4. Quality Metrics

WER (ASR):

Word Error Rate on production data

Perplexity (LLMs):

Model confidence metric

Error Rate:

% of requests that fail/timeout

5. Cost Metrics

Cost Per Request:

Total infrastructure cost / requests

Cost Per Token (LLMs):

Important for usage-based pricing

🔔 Alerting Strategy

🚨 Critical Alerts (Page On-Call)

  • â€ĸ Error rate > 5%
  • â€ĸ P99 latency > 2× baseline
  • â€ĸ GPU memory > 95%
  • â€ĸ Service down / no responses

âš ī¸ Warning Alerts (Review Next Day)

  • â€ĸ P95 latency > 1.5× baseline
  • â€ĸ GPU memory > 85%
  • â€ĸ GPU temperature > 80°C
  • â€ĸ Throughput dropped > 30%

â„šī¸ Info Alerts (Monitor Trends)

  • â€ĸ WER increased > 10%
  • â€ĸ Cost per request trending up
  • â€ĸ Traffic patterns changing

đŸ› ī¸ Monitoring Stack

Metrics Collection:

  • â€ĸ Prometheus: Time-series metrics
  • â€ĸ StatsD: Application metrics
  • â€ĸ CloudWatch: AWS metrics

Visualization:

  • â€ĸ Grafana: Dashboards
  • â€ĸ Datadog: All-in-one (paid)
  • â€ĸ Kibana: Logs + metrics

GPU Monitoring:

  • â€ĸ nvidia-smi: Basic GPU stats
  • â€ĸ DCGM: NVIDIA Data Center GPU Manager
  • â€ĸ nvtop: Real-time GPU monitor

Alerting:

  • â€ĸ PagerDuty: On-call management
  • â€ĸ AlertManager: Prometheus alerts
  • â€ĸ Opsgenie: Incident response

đŸ’ģ Monitoring Example (Prometheus + Python)

from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
request_count = Counter('inference_requests_total', 
                        'Total inference requests')
latency = Histogram('inference_latency_seconds', 
                   'Inference latency')
gpu_memory = Gauge('gpu_memory_used_bytes', 
                  'GPU memory usage')

def inference_with_monitoring(model, input_data):
    # Track request
    request_count.inc()
    
    # Measure latency
    start = time.time()
    
    # Run inference
    result = model(input_data)
    
    # Record latency
    latency.observe(time.time() - start)
    
    # Track GPU memory
    gpu_memory.set(get_gpu_memory_usage())
    
    return result
                

📈 Sample Grafana Dashboard

Latency Panel

P50/P95/P99 over time

Throughput Panel

Requests/sec graph

GPU Usage Panel

Utilization % + memory

Error Rate Panel

Failed requests %

💡 Monitoring Best Practices

  • â€ĸ Always track P95 and P99 - median alone hides problems
  • â€ĸ Set up alerts BEFORE going to production
  • â€ĸ Monitor GPU memory proactively - OOM crashes are sudden
  • â€ĸ Track quality metrics (WER, perplexity) to catch model degradation
  • â€ĸ Review dashboards weekly to spot trends
  • â€ĸ Keep historical data (90+ days) for capacity planning

28. Hyperparameter Tuning

Plain Explanation

Hyperparameters control how inference behaves without changing the model weights. Tuning them properly is critical for balancing quality, speed, and cost.

💡 Mental Model

Hyperparameters = Knobs to tune performance vs quality

🎤 ASR Hyperparameters

beam_size (1-10)

Number of candidate transcriptions explored

Lower (1-3):

Faster, less accurate

Higher (5-10):

Slower, more accurate

✓ Recommended: 5 for production, 1-3 for real-time

temperature (0.0-1.0)

Controls randomness in output selection

0.0:

Deterministic (always same output)

0.8-1.0:

More creative/random

✓ Recommended: 0.0-0.3 for ASR (want consistency)

no_speech_threshold (0.0-1.0)

Probability threshold for detecting silence

✓ Recommended: 0.6 (prevents hallucinations on silence)

compression_ratio_threshold (1.0-3.0)

Detects repetitive/gibberish output

✓ Recommended: 2.4 (reject if compression ratio too high)

đŸ’Ŧ LLM Hyperparameters

temperature (0.0-2.0)

Controls creativity vs consistency

0.0-0.3:

Factual tasks

0.7-0.9:

Conversational

1.0-2.0:

Creative writing

top_p (0.0-1.0)

Nucleus sampling: consider top tokens until cumulative probability reaches p

✓ Recommended: 0.9-0.95 for most tasks

max_tokens (1-4096+)

Maximum output length

âš ī¸ Critical: Controls costs! Set based on use case

presence_penalty (-2.0 to 2.0)

Penalizes tokens that have appeared

✓ Use 0.5-1.0 to reduce repetition

âš–ī¸ Common Trade-offs

Parameter Increase → Effect
beam_size ↑ Better quality, slower speed
temperature ↑ More creative, less predictable
max_tokens ↑ Longer output, higher cost
batch_size ↑ Higher throughput, more latency

đŸŽ¯ Quick Start Configs

Real-time ASR:

beam_size=1, temperature=0.0, batch_size=1

Batch ASR:

beam_size=5, temperature=0.0, batch_size=32

Chatbot LLM:

temperature=0.7, top_p=0.9, max_tokens=512

Code Generation:

temperature=0.2, top_p=0.95, max_tokens=2048

29. Call Center ASR Optimization

Plain Explanation

Call center ASR is one of the most challenging real-world deployments: long audio, background noise, multiple speakers, accents, and regulatory requirements.

đŸŽ¯ Unique Challenges

Audio Quality Issues

  • â€ĸ Phone line compression (8kHz)
  • â€ĸ Background noise
  • â€ĸ Poor microphones
  • â€ĸ Echo and feedback

Content Challenges

  • â€ĸ Multiple speakers
  • â€ĸ Overlapping speech
  • â€ĸ Accents and dialects
  • â€ĸ Domain-specific jargon

Scale Requirements

  • â€ĸ Long recordings (30-60 min)
  • â€ĸ High volume (1000s/day)
  • â€ĸ Real-time + batch
  • â€ĸ Cost sensitivity

Compliance

  • â€ĸ PCI-DSS (credit cards)
  • â€ĸ HIPAA (healthcare)
  • â€ĸ Data retention policies
  • â€ĸ Audit trails

🔧 Optimization Strategy

1. Audio Pre-processing

  • â€ĸ Resampling: Upsample 8kHz phone audio to 16kHz
  • â€ĸ Noise Reduction: Apply spectral subtraction or Wiener filtering
  • â€ĸ Normalization: Standardize volume levels
  • â€ĸ VAD (Voice Activity Detection): Remove silence/hold music

2. Model Selection

Recommended: Whisper Large-v3

  • ✓ Excellent at noisy audio
  • ✓ Multilingual support
  • ✓ Good with accents
  • ✓ Can handle phone quality

3. Chunking Strategy

Problem: 60-minute calls exceed model context

Solution: Sliding window with overlap

  • â€ĸ Chunk size: 30 seconds
  • â€ĸ Overlap: 3 seconds
  • â€ĸ Merge chunks with overlap deduplication

4. Hyperparameter Tuning

# Production config for call centers
config = {
    "beam_size": 5,              # Quality over speed
    "temperature": 0.0,          # Deterministic
    "no_speech_threshold": 0.6,  # Detect silence
    "compression_ratio_threshold": 2.4,
    "condition_on_previous_text": True,  # Context
    "language": "en",            # Avoid auto-detect errors
    "vad_filter": True,          # Remove silence
    "vad_parameters": {
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 2000
    }
}

💰 Cost Optimization

Approach Cost/Hour Throughput Best For
CPU (OpenVINO INT8) $0.20-0.50 5-10x realtime Batch processing
GPU T4 (TensorRT) $0.50-1.00 20-30x realtime Mixed workload
GPU A100 $3-5 50-100x realtime Real-time only

📊 Quality Metrics for Call Centers

Word Error Rate (WER)

Target: <10% for good quality, <15% acceptable

Speaker Diarization Accuracy

Target: >85% correct speaker attribution

Processing Time

Target: <0.1x realtime (60 min call in 6 min)

💡 Production Checklist

  • ✓ Implement VAD to remove silence (saves 30-50% compute)
  • ✓ Use INT8 quantization for 4x cost reduction
  • ✓ Enable batching for non-realtime workloads
  • ✓ Monitor WER on production data monthly
  • ✓ Implement PII redaction (credit cards, SSN)
  • ✓ Store only transcripts, delete audio per policy
  • ✓ Set up alerts for quality degradation

30. Common Failure Modes

Plain Explanation

AI inference systems fail in predictable ways. Knowing these patterns helps you prevent issues before they hit production.

🔴 Memory Failures

OOM: Out of Memory

Symptom: Process crashes with "CUDA out of memory" or killed by OOM killer

Common Causes:

  • â€ĸ Activations exceed VRAM (not weights!)
  • â€ĸ Batch size too large
  • â€ĸ Input sequence too long
  • â€ĸ Memory leak in application code

✓ Fix: Reduce batch size, quantize KV-cache, use gradient checkpointing

Memory Fragmentation

Symptom: OOM despite memory appearing available

✓ Fix: Restart service periodically, use memory pools, enable PagedAttention

⚡ Performance Degradation

CPU Oversubscription

Symptom: Slow inference despite low GPU usage

Cause: Too many threads fighting for CPU cores

✓ Fix: Set OMP_NUM_THREADS=physical_cores, avoid hyperthreading

NUMA Issues

Symptom: Inconsistent CPU performance

Cause: Non-Uniform Memory Access on multi-socket servers

✓ Fix: Use numactl to bind process to single NUMA node

Thermal Throttling

Symptom: Performance degrades over time

Cause: GPU/CPU overheating, clock speed reduced

✓ Fix: Improve cooling, reduce power limit, monitor temperature

🐛 Quality Issues

Hallucinations (ASR)

Symptom: Model generates text on silence or music

✓ Fix: Use VAD filter, increase no_speech_threshold to 0.6

Repetitive Output (LLM)

Symptom: Model repeats same phrases

✓ Fix: Increase presence_penalty, use repetition_penalty parameter

Quantization Degradation

Symptom: Accuracy drops after quantization

✓ Fix: Use calibration dataset, try QAT instead of PTQ

🔧 System-Level Failures

CUDA Version Mismatch

Symptom: Import errors or runtime failures

✓ Fix: Match CUDA version with PyTorch/TensorRT build

Driver Issues

Symptom: GPU not detected or slow performance

✓ Fix: Update NVIDIA drivers, verify with nvidia-smi

🔍 Debugging Checklist

  1. Check GPU memory: nvidia-smi
  2. Monitor CPU usage: htop
  3. Check system logs: dmesg | grep -i error
  4. Verify CUDA: python -c "import torch; print(torch.cuda.is_available())"
  5. Profile memory: Use torch.cuda.memory_summary()

31. Troubleshooting Guide

Plain Explanation

A systematic troubleshooting guide for diagnosing and fixing common inference issues in production.

🚨 Issue: Slow Inference

Step 1: Identify Bottleneck

  • â€ĸ Check GPU utilization (should be >80%)
  • â€ĸ Check CPU utilization
  • â€ĸ Profile with PyTorch Profiler

Step 2: Common Fixes

  • ✓ Increase batch size (if memory allows)
  • ✓ Use FP16/INT8 precision
  • ✓ Enable TensorRT/OpenVINO optimizations
  • ✓ Check data loading isn't the bottleneck

🚨 Issue: Out of Memory

Step 1: Identify What's Using Memory

import torch
print(torch.cuda.memory_summary())

Step 2: Reduce Memory Usage

  • ✓ Reduce batch size (most effective)
  • ✓ Use quantization (INT8/INT4)
  • ✓ Enable gradient checkpointing (training)
  • ✓ Clear cache: torch.cuda.empty_cache()

🚨 Issue: Poor Accuracy

Step 1: Isolate the Problem

  • â€ĸ Compare with baseline model
  • â€ĸ Test on known-good samples
  • â€ĸ Check if quantization caused it

Step 2: Common Fixes

  • ✓ Recalibrate quantization
  • ✓ Increase beam size
  • ✓ Adjust temperature
  • ✓ Enable VAD filter (ASR)

📋 Quick Reference Commands

# Check GPU status

nvidia-smi

# Monitor GPU continuously

watch -n 1 nvidia-smi

# Check CUDA version

nvcc --version

# Test PyTorch CUDA

python -c "import torch; print(torch.cuda.is_available())"

# Profile inference

from torch.profiler import profile
with profile() as prof:
    model(input)
print(prof.key_averages().table())

🎉 Congratulations! Handbook Complete!

You've completed all 31 topics across 7 parts of the AI Inference Engineering Handbook!

✅ What You've Learned:

  • â€ĸ Model fundamentals
  • â€ĸ Hardware architectures
  • â€ĸ Inference runtimes
  • â€ĸ Optimization techniques
  • â€ĸ Production deployment
  • â€ĸ Troubleshooting

🚀 Next Steps:

  • â€ĸ Deploy your first model
  • â€ĸ Benchmark performance
  • â€ĸ Share this with your team
  • â€ĸ Contribute improvements