AI Inference Engineering Handbook
From Models to Silicon: A Complete Guide
This handbook explains how AI inference actually works â step by step â starting from what a model is, all the way down to how silicon executes math.
Written specifically for IT, Platform, and Infrastructure Engineers who need to deploy and optimize AI models in production.
đ¯ What You'll Learn
- Model Fundamentals: Weights, activations, computation graphs
- Inference Runtimes: PyTorch, OpenVINO, TensorRT, ONNX, llama.cpp
- Hardware Architecture: CPU vs GPU, ISAs, CUDA, specialized accelerators
- Optimization Techniques: Quantization, batching, memory management
- Production Deployment: Real-world ASR and LLM deployment strategies
đĨ Who This Is For
- Infrastructure Engineers managing AI deployments
- Platform Engineers building ML infrastructure
- DevOps/MLOps professionals
- IT Engineers supporting AI applications
No machine learning background required â just systems/IT experience.
đ Learning Approach
Each topic follows the same pattern:
- Plain Explanation â concept in simple terms
- Mental Model â how to remember it
- Visual Diagrams â see how it works
- Real Examples â ASR (Whisper) and LLM deployments
- Operational Impact â why it matters for your job
đēī¸ Handbook Structure
31 topics across 7 parts:
- Part I: Fundamentals (Models, Weights, Activations)
- Part II: Computation Graphs (Static vs Dynamic, Optimization)
- Part III: Runtimes (PyTorch, OpenVINO, TensorRT, ONNX, llama.cpp)
- Part IV: Hardware (CPU/GPU, ISA, CUDA)
- Part V: Optimization (Quantization, Batching, Memory)
- Part VI: Deployment (ASR, LLM, Benchmarking)
- Part VII: Advanced (Hyperparameters, Troubleshooting)
1. What Is an AI Model?
Plain Explanation
An AI model is a large mathematical function. It takes input data (like audio or text) and produces output data (like transcribed text or probability scores).
During training, the model learns values. During inference, it only performs calculations. No learning happens at inference time.
đĄ Mental Model
A model is a compiled mathematical machine
You don't change it while it runs â you feed data in and read results out.
đ Model Flow Diagram
(Math Function)
⥠Key Takeaway
For inference engineers, a model is static. It is executed, not trained. Your job is to run it efficiently.
2. Weights and Parameters
Plain Explanation
Weights are the learned numerical parameters of the model. They represent what the model has learned from data during training.
- Stored in RAM (CPU) or VRAM (GPU)
- Read constantly during inference
- Never change during inference
- A large model may have billions of weights
đĄ Mental Model
Weights = Model's knowledge stored as numbers
đ Weight Size Examples
| Model | Parameters | Size (FP32) |
|---|---|---|
| Whisper Large-v3 | ~1.5B | ~6 GB |
| LLaMA-7B | 7B | 28 GB |
| GPT-4 (estimated) | ~1.7T | ~6,800 GB |
đ§Ž Memory Calculation
If weights are stored in FP32 (Float32) format:
Each parameter = 4 bytes
7B Ã 4 bytes â 28 GB
â ī¸ Operational Impact
- Weight size determines if a model fits in memory
- Larger models require more VRAM (GPU) or RAM (CPU)
- Quantization (reducing precision) reduces weight size
- This is why INT8 and INT4 models are popular for deployment
3. Activations and Memory
Plain Explanation
Activations are temporary values created as the model processes input data. Unlike weights, activations only exist during inference.
- Exist only during inference
- Not saved after execution
- Scale with input length and batch size
- Often use MORE memory than weights
đĄ Mental Model
Weights
Knowledge
Static, learned values
Activations
Thinking
Dynamic, temporary values
â ī¸ Why Activations Matter
- Memory bottleneck: Activations often require MORE memory than weights
- Batch size impact: 2Ã batch size = 2Ã activation memory
- Sequence length: Longer inputs = more activations
- OOM errors are usually caused by activations, not weights
4. Training vs Inference
Plain Explanation
Training and Inference are two completely different phases in the AI lifecycle. Understanding this distinction is crucial for infrastructure engineering.
đ Comparison
| Aspect | Training | Inference |
|---|---|---|
| Purpose | Learn from data | Make predictions |
| Changes | Weights are updated | Weights never change |
| Duration | Hours to weeks | Milliseconds to seconds |
| Hardware | Multiple GPUs | CPU or single GPU |
| Team | Data scientists, ML engineers | Platform/infra engineers (you!) |
| Focus | Accuracy, convergence | Latency, throughput, cost |
â Part I Complete!
You now understand the fundamentals: models, weights, activations, and the difference between training and inference. Ready to learn about computation graphs!
5. Understanding Computation Graphs
Plain Explanation
A computation graph represents a model as a graph of mathematical operations. It's like a blueprint that shows exactly what calculations need to happen and in what order.
đĄ Mental Model
A computation graph = Wiring diagram for math
đ§ Graph Components
Nodes
Represent operations:
- âĸ Matrix multiplication
- âĸ Addition
- âĸ Activation functions (ReLU, Softmax)
- âĸ Convolution
- âĸ Attention
Edges
Represent data flow:
- âĸ Tensors (multi-dimensional arrays)
- âĸ Flow between operations
- âĸ Define dependencies
- âĸ Show execution order
đ Simple Graph Example
đ¤ ASR Example (Whisper)
đŦ LLM Example (GPT)
đ¯ Why Graphs Matter
- âĸ Optimization: Graphs can be analyzed and optimized
- âĸ Parallelization: Independent operations can run simultaneously
- âĸ Memory planning: Know memory needs ahead of time
- âĸ Debugging: Visualize what the model does
6. Static vs Dynamic Graphs
Plain Explanation
Computation graphs can be built in two ways: dynamically (as the program runs) or statically (built once ahead of time). This choice has major implications for inference performance.
đ Dynamic Graphs
How It Works:
Graph is constructed as operations execute
â Advantages:
- âĸ Very flexible
- âĸ Easy to debug
- âĸ Supports control flow (if/else, loops)
- âĸ Great for research
â Disadvantages:
- âĸ Slower execution
- âĸ Limited optimization
- âĸ Higher overhead
Used by:
PyTorch Eager Mode
⥠Static Graphs
How It Works:
Graph is built once, then optimized and executed repeatedly
â Advantages:
- âĸ Much faster execution
- âĸ Aggressive optimization
- âĸ Memory planning
- âĸ Perfect for production
â Disadvantages:
- âĸ Less flexible
- âĸ Harder to debug
- âĸ Requires conversion step
Used by:
TorchScript, ONNX, OpenVINO, TensorRT
đĄ Mental Model
Dynamic Graph
= Interpreted Code
Like running Python
Static Graph
= Compiled Code
Like running C++
đ Performance Comparison
| Metric | Dynamic | Static |
|---|---|---|
| Inference Speed | 1x (baseline) | 2-5x faster |
| Debugging | Easy | Harder |
| Flexibility | High | Limited |
| Memory Usage | Higher | Optimized |
| Production Use | Not Recommended | Strongly Recommended |
đģ Code Example: PyTorch
Dynamic (Eager Mode):
# Dynamic execution import torch model = WhisperModel() output = model(audio) # Graph built on-the-fly
Static (TorchScript):
# Static graph - compile once import torch model = WhisperModel() scripted = torch.jit.script(model) # Build graph output = scripted(audio) # Fast execution
đ¯ When to Use Each
Use Dynamic Graphs for:
- âĸ Research and experimentation
- âĸ Prototyping
- âĸ Models with complex control flow
Use Static Graphs for:
- âĸ Production deployment
- âĸ Performance-critical applications
- âĸ ASR and LLM inference
- âĸ This is what you want 99% of the time!
7. Graph Optimization
Plain Explanation
Once you have a static computation graph, inference runtimes can optimize it automatically. These optimizations make inference faster and more memory-efficient without changing the model's accuracy.
đĄ Mental Model
Graph optimization = Compiler optimization for AI
Like how C++ compilers optimize your code automatically
đ§ Common Optimization Techniques
1. Operator Fusion (Kernel Fusion)
Combine multiple operations into a single kernel
Before:
After:
â Fewer memory reads/writes â Faster execution
2. Constant Folding
Pre-compute operations that don't depend on input
Before:
output = input * (2 + 3) # Computed every time
After:
output = input * 5 # Pre-computed
3. Dead Code Elimination
Remove operations that don't affect the output
Example: Unused outputs, redundant calculations
4. Layout Optimization
Reorganize data for better memory access patterns
Change tensor formats (NCHW â NHWC) for hardware efficiency
5. Memory Reuse
Reuse memory buffers instead of allocating new ones
Reduces total memory footprint significantly
6. Quantization-Aware Optimization
Optimize for lower precision (INT8, INT4)
Replace FP32 operations with INT8 equivalents
đ Impact of Graph Optimization
Unoptimized
100ms
Basic Optimization
50ms
2x faster
Aggressive Optimization
25ms
4x faster
Typical speedup from graph optimization on production models
đ ī¸ Which Runtimes Do This?
TensorRT (NVIDIA)
Most aggressive optimization
OpenVINO (Intel)
Strong CPU optimization
ONNX Runtime
Good cross-platform optimization
PyTorch JIT
Basic optimization
⥠Key Takeaway
Graph optimization is why specialized inference runtimes are so much faster than running models in PyTorch eager mode. The same model, same weights, but 2-5x faster execution through automatic optimization.
â Part II Complete!
You now understand computation graphs, why static graphs matter, and how they're optimized. Ready to learn about the runtimes that execute these graphs!
8. What Is an Inference Runtime?
Plain Explanation
An inference runtime is the software layer that executes your computation graph on hardware. It's the engine that takes your model and actually runs it on CPUs or GPUs.
đĄ Mental Model
Runtime = Execution Engine for AI Models
Like JVM for Java or V8 for JavaScript
đ¯ What Runtimes Do
Core Responsibilities:
- âĸ Load model weights into memory
- âĸ Parse computation graph
- âĸ Select optimal kernels
- âĸ Execute operations in order
- âĸ Manage memory allocation
- âĸ Schedule threads/cores
Optimizations:
- âĸ Graph optimization (fusion, etc.)
- âĸ Hardware-specific kernels
- âĸ Memory planning
- âĸ Parallel execution
- âĸ Quantization support
- âĸ Batch processing
đ The Inference Stack
(PyTorch, OpenVINO, TensorRT, etc.)
(Optimized math functions)
đ§ Popular Inference Runtimes
PyTorch Runtime
General PurposeDefault runtime, good for prototyping
OpenVINO
CPU OptimizedIntel CPUs, excellent INT8 performance
TensorRT
GPU OptimizedNVIDIA GPUs, ultra-low latency
ONNX Runtime
Cross-PlatformWorks everywhere, good portability
llama.cpp
LLM SpecializedOptimized for language models on CPU
đ¯ Choosing a Runtime
Your choice depends on:
- âĸ Hardware: CPU vs GPU
- âĸ Model type: ASR, LLM, vision
- âĸ Latency requirements: Real-time vs batch
- âĸ Vendor: Intel, NVIDIA, AMD, Apple
- âĸ Ecosystem: Python, C++, mobile
9. PyTorch Runtime
Plain Explanation
PyTorch is the most popular deep learning framework. While primarily designed for training, it also has inference capabilities through Eager Mode and TorchScript.
đ PyTorch Execution Modes
Eager Mode
- â Dynamic graph execution
- â Easy debugging
- â Flexible and Pythonic
- â Slower inference
- â Limited optimization
model = WhisperModel() output = model(audio)
TorchScript
- â Static graph (optimized)
- â 2-3x faster inference
- â Can run without Python
- â Requires conversion step
- â Some features unsupported
model = WhisperModel() scripted = torch.jit.script(model) output = scripted(audio)
⥠Performance Comparison
| Mode | Speed | Ease of Use | Production Ready |
|---|---|---|---|
| Eager Mode | 1x (baseline) | âââââ | â |
| TorchScript | 2-3x | âââ | â ī¸ OK |
| OpenVINO | 4-5x | âââ | â |
đģ Code Example: Converting to TorchScript
import torch
from transformers import WhisperForConditionalGeneration
# Load model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
model.eval()
# Convert to TorchScript
with torch.no_grad():
scripted_model = torch.jit.trace(
model,
example_inputs=(dummy_input,)
)
# Save for deployment
scripted_model.save("whisper_scripted.pt")
# Load and use
loaded = torch.jit.load("whisper_scripted.pt")
output = loaded(audio_input)
âī¸ Backend Options (ATen)
PyTorch uses ATen (A Tensor Library) for operations:
CPU Backend
- âĸ Uses MKL (Intel Math Kernel Library)
- âĸ OpenBLAS or Eigen as fallback
- âĸ Good threading support
GPU Backend
- âĸ CUDA kernels (NVIDIA)
- âĸ cuDNN for convolutions
- âĸ cuBLAS for matrix operations
â ī¸ When to Use PyTorch Runtime
â Good for:
- âĸ Prototyping and experimentation
- âĸ Research deployments
- âĸ When you need maximum flexibility
- âĸ Quick proof-of-concept
â Not ideal for:
- âĸ High-throughput production (use OpenVINO/TensorRT)
- âĸ Latency-critical applications
- âĸ Resource-constrained environments
đĄ Key Insight
PyTorch is excellent for development, but for production inference, you typically want to export to a specialized runtime like OpenVINO (CPU) or TensorRT (GPU) for 2-5x better performance.
10. OpenVINO (Open Visual Inference and Neural Network Optimization)
Plain Explanation
OpenVINO is Intel's inference optimization toolkit. It's specifically designed to run AI models blazingly fast on Intel CPUs, with excellent support for quantization and various model types.
đĄ Mental Model
OpenVINO = Intel's turbocharger for CPU inference
⨠Key Features
Strengths:
- â Excellent CPU performance (Intel)
- â Outstanding INT8 quantization
- â Static graph optimization
- â Auto-tuning for your CPU
- â Cross-platform (Windows, Linux)
- â Supports many frameworks
Limitations:
- âĸ Best on Intel hardware
- âĸ Requires model conversion (IR format)
- âĸ Learning curve for optimization
- âĸ Limited GPU support (vs TensorRT)
đ§ OpenVINO Workflow
(Intermediate Representation)
(Graph + Quantization)
đģ Code Example: Converting and Running
# Step 1: Install
pip install openvino openvino-dev
# Step 2: Convert PyTorch to OpenVINO IR
from openvino.tools import mo
# Convert model
mo.convert_model(
"whisper_model.pt",
output_dir="openvino_model",
compress_to_fp16=True # Reduce size
)
# Step 3: Run Inference
from openvino.runtime import Core
# Initialize runtime
core = Core()
model = core.read_model("openvino_model/model.xml")
compiled = core.compile_model(model, "CPU")
# Run inference
output = compiled([audio_input])[0]
đ Performance: OpenVINO vs PyTorch
Typical speedups for ASR models on Intel Xeon CPUs
đ¯ INT8 Quantization with OpenVINO
OpenVINO excels at INT8 quantization:
from openvino.tools import pot # Post-training Optimization Tool
# Quantize to INT8
config = {
"model": "whisper.xml",
"engine": {"type": "accuracy_checker"},
"compression": {
"algorithms": [{
"name": "DefaultQuantization",
"preset": "performance",
"stat_subset_size": 300
}]
}
}
pot.compress_model(config)
Result: 4x smaller model, 3-5x faster inference, minimal accuracy loss (<1%)
đ¯ When to Use OpenVINO
â Perfect for:
- âĸ CPU-only production deployments
- âĸ Intel hardware (Xeon, Core processors)
- âĸ ASR models (Whisper, Conformer)
- âĸ Real-time applications on CPU
- âĸ Cost-sensitive deployments (no GPU needed)
â ī¸ Consider alternatives if:
- âĸ You have NVIDIA GPUs (use TensorRT)
- âĸ You need maximum GPU performance
- âĸ You're on non-Intel hardware
â Real-World Use Case
Call Center ASR: Many companies use OpenVINO to run Whisper models on CPU servers, achieving real-time transcription at 1/10th the cost of GPU deployments.
11. TensorRT (NVIDIA Tensor Runtime)
Plain Explanation
TensorRT is NVIDIA's high-performance inference optimizer and runtime. It's designed to squeeze maximum performance out of NVIDIA GPUs through aggressive graph optimization and kernel fusion.
đĄ Mental Model
TensorRT = Ultimate GPU performance optimizer
⥠What Makes TensorRT Fast
1. Aggressive Kernel Fusion
Combines dozens of operations into single GPU kernels
2. Precision Calibration
Automatic INT8 quantization with minimal accuracy loss
3. Layer and Tensor Fusion
Optimizes memory access patterns for GPU architecture
4. Dynamic Tensor Memory
Reuses memory buffers to reduce VRAM usage
5. Multi-Stream Execution
Parallel processing of multiple batches
đ Performance Comparison
Typical speedups for LLMs on NVIDIA A100 GPUs
đģ Code Example: PyTorch to TensorRT
# Step 1: Export to ONNX
import torch
model = WhisperModel()
dummy_input = torch.randn(1, 80, 3000).cuda()
torch.onnx.export(
model, dummy_input, "whisper.onnx",
opset_version=17,
input_names=["audio"],
output_names=["text"]
)
# Step 2: Convert to TensorRT
import tensorrt as trt builder = trt.Builder(TRT_LOGGER) config = builder.create_builder_config() config.set_flag(trt.BuilderFlag.FP16) engine = builder.build_serialized_network(network, config)
đ¯ When to Use TensorRT
- âĸ Ultra-low latency requirements (real-time ASR, LLM serving)
- âĸ NVIDIA GPUs available (A100, H100, V100, T4)
- âĸ Production LLM deployments
- âĸ When GPU cost is a concern (better utilization)
12. ONNX Runtime (Open Neural Network Exchange)
Plain Explanation
ONNX Runtime is a cross-platform, hardware-agnostic inference engine. It's designed to run models from any framework (PyTorch, TensorFlow, etc.) on any hardware (CPU, GPU, mobile).
đĄ Mental Model
ONNX = "Write once, run anywhere" for AI
Like Java's JVM, but for neural networks
đ ONNX Ecosystem
Source Frameworks
Format
Target Hardware
⨠Key Features
Strengths:
- â True cross-platform portability
- â Hardware flexibility (CPU/GPU/NPU)
- â Framework agnostic
- â Good performance (2-3x vs PyTorch)
- â Active community support
- â Microsoft backing
Trade-offs:
- âĸ Not as fast as TensorRT (GPU)
- âĸ Not as fast as OpenVINO (Intel CPU)
- âĸ Conversion can be tricky
- âĸ Operator coverage gaps
đģ Code Example: Converting and Running
# Step 1: Export PyTorch to ONNX
import torch
model = WhisperModel()
dummy_input = torch.randn(1, 80, 3000)
torch.onnx.export(
model,
dummy_input,
"whisper.onnx",
export_params=True,
opset_version=14,
input_names=["audio"],
output_names=["text"],
dynamic_axes={
"audio": {0: "batch", 2: "time"},
"text": {0: "batch"}
}
)
# Step 2: Run with ONNX Runtime
import onnxruntime as ort
# Create session
session = ort.InferenceSession(
"whisper.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Run inference
outputs = session.run(
None,
{"audio": audio_input}
)
text_output = outputs[0]
âī¸ Execution Providers (Hardware Backends)
CPUExecutionProvider
Default, works everywhere
CUDAExecutionProvider
NVIDIA GPUs with CUDA
TensorrtExecutionProvider
Uses TensorRT under the hood
OpenVINOExecutionProvider
Uses OpenVINO for Intel hardware
CoreMLExecutionProvider
Apple Silicon (M1/M2/M3)
đ Performance Positioning
| Scenario | Best Choice | ONNX Position |
|---|---|---|
| Intel CPU | OpenVINO | 2nd (Good) |
| NVIDIA GPU | TensorRT | 2nd (Good) |
| Cross-platform | ONNX Runtime | 1st (Best) |
| Mobile / Edge | ONNX Runtime | 1st (Best) |
đ¯ When to Use ONNX Runtime
- âĸ Multi-hardware deployments (CPU + GPU + mobile)
- âĸ Platform flexibility (Windows, Linux, macOS, mobile)
- âĸ Framework agnostic (PyTorch, TensorFlow, etc.)
- âĸ Good "middle ground" performance
- âĸ When you need portability more than peak performance
13. llama.cpp and Variants
Plain Explanation
llama.cpp is a lightweight, CPU-optimized inference engine specifically designed for Large Language Models (LLMs). It's written in pure C/C++ with no dependencies, making it incredibly portable and efficient.
đĄ Mental Model
llama.cpp = SQLite for LLMs
Minimal, fast, runs anywhere, zero dependencies
⨠Why llama.cpp Is Special
Unique Strengths:
- â Pure C/C++ (no Python overhead)
- â Extreme portability (Linux, Mac, Windows, mobile)
- â Tiny binary (~few MB)
- â CPU-first design (no GPU required)
- â Quantization mastery (4-bit, 3-bit, 2-bit)
- â Memory mapped files (efficient loading)
Perfect For:
- âĸ Running LLMs on laptops
- âĸ Edge deployments
- âĸ CPU-only servers
- âĸ Local AI applications
- âĸ Raspberry Pi / embedded
- âĸ Cost-sensitive deployments
đī¸ GGUF Format (GPT-Generated Unified Format)
llama.cpp uses its own model format called GGUF (previously GGML). This format is optimized for:
Memory Mapping
Load instantly without copying
Quantization
Built-in 2/3/4/5/6/8-bit
Portability
Single file, works everywhere
đģ Usage Example
# Step 1: Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make
# Step 2: Convert model to GGUF
python convert.py /path/to/llama-model --outtype f16
# Step 3: Quantize (optional but recommended)
./quantize model-f16.gguf model-q4_0.gguf q4_0
# Step 4: Run inference
./main -m model-q4_0.gguf -p "Hello, my name is" -n 128 -t 8 # -m: model file # -p: prompt # -n: number of tokens to generate # -t: number of CPU threads
đ Quantization Levels
| Format | Bits | Size (7B) | Quality |
|---|---|---|---|
| F16 | 16 | 14 GB | âââââ |
| Q8_0 | 8 | 7 GB | âââââ |
| Q6_K | 6 | 5.5 GB | âââââ |
| Q5_K_M | 5 | 4.8 GB | ââââ |
| Q4_K_M | 4 | 4.1 GB | ââââ |
| Q3_K_M | 3 | 3.3 GB | âââ |
| Q2_K | 2 | 2.7 GB | ââ |
* Recommended: Q4_K_M or Q5_K_M for best quality/size trade-off
đ Popular Variants & Wrappers
llama-cpp-python
Python bindings for llama.cpp
Ollama
User-friendly wrapper with model library
LM Studio
GUI for running GGUF models (uses llama.cpp)
text-generation-webui
Web interface for LLMs (llama.cpp backend)
đ¯ When to Use llama.cpp
- âĸ CPU-only deployments (no GPU budget)
- âĸ Edge devices (Raspberry Pi, embedded systems)
- âĸ Local AI applications (privacy-focused)
- âĸ Development & prototyping LLM apps
- âĸ Memory-constrained environments (with quantization)
- âĸ Cross-platform deployment needs
â Part III Complete!
You now understand the major inference runtimes: PyTorch, OpenVINO (CPU), TensorRT (GPU), ONNX (cross-platform), and llama.cpp (LLM-specialized). Ready to dive into hardware and kernels!
14. Kernels and Operations
Plain Explanation
A kernel is a hardware-specific implementation of a mathematical operation. The same operation (like matrix multiplication) has different kernels for CPU, GPU, and other accelerators.
đĄ Mental Model
Kernel = How math actually runs on silicon
đ§ Operation vs Kernel
Operation
Abstract mathematical function
Examples:
- âĸ Matrix Multiplication
- âĸ Convolution
- âĸ Softmax
- âĸ ReLU Activation
Platform-independent concept
Kernel
Hardware-specific implementation
For MatMul:
- âĸ CPU kernel (uses AVX-512)
- âĸ GPU kernel (uses Tensor Cores)
- âĸ ARM kernel (uses NEON)
- âĸ TPU kernel (custom silicon)
Hardware-specific code
đ¯ Example: Matrix Multiplication Kernels
CPU Kernel (Intel)
Uses MKL (Math Kernel Library) with AVX-512 instructions
// Optimized for Intel CPUs
void matmul_cpu(float* A, float* B, float* C) {
cblas_sgemm(...); // MKL function
// Uses AVX-512 SIMD instructions
}
GPU Kernel (NVIDIA)
Uses CUDA with Tensor Cores
// CUDA kernel
__global__ void matmul_gpu(float* A, float* B, float* C) {
// Parallel execution across thousands of cores
// Uses Tensor Cores for FP16/INT8
}
ARM Kernel
Uses NEON SIMD instructions
// ARM NEON optimized
void matmul_arm(float* A, float* B, float* C) {
// Uses NEON vector instructions
}
đ Common Deep Learning Operations
Matrix Operations
- âĸ GEMM (General Matrix Multiply)
- âĸ BatchMatMul
Convolution
- âĸ Conv2D (2D Convolution)
- âĸ DepthwiseConv
Activation Functions
- âĸ ReLU, GELU, Swish
- âĸ Softmax, Sigmoid
Normalization
- âĸ LayerNorm, BatchNorm
- âĸ GroupNorm
Attention
- âĸ Multi-Head Attention
- âĸ Scaled Dot-Product
Pooling
- âĸ MaxPool, AvgPool
- âĸ AdaptivePool
⥠Kernel Libraries
CPU: Intel MKL
Math Kernel Library - highly optimized for Intel CPUs
GPU: cuDNN
CUDA Deep Neural Network library - NVIDIA's DL primitives
GPU: cuBLAS
CUDA Basic Linear Algebra Subprograms - matrix operations
ARM: Compute Library
Optimized kernels for ARM CPUs (NEON) and Mali GPUs
đ¯ Why This Matters
Runtimes like OpenVINO and TensorRT are fast because they:
- âĸ Select the best kernel for your hardware
- âĸ Fuse multiple kernels into one
- âĸ Use hardware-specific optimizations
- âĸ Minimize kernel launch overhead
15. CPU vs GPU Architecture
Plain Explanation
CPUs and GPUs are designed for fundamentally different workloads. Understanding their architectures helps you choose the right hardware for your inference needs.
đī¸ Architecture Comparison
CPU (Central Processing Unit)
Design Philosophy:
Few powerful cores optimized for sequential tasks
Cores:
4-64 powerful cores
Cache:
Large (32-256 MB)
Clock Speed:
High (2-5 GHz)
Memory:
RAM (DDR4/DDR5)
Bandwidth:
~50-100 GB/s
Best For:
Control flow, branching, general computing
GPU (Graphics Processing Unit)
Design Philosophy:
Thousands of simple cores optimized for parallel tasks
Cores:
1,000-10,000+ CUDA cores
Cache:
Small per core (~KB)
Clock Speed:
Lower (1-2 GHz)
Memory:
VRAM (HBM2/GDDR6)
Bandwidth:
~500-2,000 GB/s
Best For:
Parallel math, matrix operations
đĄ Mental Models
CPU
= Ferrari
Few fast cores, sequential excellence
GPU
= Bus Fleet
Many slow cores, parallel powerhouse
đ Performance Characteristics
| Task | CPU | GPU |
|---|---|---|
| Matrix Multiply (Large) | Slow | Very Fast |
| Single-threaded Code | Very Fast | Slow |
| Branching / If-else | Excellent | Poor |
| Parallel Operations | Limited | Excellent |
| Memory Bandwidth | 50-100 GB/s | 500-2,000 GB/s |
| Power Efficiency | Better (50-150W) | Hungry (250-700W) |
đ¯ When to Use Each for AI Inference
Choose CPU When:
- â Low latency, small batch (batch=1)
- â Cost-sensitive deployments
- â Already have CPU infrastructure
- â Models fit in RAM with quantization
- â ASR models (with OpenVINO)
Choose GPU When:
- â High throughput needed
- â Large batch sizes
- â Very large models (70B+ LLMs)
- â Ultra-low latency critical
- â Budget allows ($$)
đĄ Real-World Example
ASR Call Center (Whisper Large-v3):
- âĸ CPU (Xeon + OpenVINO INT8): ~200ms latency, $0.50/hour
- âĸ GPU (T4 + TensorRT FP16): ~50ms latency, $2.50/hour
Decision: CPU wins for call centers (200ms is acceptable, 5x cost savings)
16. Instruction Set Architectures (ISA)
Plain Explanation
An Instruction Set Architecture (ISA) is the language that your CPU speaks. It defines what operations the processor can perform and how software communicates with hardware.
đĄ Mental Model
ISA = CPU's native language
Like English vs Spanish vs Mandarin for humans
đĨī¸ Major CPU ISAs
x86-64 (AMD64)
DominantUsed by: Intel (Core, Xeon), AMD (Ryzen, EPYC)
Market: Servers, desktops, laptops
SIMD Extensions:
- âĸ SSE (Streaming SIMD Extensions)
- âĸ AVX (Advanced Vector Extensions)
- âĸ AVX-512 (512-bit vectors for AI/HPC)
- âĸ AMX (Advanced Matrix Extensions - new!)
AI Performance: Excellent with AVX-512/AMX
ARM64 (AArch64)
GrowingUsed by: Apple (M1/M2/M3), AWS Graviton, NVIDIA Grace
Market: Mobile, edge, emerging servers
SIMD Extensions:
- âĸ NEON (Advanced SIMD)
- âĸ SVE (Scalable Vector Extension)
- âĸ SVE2 (Enhanced for AI/ML)
AI Performance: Good, improving rapidly
Advantage: Power efficiency
RISC-V
EmergingUsed by: SiFive, StarFive, various startups
Market: Edge devices, research, future servers
Key Feature: Open-source ISA (no licensing fees!)
Extensions:
- âĸ V extension (Vector operations)
- âĸ Zve (Embedded vector)
Status: Early but promising for AI
đŽ GPU ISAs
NVIDIA: PTX â SASS
PTX (Parallel Thread Execution): Virtual ISA (like Java bytecode)
SASS: Native GPU machine code (hardware-specific)
CUDA â PTX â SASS â Hardware
AMD: GCN / RDNA / CDNA
GCN: Graphics Core Next (older)
RDNA: Gaming GPUs
CDNA: Compute/AI GPUs (MI series)
Intel: Gen Graphics
Used in Intel Xe GPUs (Arc, Flex, Max series)
đ SIMD: Single Instruction Multiple Data
SIMD extensions allow CPUs to perform the same operation on multiple data points simultaneously - crucial for AI inference.
Evolution of Intel SIMD:
Impact: AVX-512 + AMX make modern Intel CPUs competitive with GPUs for INT8 inference
đ Why ISA Matters for Inference
For Deployment:
- âĸ Kernels are ISA-specific
- âĸ x86 binaries won't run on ARM
- âĸ Need right compiler/runtime
For Performance:
- âĸ AVX-512 = 2-3x faster than AVX2
- âĸ AMX = dedicated matrix ops
- âĸ NEON optimizes ARM inference
â ī¸ Practical Implications
- âĸ Docker images must match architecture (amd64 vs arm64)
- âĸ OpenVINO automatically detects and uses best ISA features
- âĸ ARM Macs (M1/M2) need ARM-specific builds
- âĸ Check CPU flags:
lscpu | grep Flags
17. CUDA and Alternatives
Plain Explanation
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. It allows developers to use GPUs for general-purpose computing, not just graphics.
đĄ Mental Model
CUDA = The "C++" of GPU programming
Dominant, powerful, but NVIDIA-only
đī¸ CUDA Ecosystem
(PyTorch, TensorFlow)
(cuDNN, cuBLAS, TensorRT)
(Memory management, kernel launch)
đ Key CUDA Libraries for AI
cuDNN (CUDA Deep Neural Network library)
GPU-accelerated primitives for deep learning (convolution, pooling, normalization)
Used by: All major frameworks
cuBLAS (CUDA Basic Linear Algebra Subprograms)
GPU-accelerated matrix operations (GEMM, GEMV)
Critical for: Transformer models
TensorRT
High-performance inference optimizer (covered earlier)
Best for: Production inference
cuSPARSE
Sparse matrix operations
Useful for: Pruned models
NCCL (NVIDIA Collective Communications Library)
Multi-GPU communication
For: Multi-GPU training/inference
đ CUDA Alternatives
ROCm (AMD)
AMD GPUsRadeon Open Compute platform - AMD's answer to CUDA
Advantages:
- âĸ Open source
- âĸ AMD MI series (CDNA)
- âĸ HIP (CUDA compatibility layer)
Challenges:
- âĸ Smaller ecosystem
- âĸ Fewer optimized libraries
- âĸ Limited framework support
oneAPI / SYCL (Intel)
Intel GPUsUnified programming model for CPUs, GPUs, FPGAs
Advantages:
- âĸ Cross-architecture
- âĸ Standards-based (SYCL)
- âĸ Intel Xe GPUs (Arc, Flex, Max)
Status:
- âĸ Growing adoption
- âĸ Good for Intel stack
- âĸ Still maturing
OpenCL
Vendor NeutralOpen standard for parallel programming across platforms
Advantages:
- âĸ True cross-vendor
- âĸ CPU, GPU, FPGA support
- âĸ Open standard
Reality:
- âĸ Slower than CUDA on NVIDIA
- âĸ Less AI library support
- âĸ Declining adoption
Metal (Apple)
Apple SiliconApple's GPU programming framework for M1/M2/M3 chips
Advantages:
- âĸ Excellent on Apple Silicon
- âĸ Unified memory architecture
- âĸ Growing ML support (MLX)
Limitation:
- âĸ Apple devices only
- âĸ Not for datacenter
đ Market Reality Check
| Platform | AI Market Share | Ecosystem Maturity |
|---|---|---|
| CUDA (NVIDIA) | ~95% | âââââ |
| ROCm (AMD) | ~3% | âââ |
| oneAPI (Intel) | ~1% | ââ |
| Metal (Apple) | ~1% | âââ |
| OpenCL | <1% | ââ |
đĄ Practical Advice
- âĸ For production AI: CUDA (NVIDIA GPUs) is still the safest bet
- âĸ For cost optimization: Consider AMD MI series with ROCm
- âĸ For Apple Silicon: Use Metal/MLX for local development
- âĸ For maximum portability: Use high-level frameworks (PyTorch, ONNX)
18. Specialized AI Hardware
Plain Explanation
Beyond CPUs and GPUs, there are specialized accelerators designed specifically for AI workloads. These chips sacrifice flexibility for extreme performance and efficiency.
đ Major AI Accelerators
TPU (Tensor Processing Unit) - Google
Custom ASIC for TensorFlow, excellent for training and inference
Apple Neural Engine
Built into M-series chips, optimized for CoreML
AWS Inferentia / Trainium
Amazon's custom chips for cloud inference
Intel Gaudi
Deep learning accelerator for training and inference
â Part IV Complete!
You now understand hardware from kernels to silicon, ISAs, CUDA, and specialized accelerators. Ready for memory optimization!
19. RAM vs VRAM
Plain Explanation
RAM (system memory) and VRAM (video memory) serve the same purposeâstoring dataâbut they're optimized for different processors. Understanding the difference is critical for inference deployment decisions.
đĄ Mental Model
RAM = CPU's storage
VRAM = GPU's storage
Data must live where it's processed
đ Key Differences
| Characteristic | RAM | VRAM |
|---|---|---|
| Type | DDR4/DDR5 | GDDR6/HBM2 |
| Bandwidth | 50-100 GB/s | 500-2,000 GB/s |
| Capacity | 64-512 GB | 8-80 GB |
| Cost per GB | $1-3 | $20-100 |
| Latency | ~60ns | ~200ns |
đ Memory Transfer Bottleneck
Moving data between RAM and VRAM is expensive:
Transfer time for 7B model (14GB): ~1 second
đģ RAM: CPU Inference
â Advantages:
- âĸ Much larger capacity (512GB possible)
- âĸ Lower cost per GB
- âĸ Easier to upgrade
â Limitations:
- âĸ Lower bandwidth
- âĸ CPU is slower for parallel ops
đŽ VRAM: GPU Inference
â Advantages:
- âĸ Massive bandwidth (20x faster)
- âĸ GPU optimized for parallel ops
- âĸ Lower latency for inference
â Limitations:
- âĸ Limited capacity (24-80GB typical)
- âĸ Very expensive
- âĸ Cannot upgrade
đ¯ Memory Requirements by Model
Whisper Large-v3 (1.5B params)
- âĸ FP32: 6 GB
- âĸ FP16: 3 GB
- âĸ INT8: 1.5 GB
â Fits in most GPUs (even T4 with 16GB)
LLaMA-7B
- âĸ FP32: 28 GB
- âĸ FP16: 14 GB
- âĸ INT8: 7 GB
- âĸ INT4: 3.5 GB
â INT8 fits in 16GB GPU, INT4 fits in 8GB
LLaMA-70B
- âĸ FP16: 140 GB
- âĸ INT8: 70 GB
- âĸ INT4: 35 GB
â Requires multiple GPUs or extreme quantization
đĄ Practical Decision Guide
- âĸ Model fits in VRAM: Use GPU (much faster)
- âĸ Model too large for VRAM: Quantize to INT8/INT4 or use CPU
- âĸ Budget constrained: Use CPU with quantization
- âĸ Very large models (70B+): Multi-GPU or use llama.cpp on CPU with INT4
20. Quantization Techniques
Plain Explanation
Quantization means reducing the precision of model weights and activations. Instead of 32-bit floats, use 16-bit, 8-bit, or even 4-bit integers. This makes models smaller and faster with minimal accuracy loss.
đĄ Mental Model
Quantization = Compression with controlled quality loss
Like JPEG for images, but for AI models
đ Precision Levels
FP32 (Float32)
4 bytesFull precision, baseline accuracy
Range: Âą3.4 à 10Âŗâ¸
FP16 (Float16)
2 bytesHalf precision, minimal loss
2Ã speedup, 2Ã memory savings, <0.1% accuracy loss
INT8 (8-bit Integer)
1 byteMost popular for inference
4Ã speedup, 4Ã memory savings, <1% accuracy loss
INT4 (4-bit Integer)
0.5 bytesAggressive compression
8Ã memory savings, 1-3% accuracy loss, great for LLMs
đ§ Quantization Approaches
Post-Training Quantization (PTQ)
Quantize after training is complete
â Advantages:
- âĸ No retraining needed
- âĸ Fast (minutes)
- âĸ Easy to use
â Limitations:
- âĸ Can lose 1-3% accuracy
- âĸ Needs calibration data
Quantization-Aware Training (QAT)
Train with quantization in mind
â Advantages:
- âĸ Better accuracy
- âĸ Can handle INT4 better
- âĸ Model adapts to low precision
â Limitations:
- âĸ Requires retraining
- âĸ Time-consuming (days/weeks)
đ Impact on Model Size
LLaMA-7B Model Size by Precision:
đ¤ ASR Quantization
Whisper models handle quantization well:
- âĸ FP16: Recommended for GPU, no loss
- âĸ INT8: Perfect for CPU (OpenVINO), <0.5% WER increase
- âĸ INT4: Use with caution, test accuracy
đŦ LLM Quantization
LLMs are quantization-friendly:
- âĸ INT8: Minimal perplexity increase (<1%)
- âĸ INT4: Popular for llama.cpp, 1-3% loss
- âĸ GPTQ/AWQ: Advanced INT4 methods
⥠Quick Recommendations
- âĸ GPU inference: Use FP16 (native support, no accuracy loss)
- âĸ CPU inference: Use INT8 (4x faster, <1% loss)
- âĸ Memory constrained: Use INT4 (llama.cpp, GPTQ)
- âĸ Always calibrate: Test on real data before deploying
21. Calibration for Quantization
Plain Explanation
Calibration is the process of finding the right scale factors when converting from high precision (FP32) to low precision (INT8). Without calibration, quantization causes significant accuracy loss.
đĄ Mental Model
Calibration = Measuring before compressing
Like setting the right exposure for a photo
đ Why Calibration Matters
When quantizing, we need to map floating-point ranges to integer ranges:
FP32 Range
-5.2 to +8.7
â
INT8 Range
-128 to +127
Calibration finds the optimal mapping to minimize accuracy loss
đ Calibration Methods
1. Min-Max Calibration
Uses observed min/max values from calibration data
scale = (max - min) / 255
â Pros:
- âĸ Simple
- âĸ Fast
â Cons:
- âĸ Sensitive to outliers
- âĸ Less accurate
2. Entropy Calibration (KL Divergence)
Minimizes information loss between FP32 and INT8
Finds threshold that minimizes KL(P||Q)
â Pros:
- âĸ More accurate
- âĸ Robust to outliers
â Cons:
- âĸ Slower
- âĸ More complex
3. Percentile Calibration
Uses percentiles (e.g., 99.9%) to clip outliers
Ignores extreme values that hurt quantization
â Pros:
- âĸ Good balance
- âĸ Handles outliers
â Cons:
- âĸ Requires tuning
đĻ Calibration Dataset
You need representative data to calibrate:
Size:
100-1,000 samples typical (more is not always better)
Diversity:
Cover all types of inputs (different accents, languages, topics)
Source:
Use validation set or production samples
đģ Calibration Example (TensorRT)
import tensorrt as trt
# 1. Create calibrator
class Calibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, data_loader):
self.data_loader = data_loader
self.batch_size = 32
def get_batch(self, names):
# Return next batch of calibration data
return next(self.data_loader)
def get_batch_size(self):
return self.batch_size
# 2. Load calibration data
calibration_data = load_samples(count=500)
# 3. Build engine with calibration
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = Calibrator(calibration_data)
# 4. Build quantized engine
engine = builder.build_serialized_network(network, config)
TensorRT INT8 calibration using entropy method
đ Calibration Best Practices
â Do:
- âĸ Use 100-500 diverse samples
- âĸ Match production data distribution
- âĸ Test multiple calibration methods
- âĸ Validate accuracy after calibration
- âĸ Cache calibration results
â Don't:
- âĸ Use training data for calibration
- âĸ Calibrate with <50 samples
- âĸ Skip accuracy validation
- âĸ Use non-representative data
- âĸ Forget to version calibration data
đĄ Key Takeaways
- âĸ Calibration is essential for INT8 quantization quality
- âĸ Use entropy (KL divergence) for best accuracy
- âĸ 500 diverse samples is a good target
- âĸ Always validate accuracy on real data after calibration
- âĸ Cache calibration results to avoid recomputation
22. KV-Cache Optimization
Plain Explanation
The KV-cache (Key-Value cache) stores attention intermediate results in transformer models. It avoids recomputing past tokens, making generation much faster but consuming significant memory.
đĄ Mental Model
KV-cache = Memory for what the model has "seen"
Speed vs memory tradeoff
đ How KV-Cache Works
Without cache, generating each token requires recomputing attention for all previous tokens:
Compute attention (1 token)
Recompute attention (2 tokens) â
Recompute attention (3 tokens) â
Total: O(n²) complexity - very slow!
With KV-cache, we store and reuse past attention:
Compute & cache â
Use cache + compute new â
Use cache + compute new â
Total: O(n) complexity - much faster!
đ KV-Cache Memory Requirements
Memory formula per token:
KV_memory = 2 à num_layers à hidden_size à precision
(2 = key + value, multiplied by sequence length)
LLaMA-7B Example
- âĸ 32 layers à 4096 hidden à 2 bytes (FP16) à 2 (K+V)
- âĸ = 1 MB per token
- âĸ For 2048 context: 2 GB just for KV-cache!
GPT-3 (175B) Example
- âĸ 96 layers à 12288 hidden à 2 bytes (FP16) à 2 (K+V)
- âĸ = 4.5 MB per token
- âĸ For 2048 context: 9 GB for KV-cache!
⥠KV-Cache Optimizations
1. PagedAttention (vLLM)
Manage KV-cache like virtual memory pages
- â Reduces memory waste by ~40%
- â Enables dynamic memory allocation
- â Better batching efficiency
2. KV-Cache Quantization
Quantize KV-cache to INT8 or INT4
- â 2-4Ã memory savings
- â Minimal accuracy loss (<1%)
- â Allows longer contexts
3. Multi-Query Attention (MQA)
Share key/value across attention heads
- â 8Ã less KV-cache memory
- â Faster inference
- â Requires model architecture change
4. Grouped-Query Attention (GQA)
Middle ground: group heads to share KV
- â 2-4Ã less memory than MHA
- â Better accuracy than MQA
- â Used in LLaMA-2, Mistral
đģ Memory Calculation Tool
def calculate_kv_cache_memory(
num_layers: int,
hidden_size: int,
num_heads: int,
sequence_length: int,
batch_size: int = 1,
precision_bytes: int = 2 # FP16
):
"""Calculate KV-cache memory in GB"""
# Key + Value = 2
# Per layer, per token
bytes_per_token = 2 * num_layers * hidden_size * precision_bytes
# Total for sequence
total_bytes = bytes_per_token * sequence_length * batch_size
return total_bytes / (1024**3) # Convert to GB
# LLaMA-7B example
memory_gb = calculate_kv_cache_memory(
num_layers=32,
hidden_size=4096,
num_heads=32,
sequence_length=2048,
batch_size=1,
precision_bytes=2
)
print(f"KV-cache memory: {memory_gb:.2f} GB")
# Output: KV-cache memory: 2.00 GB
đĄ Practical Recommendations
- âĸ For single-user chatbots: Standard KV-cache is fine
- âĸ For high-throughput serving: Use vLLM with PagedAttention
- âĸ For very long contexts (8K+): Enable KV-cache quantization
- âĸ For new models: Consider GQA architecture for better efficiency
- âĸ Always monitor VRAM usage - KV-cache can exceed weights memory!
23. Batching Strategies
Plain Explanation
Batching means processing multiple inputs together instead of one at a time. It dramatically increases throughput but adds latency. The key is choosing the right batching strategy for your use case.
đĄ Mental Model
Batching = Loading a bus vs sending taxis
Higher throughput, but people wait for the bus to fill
đ Batch Size Impact
Typical throughput gains from batching (GPU inference)
âī¸ The Latency-Throughput Tradeoff
| Batch Size | Latency | Throughput | Use Case |
|---|---|---|---|
| 1 | Lowest (10ms) | Low | Real-time apps |
| 8-16 | Medium (50ms) | Good | Interactive services |
| 32-64 | High (200ms) | Excellent | Batch processing |
| 128+ | Very High (500ms+) | Maximum | Offline workloads |
đ§ Batching Strategies
1. Static Batching
Wait for N requests, then process together
â Pros:
- âĸ Simple to implement
- âĸ Predictable throughput
- âĸ Easy to reason about
â Cons:
- âĸ High latency (wait time)
- âĸ Wasted capacity at low load
- âĸ Fixed batch size inefficient
Example:
Wait for 32 requests â Process batch â Wait again
2. Dynamic Batching
Wait for timeout OR max batch size, whichever comes first
â Pros:
- âĸ Better latency vs throughput balance
- âĸ Adapts to load
- âĸ More efficient than static
â Cons:
- âĸ Still wastes capacity
- âĸ Padding overhead
- âĸ Timeout tuning needed
Example:
Batch size 32 OR 50ms timeout â Process batch
3. Continuous Batching (vLLM)
Add/remove requests from batch as they arrive/complete
â Pros:
- âĸ Best throughput
- âĸ Best latency
- âĸ No wasted capacity
- âĸ Adapts to variable lengths
â Cons:
- âĸ Complex implementation
- âĸ Requires scheduler
- âĸ Framework-specific (vLLM, TGI)
How it works:
Continuously iterate batches, add new requests, remove finished ones
â ī¸ Padding Overhead
When batching variable-length inputs, you must pad to the longest in the batch:
Wasted compute: ~40% (padding for sequences 1 and 3)
Solution: Use continuous batching or group similar-length sequences together
đ¤ ASR Batching
Real-time:
Batch size = 1 (low latency required)
Batch processing:
Batch size = 16-32 (group similar-length audio)
Tip:
Sort by audio length before batching to minimize padding
đŦ LLM Batching
Interactive chatbots:
Dynamic batching (16-32) or continuous
API serving:
Continuous batching (vLLM) for best efficiency
Recommendation:
Use vLLM for production LLM serving
đģ Dynamic Batching Example
import asyncio
from collections import deque
class DynamicBatcher:
def __init__(self, max_batch_size=32, timeout_ms=50):
self.max_batch_size = max_batch_size
self.timeout_ms = timeout_ms
self.queue = deque()
async def add_request(self, request):
"""Add request to batch queue"""
self.queue.append(request)
# Process when batch full or timeout
if len(self.queue) >= self.max_batch_size:
return await self.process_batch()
# Wait for timeout
await asyncio.sleep(self.timeout_ms / 1000)
if self.queue:
return await self.process_batch()
async def process_batch(self):
"""Process accumulated batch"""
batch = [self.queue.popleft()
for _ in range(min(len(self.queue),
self.max_batch_size))]
# Run inference on batch
results = model.inference(batch)
return results
đĄ Best Practices
- âĸ Real-time apps: Use batch size 1 or small dynamic batches
- âĸ High-throughput serving: Use continuous batching (vLLM)
- âĸ Batch processing: Use large static batches (32-64)
- âĸ Monitor P95/P99 latency - batching impacts tail latency!
- âĸ Group similar lengths together to minimize padding waste
24. ASR Deployment Guide
Plain Explanation
This is a practical, copy-paste guide for deploying Automatic Speech Recognition models in production. We'll cover Whisper deployment on both CPU and GPU.
đ¯ Deployment Decision Tree
Real-time streaming (<200ms latency)?
â GPU (TensorRT) or CPU (OpenVINO INT8)
Batch processing (latency flexible)?
â CPU (OpenVINO) for cost savings
High throughput (>100 req/sec)?
â GPU (TensorRT) with batching
đģ Option 1: CPU Deployment (OpenVINO)
# Step 1: Install dependencies
{pip install openvino openvino-dev
pip install transformers torch torchaudio}
# Step 2: Convert Whisper to OpenVINO
{from transformers import WhisperForConditionalGeneration
from openvino.tools import mo
model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-base"
)
# Convert to OpenVINO IR
mo.convert_model(
model,
output_dir="whisper_openvino",
compress_to_fp16=True
)}
# Step 3: Run inference
{from openvino.runtime import Core
import numpy as np
core = Core()
model = core.read_model("whisper_openvino/model.xml")
compiled = core.compile_model(model, "CPU")
# Inference
output = compiled([audio_features])[0]}
đ Option 2: GPU Deployment (TensorRT)
# Step 1: Export to ONNX
{import torch
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-base"
)
model.eval()
dummy_input = torch.randn(1, 80, 3000).cuda()
torch.onnx.export(
model,
dummy_input,
"whisper.onnx",
opset_version=17
)}
# Step 2: Build TensorRT engine
{trtexec --onnx=whisper.onnx \
--saveEngine=whisper.trt \
--fp16 \
--workspace=4096}
đ Production Configuration
| Parameter | Real-time | Batch |
|---|---|---|
| Batch Size | 1-4 | 16-32 |
| Beam Size | 1-3 | 5 |
| Precision | INT8/FP16 | INT8 |
| Chunk Size | 5-10s | 20-30s |
⥠Quick Start: Fastest Path to Production
{# Use faster-whisper (CTranslate2 backend)
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")}
25. LLM Deployment Guide
Plain Explanation
Large Language Models require specialized serving infrastructure. This guide covers the most popular deployment options from local CPU to production GPU serving.
đ¯ LLM Deployment Options
llama.cpp / Ollama
CPU/LocalBest for: Development, edge deployment, no GPU
vLLM
GPU/ProductionBest for: High-throughput GPU serving, PagedAttention
Text Generation Inference (TGI)
GPU/HuggingFaceBest for: HuggingFace ecosystem, production serving
TensorRT-LLM
GPU/Ultra-FastBest for: Lowest latency on NVIDIA GPUs
đģ Option 1: llama.cpp (CPU)
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download quantized model (GGUF format)
wget https://huggingface.co/.../model-q4_k_m.gguf
# Run inference
./main -m model-q4_k_m.gguf \
-p "Write a Python function to" \
-n 128 \
-t 8
Or use Ollama (easier):
curl https://ollama.ai/install.sh | sh ollama run llama2 ollama run mistral
đ Option 2: vLLM (GPU Production)
# Install vLLM
pip install vllm
# Python API
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
prompts = ["Hello, my name is", "The future of AI is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
# Or use OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 1
# Then use with OpenAI client
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf",
"prompt": "San Francisco is",
"max_tokens": 50}'
đ Performance Comparison
| Method | Hardware | Throughput | Latency |
|---|---|---|---|
| llama.cpp (INT4) | CPU (16 core) | ~10 tok/sec | Medium |
| vLLM (FP16) | A100-40GB | ~100 tok/sec | Low |
| TensorRT-LLM | A100-40GB | ~150 tok/sec | Very Low |
âī¸ Production Configuration Best Practices
Memory Management
- âĸ Use INT8/INT4 quantization to fit larger models
- âĸ Enable KV-cache quantization (vLLM supports this)
- âĸ Monitor GPU memory usage continuously
Batching
- âĸ Start with batch size 16-32
- âĸ Use continuous batching (vLLM does this automatically)
- âĸ Monitor P95/P99 latency, not just average
Sampling Parameters
- âĸ temperature: 0.7-0.9 (lower = more deterministic)
- âĸ top_p: 0.9 (nucleus sampling)
- âĸ max_tokens: Set based on use case (limit costs)
đĄ Cost Optimization Tips
- âĸ CPU (llama.cpp): $0.05-0.20/hr, good for <100 req/day
- âĸ GPU (T4): $0.50/hr, good for moderate traffic
- âĸ GPU (A100): $3-5/hr, for high throughput
- âĸ Consider spot instances for 60-80% cost savings
26. Benchmarking and Metrics
Plain Explanation
Benchmarking inference systems requires understanding multiple metrics beyond just latency. This guide teaches you how to measure and interpret performance correctly.
đ Key Metrics Explained
Latency
Time to process a single request
Metrics to track:
- âĸ P50 (median): Typical user experience
- âĸ P95: 95% of requests faster than this
- âĸ P99: Tail latency (important!)
- âĸ Max: Worst-case scenario
Throughput
Requests processed per second
Measured as:
- âĸ QPS (Queries Per Second)
- âĸ Tokens/second (for LLMs)
- âĸ Audio hours/hour (for ASR)
Accuracy
Quality of model predictions
ASR Metrics:
- âĸ WER (Word Error Rate): Lower is better
- âĸ CER (Character Error Rate)
LLM Metrics:
- âĸ Perplexity, BLEU, ROUGE
- âĸ Human evaluation
Resource Utilization
Hardware efficiency
- âĸ CPU/GPU utilization (%)
- âĸ Memory usage (RAM/VRAM)
- âĸ Power consumption (Watts)
â ī¸ Common Benchmarking Mistakes
â Only measuring P50 latency
Problem: Ignores tail latency. Some users get 10x worse experience.
â Fix: Always report P95 and P99
â Not warming up the model
Problem: First inference is slow (loading weights, JIT compilation)
â Fix: Run 10-100 warmup iterations before measuring
â Testing with unrealistic data
Problem: Production has noise, accents, variable length
â Fix: Use production-representative test set
â Single-threaded benchmarks
Problem: Doesn't test concurrent load
â Fix: Load test with multiple concurrent requests
đģ Benchmarking Code Example
{`import time
import numpy as np
def benchmark_inference(model, test_data, warmup=10):
# Warmup
for i in range(warmup):
model(test_data[0])
# Measure
latencies = []
for data in test_data:
start = time.perf_counter()
output = model(data)
latency = (time.perf_counter() - start) * 1000 # ms
latencies.append(latency)
# Report
print(f"P50: {np.percentile(latencies, 50):.2f}ms")
print(f"P95: {np.percentile(latencies, 95):.2f}ms")
print(f"P99: {np.percentile(latencies, 99):.2f}ms")
print(f"Max: {np.max(latencies):.2f}ms")
print(f"Throughput: {len(test_data) / np.sum(latencies) * 1000:.2f} case 'isa':
16. Instruction Set Architectures (ISA)
Plain Explanation
An Instruction Set Architecture (ISA) is the language that your CPU speaks. It defines what operations the processor can perform and how software communicates with hardware.
đĄ Mental Model
ISA = CPU's native language
Like English vs Spanish vs Mandarin for humans
đĨī¸ Major CPU ISAs
x86-64 (AMD64)
Dominant
Used by: Intel (Core, Xeon), AMD (Ryzen, EPYC)
Market: Servers, desktops, laptops
SIMD Extensions:
- âĸ SSE (Streaming SIMD Extensions)
- âĸ AVX (Advanced Vector Extensions)
- âĸ AVX-512 (512-bit vectors for AI/HPC)
- âĸ AMX (Advanced Matrix Extensions - new!)
AI Performance: Excellent with AVX-512/AMX
ARM64 (AArch64)
Growing
Used by: Apple (M1/M2/M3), AWS Graviton, NVIDIA Grace
Market: Mobile, edge, emerging servers
SIMD Extensions:
- âĸ NEON (Advanced SIMD)
- âĸ SVE (Scalable Vector Extension)
- âĸ SVE2 (Enhanced for AI/ML)
AI Performance: Good, improving rapidly
Advantage: Power efficiency
RISC-V
Emerging
Used by: SiFive, StarFive, various startups
Market: Edge devices, research, future servers
Key Feature: Open-source ISA (no licensing fees!)
Extensions:
- âĸ V extension (Vector operations)
- âĸ Zve (Embedded vector)
Status: Early but promising for AI
đŽ GPU ISAs
NVIDIA: PTX â SASS
PTX (Parallel Thread Execution): Virtual ISA (like Java bytecode)
SASS: Native GPU machine code (hardware-specific)
CUDA â PTX â SASS â Hardware
AMD: GCN / RDNA / CDNA
GCN: Graphics Core Next (older)
RDNA: Gaming GPUs
CDNA: Compute/AI GPUs (MI series)
Intel: Gen Graphics
Used in Intel Xe GPUs (Arc, Flex, Max series)
đ SIMD: Single Instruction Multiple Data
SIMD extensions allow CPUs to perform the same operation on multiple data points simultaneously - crucial for AI inference.
Evolution of Intel SIMD:
SSE
128-bit (4 floats)
~1999
AVX
256-bit (8 floats)
~2011
AVX2
256-bit + FMA
~2013
AVX-512
512-bit (16 floats)
~2016
AMX
Matrix tiles (INT8/BF16)
~2021
Impact: AVX-512 + AMX make modern Intel CPUs competitive with GPUs for INT8 inference
đ Why ISA Matters for Inference
For Deployment:
- âĸ Kernels are ISA-specific
- âĸ x86 binaries won't run on ARM
- âĸ Need right compiler/runtime
For Performance:
- âĸ AVX-512 = 2-3x faster than AVX2
- âĸ AMX = dedicated matrix ops
- âĸ NEON optimizes ARM inference
â ī¸ Practical Implications
- âĸ Docker images must match architecture (amd64 vs arm64)
- âĸ OpenVINO automatically detects and uses best ISA features
- âĸ ARM Macs (M1/M2) need ARM-specific builds
- âĸ Check CPU flags:
lscpu | grep Flags
27. Production Monitoring
Plain Explanation
Production monitoring means tracking key metrics to ensure your AI inference system is healthy, fast, and accurate. Without monitoring, you fly blind.
đĄ Mental Model
Monitoring = Health dashboard for your inference system
See problems before users complain
đ Critical Metrics to Track
1. Latency Metrics
P50 (Median):
Typical user experience. Target: <100ms for interactive
P95:
95% of requests faster. Good SLA metric. Target: <200ms
P99:
Tail latency - critical! Target: <500ms
Max:
Worst case. Should not exceed 2Ã P99
2. Throughput Metrics
Requests Per Second (RPS/QPS):
Total load on system
Tokens Per Second (LLMs):
Generation speed
Batch Utilization:
% of max batch size used
3. Resource Metrics
GPU Utilization:
Should be 70-90% for good efficiency
GPU Memory Usage:
Watch for OOM! Alert at 90%
CPU Utilization:
For CPU inference or preprocessing
GPU Temperature:
Alert if >85°C (thermal throttling risk)
4. Quality Metrics
WER (ASR):
Word Error Rate on production data
Perplexity (LLMs):
Model confidence metric
Error Rate:
% of requests that fail/timeout
5. Cost Metrics
Cost Per Request:
Total infrastructure cost / requests
Cost Per Token (LLMs):
Important for usage-based pricing
đ Alerting Strategy
đ¨ Critical Alerts (Page On-Call)
- âĸ Error rate > 5%
- âĸ P99 latency > 2Ã baseline
- âĸ GPU memory > 95%
- âĸ Service down / no responses
â ī¸ Warning Alerts (Review Next Day)
- âĸ P95 latency > 1.5Ã baseline
- âĸ GPU memory > 85%
- âĸ GPU temperature > 80°C
- âĸ Throughput dropped > 30%
âšī¸ Info Alerts (Monitor Trends)
- âĸ WER increased > 10%
- âĸ Cost per request trending up
- âĸ Traffic patterns changing
đ ī¸ Monitoring Stack
Metrics Collection:
- âĸ Prometheus: Time-series metrics
- âĸ StatsD: Application metrics
- âĸ CloudWatch: AWS metrics
Visualization:
- âĸ Grafana: Dashboards
- âĸ Datadog: All-in-one (paid)
- âĸ Kibana: Logs + metrics
GPU Monitoring:
- âĸ nvidia-smi: Basic GPU stats
- âĸ DCGM: NVIDIA Data Center GPU Manager
- âĸ nvtop: Real-time GPU monitor
Alerting:
- âĸ PagerDuty: On-call management
- âĸ AlertManager: Prometheus alerts
- âĸ Opsgenie: Incident response
đģ Monitoring Example (Prometheus + Python)
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
request_count = Counter('inference_requests_total',
'Total inference requests')
latency = Histogram('inference_latency_seconds',
'Inference latency')
gpu_memory = Gauge('gpu_memory_used_bytes',
'GPU memory usage')
def inference_with_monitoring(model, input_data):
# Track request
request_count.inc()
# Measure latency
start = time.time()
# Run inference
result = model(input_data)
# Record latency
latency.observe(time.time() - start)
# Track GPU memory
gpu_memory.set(get_gpu_memory_usage())
return result
đ Sample Grafana Dashboard
Latency Panel
P50/P95/P99 over time
Throughput Panel
Requests/sec graph
GPU Usage Panel
Utilization % + memory
Error Rate Panel
Failed requests %
đĄ Monitoring Best Practices
- âĸ Always track P95 and P99 - median alone hides problems
- âĸ Set up alerts BEFORE going to production
- âĸ Monitor GPU memory proactively - OOM crashes are sudden
- âĸ Track quality metrics (WER, perplexity) to catch model degradation
- âĸ Review dashboards weekly to spot trends
- âĸ Keep historical data (90+ days) for capacity planning
28. Hyperparameter Tuning
Plain Explanation
Hyperparameters control how inference behaves without changing the model weights. Tuning them properly is critical for balancing quality, speed, and cost.
đĄ Mental Model
Hyperparameters = Knobs to tune performance vs quality
đ¤ ASR Hyperparameters
beam_size (1-10)
Number of candidate transcriptions explored
Lower (1-3):
Faster, less accurate
Higher (5-10):
Slower, more accurate
â Recommended: 5 for production, 1-3 for real-time
temperature (0.0-1.0)
Controls randomness in output selection
0.0:
Deterministic (always same output)
0.8-1.0:
More creative/random
â Recommended: 0.0-0.3 for ASR (want consistency)
no_speech_threshold (0.0-1.0)
Probability threshold for detecting silence
â Recommended: 0.6 (prevents hallucinations on silence)
compression_ratio_threshold (1.0-3.0)
Detects repetitive/gibberish output
â Recommended: 2.4 (reject if compression ratio too high)
đŦ LLM Hyperparameters
temperature (0.0-2.0)
Controls creativity vs consistency
0.0-0.3:
Factual tasks
0.7-0.9:
Conversational
1.0-2.0:
Creative writing
top_p (0.0-1.0)
Nucleus sampling: consider top tokens until cumulative probability reaches p
â Recommended: 0.9-0.95 for most tasks
max_tokens (1-4096+)
Maximum output length
â ī¸ Critical: Controls costs! Set based on use case
presence_penalty (-2.0 to 2.0)
Penalizes tokens that have appeared
â Use 0.5-1.0 to reduce repetition
âī¸ Common Trade-offs
Parameter
Increase â
Effect
beam_size
â
Better quality, slower speed
temperature
â
More creative, less predictable
max_tokens
â
Longer output, higher cost
batch_size
â
Higher throughput, more latency
đ¯ Quick Start Configs
Real-time ASR:
beam_size=1, temperature=0.0, batch_size=1
Batch ASR:
beam_size=5, temperature=0.0, batch_size=32
Chatbot LLM:
temperature=0.7, top_p=0.9, max_tokens=512
Code Generation:
temperature=0.2, top_p=0.95, max_tokens=2048
29. Call Center ASR Optimization
Plain Explanation
Call center ASR is one of the most challenging real-world deployments: long audio, background noise, multiple speakers, accents, and regulatory requirements.
đ¯ Unique Challenges
Audio Quality Issues
- âĸ Phone line compression (8kHz)
- âĸ Background noise
- âĸ Poor microphones
- âĸ Echo and feedback
Content Challenges
- âĸ Multiple speakers
- âĸ Overlapping speech
- âĸ Accents and dialects
- âĸ Domain-specific jargon
Scale Requirements
- âĸ Long recordings (30-60 min)
- âĸ High volume (1000s/day)
- âĸ Real-time + batch
- âĸ Cost sensitivity
Compliance
- âĸ PCI-DSS (credit cards)
- âĸ HIPAA (healthcare)
- âĸ Data retention policies
- âĸ Audit trails
đ§ Optimization Strategy
1. Audio Pre-processing
- âĸ Resampling: Upsample 8kHz phone audio to 16kHz
- âĸ Noise Reduction: Apply spectral subtraction or Wiener filtering
- âĸ Normalization: Standardize volume levels
- âĸ VAD (Voice Activity Detection): Remove silence/hold music
2. Model Selection
Recommended: Whisper Large-v3
- â Excellent at noisy audio
- â Multilingual support
- â Good with accents
- â Can handle phone quality
3. Chunking Strategy
Problem: 60-minute calls exceed model context
Solution: Sliding window with overlap
- âĸ Chunk size: 30 seconds
- âĸ Overlap: 3 seconds
- âĸ Merge chunks with overlap deduplication
4. Hyperparameter Tuning
# Production config for call centers
config = {
"beam_size": 5, # Quality over speed
"temperature": 0.0, # Deterministic
"no_speech_threshold": 0.6, # Detect silence
"compression_ratio_threshold": 2.4,
"condition_on_previous_text": True, # Context
"language": "en", # Avoid auto-detect errors
"vad_filter": True, # Remove silence
"vad_parameters": {
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 2000
}
}
đ° Cost Optimization
Approach
Cost/Hour
Throughput
Best For
CPU (OpenVINO INT8)
$0.20-0.50
5-10x realtime
Batch processing
GPU T4 (TensorRT)
$0.50-1.00
20-30x realtime
Mixed workload
GPU A100
$3-5
50-100x realtime
Real-time only
đ Quality Metrics for Call Centers
Word Error Rate (WER)
Target: <10% for good quality, <15% acceptable
Speaker Diarization Accuracy
Target: >85% correct speaker attribution
Processing Time
Target: <0.1x realtime (60 min call in 6 min)
đĄ Production Checklist
- â Implement VAD to remove silence (saves 30-50% compute)
- â Use INT8 quantization for 4x cost reduction
- â Enable batching for non-realtime workloads
- â Monitor WER on production data monthly
- â Implement PII redaction (credit cards, SSN)
- â Store only transcripts, delete audio per policy
- â Set up alerts for quality degradation
30. Common Failure Modes
Plain Explanation
AI inference systems fail in predictable ways. Knowing these patterns helps you prevent issues before they hit production.
đ´ Memory Failures
OOM: Out of Memory
Symptom: Process crashes with "CUDA out of memory" or killed by OOM killer
Common Causes:
- âĸ Activations exceed VRAM (not weights!)
- âĸ Batch size too large
- âĸ Input sequence too long
- âĸ Memory leak in application code
â Fix: Reduce batch size, quantize KV-cache, use gradient checkpointing
Memory Fragmentation
Symptom: OOM despite memory appearing available
â Fix: Restart service periodically, use memory pools, enable PagedAttention
⥠Performance Degradation
CPU Oversubscription
Symptom: Slow inference despite low GPU usage
Cause: Too many threads fighting for CPU cores
â Fix: Set OMP_NUM_THREADS=physical_cores, avoid hyperthreading
NUMA Issues
Symptom: Inconsistent CPU performance
Cause: Non-Uniform Memory Access on multi-socket servers
â Fix: Use numactl to bind process to single NUMA node
Thermal Throttling
Symptom: Performance degrades over time
Cause: GPU/CPU overheating, clock speed reduced
â Fix: Improve cooling, reduce power limit, monitor temperature
đ Quality Issues
Hallucinations (ASR)
Symptom: Model generates text on silence or music
â Fix: Use VAD filter, increase no_speech_threshold to 0.6
Repetitive Output (LLM)
Symptom: Model repeats same phrases
â Fix: Increase presence_penalty, use repetition_penalty parameter
Quantization Degradation
Symptom: Accuracy drops after quantization
â Fix: Use calibration dataset, try QAT instead of PTQ
đ§ System-Level Failures
CUDA Version Mismatch
Symptom: Import errors or runtime failures
â Fix: Match CUDA version with PyTorch/TensorRT build
Driver Issues
Symptom: GPU not detected or slow performance
â Fix: Update NVIDIA drivers, verify with nvidia-smi
đ Debugging Checklist
- Check GPU memory:
nvidia-smi
- Monitor CPU usage:
htop
- Check system logs:
dmesg | grep -i error
- Verify CUDA:
python -c "import torch; print(torch.cuda.is_available())"
- Profile memory: Use torch.cuda.memory_summary()
31. Troubleshooting Guide
Plain Explanation
A systematic troubleshooting guide for diagnosing and fixing common inference issues in production.
đ¨ Issue: Slow Inference
Step 1: Identify Bottleneck
- âĸ Check GPU utilization (should be >80%)
- âĸ Check CPU utilization
- âĸ Profile with PyTorch Profiler
Step 2: Common Fixes
- â Increase batch size (if memory allows)
- â Use FP16/INT8 precision
- â Enable TensorRT/OpenVINO optimizations
- â Check data loading isn't the bottleneck
đ¨ Issue: Out of Memory
Step 1: Identify What's Using Memory
import torch
print(torch.cuda.memory_summary())
Step 2: Reduce Memory Usage
- â Reduce batch size (most effective)
- â Use quantization (INT8/INT4)
- â Enable gradient checkpointing (training)
- â Clear cache: torch.cuda.empty_cache()
đ¨ Issue: Poor Accuracy
Step 1: Isolate the Problem
- âĸ Compare with baseline model
- âĸ Test on known-good samples
- âĸ Check if quantization caused it
Step 2: Common Fixes
- â Recalibrate quantization
- â Increase beam size
- â Adjust temperature
- â Enable VAD filter (ASR)
đ Quick Reference Commands
# Check GPU status
nvidia-smi
# Monitor GPU continuously
watch -n 1 nvidia-smi
# Check CUDA version
nvcc --version
# Test PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Profile inference
from torch.profiler import profile
with profile() as prof:
model(input)
print(prof.key_averages().table())
đ Congratulations! Handbook Complete!
You've completed all 31 topics across 7 parts of the AI Inference Engineering Handbook!
â
What You've Learned:
- âĸ Model fundamentals
- âĸ Hardware architectures
- âĸ Inference runtimes
- âĸ Optimization techniques
- âĸ Production deployment
- âĸ Troubleshooting
đ Next Steps:
- âĸ Deploy your first model
- âĸ Benchmark performance
- âĸ Share this with your team
- âĸ Contribute improvements
@junAiD