AI Engineer Study Guide

Reference for ML Production Roles

November 2025 ~45 min read

AI/ML LLMs Transformers RAG Production ML

Target Audience: AI/ML Engineers with production LLM experience preparing for founding engineer interviews

Focus: Theory + Whiteboard Design + Paper Discussions

1. Mathematical Foundations

1.1 Linear Algebra (The Core)

Every neural network operation is matrix multiplication. Master matrix shapes, eigenvalues (PCA, gradient behavior), and SVD (LoRA uses this). Interview focus: attention mechanism dimensions and low-rank approximations.

Why it matters: Every neural network operation is matrix multiplication. Understanding shapes, ranks, and transformations is non-negotiable.

Key Concepts

Matrix Multiplication & Dimensionality

Matrix multiplication (m × n) @ (n × p) = (m × p) - the inner dimensions must match
Tricky bit: In attention, Q @ K^T works because (seq_len × d_k) @ (d_k × seq_len) = (seq_len × seq_len)
Interview trap: "Why do we transpose K in attention?" → Creates compatibility AND semantic meaning (query-key similarity matrix)

Eigenvalues & Eigenvectors

Av = λv - direction v that doesn't change under transformation A, only scales by λ
Why it matters:
- PCA finds principal components (eigenvectors of covariance matrix)
- Gradient explosion/vanishing relates to eigenvalues of weight matrices
- Spectral normalization uses largest eigenvalue for stability

Singular Value Decomposition (SVD)

Formula: A = UΣV^T

U: left singular vectors (output space basis)
Σ: singular values (scaling factors)
V^T: right singular vectors (input space basis)

Applications in ML:

Low-rank matrix factorization (LoRA for LLM fine-tuning)
Dimensionality reduction
Matrix completion (recommendation systems)

📊 Linear Algebra Concepts Visualization

Key Relationships:

• Matrix Multiplication: (m×n) @ (n×p) → (m×p) - inner dimensions must match

• Eigenvectors: Directions unchanged by transformation A, scaled by eigenvalue λ

• SVD: A = UΣV^T decomposes any matrix into rotation-scale-rotation

• Applications: PCA (eigenvectors), LoRA (low-rank), attention (Q@K^T)

Conceptual representation of matrix operations and transformations in neural networks

Tricky Interview Question: "How does LoRA use SVD concepts?"
→ LoRA approximates weight updates as low-rank: ΔW = BA where B is (d × r) and A is (r × d) with r << d. This is inspired by SVD's idea that most information lives in top singular values.

Norms & Distance Metrics

L1 vs L2 Norms

L1 (Manhattan): ||x||₁ = Σ|xᵢ| → Sparse solutions, robust to outliers
L2 (Euclidean): ||x||₂ = √(Σxᵢ²) → Smooth gradients, penalizes large weights more

Cosine Similarity (Critical for embeddings)

Formula: cos(θ) = (A · B) / (||A|| ||B||)

Range: [-1, 1]
Why cosine not Euclidean for embeddings? → Scale-invariant, captures angle not magnitude
Interview trap: "When would cosine similarity fail?" → When magnitude matters (e.g., word frequency in TF-IDF)

1.2 Calculus & Optimization

Chain Rule - The Backpropagation Foundation

Formula: ∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w

Tricky bit - Matrix Calculus:

When differentiating matrix operations, track dimensions carefully:

∂(Wx)/∂W = x^T (outer product creates the right shape)
∂(Wx)/∂x = W^T (transpose for correct flow)

Gradient Descent Variants

Algorithm	Update Rule	Key Property	When to Use
SGD	`w -= lr × ∇w`	Noisy, explores well	Small datasets, need exploration
Momentum	`v = βv + ∇w`	Accelerates in consistent directions	When gradients have high variance
RMSProp	`v = βv + (1-β)∇w²`	Adapts per-parameter learning rate	Non-stationary objectives
Adam	Combines momentum + RMSProp	Fast convergence	Default choice for transformers
AdamW	Adam + decoupled weight decay	Better regularization	SOTA for LLM training

Gradient Descent Variants

Tricky Interview Question: "Why use AdamW over Adam?"
→ Adam's weight decay interacts with adaptive learning rates incorrectly. AdamW decouples it: w -= lr × (∇w + λw) after adaptive scaling.

1.3 Probability & Statistics

Central Limit Theorem (CLT) - Foundation of Modern ML

Statement: The distribution of sample means approximates a normal distribution as sample size increases, regardless of the population's distribution shape.

Mathematical Form: X̄ ~ N(μ, σ²/n) as n → ∞

Why It Matters in ML:

Batch Training Stability: Averaging gradients over a batch (n samples) reduces variance by 1/√n. This is why larger batches lead to more stable updates.
Why SGD Converges: Noisy gradient estimates from mini-batches approximate true gradients via CLT, enabling convergence guarantees.
Batch Normalization: Assumes activations follow approximately normal distribution per batch, which CLT supports for large batches.
Confidence Intervals: Error bars on model performance metrics rely on CLT for validity.

Central Limit Theorem

Tricky Interview Question: "Why do larger batches in SGD lead to worse generalization despite more stable gradients?"
→ Large batches converge to sharp minima (high curvature) which generalize poorly. Small batches' noise helps escape sharp minima and find flat minima (low curvature) with better test performance. This is the generalization gap phenomenon.

Key Statistical Concepts

Maximum Likelihood Estimation (MLE)

θ* = argmax_θ Π p(xᵢ|θ) = argmax_θ Σ log p(xᵢ|θ)

Why log? Converts products to sums (numerical stability + easier gradients)
Connection to loss: Minimizing cross-entropy = MLE for categorical distribution

Bias-Variance Decomposition

E[(y - ŷ)²] = Bias² + Variance + Irreducible Error

High bias: Underfitting (too simple model)
High variance: Overfitting (too complex, memorizes noise)
Sweet spot: Balance through regularization, model capacity

Bias-Variance Tradeoff

1.4 Information Theory

Entropy (Measure of uncertainty)

H(X) = -Σ p(x) log p(x)

High entropy = high uncertainty (uniform distribution)
Low entropy = low uncertainty (peaked distribution)
In ML: We want models with low entropy predictions (confident)

Cross-Entropy (Distance between distributions)

H(P, Q) = -Σ p(x) log q(x)

P: true distribution, Q: predicted distribution
Cross-entropy loss: Minimizing this = matching distributions
Binary: -[y log(ŷ) + (1-y)log(1-ŷ)]
Multi-class: -Σ yᵢ log(ŷᵢ) (categorical cross-entropy)

KL Divergence (How different are two distributions?)

D_KL(P || Q) = Σ p(x) log(p(x)/q(x)) = H(P,Q) - H(P)

Properties: Always ≥ 0, asymmetric (P||Q ≠ Q||P)
In VAE: Regularization term KL(q(z|x) || p(z)) keeps latent space structured
In RLHF: KL penalty keeps model close to reference policy

Information Theory Concepts

Tricky Interview Question: "Why is cross-entropy preferred over MSE for classification?"
→ Cross-entropy has better gradients for probability outputs. MSE gradient = (ŷ - y), but cross-entropy = (ŷ - y)/ŷ(1-ŷ) which is stronger when prediction is wrong (ŷ close to 0 or 1).

1.5 Statistical Inference & Hypothesis Testing

Why This Matters: A/B testing, model comparison, and determining if performance improvements are real or noise all rely on hypothesis testing. Production ML decisions need statistical rigor.

Hypothesis Testing Framework

The Setup:

Null Hypothesis (H₀): The "boring" hypothesis - no effect, no difference
Alternative Hypothesis (H₁): The claim you're trying to prove
Significance Level (α): Threshold for rejecting H₀ (typically 0.05 or 0.01)
p-value: Probability of observing this extreme data if H₀ is true

Decision Rule: If p-value < α, reject H₀ (statistically significant result)

Type I and Type II Errors

Type I Error (False Positive, α):

Definition: Rejecting H₀ when it's actually true
In ML: Saying model B is better when it's not
Real-world cost: Wasted resources deploying inferior model
Control: Set lower α (0.01 instead of 0.05) for critical decisions

Type II Error (False Negative, β):

Definition: Failing to reject H₀ when H₁ is true
In ML: Missing a real improvement
Real-world cost: Leaving better model undiscovered
Control: Increase sample size, increase α (trade-off!)

Type I and Type II Errors

Statistical Power (1 - β):

Definition: Probability of correctly rejecting H₀ when H₁ is true
Target: Power ≥ 0.80 (80% chance of detecting true effect)
Affected by: Sample size, effect size, α, test type

Statistical Power

Tricky Interview Question: "Your model shows 2% accuracy improvement. Is it significant?"
→ Depends on: (1) Sample size - is it 100 examples or 10,000? (2) Variance - consistent or noisy? (3) Business context - is 2% valuable? Run a paired t-test on validation predictions, compute confidence interval, consider practical significance vs statistical significance.

A/B Testing in ML

Setup: Compare model A (baseline) vs model B (new model)

Randomization: Crucial for causal inference

Random assignment of users/requests to A or B
Eliminates confounding variables
Enables causal claims ("B caused the improvement")

Multiple Testing Problem:

Running 20 tests with α=0.05 → Expect 1 false positive by chance
Solution: Bonferroni correction (α_adjusted = α/k where k = # tests)
Better: False Discovery Rate (FDR) control for many tests

Tricky Interview Question: "Your A/B test shows model B is 3% better after 3 days. Should you deploy?"
→ No! (1) Too short - haven't captured weekly patterns, (2) Possible novelty effect, (3) Statistical power may be insufficient, (4) Need to verify across different user segments, (5) Check if improvement is consistent across days or just a lucky spike.

2. Foundational Machine Learning

2.1 Core Concepts Review

Problem	Symptoms	Solutions
Underfitting	High train & test error	Increase model capacity, more features, train longer
Overfitting	Low train error, high test error	Regularization, more data, early stopping, dropout
Just Right	Low train & test error, small gap	You're good! Monitor for data drift

2.2 Regularization Techniques

L1 (Lasso) Regularization: Loss + λΣ|wᵢ|

Encourages sparsity (many weights → 0)
Feature selection built-in
Non-differentiable at 0 (use subgradient)

L2 (Ridge) Regularization: Loss + λΣwᵢ²

Encourages small weights (weight decay)
Smoother than L1, all weights shrink
Equivalent to Gaussian prior in Bayesian view

Dropout

Randomly zero out activations during training
Intuition: Ensemble of subnetworks
Inference: Scale activations by keep_prob (or use inverted dropout)
Tricky bit: Acts as adaptive regularization (more important neurons less likely to drop)

Optimization Landscape

2.3 Model Evaluation

Classification Metrics

Precision vs Recall Trade-off

Precision: TP/(TP+FP) - "Of predicted positives, how many are correct?"
Recall: TP/(TP+FN) - "Of actual positives, how many did we catch?"
F1 Score: 2 × (Precision × Recall)/(Precision + Recall) - harmonic mean

When to optimize what?

Spam detection: High precision (don't block good emails)
Cancer screening: High recall (catch all cases)
Search: Precision@K for top results

ROC-AUC vs PR-AUC

ROC-AUC: Good for balanced datasets, plots TPR vs FPR
PR-AUC: Better for imbalanced datasets, focuses on positive class
Tricky Interview Question: "Why PR-AUC for imbalanced data?" → ROC can look good even with poor minority class performance due to high TN count

3. Deep Learning Fundamentals

3.1 Neural Network Basics

Forward Pass

The forward pass involves two steps:

Linear transformation: z = Wx + b
Activation function: a = σ(z)

Activation Functions

Function	Formula	Range	Pros	Cons	Use Case
Sigmoid	`1/(1+e^(-x))`	(0,1)	Smooth, probabilistic	Saturates, slow gradients	Output layer binary class
ReLU	`max(0,x)`	[0,∞)	Fast, no saturation for x>0	Dead neurons (x<0)	Default choice hidden layers
GELU	`x×Φ(x)`	(-∞,∞)	Smooth, stochastic	Slower to compute	Transformers (GPT, BERT)

Why GELU in transformers? Smooth, non-monotonic, allows negative values with probability (stochastic regularization effect), empirically better for NLP.

Backpropagation Flow

3.2 Initialization

Why initialization matters: Poor init → vanishing/exploding gradients before training even starts

Xavier/Glorot Initialization

w ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))

For sigmoid/tanh activations
Keeps variance similar across layers

He Initialization (Kaiming)

w ~ N(0, 2/n_in) for ReLU activations

Accounts for ReLU zeroing half the neurons
Default for modern architectures

3.3 Normalization Techniques

Batch Normalization

y = γ((x - μ_batch)/σ_batch) + β

Normalizes across batch dimension
Pros: Faster training, acts as regularizer, less sensitive to init
Cons: Batch size dependent, different behavior train/test, breaks for seq2seq

Layer Normalization (Used in Transformers)

y = γ((x - μ_layer)/σ_layer) + β

Normalizes across feature dimension (per sample)
Pros: Batch-independent, works for any sequence length, stable for RNNs/Transformers
Cons: Slightly slower than BatchNorm for CNNs

RMS Normalization (Root Mean Square)

y = x / RMS(x) × γ where RMS(x) = √(mean(x²))

Removes mean subtraction (faster, simpler)
Used in modern LLMs (LLaMA, GPT-3)
Why? Empirically works as well, 10-20% faster

Normalization Techniques

Tricky Interview Question: "Why LayerNorm in transformers not BatchNorm?"
→ Transformers process variable-length sequences, BatchNorm would require padding/masking complexities. LayerNorm works per-sample so length-agnostic. Also, small batch sizes (memory constraints with long sequences) make BatchNorm statistics noisy.

4. Classical Architectures - RNNs & LSTMs

4.1 Recurrent Neural Networks (RNNs)

The Basic Idea: Process sequences by maintaining hidden state

RNN Formulas:

Hidden state update: h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
Output: y_t = W_hy @ h_t + b_y

The Fatal Flaw: Vanishing/exploding gradients over time

Why RNNs failed at long sequences:

Tanh saturation → gradients < 1
Matrix W_hh multiplied T times
For T=100 and gradient 0.9 per step: 0.9^100 ≈ 0 (vanished!)

4.2 Long Short-Term Memory (LSTM)

The Solution: Gates that control information flow

The Four Gates

Forget Gate (what to throw away from cell state): f_t = σ(W_f @ [h_{t-1}, x_t] + b_f)
Input Gate (what new info to store): i_t = σ(W_i @ [h_{t-1}, x_t] + b_i)
Cell State Update: C_t = f_t * C_{t-1} + i_t * C̃_t
Output Gate (what to output): o_t = σ(W_o @ [h_{t-1}, x_t] + b_o)

Why LSTMs work better:

Additive updates: C_t = f_t * C_{t-1} + ... (not multiplicative like RNNs)
Gradient highway: Gradients flow through cell state with fewer transformations
Selective memory: Gates learn what to remember/forget

LSTM Architecture

4.3 Why Transformers Killed RNNs/LSTMs

Sequential Processing Problem

LSTMs must process token-by-token sequentially
Can't parallelize across sequence (unlike transformers)
For sequence length T, need T sequential steps

The Death Blow: "Attention Is All You Need" (2017)

Showed transformers outperform RNNs on all benchmarks
10x faster training on modern hardware
Better at long-range dependencies
End of the RNN era for NLP

When to still use LSTMs:

Streaming applications (process token-by-token in real-time)
Very long sequences where O(n²) attention is prohibitive
Limited hardware (mobile deployment)
Time-series forecasting where sequential structure helps

RNN vs LSTM vs Transformer Comparison

5. Transformer Architecture - Deep Dive

THE critical interview section. Master: Self-attention math (Q@K^T/√d_k, then softmax, then @V), multi-head parallelization (8 heads learn different patterns), positional encoding (sinusoidal adds sequence order), layer norm placement (pre-norm vs post-norm), and why it works (parallel processing, O(1) path between any tokens).

This is the most critical section. Transformers are THE architecture for modern LLMs.

5.1 The Core Innovation: Attention Mechanism

The Problem Transformers Solve

RNNs compress entire history into fixed-size hidden state → information bottleneck
Need direct access to all previous tokens for context

Self-Attention Intuition

For each token, compute how much to "attend to" every other token in the sequence.

Example: "The cat sat on the mat because it was tired"

"it" should attend strongly to "cat" (resolved reference)
"tired" should attend to "sat" (action-state relationship)

🔍 Self-Attention Mechanism Flow

Visual Flow:

Input X (n × d_model) → Linear Projections → Q, K, V

↓

Q @ K^T → (n × n) attention scores → / √d_k (scale)

↓

Softmax → attention weights (sum to 1 per row)

↓

@ V → weighted combination of values → Output (n × d_v)

Example: Token "it" looks at all tokens via Q@K^T, softmax weights highest for "cat", outputs V-weighted mix

Self-attention allows each token to attend to all other tokens in O(1) steps

5.2 Attention Mathematics (Step-by-Step)

Input: Sequence of embeddings X = [x₁, x₂, ..., x_n] where each xᵢ ∈ ℝ^d_model

Step 1: Create Queries, Keys, Values

Linear projections create Q, K, V matrices:

Q = X @ W_Q with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
K = X @ W_K with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
V = X @ W_V with shape (n × d_model) @ (d_model × d_v) = (n × d_v)

Intuition:

Query: "What am I looking for?"
Key: "What do I offer?"
Value: "What do I actually communicate?"

Step 2: Compute Attention Scores

scores = Q @ K.T / √d_k

Q @ K.T computes similarity between all pairs
Division by √d_k prevents softmax saturation (numerical stability)
Why scale? Dot products grow with dimension, pushing softmax into saturation regions

Attention Score Computation

Tricky Interview Question: "Why divide by √d_k not d_k?"
→ Variance of dot product of d_k random variables is d_k. Dividing by √d_k makes variance ≈1, keeping pre-softmax values in a reasonable range. Empirically, √d_k works better than d_k.

Step 3: Apply Softmax (Normalize)

attention_weights = softmax(scores, dim=-1)

Each row sums to 1 → weighted average over values

Step 4: Weighted Sum of Values

output = attention_weights @ V

Complete Scaled Dot-Product Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

5.3 Multi-Head Attention

Why Multiple Heads?

Single attention might focus on one relationship type. Multiple heads learn different patterns:

Head 1: Subject-verb agreement
Head 2: Coreference resolution
Head 3: Positional proximity
Head 4: Semantic similarity

Implementation Algorithm:

Split d_model across heads: d_k = d_model // num_heads
Create projections for each head: Each head has separate W_Q_i, W_K_i, W_V_i matrices
Parallel attention: Compute attention independently for each head
Concatenate heads: Combine all head outputs along feature dimension (n × d_model)
Final projection: Apply W_O to get final output

Multi-Head Attention Architecture

Tricky Interview Question: "Why concatenate heads instead of averaging?"
→ Averaging loses information - all heads would need to agree. Concatenation preserves each head's unique perspective, then W_O learns how to combine them optimally.

5.4 Positional Encoding

The Problem: Attention has no notion of order! "Cat sat mat" = "Mat sat cat" without position info.

Solution: Add positional information to input embeddings

Sinusoidal Positional Encoding (Original transformer)

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Why this weird formula?

Different frequencies for different dimensions
Smooth, continuous representation
Can extrapolate to longer sequences than seen during training
Relative positions encoded: PE(pos+k) can be expressed as linear function of PE(pos)

Rotary Position Embeddings (RoPE) (Modern LLMs like LLaMA)

Algorithm:

Rotate query vectors based on position: Q_rotated = rotate(Q, pos)
Rotate key vectors based on position: K_rotated = rotate(K, pos)
Compute attention: attention_scores = Q_rotated @ K_rotated.T

Why RoPE is better:

Encodes relative positions directly in attention computation
Better extrapolation to longer sequences
More efficient than adding positional encodings

Positional Encoding Patterns

5.5 Transformer Block Architecture

Complete Transformer Encoder Block:

Multi-head self-attention layer: MultiHeadAttention(d_model, num_heads)
First layer normalization: LayerNorm(d_model)
Feed-forward network: FeedForward(d_model, d_ff)
Second layer normalization: LayerNorm(d_model)
Dropout for regularization

Feed-Forward Network (Position-wise)

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Often uses GELU instead of ReLU in modern transformers
Applied independently to each position
Typically: d_ff = 4 × d_model (expansion then projection)
Why needed? Attention is linear (weighted average), FFN adds non-linearity

Residual Connections (Skip Connections)

x = x + SubLayer(x)

Why critical? Enables gradient flow through many layers (like ResNet)
Without residuals, deep transformers (>12 layers) very hard to train
Allows identity mapping (model can learn to skip layers if not needed)

Transformer Block Architecture

Layer Normalization Placement: Pre-norm vs Post-norm

Post-norm (Original transformer):

x = LayerNorm(x + Attention(x))
x = LayerNorm(x + FFN(x))

Pre-norm (Modern practice):

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Why Pre-norm became standard?

More stable training for deep transformers (>24 layers)
Gradients flow better through residual path
Allows training without learning rate warmup (though warmup still helps)
Used in GPT-2, GPT-3, BERT-large

Training Stability Techniques

5.6 Encoder vs Decoder Architecture

Encoder (BERT-style):

Bidirectional attention (each token sees all tokens)
Used for understanding tasks: classification, NER, Q&A

Decoder (GPT-style):

Causal/masked attention (token i can only see tokens ≤ i)
Used for generation tasks: language modeling, completion
Causal masking: Set attention scores to -∞ for future positions before softmax

When to use what?

Encoder-only: Classification, tagging, embeddings (BERT, RoBERTa)
Decoder-only: Generation, completion, few-shot (GPT series)
Encoder-Decoder: Translation, summarization, structured generation (T5, BART)

6. Modern LLM Architectures

6.1 BERT (Bidirectional Encoder Representations from Transformers)

Architecture: Encoder-only transformer

Key Innovations:

1. Masked Language Modeling (MLM)

Example:

Input: "The cat [MASK] on the mat"
Task: Predict [MASK] = "sat"
Randomly mask 15% of tokens
Of those: 80% → [MASK], 10% → random token, 10% → unchanged
Forces bidirectional understanding

Variants:

RoBERTa: Better BERT (removed NSP, longer training, more data)
ALBERT: Parameter sharing across layers (much smaller)
DistilBERT: Distilled version (40% smaller, 97% performance)

6.2 GPT Series (Generative Pre-trained Transformer)

Architecture: Decoder-only transformer with causal attention

Evolution:

GPT-1 (2018):

12 layers, 117M parameters
Showed pre-training + fine-tuning works

GPT-2 (2019):

48 layers, 1.5B parameters
Zero-shot learning: performs tasks without fine-tuning

GPT-3 (2020):

96 layers, 175B parameters
Few-shot in-context learning
Key insight: Large enough models can learn from examples in prompt

In-Context Learning (The GPT-3 Breakthrough)

Example Prompt:

Review: "This movie was great!" Sentiment: Positive
Review: "Terrible experience." Sentiment: Negative
Review: "Amazing plot twist!" Sentiment: [MODEL PREDICTS: Positive]

Key properties:

No gradient updates, just prompt engineering
Model learns task from examples in context window

6.3 LLaMA (Large Language Model Meta AI)

Key Improvements over GPT:

1. RMSNorm instead of LayerNorm

Simpler, faster (no mean subtraction)
Comparable performance

2. RoPE (Rotary Position Embeddings)

Better length extrapolation
More efficient than learned/sinusoidal

3. SwiGLU Activation (Swish-Gated Linear Unit)

SwiGLU(x) = Swish(xW) ⊙ (xV) where Swish(x) = x × sigmoid(x)

Better than ReLU/GELU for LLMs

4. Grouped-Query Attention (LLaMA 2)

Between multi-head and multi-query attention
Shares keys and values across groups of queries
Faster inference, minimal quality loss

LLaMA vs Others:

Open-source (weights available)
Trained on publicly available data (no private datasets)
Smaller models competitive with much larger closed models
LLaMA 13B ≈ GPT-3 175B on many tasks (better training)

6.4 Mixture of Experts (MoE)

Concept: Sparse model activation

Many expert networks, only use a few per token
Router network decides which experts to use

MoELayer Algorithm:

Router decision: Router computes probabilities for each expert
Select top-K experts: Choose the K experts with highest probabilities (typically K=2)
Combine outputs: Weighted sum of selected expert outputs

Benefits:

Massive parameter count with manageable compute
Each expert can specialize (different languages, domains)
Only activate ~10-20% of parameters per forward pass

Examples:

Switch Transformer: 1.6T parameters, only activates 10B per token
GPT-4 (rumored): MoE with ~8 experts, ~200B params each

6.5 State Space Models (Mamba, S4)

The Problem with Transformers: O(n²) attention complexity

State Space Models: Alternative to attention

Key Properties:

Linear time complexity O(n) for long sequences
Can be parallelized for training (using convolution view)
Efficient autoregressive inference (recurrent view)

Mamba (Recent breakthrough):

Selective state spaces (context-dependent dynamics)
Matches transformer quality on language tasks
5x faster inference for long sequences (>2K tokens)

When to use SSMs vs Transformers?

SSMs: Very long sequences (100K+ tokens), efficiency critical
Transformers: State of the art quality, standard for LLMs

Modern LLM Architectures Comparison

7. RAG & Retrieval Systems

RAG = Retrieve relevant docs, augment prompt, generate with context. Key concepts: Dense retrieval (embeddings), hybrid search (semantic + keyword), chunking strategies, reranking, production challenges (latency, cost, cache hits). Interview focus: when to use RAG vs fine-tuning, semantic vs keyword trade-offs.

7.1 Why RAG?

Problem: LLMs have limited context and outdated knowledge

Solution: Retrieval-Augmented Generation

Retrieve relevant documents from knowledge base
Augment prompt with retrieved context
Generate answer using both parametric knowledge (model weights) and non-parametric knowledge (retrieved docs)

Benefits:

Up-to-date information without retraining
Cite sources (explainability)
Reduced hallucinations
Domain-specific knowledge injection

🔄 RAG Pipeline Architecture Flow

Query Processing:

User Query → Query Embedding (via embedding model)

↓

Retrieval: Vector DB search → Top-k similar chunks (cosine similarity)

↓

Reranking (optional): Cross-encoder reranks results for relevance

↓

Augmentation: Inject retrieved docs into prompt template

↓

Generation: LLM generates answer with context → Response with citations

End-to-end RAG: Chunking → Embedding → Indexing → Retrieval → Generation

7.2 Dense Retrieval

Old Way: BM25 (Sparse Retrieval)

Keyword matching with TF-IDF
Fast, interpretable, but misses semantic similarity
"How to train a dog" won't match "Canine obedience techniques"

New Way: Dense Retrieval with Embeddings

Bi-Encoder Architecture:

Offline phase (precompute once):

Encode all documents: doc_embeddings = encoder(documents) with shape (N × d)

Online phase (at query time):

Encode query: query_embedding = encoder(query) with shape (1 × d)
Compute scores: scores = cosine_similarity(query_embedding, doc_embeddings)
Retrieve top-K: top_k_docs = argsort(scores)[-k:]

Hard Negatives: Critical for good retrieval

Random negatives too easy
Mine hard negatives: high BM25 score but wrong answer
In-batch negatives: Use other queries' positives as your negatives

Dense Retrieval Architecture

7.3 Embedding Models

Modern Embedding Models:

Sentence-BERT: BERT with siamese network
E5: Multilingual, instruction-aware embeddings
BGE: State-of-the-art for retrieval
OpenAI text-embedding-3: Commercial API
Jina Embeddings v3: 8K context, excellent for long docs

Matryoshka Embeddings:

Single model produces embeddings at multiple dimensions
768 → 512 → 256 → 128 → 64
Truncate to smaller dim for speed/storage trade-off
Minimal quality loss for many tasks

7.4 Vector Databases & Approximate Nearest Neighbor (ANN)

Exact Search Problem: O(N) for N documents - too slow for millions of docs

ANN Algorithms:

1. HNSW (Hierarchical Navigable Small World)

Graph-based, navigates through layers
Very fast queries, high recall
Used by: Qdrant, Weaviate, Pinecone

2. IVF (Inverted File Index)

Cluster embeddings, search only relevant clusters
Memory efficient
Used by: FAISS

Vector DB Comparison:

Database	Best For	Key Feature
Qdrant	Production RAG	Filtering + vector search
Pinecone	Managed service	Easiest to use
Weaviate	Hybrid search	GraphQL, BM25 + vector
FAISS	Offline/research	Facebook, highly optimized

7.5 Hybrid Search

Combine sparse + dense retrieval:

Hybrid Search Algorithm:

Get results from BM25: bm25_scores = bm25_search(query, docs)
Get results from vector search: vector_scores = vector_search(query_emb, doc_embs)
Combine using Reciprocal Rank Fusion (RRF)

Why hybrid?

Dense: Semantic similarity, synonyms, paraphrases
Sparse: Exact keyword matches, rare terms, names
Together: Best of both worlds

7.6 Advanced RAG Techniques

Re-ranking:

Two-stage retrieval:

Fast retrieval: Get top 100 candidates using vector search
Slow but accurate re-ranking: Use cross-encoder to score all (query, candidate) pairs
Select final top-K: Pick top 5 based on re-ranker scores

HyDE (Hypothetical Document Embeddings):

Algorithm:

Generate hypothetical answer using LLM: hypo_doc = llm("Answer this question: " + query)
Use hypothetical answer for retrieval: docs = vector_search(embed(hypo_doc))

Why it works: Bridges query-document gap

Query Rewriting:

Example:

Original query: "it" (ambiguous)
Rewritten with context: "GPT-4 architecture details"

7.7 Production RAG Pipeline

Complete Pipeline:

Configuration:

chunk_size: 512 tokens per chunk
chunk_overlap: 50 tokens
retrieval_top_k: 20 candidates
rerank_top_k: 5 final docs

Ingestion Pipeline:

Chunk documents: Break documents into overlapping chunks (512 tokens, 50 overlap)
Generate embeddings: Embed all chunks using embedding model
Store in vector DB: Upsert embeddings with metadata

Retrieval Pipeline:

Embed query: Convert query to embedding vector
Hybrid search: Vector + BM25, combine using reciprocal rank fusion
Re-rank: Use cross-encoder to re-rank top 20, select top 5
Return: Final top-5 documents with context

Tricky Interview Question: "How do you handle outdated information in RAG?"
→ Timestamp metadata + periodic re-ingestion. Filter results by recency. Implement cache invalidation when documents update. For time-sensitive queries, boost recent documents in ranking.

8. Memory Architectures

8.1 The Memory Problem

Challenge: How do models remember information across interactions?

Context Window Limitations:

GPT-4: 128K tokens (~300 pages) - expensive, slow
Most models: 4K-32K tokens
What about conversations over days/weeks/months?

8.2 Modern Conversational Memory

Conversational AI Memory Hierarchy:

1. Short-term (Context Window)

Last N tokens in conversation
Directly in model context
Fast, but limited capacity

2. Medium-term (Session Memory)

Summary of current conversation
Vector DB with session embeddings
Retrieve relevant parts when context full

3. Long-term (Episodic Memory)

Past conversations, user preferences
Graph DB (entities, relationships, events)
Retrieve when semantically relevant

Conversational Memory Architecture

Practical Architecture:

add_message Algorithm:

Add to short-term: Append message to recent messages list
Check capacity: If short_term > MAX_CONTEXT:
- Summarize oldest 10 messages using LLM
- Store summary in vector DB (medium-term)
- Remove oldest 10 from short-term
Extract facts: Parse message for entities/relationships
Store in graph: Add facts to graph database (long-term)

retrieve_context Algorithm:

Get short-term: All recent messages
Search medium-term: Vector search for relevant summaries (top-5)
Query long-term: Graph traversal for related facts
Combine and format: Merge all three sources into coherent context

Conversational Memory System Design

8.3 Knowledge Graphs for Memory

Graph Structure:

Nodes: Entities (people, places, concepts)
Edges: Relationships (knows, works_at, discussed_on)
Properties: Attributes (age, location, sentiment)

Example:

User talked about meeting John at coffee shop:

(User)-[:MENTIONED]->(John:Person)
(John)-[:MET_AT {date: "2024-01-15"}]->(Starbucks:Place)

Temporal Graphs:

Relationships have timestamps
Query: "Who did I meet last week?"
Decay old information (importance ∝ recency)

Graph + Vector Hybrid:

Vector search for semantic similarity
Graph traversal for structured relationships
Combine results

Tricky Interview Question: "How do you handle conflicting information in long-term memory?"
→ Temporal priority (newer > older), confidence scores, user feedback loop. Keep version history with timestamps. Flag conflicts for user resolution. Use graph structure to track belief updates over time.

9. Whiteboard Design Scenarios

9.1 Design a Conversational Memory System

Requirements:

Store unlimited conversation history
Fast retrieval (<100ms)
Understand context from weeks ago
Scale to millions of users
Privacy-first (user data isolated)

Components:

1. Storage Layers

Hot Storage: Redis (last 24h of conversation)
Warm Storage: Vector DB (last 30 days, embeddings)
Cold Storage: S3 + Graph DB (all history, structured facts)

2. Retrieval Strategy

retrieve_memory Algorithm:

Fast path - Recent conversation: Fetch from Redis cache (last 24h)
Semantic search - Relevant past: Vector search filtered by user_id for top-10
Graph query - Entity-based: Graph traversal for relationships
Merge and rank: Combine all results with recency decay weighting

3. Scaling Considerations

Sharding: By user_id (each shard handles subset of users)
Caching: Frequently accessed memories in L1 cache
Compression: Older conversations summarized, original stored in cold
Pruning: Remove low-importance memories (based on access frequency + age)

9.2 Production RAG Pipeline Design

Requirements:

1M documents, 10K queries/sec
<200ms end-to-end latency
Accuracy > 90% (user satisfaction)
Handle document updates in real-time

Production RAG System Design

Pipeline Stages:

1. Document Ingestion

Extract text from PDF/HTML
Clean and normalize
Intelligent chunking (512 tokens, 50 overlap, sentence-aware)
Generate embeddings (batch encode, batch_size=32)
Store with metadata

2. Query Processing

Check cache: Hash query + context, return if cached
Parallel retrieval: Run vector search and BM25 concurrently
Fusion: Apply reciprocal rank fusion, take top-20
Re-rank: Use cross-encoder to re-rank, select top-5
Generate: LLM generates answer using context
Cache result: Store in Redis with 1-hour TTL

3. Latency Budget

Retrieval: 50ms (vector search)
Re-ranking: 30ms (cross-encoder on 20 docs)
Generation: 100ms (LLM with streaming)
Overhead: 20ms (network, serialization)
Total: ~200ms

4. Optimization Techniques

Caching: 70% cache hit rate → 70% queries <10ms
Approximate search: HNSW with ef=32 (vs brute force)
Quantization: 8-bit embeddings (768 dims → 192 bytes)
Batching: Batch re-ranking for efficiency
Streaming: Start generating while re-ranking completes

9.3 Multi-Agent Coordination System

Requirements:

5-10 specialized agents (sales, support, coding, research)
Route user queries to appropriate agents
Agents can call each other for help
Maintain conversation context across agents
Avoid loops and deadlocks

Multi-Agent System Design

Components:

1. Router Agent (Master orchestrator)

Routing Algorithm:

Classify intent: Determine query category and confidence
High confidence (>0.9): Route to single specialized agent
Low confidence: Consult all agents in parallel, ensemble results
Return final answer

2. Inter-Agent Communication

Safe Agent Call Algorithm:

Deadlock prevention: Check if call would create cycle in call graph
Add edge: Record agent-to-agent call in graph
Rate limiting: Ensure agent hasn't exceeded MAX_CALLS
Execute: Call target agent's process method
Return result

3. Evaluation & Monitoring

Key Metrics:

routing_accuracy: % queries correctly routed to right agent
agent_success_rate: % queries successfully resolved
avg_agent_calls: Average # agents invoked per query
latency_by_agent: Latency breakdown per agent type

Tricky Interview Question: "How do you prevent agent loops?"
→ Track call graph in real-time. Before allowing agent A to call agent B, check if it creates cycle. Maintain maximum call depth limit. Use timeout for entire query processing. Log suspicious patterns (A→B→A) for review.

10. Paper Discussion Prep

10.1 "Attention Is All You Need" (Vaswani et al., 2017)

Key Contributions:

Transformer architecture (encoder-decoder)
Multi-head self-attention mechanism
Positional encoding (sinusoidal)
Completely replaced RNNs/LSTMs

Interview Questions:

Q: Why is self-attention O(n²) and why is that a problem?
A: Each token attends to all other tokens → n × n attention matrix. For long sequences (>10K tokens), this becomes memory-prohibitive (O(n²) space) and computationally expensive (O(n²d) time for computing attention scores).

Q: Why multi-head attention instead of single large attention?
A: Multiple heads learn different relationship patterns (syntax vs semantics vs position). Similar to multiple filters in CNNs. Empirically, 8-16 heads work better than one head with 8-16× dimensions.

Q: How would you modify transformers for longer sequences?
A: Sparse attention (Longformer, BigBird), Linformer (low-rank approximation), Reformer (LSH), Flash Attention (memory-efficient), Mixture of experts (sparse activation)

10.2 RAG Papers

"Dense Passage Retrieval" (Karpukhin et al., 2020)

Key Idea: Use dense embeddings for retrieval instead of BM25

Training:

Bi-encoder: Separate encoders for query and passage
In-batch negatives: Other passages in batch are negatives
Hard negatives: High BM25 score but wrong answer

Interview Q: Why in-batch negatives?
A: Computationally efficient (no extra forward passes), provides diverse negatives, scales to large batch sizes. Limitation: If batch size small, negatives may be too easy.

"Retrieval-Augmented Generation" (Lewis et al., 2020)

Key Idea: Combine parametric (model weights) and non-parametric (retrieval) knowledge

Two Variants:

RAG-Sequence: Retrieve once, use for entire generation
RAG-Token: Retrieve for each generated token

10.3 Modern LLM Papers

"Language Models are Few-Shot Learners" (GPT-3, Brown et al., 2020)

Key Findings:

Scale is all you need (175B parameters)
In-context learning emerges at scale
No fine-tuning needed for many tasks

Interview Q: Why does in-context learning emerge? What's happening?
A: During pre-training on internet text, model sees many examples of pattern completion, Q&A, etc. It learns meta-learning: "given these examples, continue the pattern." At sufficient scale, this generalizes to new tasks.

"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)

Key Contributions:

Smaller models competitive with larger ones (better training)
Open-source weights
Architectural improvements: RMSNorm, SwiGLU, RoPE

Interview Q: Why is LLaMA 13B competitive with GPT-3 175B?
A: Better data quality + curation, longer training (1.4T tokens vs 300B), architectural improvements, training tricks (better learning rate schedule, etc.). Shows that data quality >> model size for many tasks.

11. Interview Traps & Gotchas

11.1 Math Traps

"Explain backprop from scratch"

Trap: Glossing over chain rule details
Show: Matrix dimensions at each step, transpose operations

"When would you use L1 vs L2 regularization?"

Trap: "L1 for sparsity, L2 for smoothness" (too vague)
Better: L1 when you need feature selection (eliminates features entirely), L2 when all features matter but you want small weights. L1 gradient discontinuity at zero means SGD implementations need special handling.

11.2 Architecture Traps

"Why do transformers work better than LSTMs?"

Trap: Only mentioning parallelization
Complete answer: Parallelization + direct connections (better long-range) + better gradient flow + more efficient for modern hardware

"What happens if you don't use positional encoding?"

Trap: "Model won't know order"
Nuance: Model can still learn some positional info from content (e.g., "first" and "finally" are positional markers), but explicit encoding much more effective

11.3 Training Traps

"Your model isn't learning. What do you check?"

Systematic debugging:

Overfit single batch (proves model capacity + no bugs)
Check gradients (vanishing? exploding? NaN?)
Learning rate (too high? too low? plot loss curve)
Data (labels correct? normalized? shuffled?)
Architecture (bottlenecks? activation choice?)
Loss function (appropriate for task? numerically stable?)

"How do you choose hyperparameters?"

Trap: "Grid search"
Better: Start with known good defaults (from papers), use learning rate finder, random search > grid search, Bayesian optimization for expensive searches, monitor during training and adjust.

11.4 RAG Traps

"Why not just increase context window instead of using RAG?"

Costs: Larger context = much more expensive (quadratic in tokens)
Quality: Attention dilutes with more context ("lost in the middle" problem)
Freshness: Can't update knowledge without retraining
Attribution: RAG can cite sources

"How do you evaluate RAG quality?"

Trap: Only end-to-end accuracy
Decompose: Retrieval quality (recall@k, precision@k, MRR), relevance (are retrieved docs actually relevant?), generation quality (fluency, factuality, groundedness)

11.5 Production ML Traps

"How do you monitor ML models in production?"

Not just accuracy: Data drift, prediction drift, latency, error rates
Business metrics: User engagement, conversion, retention
Model-specific: Attention patterns, confidence scores, embedding drift

"Your ML model shows bias. What do you do?"

Understand source: Training data bias? Model architecture? Evaluation metric?
Measure: Define fairness metrics for your use case (demographic parity, equalized odds, etc.)
Mitigate: Re-sample training data, re-weight loss, adversarial debiasing, fairness constraints
Monitor: Continuous bias metrics in production, broken down by demographic groups

11.6 Behavioral/System Design Traps

"How would you explain transformers to a non-technical person?"

Use analogy: "Like reading a book where you can instantly flip to any related section, instead of reading page-by-page. The model learns which parts to focus on for any given question."
Avoid jargon: Don't say "attention mechanism" - say "focus on relevant information"

"How do you prioritize when building an ML system?"

Start with simplest baseline
Identify bottlenecks (data? model? engineering?)
Measure impact of improvements
80/20 rule: Simple models often get 80% of the way there

Appendix: Quick Reference

Key Formulas

Attention: Attention(Q,K,V) = softmax(QK^T / √d_k)V
Cross-Entropy Loss: L = -Σ y_i log(ŷ_i)
Layer Norm: y = γ(x - μ)/σ + β

Typical Hyperparameters

Transformer (GPT-style):

Layers: 12-96
d_model: 768-12288
Heads: 12-96
d_ff: 3072-49152 (4× d_model)
Dropout: 0.1
Learning rate: 1e-4 to 6e-4
Warmup: 2000-10000 steps

RAG Pipeline:

Chunk size: 256-512 tokens
Overlap: 50-100 tokens
Top-k retrieval: 20-50
Top-k rerank: 3-5
Embedding dim: 768-1024

Final Tips for Interview Success

Before the interview:

Re-read papers on company's core tech (RAG? Agents? Specific architecture?)
Practice whiteboarding system designs out loud
Prepare 2-3 deep technical stories from your experience
Review recent ML news (new models, techniques)

During technical discussion:

Think out loud (show reasoning process)
Ask clarifying questions before diving in
Start simple, then add complexity
Discuss trade-offs explicitly
Admit when you don't know (then reason through it)

Red flags to avoid:

Claiming to know everything
Not asking questions
Ignoring trade-offs ("this is always better")
Overcomplicating simple problems
Not testing your solution

Green flags to show:

Systematic problem-solving
Awareness of latest research
Production ML experience
Clear communication
Collaborative attitude

Remember: Founding engineer roles value pragmatism + depth. Show you can ship fast while understanding fundamentals deeply. Balance is key.