AI Engineer Study Guide
Reference for ML Production Roles
Target Audience: AI/ML Engineers with production LLM experience preparing for founding engineer interviews
Focus: Theory + Whiteboard Design + Paper Discussions
1. Mathematical Foundations
1.1 Linear Algebra (The Core)
Every neural network operation is matrix multiplication. Master matrix shapes, eigenvalues (PCA, gradient behavior), and SVD (LoRA uses this). Interview focus: attention mechanism dimensions and low-rank approximations.
Why it matters: Every neural network operation is matrix multiplication. Understanding shapes, ranks, and transformations is non-negotiable.
Key Concepts
Matrix Multiplication & Dimensionality
- Matrix multiplication
(m × n) @ (n × p) = (m × p)- the inner dimensions must match - Tricky bit: In attention,
Q @ K^Tworks because(seq_len × d_k) @ (d_k × seq_len) = (seq_len × seq_len) - Interview trap: "Why do we transpose K in attention?" → Creates compatibility AND semantic meaning (query-key similarity matrix)
Eigenvalues & Eigenvectors
Av = λv- direction v that doesn't change under transformation A, only scales by λ- Why it matters:
- PCA finds principal components (eigenvectors of covariance matrix)
- Gradient explosion/vanishing relates to eigenvalues of weight matrices
- Spectral normalization uses largest eigenvalue for stability
Singular Value Decomposition (SVD)
Formula: A = UΣV^T
- U: left singular vectors (output space basis)
- Σ: singular values (scaling factors)
- V^T: right singular vectors (input space basis)
Applications in ML:
- Low-rank matrix factorization (LoRA for LLM fine-tuning)
- Dimensionality reduction
- Matrix completion (recommendation systems)
Key Relationships:
• Matrix Multiplication: (m×n) @ (n×p) → (m×p) - inner dimensions must match
• Eigenvectors: Directions unchanged by transformation A, scaled by eigenvalue λ
• SVD: A = UΣV^T decomposes any matrix into rotation-scale-rotation
• Applications: PCA (eigenvectors), LoRA (low-rank), attention (Q@K^T)
Tricky Interview Question: "How does LoRA use SVD concepts?"
→ LoRA approximates weight updates as low-rank: ΔW = BA where B is (d × r) and A is (r × d) with r << d. This is inspired by SVD's idea that most information lives in top singular values.
Norms & Distance Metrics
L1 vs L2 Norms
- L1 (Manhattan):
||x||₁ = Σ|xᵢ|→ Sparse solutions, robust to outliers - L2 (Euclidean):
||x||₂ = √(Σxᵢ²)→ Smooth gradients, penalizes large weights more
Cosine Similarity (Critical for embeddings)
Formula: cos(θ) = (A · B) / (||A|| ||B||)
- Range: [-1, 1]
- Why cosine not Euclidean for embeddings? → Scale-invariant, captures angle not magnitude
- Interview trap: "When would cosine similarity fail?" → When magnitude matters (e.g., word frequency in TF-IDF)
1.2 Calculus & Optimization
Chain Rule - The Backpropagation Foundation
Formula: ∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w
Tricky bit - Matrix Calculus:
When differentiating matrix operations, track dimensions carefully:
∂(Wx)/∂W = x^T(outer product creates the right shape)∂(Wx)/∂x = W^T(transpose for correct flow)
Gradient Descent Variants
| Algorithm | Update Rule | Key Property | When to Use |
|---|---|---|---|
| SGD | w -= lr × ∇w |
Noisy, explores well | Small datasets, need exploration |
| Momentum | v = βv + ∇w |
Accelerates in consistent directions | When gradients have high variance |
| RMSProp | v = βv + (1-β)∇w² |
Adapts per-parameter learning rate | Non-stationary objectives |
| Adam | Combines momentum + RMSProp | Fast convergence | Default choice for transformers |
| AdamW | Adam + decoupled weight decay | Better regularization | SOTA for LLM training |
Gradient Descent Variants
Tricky Interview Question: "Why use AdamW over Adam?"
→ Adam's weight decay interacts with adaptive learning rates incorrectly. AdamW decouples it: w -= lr × (∇w + λw) after adaptive scaling.
1.3 Probability & Statistics
Central Limit Theorem (CLT) - Foundation of Modern ML
Statement: The distribution of sample means approximates a normal distribution as sample size increases, regardless of the population's distribution shape.
Mathematical Form: X̄ ~ N(μ, σ²/n) as n → ∞
Why It Matters in ML:
- Batch Training Stability: Averaging gradients over a batch (n samples) reduces variance by 1/√n. This is why larger batches lead to more stable updates.
- Why SGD Converges: Noisy gradient estimates from mini-batches approximate true gradients via CLT, enabling convergence guarantees.
- Batch Normalization: Assumes activations follow approximately normal distribution per batch, which CLT supports for large batches.
- Confidence Intervals: Error bars on model performance metrics rely on CLT for validity.
Central Limit Theorem
Tricky Interview Question: "Why do larger batches in SGD lead to worse generalization despite more stable gradients?"
→ Large batches converge to sharp minima (high curvature) which generalize poorly. Small batches' noise helps escape sharp minima and find flat minima (low curvature) with better test performance. This is the generalization gap phenomenon.
Key Statistical Concepts
Maximum Likelihood Estimation (MLE)
θ* = argmax_θ Π p(xᵢ|θ) = argmax_θ Σ log p(xᵢ|θ)
- Why log? Converts products to sums (numerical stability + easier gradients)
- Connection to loss: Minimizing cross-entropy = MLE for categorical distribution
Bias-Variance Decomposition
E[(y - ŷ)²] = Bias² + Variance + Irreducible Error
- High bias: Underfitting (too simple model)
- High variance: Overfitting (too complex, memorizes noise)
- Sweet spot: Balance through regularization, model capacity
Bias-Variance Tradeoff
1.4 Information Theory
Entropy (Measure of uncertainty)
H(X) = -Σ p(x) log p(x)
- High entropy = high uncertainty (uniform distribution)
- Low entropy = low uncertainty (peaked distribution)
- In ML: We want models with low entropy predictions (confident)
Cross-Entropy (Distance between distributions)
H(P, Q) = -Σ p(x) log q(x)
- P: true distribution, Q: predicted distribution
- Cross-entropy loss: Minimizing this = matching distributions
- Binary:
-[y log(ŷ) + (1-y)log(1-ŷ)] - Multi-class:
-Σ yᵢ log(ŷᵢ)(categorical cross-entropy)
KL Divergence (How different are two distributions?)
D_KL(P || Q) = Σ p(x) log(p(x)/q(x)) = H(P,Q) - H(P)
- Properties: Always ≥ 0, asymmetric (P||Q ≠ Q||P)
- In VAE: Regularization term KL(q(z|x) || p(z)) keeps latent space structured
- In RLHF: KL penalty keeps model close to reference policy
Information Theory Concepts
Tricky Interview Question: "Why is cross-entropy preferred over MSE for classification?"
→ Cross-entropy has better gradients for probability outputs. MSE gradient = (ŷ - y), but cross-entropy = (ŷ - y)/ŷ(1-ŷ) which is stronger when prediction is wrong (ŷ close to 0 or 1).
1.5 Statistical Inference & Hypothesis Testing
Why This Matters: A/B testing, model comparison, and determining if performance improvements are real or noise all rely on hypothesis testing. Production ML decisions need statistical rigor.
Hypothesis Testing Framework
The Setup:
- Null Hypothesis (H₀): The "boring" hypothesis - no effect, no difference
- Alternative Hypothesis (H₁): The claim you're trying to prove
- Significance Level (α): Threshold for rejecting H₀ (typically 0.05 or 0.01)
- p-value: Probability of observing this extreme data if H₀ is true
Decision Rule: If p-value < α, reject H₀ (statistically significant result)
Type I and Type II Errors
Type I Error (False Positive, α):
- Definition: Rejecting H₀ when it's actually true
- In ML: Saying model B is better when it's not
- Real-world cost: Wasted resources deploying inferior model
- Control: Set lower α (0.01 instead of 0.05) for critical decisions
Type II Error (False Negative, β):
- Definition: Failing to reject H₀ when H₁ is true
- In ML: Missing a real improvement
- Real-world cost: Leaving better model undiscovered
- Control: Increase sample size, increase α (trade-off!)
Type I and Type II Errors
Statistical Power (1 - β):
- Definition: Probability of correctly rejecting H₀ when H₁ is true
- Target: Power ≥ 0.80 (80% chance of detecting true effect)
- Affected by: Sample size, effect size, α, test type
Statistical Power
Tricky Interview Question: "Your model shows 2% accuracy improvement. Is it significant?"
→ Depends on: (1) Sample size - is it 100 examples or 10,000? (2) Variance - consistent or noisy? (3) Business context - is 2% valuable? Run a paired t-test on validation predictions, compute confidence interval, consider practical significance vs statistical significance.
A/B Testing in ML
Setup: Compare model A (baseline) vs model B (new model)
Randomization: Crucial for causal inference
- Random assignment of users/requests to A or B
- Eliminates confounding variables
- Enables causal claims ("B caused the improvement")
Multiple Testing Problem:
- Running 20 tests with α=0.05 → Expect 1 false positive by chance
- Solution: Bonferroni correction (α_adjusted = α/k where k = # tests)
- Better: False Discovery Rate (FDR) control for many tests
Tricky Interview Question: "Your A/B test shows model B is 3% better after 3 days. Should you deploy?"
→ No! (1) Too short - haven't captured weekly patterns, (2) Possible novelty effect, (3) Statistical power may be insufficient, (4) Need to verify across different user segments, (5) Check if improvement is consistent across days or just a lucky spike.
2. Foundational Machine Learning
2.1 Core Concepts Review
| Problem | Symptoms | Solutions |
|---|---|---|
| Underfitting | High train & test error | Increase model capacity, more features, train longer |
| Overfitting | Low train error, high test error | Regularization, more data, early stopping, dropout |
| Just Right | Low train & test error, small gap | You're good! Monitor for data drift |
2.2 Regularization Techniques
L1 (Lasso) Regularization: Loss + λΣ|wᵢ|
- Encourages sparsity (many weights → 0)
- Feature selection built-in
- Non-differentiable at 0 (use subgradient)
L2 (Ridge) Regularization: Loss + λΣwᵢ²
- Encourages small weights (weight decay)
- Smoother than L1, all weights shrink
- Equivalent to Gaussian prior in Bayesian view
Dropout
- Randomly zero out activations during training
- Intuition: Ensemble of subnetworks
- Inference: Scale activations by keep_prob (or use inverted dropout)
- Tricky bit: Acts as adaptive regularization (more important neurons less likely to drop)
Optimization Landscape
2.3 Model Evaluation
Classification Metrics
Precision vs Recall Trade-off
- Precision:
TP/(TP+FP)- "Of predicted positives, how many are correct?" - Recall:
TP/(TP+FN)- "Of actual positives, how many did we catch?" - F1 Score:
2 × (Precision × Recall)/(Precision + Recall)- harmonic mean
When to optimize what?
- Spam detection: High precision (don't block good emails)
- Cancer screening: High recall (catch all cases)
- Search: Precision@K for top results
ROC-AUC vs PR-AUC
- ROC-AUC: Good for balanced datasets, plots TPR vs FPR
- PR-AUC: Better for imbalanced datasets, focuses on positive class
- Tricky Interview Question: "Why PR-AUC for imbalanced data?" → ROC can look good even with poor minority class performance due to high TN count
3. Deep Learning Fundamentals
3.1 Neural Network Basics
Forward Pass
The forward pass involves two steps:
- Linear transformation: z = Wx + b
- Activation function: a = σ(z)
Activation Functions
| Function | Formula | Range | Pros | Cons | Use Case |
|---|---|---|---|---|---|
| Sigmoid | 1/(1+e^(-x)) |
(0,1) | Smooth, probabilistic | Saturates, slow gradients | Output layer binary class |
| ReLU | max(0,x) |
[0,∞) | Fast, no saturation for x>0 | Dead neurons (x<0) | Default choice hidden layers |
| GELU | x×Φ(x) |
(-∞,∞) | Smooth, stochastic | Slower to compute | Transformers (GPT, BERT) |
Why GELU in transformers? Smooth, non-monotonic, allows negative values with probability (stochastic regularization effect), empirically better for NLP.
Backpropagation Flow
3.2 Initialization
Why initialization matters: Poor init → vanishing/exploding gradients before training even starts
Xavier/Glorot Initialization
w ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))
- For sigmoid/tanh activations
- Keeps variance similar across layers
He Initialization (Kaiming)
w ~ N(0, 2/n_in) for ReLU activations
- Accounts for ReLU zeroing half the neurons
- Default for modern architectures
3.3 Normalization Techniques
Batch Normalization
y = γ((x - μ_batch)/σ_batch) + β
- Normalizes across batch dimension
- Pros: Faster training, acts as regularizer, less sensitive to init
- Cons: Batch size dependent, different behavior train/test, breaks for seq2seq
Layer Normalization (Used in Transformers)
y = γ((x - μ_layer)/σ_layer) + β
- Normalizes across feature dimension (per sample)
- Pros: Batch-independent, works for any sequence length, stable for RNNs/Transformers
- Cons: Slightly slower than BatchNorm for CNNs
RMS Normalization (Root Mean Square)
y = x / RMS(x) × γ where RMS(x) = √(mean(x²))
- Removes mean subtraction (faster, simpler)
- Used in modern LLMs (LLaMA, GPT-3)
- Why? Empirically works as well, 10-20% faster
Normalization Techniques
Tricky Interview Question: "Why LayerNorm in transformers not BatchNorm?"
→ Transformers process variable-length sequences, BatchNorm would require padding/masking complexities. LayerNorm works per-sample so length-agnostic. Also, small batch sizes (memory constraints with long sequences) make BatchNorm statistics noisy.
4. Classical Architectures - RNNs & LSTMs
4.1 Recurrent Neural Networks (RNNs)
The Basic Idea: Process sequences by maintaining hidden state
RNN Formulas:
- Hidden state update:
h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h) - Output:
y_t = W_hy @ h_t + b_y
The Fatal Flaw: Vanishing/exploding gradients over time
Why RNNs failed at long sequences:
- Tanh saturation → gradients < 1
- Matrix W_hh multiplied T times
- For T=100 and gradient 0.9 per step:
0.9^100 ≈ 0(vanished!)
4.2 Long Short-Term Memory (LSTM)
The Solution: Gates that control information flow
The Four Gates
- Forget Gate (what to throw away from cell state):
f_t = σ(W_f @ [h_{t-1}, x_t] + b_f) - Input Gate (what new info to store):
i_t = σ(W_i @ [h_{t-1}, x_t] + b_i) - Cell State Update:
C_t = f_t * C_{t-1} + i_t * C̃_t - Output Gate (what to output):
o_t = σ(W_o @ [h_{t-1}, x_t] + b_o)
Why LSTMs work better:
- Additive updates:
C_t = f_t * C_{t-1} + ...(not multiplicative like RNNs) - Gradient highway: Gradients flow through cell state with fewer transformations
- Selective memory: Gates learn what to remember/forget
LSTM Architecture
4.3 Why Transformers Killed RNNs/LSTMs
Sequential Processing Problem
- LSTMs must process token-by-token sequentially
- Can't parallelize across sequence (unlike transformers)
- For sequence length T, need T sequential steps
The Death Blow: "Attention Is All You Need" (2017)
- Showed transformers outperform RNNs on all benchmarks
- 10x faster training on modern hardware
- Better at long-range dependencies
- End of the RNN era for NLP
When to still use LSTMs:
- Streaming applications (process token-by-token in real-time)
- Very long sequences where O(n²) attention is prohibitive
- Limited hardware (mobile deployment)
- Time-series forecasting where sequential structure helps
RNN vs LSTM vs Transformer Comparison
5. Transformer Architecture - Deep Dive
THE critical interview section. Master: Self-attention math (Q@K^T/√d_k, then softmax, then @V), multi-head parallelization (8 heads learn different patterns), positional encoding (sinusoidal adds sequence order), layer norm placement (pre-norm vs post-norm), and why it works (parallel processing, O(1) path between any tokens).
This is the most critical section. Transformers are THE architecture for modern LLMs.
5.1 The Core Innovation: Attention Mechanism
The Problem Transformers Solve
- RNNs compress entire history into fixed-size hidden state → information bottleneck
- Need direct access to all previous tokens for context
Self-Attention Intuition
For each token, compute how much to "attend to" every other token in the sequence.
Example: "The cat sat on the mat because it was tired"
- "it" should attend strongly to "cat" (resolved reference)
- "tired" should attend to "sat" (action-state relationship)
Visual Flow:
Input X (n × d_model) → Linear Projections → Q, K, V
↓
Q @ K^T → (n × n) attention scores → / √d_k (scale)
↓
Softmax → attention weights (sum to 1 per row)
↓
@ V → weighted combination of values → Output (n × d_v)
Example: Token "it" looks at all tokens via Q@K^T, softmax weights highest for "cat", outputs V-weighted mix
5.2 Attention Mathematics (Step-by-Step)
Input: Sequence of embeddings X = [x₁, x₂, ..., x_n] where each xᵢ ∈ ℝ^d_model
Step 1: Create Queries, Keys, Values
Linear projections create Q, K, V matrices:
- Q = X @ W_Q with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
- K = X @ W_K with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
- V = X @ W_V with shape (n × d_model) @ (d_model × d_v) = (n × d_v)
Intuition:
- Query: "What am I looking for?"
- Key: "What do I offer?"
- Value: "What do I actually communicate?"
Step 2: Compute Attention Scores
scores = Q @ K.T / √d_k
Q @ K.Tcomputes similarity between all pairs- Division by
√d_kprevents softmax saturation (numerical stability) - Why scale? Dot products grow with dimension, pushing softmax into saturation regions
Attention Score Computation
Tricky Interview Question: "Why divide by √d_k not d_k?"
→ Variance of dot product of d_k random variables is d_k. Dividing by √d_k makes variance ≈1, keeping pre-softmax values in a reasonable range. Empirically, √d_k works better than d_k.
Step 3: Apply Softmax (Normalize)
attention_weights = softmax(scores, dim=-1)
Each row sums to 1 → weighted average over values
Step 4: Weighted Sum of Values
output = attention_weights @ V
Complete Scaled Dot-Product Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
5.3 Multi-Head Attention
Why Multiple Heads?
Single attention might focus on one relationship type. Multiple heads learn different patterns:
- Head 1: Subject-verb agreement
- Head 2: Coreference resolution
- Head 3: Positional proximity
- Head 4: Semantic similarity
Implementation Algorithm:
- Split d_model across heads: d_k = d_model // num_heads
- Create projections for each head: Each head has separate W_Q_i, W_K_i, W_V_i matrices
- Parallel attention: Compute attention independently for each head
- Concatenate heads: Combine all head outputs along feature dimension (n × d_model)
- Final projection: Apply W_O to get final output
Multi-Head Attention Architecture
Tricky Interview Question: "Why concatenate heads instead of averaging?"
→ Averaging loses information - all heads would need to agree. Concatenation preserves each head's unique perspective, then W_O learns how to combine them optimally.
5.4 Positional Encoding
The Problem: Attention has no notion of order! "Cat sat mat" = "Mat sat cat" without position info.
Solution: Add positional information to input embeddings
Sinusoidal Positional Encoding (Original transformer)
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Why this weird formula?
- Different frequencies for different dimensions
- Smooth, continuous representation
- Can extrapolate to longer sequences than seen during training
- Relative positions encoded:
PE(pos+k)can be expressed as linear function ofPE(pos)
Rotary Position Embeddings (RoPE) (Modern LLMs like LLaMA)
Algorithm:
- Rotate query vectors based on position: Q_rotated = rotate(Q, pos)
- Rotate key vectors based on position: K_rotated = rotate(K, pos)
- Compute attention: attention_scores = Q_rotated @ K_rotated.T
Why RoPE is better:
- Encodes relative positions directly in attention computation
- Better extrapolation to longer sequences
- More efficient than adding positional encodings
Positional Encoding Patterns
5.5 Transformer Block Architecture
Complete Transformer Encoder Block:
- Multi-head self-attention layer: MultiHeadAttention(d_model, num_heads)
- First layer normalization: LayerNorm(d_model)
- Feed-forward network: FeedForward(d_model, d_ff)
- Second layer normalization: LayerNorm(d_model)
- Dropout for regularization
Feed-Forward Network (Position-wise)
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
- Often uses GELU instead of ReLU in modern transformers
- Applied independently to each position
- Typically:
d_ff = 4 × d_model(expansion then projection) - Why needed? Attention is linear (weighted average), FFN adds non-linearity
Residual Connections (Skip Connections)
x = x + SubLayer(x)
- Why critical? Enables gradient flow through many layers (like ResNet)
- Without residuals, deep transformers (>12 layers) very hard to train
- Allows identity mapping (model can learn to skip layers if not needed)
Transformer Block Architecture
Layer Normalization Placement: Pre-norm vs Post-norm
Post-norm (Original transformer):
x = LayerNorm(x + Attention(x))x = LayerNorm(x + FFN(x))
Pre-norm (Modern practice):
x = x + Attention(LayerNorm(x))x = x + FFN(LayerNorm(x))
Why Pre-norm became standard?
- More stable training for deep transformers (>24 layers)
- Gradients flow better through residual path
- Allows training without learning rate warmup (though warmup still helps)
- Used in GPT-2, GPT-3, BERT-large
Training Stability Techniques
5.6 Encoder vs Decoder Architecture
Encoder (BERT-style):
- Bidirectional attention (each token sees all tokens)
- Used for understanding tasks: classification, NER, Q&A
Decoder (GPT-style):
- Causal/masked attention (token i can only see tokens ≤ i)
- Used for generation tasks: language modeling, completion
- Causal masking: Set attention scores to -∞ for future positions before softmax
When to use what?
- Encoder-only: Classification, tagging, embeddings (BERT, RoBERTa)
- Decoder-only: Generation, completion, few-shot (GPT series)
- Encoder-Decoder: Translation, summarization, structured generation (T5, BART)
6. Modern LLM Architectures
6.1 BERT (Bidirectional Encoder Representations from Transformers)
Architecture: Encoder-only transformer
Key Innovations:
1. Masked Language Modeling (MLM)
Example:
- Input: "The cat [MASK] on the mat"
- Task: Predict [MASK] = "sat"
- Randomly mask 15% of tokens
- Of those: 80% → [MASK], 10% → random token, 10% → unchanged
- Forces bidirectional understanding
Variants:
- RoBERTa: Better BERT (removed NSP, longer training, more data)
- ALBERT: Parameter sharing across layers (much smaller)
- DistilBERT: Distilled version (40% smaller, 97% performance)
6.2 GPT Series (Generative Pre-trained Transformer)
Architecture: Decoder-only transformer with causal attention
Evolution:
GPT-1 (2018):
- 12 layers, 117M parameters
- Showed pre-training + fine-tuning works
GPT-2 (2019):
- 48 layers, 1.5B parameters
- Zero-shot learning: performs tasks without fine-tuning
GPT-3 (2020):
- 96 layers, 175B parameters
- Few-shot in-context learning
- Key insight: Large enough models can learn from examples in prompt
In-Context Learning (The GPT-3 Breakthrough)
Example Prompt:
- Review: "This movie was great!" Sentiment: Positive
- Review: "Terrible experience." Sentiment: Negative
- Review: "Amazing plot twist!" Sentiment: [MODEL PREDICTS: Positive]
Key properties:
- No gradient updates, just prompt engineering
- Model learns task from examples in context window
6.3 LLaMA (Large Language Model Meta AI)
Key Improvements over GPT:
1. RMSNorm instead of LayerNorm
- Simpler, faster (no mean subtraction)
- Comparable performance
2. RoPE (Rotary Position Embeddings)
- Better length extrapolation
- More efficient than learned/sinusoidal
3. SwiGLU Activation (Swish-Gated Linear Unit)
SwiGLU(x) = Swish(xW) ⊙ (xV) where Swish(x) = x × sigmoid(x)
- Better than ReLU/GELU for LLMs
4. Grouped-Query Attention (LLaMA 2)
- Between multi-head and multi-query attention
- Shares keys and values across groups of queries
- Faster inference, minimal quality loss
LLaMA vs Others:
- Open-source (weights available)
- Trained on publicly available data (no private datasets)
- Smaller models competitive with much larger closed models
- LLaMA 13B ≈ GPT-3 175B on many tasks (better training)
6.4 Mixture of Experts (MoE)
Concept: Sparse model activation
- Many expert networks, only use a few per token
- Router network decides which experts to use
MoELayer Algorithm:
- Router decision: Router computes probabilities for each expert
- Select top-K experts: Choose the K experts with highest probabilities (typically K=2)
- Combine outputs: Weighted sum of selected expert outputs
Benefits:
- Massive parameter count with manageable compute
- Each expert can specialize (different languages, domains)
- Only activate ~10-20% of parameters per forward pass
Examples:
- Switch Transformer: 1.6T parameters, only activates 10B per token
- GPT-4 (rumored): MoE with ~8 experts, ~200B params each
6.5 State Space Models (Mamba, S4)
The Problem with Transformers: O(n²) attention complexity
State Space Models: Alternative to attention
Key Properties:
- Linear time complexity O(n) for long sequences
- Can be parallelized for training (using convolution view)
- Efficient autoregressive inference (recurrent view)
Mamba (Recent breakthrough):
- Selective state spaces (context-dependent dynamics)
- Matches transformer quality on language tasks
- 5x faster inference for long sequences (>2K tokens)
When to use SSMs vs Transformers?
- SSMs: Very long sequences (100K+ tokens), efficiency critical
- Transformers: State of the art quality, standard for LLMs
Modern LLM Architectures Comparison
7. RAG & Retrieval Systems
RAG = Retrieve relevant docs, augment prompt, generate with context. Key concepts: Dense retrieval (embeddings), hybrid search (semantic + keyword), chunking strategies, reranking, production challenges (latency, cost, cache hits). Interview focus: when to use RAG vs fine-tuning, semantic vs keyword trade-offs.
7.1 Why RAG?
Problem: LLMs have limited context and outdated knowledge
Solution: Retrieval-Augmented Generation
- Retrieve relevant documents from knowledge base
- Augment prompt with retrieved context
- Generate answer using both parametric knowledge (model weights) and non-parametric knowledge (retrieved docs)
Benefits:
- Up-to-date information without retraining
- Cite sources (explainability)
- Reduced hallucinations
- Domain-specific knowledge injection
Query Processing:
User Query → Query Embedding (via embedding model)
↓
Retrieval: Vector DB search → Top-k similar chunks (cosine similarity)
↓
Reranking (optional): Cross-encoder reranks results for relevance
↓
Augmentation: Inject retrieved docs into prompt template
↓
Generation: LLM generates answer with context → Response with citations
7.2 Dense Retrieval
Old Way: BM25 (Sparse Retrieval)
- Keyword matching with TF-IDF
- Fast, interpretable, but misses semantic similarity
- "How to train a dog" won't match "Canine obedience techniques"
New Way: Dense Retrieval with Embeddings
Bi-Encoder Architecture:
Offline phase (precompute once):
- Encode all documents: doc_embeddings = encoder(documents) with shape (N × d)
Online phase (at query time):
- Encode query: query_embedding = encoder(query) with shape (1 × d)
- Compute scores: scores = cosine_similarity(query_embedding, doc_embeddings)
- Retrieve top-K: top_k_docs = argsort(scores)[-k:]
Hard Negatives: Critical for good retrieval
- Random negatives too easy
- Mine hard negatives: high BM25 score but wrong answer
- In-batch negatives: Use other queries' positives as your negatives
Dense Retrieval Architecture
7.3 Embedding Models
Modern Embedding Models:
- Sentence-BERT: BERT with siamese network
- E5: Multilingual, instruction-aware embeddings
- BGE: State-of-the-art for retrieval
- OpenAI text-embedding-3: Commercial API
- Jina Embeddings v3: 8K context, excellent for long docs
Matryoshka Embeddings:
- Single model produces embeddings at multiple dimensions
- 768 → 512 → 256 → 128 → 64
- Truncate to smaller dim for speed/storage trade-off
- Minimal quality loss for many tasks
7.4 Vector Databases & Approximate Nearest Neighbor (ANN)
Exact Search Problem: O(N) for N documents - too slow for millions of docs
ANN Algorithms:
1. HNSW (Hierarchical Navigable Small World)
- Graph-based, navigates through layers
- Very fast queries, high recall
- Used by: Qdrant, Weaviate, Pinecone
2. IVF (Inverted File Index)
- Cluster embeddings, search only relevant clusters
- Memory efficient
- Used by: FAISS
Vector DB Comparison:
| Database | Best For | Key Feature |
|---|---|---|
| Qdrant | Production RAG | Filtering + vector search |
| Pinecone | Managed service | Easiest to use |
| Weaviate | Hybrid search | GraphQL, BM25 + vector |
| FAISS | Offline/research | Facebook, highly optimized |
7.5 Hybrid Search
Combine sparse + dense retrieval:
Hybrid Search Algorithm:
- Get results from BM25: bm25_scores = bm25_search(query, docs)
- Get results from vector search: vector_scores = vector_search(query_emb, doc_embs)
- Combine using Reciprocal Rank Fusion (RRF)
Why hybrid?
- Dense: Semantic similarity, synonyms, paraphrases
- Sparse: Exact keyword matches, rare terms, names
- Together: Best of both worlds
7.6 Advanced RAG Techniques
Re-ranking:
Two-stage retrieval:
- Fast retrieval: Get top 100 candidates using vector search
- Slow but accurate re-ranking: Use cross-encoder to score all (query, candidate) pairs
- Select final top-K: Pick top 5 based on re-ranker scores
HyDE (Hypothetical Document Embeddings):
Algorithm:
- Generate hypothetical answer using LLM: hypo_doc = llm("Answer this question: " + query)
- Use hypothetical answer for retrieval: docs = vector_search(embed(hypo_doc))
- Why it works: Bridges query-document gap
Query Rewriting:
Example:
- Original query: "it" (ambiguous)
- Rewritten with context: "GPT-4 architecture details"
7.7 Production RAG Pipeline
Complete Pipeline:
Configuration:
- chunk_size: 512 tokens per chunk
- chunk_overlap: 50 tokens
- retrieval_top_k: 20 candidates
- rerank_top_k: 5 final docs
Ingestion Pipeline:
- Chunk documents: Break documents into overlapping chunks (512 tokens, 50 overlap)
- Generate embeddings: Embed all chunks using embedding model
- Store in vector DB: Upsert embeddings with metadata
Retrieval Pipeline:
- Embed query: Convert query to embedding vector
- Hybrid search: Vector + BM25, combine using reciprocal rank fusion
- Re-rank: Use cross-encoder to re-rank top 20, select top 5
- Return: Final top-5 documents with context
Tricky Interview Question: "How do you handle outdated information in RAG?"
→ Timestamp metadata + periodic re-ingestion. Filter results by recency. Implement cache invalidation when documents update. For time-sensitive queries, boost recent documents in ranking.
8. Memory Architectures
8.1 The Memory Problem
Challenge: How do models remember information across interactions?
Context Window Limitations:
- GPT-4: 128K tokens (~300 pages) - expensive, slow
- Most models: 4K-32K tokens
- What about conversations over days/weeks/months?
8.2 Modern Conversational Memory
Conversational AI Memory Hierarchy:
1. Short-term (Context Window)
- Last N tokens in conversation
- Directly in model context
- Fast, but limited capacity
2. Medium-term (Session Memory)
- Summary of current conversation
- Vector DB with session embeddings
- Retrieve relevant parts when context full
3. Long-term (Episodic Memory)
- Past conversations, user preferences
- Graph DB (entities, relationships, events)
- Retrieve when semantically relevant
Conversational Memory Architecture
Practical Architecture:
add_message Algorithm:
- Add to short-term: Append message to recent messages list
- Check capacity: If short_term > MAX_CONTEXT:
- Summarize oldest 10 messages using LLM
- Store summary in vector DB (medium-term)
- Remove oldest 10 from short-term
- Extract facts: Parse message for entities/relationships
- Store in graph: Add facts to graph database (long-term)
retrieve_context Algorithm:
- Get short-term: All recent messages
- Search medium-term: Vector search for relevant summaries (top-5)
- Query long-term: Graph traversal for related facts
- Combine and format: Merge all three sources into coherent context
Conversational Memory System Design
8.3 Knowledge Graphs for Memory
Graph Structure:
- Nodes: Entities (people, places, concepts)
- Edges: Relationships (knows, works_at, discussed_on)
- Properties: Attributes (age, location, sentiment)
Example:
User talked about meeting John at coffee shop:
- (User)-[:MENTIONED]->(John:Person)
- (John)-[:MET_AT {date: "2024-01-15"}]->(Starbucks:Place)
Temporal Graphs:
- Relationships have timestamps
- Query: "Who did I meet last week?"
- Decay old information (importance ∝ recency)
Graph + Vector Hybrid:
- Vector search for semantic similarity
- Graph traversal for structured relationships
- Combine results
Tricky Interview Question: "How do you handle conflicting information in long-term memory?"
→ Temporal priority (newer > older), confidence scores, user feedback loop. Keep version history with timestamps. Flag conflicts for user resolution. Use graph structure to track belief updates over time.
9. Whiteboard Design Scenarios
9.1 Design a Conversational Memory System
Requirements:
- Store unlimited conversation history
- Fast retrieval (<100ms)
- Understand context from weeks ago
- Scale to millions of users
- Privacy-first (user data isolated)
Components:
1. Storage Layers
- Hot Storage: Redis (last 24h of conversation)
- Warm Storage: Vector DB (last 30 days, embeddings)
- Cold Storage: S3 + Graph DB (all history, structured facts)
2. Retrieval Strategy
retrieve_memory Algorithm:
- Fast path - Recent conversation: Fetch from Redis cache (last 24h)
- Semantic search - Relevant past: Vector search filtered by user_id for top-10
- Graph query - Entity-based: Graph traversal for relationships
- Merge and rank: Combine all results with recency decay weighting
3. Scaling Considerations
- Sharding: By user_id (each shard handles subset of users)
- Caching: Frequently accessed memories in L1 cache
- Compression: Older conversations summarized, original stored in cold
- Pruning: Remove low-importance memories (based on access frequency + age)
9.2 Production RAG Pipeline Design
Requirements:
- 1M documents, 10K queries/sec
- <200ms end-to-end latency
- Accuracy > 90% (user satisfaction)
- Handle document updates in real-time
Production RAG System Design
Pipeline Stages:
1. Document Ingestion
- Extract text from PDF/HTML
- Clean and normalize
- Intelligent chunking (512 tokens, 50 overlap, sentence-aware)
- Generate embeddings (batch encode, batch_size=32)
- Store with metadata
2. Query Processing
- Check cache: Hash query + context, return if cached
- Parallel retrieval: Run vector search and BM25 concurrently
- Fusion: Apply reciprocal rank fusion, take top-20
- Re-rank: Use cross-encoder to re-rank, select top-5
- Generate: LLM generates answer using context
- Cache result: Store in Redis with 1-hour TTL
3. Latency Budget
- Retrieval: 50ms (vector search)
- Re-ranking: 30ms (cross-encoder on 20 docs)
- Generation: 100ms (LLM with streaming)
- Overhead: 20ms (network, serialization)
- Total: ~200ms
4. Optimization Techniques
- Caching: 70% cache hit rate → 70% queries <10ms
- Approximate search: HNSW with ef=32 (vs brute force)
- Quantization: 8-bit embeddings (768 dims → 192 bytes)
- Batching: Batch re-ranking for efficiency
- Streaming: Start generating while re-ranking completes
9.3 Multi-Agent Coordination System
Requirements:
- 5-10 specialized agents (sales, support, coding, research)
- Route user queries to appropriate agents
- Agents can call each other for help
- Maintain conversation context across agents
- Avoid loops and deadlocks
Multi-Agent System Design
Components:
1. Router Agent (Master orchestrator)
Routing Algorithm:
- Classify intent: Determine query category and confidence
- High confidence (>0.9): Route to single specialized agent
- Low confidence: Consult all agents in parallel, ensemble results
- Return final answer
2. Inter-Agent Communication
Safe Agent Call Algorithm:
- Deadlock prevention: Check if call would create cycle in call graph
- Add edge: Record agent-to-agent call in graph
- Rate limiting: Ensure agent hasn't exceeded MAX_CALLS
- Execute: Call target agent's process method
- Return result
3. Evaluation & Monitoring
Key Metrics:
- routing_accuracy: % queries correctly routed to right agent
- agent_success_rate: % queries successfully resolved
- avg_agent_calls: Average # agents invoked per query
- latency_by_agent: Latency breakdown per agent type
Tricky Interview Question: "How do you prevent agent loops?"
→ Track call graph in real-time. Before allowing agent A to call agent B, check if it creates cycle. Maintain maximum call depth limit. Use timeout for entire query processing. Log suspicious patterns (A→B→A) for review.
10. Paper Discussion Prep
10.1 "Attention Is All You Need" (Vaswani et al., 2017)
Key Contributions:
- Transformer architecture (encoder-decoder)
- Multi-head self-attention mechanism
- Positional encoding (sinusoidal)
- Completely replaced RNNs/LSTMs
Interview Questions:
Q: Why is self-attention O(n²) and why is that a problem?
A: Each token attends to all other tokens → n × n attention matrix. For long sequences (>10K tokens), this becomes memory-prohibitive (O(n²) space) and computationally expensive (O(n²d) time for computing attention scores).
Q: Why multi-head attention instead of single large attention?
A: Multiple heads learn different relationship patterns (syntax vs semantics vs position). Similar to multiple filters in CNNs. Empirically, 8-16 heads work better than one head with 8-16× dimensions.
Q: How would you modify transformers for longer sequences?
A: Sparse attention (Longformer, BigBird), Linformer (low-rank approximation), Reformer (LSH), Flash Attention (memory-efficient), Mixture of experts (sparse activation)
10.2 RAG Papers
"Dense Passage Retrieval" (Karpukhin et al., 2020)
Key Idea: Use dense embeddings for retrieval instead of BM25
Training:
- Bi-encoder: Separate encoders for query and passage
- In-batch negatives: Other passages in batch are negatives
- Hard negatives: High BM25 score but wrong answer
Interview Q: Why in-batch negatives?
A: Computationally efficient (no extra forward passes), provides diverse negatives, scales to large batch sizes. Limitation: If batch size small, negatives may be too easy.
"Retrieval-Augmented Generation" (Lewis et al., 2020)
Key Idea: Combine parametric (model weights) and non-parametric (retrieval) knowledge
Two Variants:
- RAG-Sequence: Retrieve once, use for entire generation
- RAG-Token: Retrieve for each generated token
10.3 Modern LLM Papers
"Language Models are Few-Shot Learners" (GPT-3, Brown et al., 2020)
Key Findings:
- Scale is all you need (175B parameters)
- In-context learning emerges at scale
- No fine-tuning needed for many tasks
Interview Q: Why does in-context learning emerge? What's happening?
A: During pre-training on internet text, model sees many examples of pattern completion, Q&A, etc. It learns meta-learning: "given these examples, continue the pattern." At sufficient scale, this generalizes to new tasks.
"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
Key Contributions:
- Smaller models competitive with larger ones (better training)
- Open-source weights
- Architectural improvements: RMSNorm, SwiGLU, RoPE
Interview Q: Why is LLaMA 13B competitive with GPT-3 175B?
A: Better data quality + curation, longer training (1.4T tokens vs 300B), architectural improvements, training tricks (better learning rate schedule, etc.). Shows that data quality >> model size for many tasks.
11. Interview Traps & Gotchas
11.1 Math Traps
"Explain backprop from scratch"
- Trap: Glossing over chain rule details
- Show: Matrix dimensions at each step, transpose operations
"When would you use L1 vs L2 regularization?"
- Trap: "L1 for sparsity, L2 for smoothness" (too vague)
- Better: L1 when you need feature selection (eliminates features entirely), L2 when all features matter but you want small weights. L1 gradient discontinuity at zero means SGD implementations need special handling.
11.2 Architecture Traps
"Why do transformers work better than LSTMs?"
- Trap: Only mentioning parallelization
- Complete answer: Parallelization + direct connections (better long-range) + better gradient flow + more efficient for modern hardware
"What happens if you don't use positional encoding?"
- Trap: "Model won't know order"
- Nuance: Model can still learn some positional info from content (e.g., "first" and "finally" are positional markers), but explicit encoding much more effective
11.3 Training Traps
"Your model isn't learning. What do you check?"
Systematic debugging:
- Overfit single batch (proves model capacity + no bugs)
- Check gradients (vanishing? exploding? NaN?)
- Learning rate (too high? too low? plot loss curve)
- Data (labels correct? normalized? shuffled?)
- Architecture (bottlenecks? activation choice?)
- Loss function (appropriate for task? numerically stable?)
"How do you choose hyperparameters?"
- Trap: "Grid search"
- Better: Start with known good defaults (from papers), use learning rate finder, random search > grid search, Bayesian optimization for expensive searches, monitor during training and adjust.
11.4 RAG Traps
"Why not just increase context window instead of using RAG?"
- Costs: Larger context = much more expensive (quadratic in tokens)
- Quality: Attention dilutes with more context ("lost in the middle" problem)
- Freshness: Can't update knowledge without retraining
- Attribution: RAG can cite sources
"How do you evaluate RAG quality?"
- Trap: Only end-to-end accuracy
- Decompose: Retrieval quality (recall@k, precision@k, MRR), relevance (are retrieved docs actually relevant?), generation quality (fluency, factuality, groundedness)
11.5 Production ML Traps
"How do you monitor ML models in production?"
- Not just accuracy: Data drift, prediction drift, latency, error rates
- Business metrics: User engagement, conversion, retention
- Model-specific: Attention patterns, confidence scores, embedding drift
"Your ML model shows bias. What do you do?"
- Understand source: Training data bias? Model architecture? Evaluation metric?
- Measure: Define fairness metrics for your use case (demographic parity, equalized odds, etc.)
- Mitigate: Re-sample training data, re-weight loss, adversarial debiasing, fairness constraints
- Monitor: Continuous bias metrics in production, broken down by demographic groups
11.6 Behavioral/System Design Traps
"How would you explain transformers to a non-technical person?"
- Use analogy: "Like reading a book where you can instantly flip to any related section, instead of reading page-by-page. The model learns which parts to focus on for any given question."
- Avoid jargon: Don't say "attention mechanism" - say "focus on relevant information"
"How do you prioritize when building an ML system?"
- Start with simplest baseline
- Identify bottlenecks (data? model? engineering?)
- Measure impact of improvements
- 80/20 rule: Simple models often get 80% of the way there
Appendix: Quick Reference
Key Formulas
- Attention: Attention(Q,K,V) = softmax(QK^T / √d_k)V
- Cross-Entropy Loss: L = -Σ y_i log(ŷ_i)
- Layer Norm: y = γ(x - μ)/σ + β
Typical Hyperparameters
Transformer (GPT-style):
- Layers: 12-96
- d_model: 768-12288
- Heads: 12-96
- d_ff: 3072-49152 (4× d_model)
- Dropout: 0.1
- Learning rate: 1e-4 to 6e-4
- Warmup: 2000-10000 steps
RAG Pipeline:
- Chunk size: 256-512 tokens
- Overlap: 50-100 tokens
- Top-k retrieval: 20-50
- Top-k rerank: 3-5
- Embedding dim: 768-1024
Final Tips for Interview Success
Before the interview:
- Re-read papers on company's core tech (RAG? Agents? Specific architecture?)
- Practice whiteboarding system designs out loud
- Prepare 2-3 deep technical stories from your experience
- Review recent ML news (new models, techniques)
During technical discussion:
- Think out loud (show reasoning process)
- Ask clarifying questions before diving in
- Start simple, then add complexity
- Discuss trade-offs explicitly
- Admit when you don't know (then reason through it)
Red flags to avoid:
- Claiming to know everything
- Not asking questions
- Ignoring trade-offs ("this is always better")
- Overcomplicating simple problems
- Not testing your solution
Green flags to show:
- Systematic problem-solving
- Awareness of latest research
- Production ML experience
- Clear communication
- Collaborative attitude
Remember: Founding engineer roles value pragmatism + depth. Show you can ship fast while understanding fundamentals deeply. Balance is key.