AI Engineer Study Guide

Reference for ML Production Roles

Target Audience: AI/ML Engineers with production LLM experience preparing for founding engineer interviews

Focus: Theory + Whiteboard Design + Paper Discussions

1. Mathematical Foundations

1.1 Linear Algebra (The Core)

Every neural network operation is matrix multiplication. Master matrix shapes, eigenvalues (PCA, gradient behavior), and SVD (LoRA uses this). Interview focus: attention mechanism dimensions and low-rank approximations.

Why it matters: Every neural network operation is matrix multiplication. Understanding shapes, ranks, and transformations is non-negotiable.

Key Concepts

Matrix Multiplication & Dimensionality

  • Matrix multiplication (m × n) @ (n × p) = (m × p) - the inner dimensions must match
  • Tricky bit: In attention, Q @ K^T works because (seq_len × d_k) @ (d_k × seq_len) = (seq_len × seq_len)
  • Interview trap: "Why do we transpose K in attention?" → Creates compatibility AND semantic meaning (query-key similarity matrix)

Eigenvalues & Eigenvectors

  • Av = λv - direction v that doesn't change under transformation A, only scales by λ
  • Why it matters:
    • PCA finds principal components (eigenvectors of covariance matrix)
    • Gradient explosion/vanishing relates to eigenvalues of weight matrices
    • Spectral normalization uses largest eigenvalue for stability

Singular Value Decomposition (SVD)

Formula: A = UΣV^T

  • U: left singular vectors (output space basis)
  • Σ: singular values (scaling factors)
  • V^T: right singular vectors (input space basis)

Applications in ML:

  • Low-rank matrix factorization (LoRA for LLM fine-tuning)
  • Dimensionality reduction
  • Matrix completion (recommendation systems)
📊 Linear Algebra Concepts Visualization

Key Relationships:

Matrix Multiplication: (m×n) @ (n×p) → (m×p) - inner dimensions must match

Eigenvectors: Directions unchanged by transformation A, scaled by eigenvalue λ

SVD: A = UΣV^T decomposes any matrix into rotation-scale-rotation

Applications: PCA (eigenvectors), LoRA (low-rank), attention (Q@K^T)

Conceptual representation of matrix operations and transformations in neural networks

Tricky Interview Question: "How does LoRA use SVD concepts?"
→ LoRA approximates weight updates as low-rank: ΔW = BA where B is (d × r) and A is (r × d) with r << d. This is inspired by SVD's idea that most information lives in top singular values.

Norms & Distance Metrics

L1 vs L2 Norms

  • L1 (Manhattan): ||x||₁ = Σ|xᵢ| → Sparse solutions, robust to outliers
  • L2 (Euclidean): ||x||₂ = √(Σxᵢ²) → Smooth gradients, penalizes large weights more

Cosine Similarity (Critical for embeddings)

Formula: cos(θ) = (A · B) / (||A|| ||B||)

  • Range: [-1, 1]
  • Why cosine not Euclidean for embeddings? → Scale-invariant, captures angle not magnitude
  • Interview trap: "When would cosine similarity fail?" → When magnitude matters (e.g., word frequency in TF-IDF)

1.2 Calculus & Optimization

Chain Rule - The Backpropagation Foundation

Formula: ∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w

Tricky bit - Matrix Calculus:

When differentiating matrix operations, track dimensions carefully:

  • ∂(Wx)/∂W = x^T (outer product creates the right shape)
  • ∂(Wx)/∂x = W^T (transpose for correct flow)

Gradient Descent Variants

Algorithm Update Rule Key Property When to Use
SGD w -= lr × ∇w Noisy, explores well Small datasets, need exploration
Momentum v = βv + ∇w Accelerates in consistent directions When gradients have high variance
RMSProp v = βv + (1-β)∇w² Adapts per-parameter learning rate Non-stationary objectives
Adam Combines momentum + RMSProp Fast convergence Default choice for transformers
AdamW Adam + decoupled weight decay Better regularization SOTA for LLM training
Gradient Descent Variants

Gradient Descent Variants

Tricky Interview Question: "Why use AdamW over Adam?"
→ Adam's weight decay interacts with adaptive learning rates incorrectly. AdamW decouples it: w -= lr × (∇w + λw) after adaptive scaling.

1.3 Probability & Statistics

Central Limit Theorem (CLT) - Foundation of Modern ML

Statement: The distribution of sample means approximates a normal distribution as sample size increases, regardless of the population's distribution shape.

Mathematical Form: X̄ ~ N(μ, σ²/n) as n → ∞

Why It Matters in ML:

  1. Batch Training Stability: Averaging gradients over a batch (n samples) reduces variance by 1/√n. This is why larger batches lead to more stable updates.
  2. Why SGD Converges: Noisy gradient estimates from mini-batches approximate true gradients via CLT, enabling convergence guarantees.
  3. Batch Normalization: Assumes activations follow approximately normal distribution per batch, which CLT supports for large batches.
  4. Confidence Intervals: Error bars on model performance metrics rely on CLT for validity.
Central Limit Theorem

Central Limit Theorem

Tricky Interview Question: "Why do larger batches in SGD lead to worse generalization despite more stable gradients?"
→ Large batches converge to sharp minima (high curvature) which generalize poorly. Small batches' noise helps escape sharp minima and find flat minima (low curvature) with better test performance. This is the generalization gap phenomenon.

Key Statistical Concepts

Maximum Likelihood Estimation (MLE)

θ* = argmax_θ Π p(xᵢ|θ) = argmax_θ Σ log p(xᵢ|θ)

  • Why log? Converts products to sums (numerical stability + easier gradients)
  • Connection to loss: Minimizing cross-entropy = MLE for categorical distribution

Bias-Variance Decomposition

E[(y - ŷ)²] = Bias² + Variance + Irreducible Error

  • High bias: Underfitting (too simple model)
  • High variance: Overfitting (too complex, memorizes noise)
  • Sweet spot: Balance through regularization, model capacity
Bias-Variance Tradeoff

Bias-Variance Tradeoff

1.4 Information Theory

Entropy (Measure of uncertainty)

H(X) = -Σ p(x) log p(x)

  • High entropy = high uncertainty (uniform distribution)
  • Low entropy = low uncertainty (peaked distribution)
  • In ML: We want models with low entropy predictions (confident)

Cross-Entropy (Distance between distributions)

H(P, Q) = -Σ p(x) log q(x)

  • P: true distribution, Q: predicted distribution
  • Cross-entropy loss: Minimizing this = matching distributions
  • Binary: -[y log(ŷ) + (1-y)log(1-ŷ)]
  • Multi-class: -Σ yᵢ log(ŷᵢ) (categorical cross-entropy)

KL Divergence (How different are two distributions?)

D_KL(P || Q) = Σ p(x) log(p(x)/q(x)) = H(P,Q) - H(P)

  • Properties: Always ≥ 0, asymmetric (P||Q ≠ Q||P)
  • In VAE: Regularization term KL(q(z|x) || p(z)) keeps latent space structured
  • In RLHF: KL penalty keeps model close to reference policy
Information Theory Concepts

Information Theory Concepts

Tricky Interview Question: "Why is cross-entropy preferred over MSE for classification?"
→ Cross-entropy has better gradients for probability outputs. MSE gradient = (ŷ - y), but cross-entropy = (ŷ - y)/ŷ(1-ŷ) which is stronger when prediction is wrong (ŷ close to 0 or 1).

1.5 Statistical Inference & Hypothesis Testing

Why This Matters: A/B testing, model comparison, and determining if performance improvements are real or noise all rely on hypothesis testing. Production ML decisions need statistical rigor.

Hypothesis Testing Framework

The Setup:

  • Null Hypothesis (H₀): The "boring" hypothesis - no effect, no difference
  • Alternative Hypothesis (H₁): The claim you're trying to prove
  • Significance Level (α): Threshold for rejecting H₀ (typically 0.05 or 0.01)
  • p-value: Probability of observing this extreme data if H₀ is true

Decision Rule: If p-value < α, reject H₀ (statistically significant result)

Type I and Type II Errors

Type I Error (False Positive, α):

  • Definition: Rejecting H₀ when it's actually true
  • In ML: Saying model B is better when it's not
  • Real-world cost: Wasted resources deploying inferior model
  • Control: Set lower α (0.01 instead of 0.05) for critical decisions

Type II Error (False Negative, β):

  • Definition: Failing to reject H₀ when H₁ is true
  • In ML: Missing a real improvement
  • Real-world cost: Leaving better model undiscovered
  • Control: Increase sample size, increase α (trade-off!)
Type I and Type II Errors

Type I and Type II Errors

Statistical Power (1 - β):

  • Definition: Probability of correctly rejecting H₀ when H₁ is true
  • Target: Power ≥ 0.80 (80% chance of detecting true effect)
  • Affected by: Sample size, effect size, α, test type
Statistical Power

Statistical Power

Tricky Interview Question: "Your model shows 2% accuracy improvement. Is it significant?"
→ Depends on: (1) Sample size - is it 100 examples or 10,000? (2) Variance - consistent or noisy? (3) Business context - is 2% valuable? Run a paired t-test on validation predictions, compute confidence interval, consider practical significance vs statistical significance.

A/B Testing in ML

Setup: Compare model A (baseline) vs model B (new model)

Randomization: Crucial for causal inference

  • Random assignment of users/requests to A or B
  • Eliminates confounding variables
  • Enables causal claims ("B caused the improvement")

Multiple Testing Problem:

  • Running 20 tests with α=0.05 → Expect 1 false positive by chance
  • Solution: Bonferroni correction (α_adjusted = α/k where k = # tests)
  • Better: False Discovery Rate (FDR) control for many tests

Tricky Interview Question: "Your A/B test shows model B is 3% better after 3 days. Should you deploy?"
→ No! (1) Too short - haven't captured weekly patterns, (2) Possible novelty effect, (3) Statistical power may be insufficient, (4) Need to verify across different user segments, (5) Check if improvement is consistent across days or just a lucky spike.

2. Foundational Machine Learning

2.1 Core Concepts Review

Problem Symptoms Solutions
Underfitting High train & test error Increase model capacity, more features, train longer
Overfitting Low train error, high test error Regularization, more data, early stopping, dropout
Just Right Low train & test error, small gap You're good! Monitor for data drift

2.2 Regularization Techniques

L1 (Lasso) Regularization: Loss + λΣ|wᵢ|

  • Encourages sparsity (many weights → 0)
  • Feature selection built-in
  • Non-differentiable at 0 (use subgradient)

L2 (Ridge) Regularization: Loss + λΣwᵢ²

  • Encourages small weights (weight decay)
  • Smoother than L1, all weights shrink
  • Equivalent to Gaussian prior in Bayesian view

Dropout

  • Randomly zero out activations during training
  • Intuition: Ensemble of subnetworks
  • Inference: Scale activations by keep_prob (or use inverted dropout)
  • Tricky bit: Acts as adaptive regularization (more important neurons less likely to drop)
Optimization Landscape

Optimization Landscape

2.3 Model Evaluation

Classification Metrics

Precision vs Recall Trade-off

  • Precision: TP/(TP+FP) - "Of predicted positives, how many are correct?"
  • Recall: TP/(TP+FN) - "Of actual positives, how many did we catch?"
  • F1 Score: 2 × (Precision × Recall)/(Precision + Recall) - harmonic mean

When to optimize what?

  • Spam detection: High precision (don't block good emails)
  • Cancer screening: High recall (catch all cases)
  • Search: Precision@K for top results

ROC-AUC vs PR-AUC

  • ROC-AUC: Good for balanced datasets, plots TPR vs FPR
  • PR-AUC: Better for imbalanced datasets, focuses on positive class
  • Tricky Interview Question: "Why PR-AUC for imbalanced data?" → ROC can look good even with poor minority class performance due to high TN count

3. Deep Learning Fundamentals

3.1 Neural Network Basics

Forward Pass

The forward pass involves two steps:

  1. Linear transformation: z = Wx + b
  2. Activation function: a = σ(z)

Activation Functions

Function Formula Range Pros Cons Use Case
Sigmoid 1/(1+e^(-x)) (0,1) Smooth, probabilistic Saturates, slow gradients Output layer binary class
ReLU max(0,x) [0,∞) Fast, no saturation for x>0 Dead neurons (x<0) Default choice hidden layers
GELU x×Φ(x) (-∞,∞) Smooth, stochastic Slower to compute Transformers (GPT, BERT)

Why GELU in transformers? Smooth, non-monotonic, allows negative values with probability (stochastic regularization effect), empirically better for NLP.

Backpropagation Flow

Backpropagation Flow

3.2 Initialization

Why initialization matters: Poor init → vanishing/exploding gradients before training even starts

Xavier/Glorot Initialization

w ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))

  • For sigmoid/tanh activations
  • Keeps variance similar across layers

He Initialization (Kaiming)

w ~ N(0, 2/n_in) for ReLU activations

  • Accounts for ReLU zeroing half the neurons
  • Default for modern architectures

3.3 Normalization Techniques

Batch Normalization

y = γ((x - μ_batch)/σ_batch) + β

  • Normalizes across batch dimension
  • Pros: Faster training, acts as regularizer, less sensitive to init
  • Cons: Batch size dependent, different behavior train/test, breaks for seq2seq

Layer Normalization (Used in Transformers)

y = γ((x - μ_layer)/σ_layer) + β

  • Normalizes across feature dimension (per sample)
  • Pros: Batch-independent, works for any sequence length, stable for RNNs/Transformers
  • Cons: Slightly slower than BatchNorm for CNNs

RMS Normalization (Root Mean Square)

y = x / RMS(x) × γ where RMS(x) = √(mean(x²))

  • Removes mean subtraction (faster, simpler)
  • Used in modern LLMs (LLaMA, GPT-3)
  • Why? Empirically works as well, 10-20% faster
Normalization Techniques

Normalization Techniques

Tricky Interview Question: "Why LayerNorm in transformers not BatchNorm?"
→ Transformers process variable-length sequences, BatchNorm would require padding/masking complexities. LayerNorm works per-sample so length-agnostic. Also, small batch sizes (memory constraints with long sequences) make BatchNorm statistics noisy.

4. Classical Architectures - RNNs & LSTMs

4.1 Recurrent Neural Networks (RNNs)

The Basic Idea: Process sequences by maintaining hidden state

RNN Formulas:

  • Hidden state update: h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
  • Output: y_t = W_hy @ h_t + b_y

The Fatal Flaw: Vanishing/exploding gradients over time

Why RNNs failed at long sequences:

  • Tanh saturation → gradients < 1
  • Matrix W_hh multiplied T times
  • For T=100 and gradient 0.9 per step: 0.9^100 ≈ 0 (vanished!)

4.2 Long Short-Term Memory (LSTM)

The Solution: Gates that control information flow

The Four Gates

  1. Forget Gate (what to throw away from cell state): f_t = σ(W_f @ [h_{t-1}, x_t] + b_f)
  2. Input Gate (what new info to store): i_t = σ(W_i @ [h_{t-1}, x_t] + b_i)
  3. Cell State Update: C_t = f_t * C_{t-1} + i_t * C̃_t
  4. Output Gate (what to output): o_t = σ(W_o @ [h_{t-1}, x_t] + b_o)

Why LSTMs work better:

  • Additive updates: C_t = f_t * C_{t-1} + ... (not multiplicative like RNNs)
  • Gradient highway: Gradients flow through cell state with fewer transformations
  • Selective memory: Gates learn what to remember/forget
LSTM Architecture

LSTM Architecture

4.3 Why Transformers Killed RNNs/LSTMs

Sequential Processing Problem

  • LSTMs must process token-by-token sequentially
  • Can't parallelize across sequence (unlike transformers)
  • For sequence length T, need T sequential steps

The Death Blow: "Attention Is All You Need" (2017)

  • Showed transformers outperform RNNs on all benchmarks
  • 10x faster training on modern hardware
  • Better at long-range dependencies
  • End of the RNN era for NLP

When to still use LSTMs:

  • Streaming applications (process token-by-token in real-time)
  • Very long sequences where O(n²) attention is prohibitive
  • Limited hardware (mobile deployment)
  • Time-series forecasting where sequential structure helps
RNN vs LSTM vs Transformer

RNN vs LSTM vs Transformer Comparison

5. Transformer Architecture - Deep Dive

THE critical interview section. Master: Self-attention math (Q@K^T/√d_k, then softmax, then @V), multi-head parallelization (8 heads learn different patterns), positional encoding (sinusoidal adds sequence order), layer norm placement (pre-norm vs post-norm), and why it works (parallel processing, O(1) path between any tokens).

This is the most critical section. Transformers are THE architecture for modern LLMs.

5.1 The Core Innovation: Attention Mechanism

The Problem Transformers Solve

  • RNNs compress entire history into fixed-size hidden state → information bottleneck
  • Need direct access to all previous tokens for context

Self-Attention Intuition

For each token, compute how much to "attend to" every other token in the sequence.

Example: "The cat sat on the mat because it was tired"

  • "it" should attend strongly to "cat" (resolved reference)
  • "tired" should attend to "sat" (action-state relationship)
🔍 Self-Attention Mechanism Flow

Visual Flow:

Input X (n × d_model) → Linear Projections → Q, K, V

Q @ K^T → (n × n) attention scores → / √d_k (scale)

Softmax → attention weights (sum to 1 per row)

@ V → weighted combination of values → Output (n × d_v)

Example: Token "it" looks at all tokens via Q@K^T, softmax weights highest for "cat", outputs V-weighted mix

Self-attention allows each token to attend to all other tokens in O(1) steps

5.2 Attention Mathematics (Step-by-Step)

Input: Sequence of embeddings X = [x₁, x₂, ..., x_n] where each xᵢ ∈ ℝ^d_model

Step 1: Create Queries, Keys, Values

Linear projections create Q, K, V matrices:

  • Q = X @ W_Q with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
  • K = X @ W_K with shape (n × d_model) @ (d_model × d_k) = (n × d_k)
  • V = X @ W_V with shape (n × d_model) @ (d_model × d_v) = (n × d_v)

Intuition:

  • Query: "What am I looking for?"
  • Key: "What do I offer?"
  • Value: "What do I actually communicate?"

Step 2: Compute Attention Scores

scores = Q @ K.T / √d_k

  • Q @ K.T computes similarity between all pairs
  • Division by √d_k prevents softmax saturation (numerical stability)
  • Why scale? Dot products grow with dimension, pushing softmax into saturation regions
Attention Score Computation

Attention Score Computation

Tricky Interview Question: "Why divide by √d_k not d_k?"
→ Variance of dot product of d_k random variables is d_k. Dividing by √d_k makes variance ≈1, keeping pre-softmax values in a reasonable range. Empirically, √d_k works better than d_k.

Step 3: Apply Softmax (Normalize)

attention_weights = softmax(scores, dim=-1)

Each row sums to 1 → weighted average over values

Step 4: Weighted Sum of Values

output = attention_weights @ V

Complete Scaled Dot-Product Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

5.3 Multi-Head Attention

Why Multiple Heads?

Single attention might focus on one relationship type. Multiple heads learn different patterns:

  • Head 1: Subject-verb agreement
  • Head 2: Coreference resolution
  • Head 3: Positional proximity
  • Head 4: Semantic similarity

Implementation Algorithm:

  1. Split d_model across heads: d_k = d_model // num_heads
  2. Create projections for each head: Each head has separate W_Q_i, W_K_i, W_V_i matrices
  3. Parallel attention: Compute attention independently for each head
  4. Concatenate heads: Combine all head outputs along feature dimension (n × d_model)
  5. Final projection: Apply W_O to get final output
Multi-Head Attention Architecture

Multi-Head Attention Architecture

Tricky Interview Question: "Why concatenate heads instead of averaging?"
→ Averaging loses information - all heads would need to agree. Concatenation preserves each head's unique perspective, then W_O learns how to combine them optimally.

5.4 Positional Encoding

The Problem: Attention has no notion of order! "Cat sat mat" = "Mat sat cat" without position info.

Solution: Add positional information to input embeddings

Sinusoidal Positional Encoding (Original transformer)

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Why this weird formula?

  • Different frequencies for different dimensions
  • Smooth, continuous representation
  • Can extrapolate to longer sequences than seen during training
  • Relative positions encoded: PE(pos+k) can be expressed as linear function of PE(pos)

Rotary Position Embeddings (RoPE) (Modern LLMs like LLaMA)

Algorithm:

  1. Rotate query vectors based on position: Q_rotated = rotate(Q, pos)
  2. Rotate key vectors based on position: K_rotated = rotate(K, pos)
  3. Compute attention: attention_scores = Q_rotated @ K_rotated.T

Why RoPE is better:

  • Encodes relative positions directly in attention computation
  • Better extrapolation to longer sequences
  • More efficient than adding positional encodings
Positional Encoding Patterns

Positional Encoding Patterns

5.5 Transformer Block Architecture

Complete Transformer Encoder Block:

  1. Multi-head self-attention layer: MultiHeadAttention(d_model, num_heads)
  2. First layer normalization: LayerNorm(d_model)
  3. Feed-forward network: FeedForward(d_model, d_ff)
  4. Second layer normalization: LayerNorm(d_model)
  5. Dropout for regularization

Feed-Forward Network (Position-wise)

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

  • Often uses GELU instead of ReLU in modern transformers
  • Applied independently to each position
  • Typically: d_ff = 4 × d_model (expansion then projection)
  • Why needed? Attention is linear (weighted average), FFN adds non-linearity

Residual Connections (Skip Connections)

x = x + SubLayer(x)

  • Why critical? Enables gradient flow through many layers (like ResNet)
  • Without residuals, deep transformers (>12 layers) very hard to train
  • Allows identity mapping (model can learn to skip layers if not needed)
Transformer Block Architecture

Transformer Block Architecture

Layer Normalization Placement: Pre-norm vs Post-norm

Post-norm (Original transformer):

  • x = LayerNorm(x + Attention(x))
  • x = LayerNorm(x + FFN(x))

Pre-norm (Modern practice):

  • x = x + Attention(LayerNorm(x))
  • x = x + FFN(LayerNorm(x))

Why Pre-norm became standard?

  • More stable training for deep transformers (>24 layers)
  • Gradients flow better through residual path
  • Allows training without learning rate warmup (though warmup still helps)
  • Used in GPT-2, GPT-3, BERT-large
Training Stability Techniques

Training Stability Techniques

5.6 Encoder vs Decoder Architecture

Encoder (BERT-style):

  • Bidirectional attention (each token sees all tokens)
  • Used for understanding tasks: classification, NER, Q&A

Decoder (GPT-style):

  • Causal/masked attention (token i can only see tokens ≤ i)
  • Used for generation tasks: language modeling, completion
  • Causal masking: Set attention scores to -∞ for future positions before softmax

When to use what?

  • Encoder-only: Classification, tagging, embeddings (BERT, RoBERTa)
  • Decoder-only: Generation, completion, few-shot (GPT series)
  • Encoder-Decoder: Translation, summarization, structured generation (T5, BART)

6. Modern LLM Architectures

6.1 BERT (Bidirectional Encoder Representations from Transformers)

Architecture: Encoder-only transformer

Key Innovations:

1. Masked Language Modeling (MLM)

Example:

  • Input: "The cat [MASK] on the mat"
  • Task: Predict [MASK] = "sat"
  • Randomly mask 15% of tokens
  • Of those: 80% → [MASK], 10% → random token, 10% → unchanged
  • Forces bidirectional understanding

Variants:

  • RoBERTa: Better BERT (removed NSP, longer training, more data)
  • ALBERT: Parameter sharing across layers (much smaller)
  • DistilBERT: Distilled version (40% smaller, 97% performance)

6.2 GPT Series (Generative Pre-trained Transformer)

Architecture: Decoder-only transformer with causal attention

Evolution:

GPT-1 (2018):

  • 12 layers, 117M parameters
  • Showed pre-training + fine-tuning works

GPT-2 (2019):

  • 48 layers, 1.5B parameters
  • Zero-shot learning: performs tasks without fine-tuning

GPT-3 (2020):

  • 96 layers, 175B parameters
  • Few-shot in-context learning
  • Key insight: Large enough models can learn from examples in prompt

In-Context Learning (The GPT-3 Breakthrough)

Example Prompt:

  • Review: "This movie was great!" Sentiment: Positive
  • Review: "Terrible experience." Sentiment: Negative
  • Review: "Amazing plot twist!" Sentiment: [MODEL PREDICTS: Positive]

Key properties:

  • No gradient updates, just prompt engineering
  • Model learns task from examples in context window

6.3 LLaMA (Large Language Model Meta AI)

Key Improvements over GPT:

1. RMSNorm instead of LayerNorm

  • Simpler, faster (no mean subtraction)
  • Comparable performance

2. RoPE (Rotary Position Embeddings)

  • Better length extrapolation
  • More efficient than learned/sinusoidal

3. SwiGLU Activation (Swish-Gated Linear Unit)

SwiGLU(x) = Swish(xW) ⊙ (xV) where Swish(x) = x × sigmoid(x)

  • Better than ReLU/GELU for LLMs

4. Grouped-Query Attention (LLaMA 2)

  • Between multi-head and multi-query attention
  • Shares keys and values across groups of queries
  • Faster inference, minimal quality loss

LLaMA vs Others:

  • Open-source (weights available)
  • Trained on publicly available data (no private datasets)
  • Smaller models competitive with much larger closed models
  • LLaMA 13B ≈ GPT-3 175B on many tasks (better training)

6.4 Mixture of Experts (MoE)

Concept: Sparse model activation

  • Many expert networks, only use a few per token
  • Router network decides which experts to use

MoELayer Algorithm:

  1. Router decision: Router computes probabilities for each expert
  2. Select top-K experts: Choose the K experts with highest probabilities (typically K=2)
  3. Combine outputs: Weighted sum of selected expert outputs

Benefits:

  • Massive parameter count with manageable compute
  • Each expert can specialize (different languages, domains)
  • Only activate ~10-20% of parameters per forward pass

Examples:

  • Switch Transformer: 1.6T parameters, only activates 10B per token
  • GPT-4 (rumored): MoE with ~8 experts, ~200B params each

6.5 State Space Models (Mamba, S4)

The Problem with Transformers: O(n²) attention complexity

State Space Models: Alternative to attention

Key Properties:

  • Linear time complexity O(n) for long sequences
  • Can be parallelized for training (using convolution view)
  • Efficient autoregressive inference (recurrent view)

Mamba (Recent breakthrough):

  • Selective state spaces (context-dependent dynamics)
  • Matches transformer quality on language tasks
  • 5x faster inference for long sequences (>2K tokens)

When to use SSMs vs Transformers?

  • SSMs: Very long sequences (100K+ tokens), efficiency critical
  • Transformers: State of the art quality, standard for LLMs
Modern LLM Architectures Comparison

Modern LLM Architectures Comparison

7. RAG & Retrieval Systems

RAG = Retrieve relevant docs, augment prompt, generate with context. Key concepts: Dense retrieval (embeddings), hybrid search (semantic + keyword), chunking strategies, reranking, production challenges (latency, cost, cache hits). Interview focus: when to use RAG vs fine-tuning, semantic vs keyword trade-offs.

7.1 Why RAG?

Problem: LLMs have limited context and outdated knowledge

Solution: Retrieval-Augmented Generation

  1. Retrieve relevant documents from knowledge base
  2. Augment prompt with retrieved context
  3. Generate answer using both parametric knowledge (model weights) and non-parametric knowledge (retrieved docs)

Benefits:

  • Up-to-date information without retraining
  • Cite sources (explainability)
  • Reduced hallucinations
  • Domain-specific knowledge injection
🔄 RAG Pipeline Architecture Flow

Query Processing:

User Query → Query Embedding (via embedding model)

Retrieval: Vector DB search → Top-k similar chunks (cosine similarity)

Reranking (optional): Cross-encoder reranks results for relevance

Augmentation: Inject retrieved docs into prompt template

Generation: LLM generates answer with context → Response with citations

End-to-end RAG: Chunking → Embedding → Indexing → Retrieval → Generation

7.2 Dense Retrieval

Old Way: BM25 (Sparse Retrieval)

  • Keyword matching with TF-IDF
  • Fast, interpretable, but misses semantic similarity
  • "How to train a dog" won't match "Canine obedience techniques"

New Way: Dense Retrieval with Embeddings

Bi-Encoder Architecture:

Offline phase (precompute once):

  • Encode all documents: doc_embeddings = encoder(documents) with shape (N × d)

Online phase (at query time):

  • Encode query: query_embedding = encoder(query) with shape (1 × d)
  • Compute scores: scores = cosine_similarity(query_embedding, doc_embeddings)
  • Retrieve top-K: top_k_docs = argsort(scores)[-k:]

Hard Negatives: Critical for good retrieval

  • Random negatives too easy
  • Mine hard negatives: high BM25 score but wrong answer
  • In-batch negatives: Use other queries' positives as your negatives
Dense Retrieval Architecture

Dense Retrieval Architecture

7.3 Embedding Models

Modern Embedding Models:

  • Sentence-BERT: BERT with siamese network
  • E5: Multilingual, instruction-aware embeddings
  • BGE: State-of-the-art for retrieval
  • OpenAI text-embedding-3: Commercial API
  • Jina Embeddings v3: 8K context, excellent for long docs

Matryoshka Embeddings:

  • Single model produces embeddings at multiple dimensions
  • 768 → 512 → 256 → 128 → 64
  • Truncate to smaller dim for speed/storage trade-off
  • Minimal quality loss for many tasks

7.4 Vector Databases & Approximate Nearest Neighbor (ANN)

Exact Search Problem: O(N) for N documents - too slow for millions of docs

ANN Algorithms:

1. HNSW (Hierarchical Navigable Small World)

  • Graph-based, navigates through layers
  • Very fast queries, high recall
  • Used by: Qdrant, Weaviate, Pinecone

2. IVF (Inverted File Index)

  • Cluster embeddings, search only relevant clusters
  • Memory efficient
  • Used by: FAISS

Vector DB Comparison:

Database Best For Key Feature
Qdrant Production RAG Filtering + vector search
Pinecone Managed service Easiest to use
Weaviate Hybrid search GraphQL, BM25 + vector
FAISS Offline/research Facebook, highly optimized

7.5 Hybrid Search

Combine sparse + dense retrieval:

Hybrid Search Algorithm:

  1. Get results from BM25: bm25_scores = bm25_search(query, docs)
  2. Get results from vector search: vector_scores = vector_search(query_emb, doc_embs)
  3. Combine using Reciprocal Rank Fusion (RRF)

Why hybrid?

  • Dense: Semantic similarity, synonyms, paraphrases
  • Sparse: Exact keyword matches, rare terms, names
  • Together: Best of both worlds

7.6 Advanced RAG Techniques

Re-ranking:

Two-stage retrieval:

  1. Fast retrieval: Get top 100 candidates using vector search
  2. Slow but accurate re-ranking: Use cross-encoder to score all (query, candidate) pairs
  3. Select final top-K: Pick top 5 based on re-ranker scores

HyDE (Hypothetical Document Embeddings):

Algorithm:

  1. Generate hypothetical answer using LLM: hypo_doc = llm("Answer this question: " + query)
  2. Use hypothetical answer for retrieval: docs = vector_search(embed(hypo_doc))
  • Why it works: Bridges query-document gap

Query Rewriting:

Example:

  • Original query: "it" (ambiguous)
  • Rewritten with context: "GPT-4 architecture details"

7.7 Production RAG Pipeline

Complete Pipeline:

Configuration:

  • chunk_size: 512 tokens per chunk
  • chunk_overlap: 50 tokens
  • retrieval_top_k: 20 candidates
  • rerank_top_k: 5 final docs

Ingestion Pipeline:

  1. Chunk documents: Break documents into overlapping chunks (512 tokens, 50 overlap)
  2. Generate embeddings: Embed all chunks using embedding model
  3. Store in vector DB: Upsert embeddings with metadata

Retrieval Pipeline:

  1. Embed query: Convert query to embedding vector
  2. Hybrid search: Vector + BM25, combine using reciprocal rank fusion
  3. Re-rank: Use cross-encoder to re-rank top 20, select top 5
  4. Return: Final top-5 documents with context

Tricky Interview Question: "How do you handle outdated information in RAG?"
→ Timestamp metadata + periodic re-ingestion. Filter results by recency. Implement cache invalidation when documents update. For time-sensitive queries, boost recent documents in ranking.

8. Memory Architectures

8.1 The Memory Problem

Challenge: How do models remember information across interactions?

Context Window Limitations:

  • GPT-4: 128K tokens (~300 pages) - expensive, slow
  • Most models: 4K-32K tokens
  • What about conversations over days/weeks/months?

8.2 Modern Conversational Memory

Conversational AI Memory Hierarchy:

1. Short-term (Context Window)

  • Last N tokens in conversation
  • Directly in model context
  • Fast, but limited capacity

2. Medium-term (Session Memory)

  • Summary of current conversation
  • Vector DB with session embeddings
  • Retrieve relevant parts when context full

3. Long-term (Episodic Memory)

  • Past conversations, user preferences
  • Graph DB (entities, relationships, events)
  • Retrieve when semantically relevant
Conversational Memory Architecture

Conversational Memory Architecture

Practical Architecture:

add_message Algorithm:

  1. Add to short-term: Append message to recent messages list
  2. Check capacity: If short_term > MAX_CONTEXT:
    • Summarize oldest 10 messages using LLM
    • Store summary in vector DB (medium-term)
    • Remove oldest 10 from short-term
  3. Extract facts: Parse message for entities/relationships
  4. Store in graph: Add facts to graph database (long-term)

retrieve_context Algorithm:

  1. Get short-term: All recent messages
  2. Search medium-term: Vector search for relevant summaries (top-5)
  3. Query long-term: Graph traversal for related facts
  4. Combine and format: Merge all three sources into coherent context
Conversational Memory System Design

Conversational Memory System Design

8.3 Knowledge Graphs for Memory

Graph Structure:

  • Nodes: Entities (people, places, concepts)
  • Edges: Relationships (knows, works_at, discussed_on)
  • Properties: Attributes (age, location, sentiment)

Example:

User talked about meeting John at coffee shop:

  • (User)-[:MENTIONED]->(John:Person)
  • (John)-[:MET_AT {date: "2024-01-15"}]->(Starbucks:Place)

Temporal Graphs:

  • Relationships have timestamps
  • Query: "Who did I meet last week?"
  • Decay old information (importance ∝ recency)

Graph + Vector Hybrid:

  1. Vector search for semantic similarity
  2. Graph traversal for structured relationships
  3. Combine results

Tricky Interview Question: "How do you handle conflicting information in long-term memory?"
→ Temporal priority (newer > older), confidence scores, user feedback loop. Keep version history with timestamps. Flag conflicts for user resolution. Use graph structure to track belief updates over time.

9. Whiteboard Design Scenarios

9.1 Design a Conversational Memory System

Requirements:

  • Store unlimited conversation history
  • Fast retrieval (<100ms)
  • Understand context from weeks ago
  • Scale to millions of users
  • Privacy-first (user data isolated)

Components:

1. Storage Layers

  • Hot Storage: Redis (last 24h of conversation)
  • Warm Storage: Vector DB (last 30 days, embeddings)
  • Cold Storage: S3 + Graph DB (all history, structured facts)

2. Retrieval Strategy

retrieve_memory Algorithm:

  1. Fast path - Recent conversation: Fetch from Redis cache (last 24h)
  2. Semantic search - Relevant past: Vector search filtered by user_id for top-10
  3. Graph query - Entity-based: Graph traversal for relationships
  4. Merge and rank: Combine all results with recency decay weighting

3. Scaling Considerations

  • Sharding: By user_id (each shard handles subset of users)
  • Caching: Frequently accessed memories in L1 cache
  • Compression: Older conversations summarized, original stored in cold
  • Pruning: Remove low-importance memories (based on access frequency + age)

9.2 Production RAG Pipeline Design

Requirements:

  • 1M documents, 10K queries/sec
  • <200ms end-to-end latency
  • Accuracy > 90% (user satisfaction)
  • Handle document updates in real-time
Production RAG System Design

Production RAG System Design

Pipeline Stages:

1. Document Ingestion

  1. Extract text from PDF/HTML
  2. Clean and normalize
  3. Intelligent chunking (512 tokens, 50 overlap, sentence-aware)
  4. Generate embeddings (batch encode, batch_size=32)
  5. Store with metadata

2. Query Processing

  1. Check cache: Hash query + context, return if cached
  2. Parallel retrieval: Run vector search and BM25 concurrently
  3. Fusion: Apply reciprocal rank fusion, take top-20
  4. Re-rank: Use cross-encoder to re-rank, select top-5
  5. Generate: LLM generates answer using context
  6. Cache result: Store in Redis with 1-hour TTL

3. Latency Budget

  • Retrieval: 50ms (vector search)
  • Re-ranking: 30ms (cross-encoder on 20 docs)
  • Generation: 100ms (LLM with streaming)
  • Overhead: 20ms (network, serialization)
  • Total: ~200ms

4. Optimization Techniques

  • Caching: 70% cache hit rate → 70% queries <10ms
  • Approximate search: HNSW with ef=32 (vs brute force)
  • Quantization: 8-bit embeddings (768 dims → 192 bytes)
  • Batching: Batch re-ranking for efficiency
  • Streaming: Start generating while re-ranking completes

9.3 Multi-Agent Coordination System

Requirements:

  • 5-10 specialized agents (sales, support, coding, research)
  • Route user queries to appropriate agents
  • Agents can call each other for help
  • Maintain conversation context across agents
  • Avoid loops and deadlocks
Multi-Agent System Design

Multi-Agent System Design

Components:

1. Router Agent (Master orchestrator)

Routing Algorithm:

  1. Classify intent: Determine query category and confidence
  2. High confidence (>0.9): Route to single specialized agent
  3. Low confidence: Consult all agents in parallel, ensemble results
  4. Return final answer

2. Inter-Agent Communication

Safe Agent Call Algorithm:

  1. Deadlock prevention: Check if call would create cycle in call graph
  2. Add edge: Record agent-to-agent call in graph
  3. Rate limiting: Ensure agent hasn't exceeded MAX_CALLS
  4. Execute: Call target agent's process method
  5. Return result

3. Evaluation & Monitoring

Key Metrics:

  • routing_accuracy: % queries correctly routed to right agent
  • agent_success_rate: % queries successfully resolved
  • avg_agent_calls: Average # agents invoked per query
  • latency_by_agent: Latency breakdown per agent type

Tricky Interview Question: "How do you prevent agent loops?"
→ Track call graph in real-time. Before allowing agent A to call agent B, check if it creates cycle. Maintain maximum call depth limit. Use timeout for entire query processing. Log suspicious patterns (A→B→A) for review.

10. Paper Discussion Prep

10.1 "Attention Is All You Need" (Vaswani et al., 2017)

Key Contributions:

  1. Transformer architecture (encoder-decoder)
  2. Multi-head self-attention mechanism
  3. Positional encoding (sinusoidal)
  4. Completely replaced RNNs/LSTMs

Interview Questions:

Q: Why is self-attention O(n²) and why is that a problem?
A: Each token attends to all other tokens → n × n attention matrix. For long sequences (>10K tokens), this becomes memory-prohibitive (O(n²) space) and computationally expensive (O(n²d) time for computing attention scores).

Q: Why multi-head attention instead of single large attention?
A: Multiple heads learn different relationship patterns (syntax vs semantics vs position). Similar to multiple filters in CNNs. Empirically, 8-16 heads work better than one head with 8-16× dimensions.

Q: How would you modify transformers for longer sequences?
A: Sparse attention (Longformer, BigBird), Linformer (low-rank approximation), Reformer (LSH), Flash Attention (memory-efficient), Mixture of experts (sparse activation)

10.2 RAG Papers

"Dense Passage Retrieval" (Karpukhin et al., 2020)

Key Idea: Use dense embeddings for retrieval instead of BM25

Training:

  • Bi-encoder: Separate encoders for query and passage
  • In-batch negatives: Other passages in batch are negatives
  • Hard negatives: High BM25 score but wrong answer

Interview Q: Why in-batch negatives?
A: Computationally efficient (no extra forward passes), provides diverse negatives, scales to large batch sizes. Limitation: If batch size small, negatives may be too easy.

"Retrieval-Augmented Generation" (Lewis et al., 2020)

Key Idea: Combine parametric (model weights) and non-parametric (retrieval) knowledge

Two Variants:

  1. RAG-Sequence: Retrieve once, use for entire generation
  2. RAG-Token: Retrieve for each generated token

10.3 Modern LLM Papers

"Language Models are Few-Shot Learners" (GPT-3, Brown et al., 2020)

Key Findings:

  1. Scale is all you need (175B parameters)
  2. In-context learning emerges at scale
  3. No fine-tuning needed for many tasks

Interview Q: Why does in-context learning emerge? What's happening?
A: During pre-training on internet text, model sees many examples of pattern completion, Q&A, etc. It learns meta-learning: "given these examples, continue the pattern." At sufficient scale, this generalizes to new tasks.

"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)

Key Contributions:

  • Smaller models competitive with larger ones (better training)
  • Open-source weights
  • Architectural improvements: RMSNorm, SwiGLU, RoPE

Interview Q: Why is LLaMA 13B competitive with GPT-3 175B?
A: Better data quality + curation, longer training (1.4T tokens vs 300B), architectural improvements, training tricks (better learning rate schedule, etc.). Shows that data quality >> model size for many tasks.

11. Interview Traps & Gotchas

11.1 Math Traps

"Explain backprop from scratch"

  • Trap: Glossing over chain rule details
  • Show: Matrix dimensions at each step, transpose operations

"When would you use L1 vs L2 regularization?"

  • Trap: "L1 for sparsity, L2 for smoothness" (too vague)
  • Better: L1 when you need feature selection (eliminates features entirely), L2 when all features matter but you want small weights. L1 gradient discontinuity at zero means SGD implementations need special handling.

11.2 Architecture Traps

"Why do transformers work better than LSTMs?"

  • Trap: Only mentioning parallelization
  • Complete answer: Parallelization + direct connections (better long-range) + better gradient flow + more efficient for modern hardware

"What happens if you don't use positional encoding?"

  • Trap: "Model won't know order"
  • Nuance: Model can still learn some positional info from content (e.g., "first" and "finally" are positional markers), but explicit encoding much more effective

11.3 Training Traps

"Your model isn't learning. What do you check?"

Systematic debugging:

  1. Overfit single batch (proves model capacity + no bugs)
  2. Check gradients (vanishing? exploding? NaN?)
  3. Learning rate (too high? too low? plot loss curve)
  4. Data (labels correct? normalized? shuffled?)
  5. Architecture (bottlenecks? activation choice?)
  6. Loss function (appropriate for task? numerically stable?)

"How do you choose hyperparameters?"

  • Trap: "Grid search"
  • Better: Start with known good defaults (from papers), use learning rate finder, random search > grid search, Bayesian optimization for expensive searches, monitor during training and adjust.

11.4 RAG Traps

"Why not just increase context window instead of using RAG?"

  • Costs: Larger context = much more expensive (quadratic in tokens)
  • Quality: Attention dilutes with more context ("lost in the middle" problem)
  • Freshness: Can't update knowledge without retraining
  • Attribution: RAG can cite sources

"How do you evaluate RAG quality?"

  • Trap: Only end-to-end accuracy
  • Decompose: Retrieval quality (recall@k, precision@k, MRR), relevance (are retrieved docs actually relevant?), generation quality (fluency, factuality, groundedness)

11.5 Production ML Traps

"How do you monitor ML models in production?"

  • Not just accuracy: Data drift, prediction drift, latency, error rates
  • Business metrics: User engagement, conversion, retention
  • Model-specific: Attention patterns, confidence scores, embedding drift

"Your ML model shows bias. What do you do?"

  • Understand source: Training data bias? Model architecture? Evaluation metric?
  • Measure: Define fairness metrics for your use case (demographic parity, equalized odds, etc.)
  • Mitigate: Re-sample training data, re-weight loss, adversarial debiasing, fairness constraints
  • Monitor: Continuous bias metrics in production, broken down by demographic groups

11.6 Behavioral/System Design Traps

"How would you explain transformers to a non-technical person?"

  • Use analogy: "Like reading a book where you can instantly flip to any related section, instead of reading page-by-page. The model learns which parts to focus on for any given question."
  • Avoid jargon: Don't say "attention mechanism" - say "focus on relevant information"

"How do you prioritize when building an ML system?"

  • Start with simplest baseline
  • Identify bottlenecks (data? model? engineering?)
  • Measure impact of improvements
  • 80/20 rule: Simple models often get 80% of the way there

Appendix: Quick Reference

Key Formulas

  • Attention: Attention(Q,K,V) = softmax(QK^T / √d_k)V
  • Cross-Entropy Loss: L = -Σ y_i log(ŷ_i)
  • Layer Norm: y = γ(x - μ)/σ + β

Typical Hyperparameters

Transformer (GPT-style):

  • Layers: 12-96
  • d_model: 768-12288
  • Heads: 12-96
  • d_ff: 3072-49152 (4× d_model)
  • Dropout: 0.1
  • Learning rate: 1e-4 to 6e-4
  • Warmup: 2000-10000 steps

RAG Pipeline:

  • Chunk size: 256-512 tokens
  • Overlap: 50-100 tokens
  • Top-k retrieval: 20-50
  • Top-k rerank: 3-5
  • Embedding dim: 768-1024

Final Tips for Interview Success

Before the interview:

  1. Re-read papers on company's core tech (RAG? Agents? Specific architecture?)
  2. Practice whiteboarding system designs out loud
  3. Prepare 2-3 deep technical stories from your experience
  4. Review recent ML news (new models, techniques)

During technical discussion:

  1. Think out loud (show reasoning process)
  2. Ask clarifying questions before diving in
  3. Start simple, then add complexity
  4. Discuss trade-offs explicitly
  5. Admit when you don't know (then reason through it)

Red flags to avoid:

  • Claiming to know everything
  • Not asking questions
  • Ignoring trade-offs ("this is always better")
  • Overcomplicating simple problems
  • Not testing your solution

Green flags to show:

  • Systematic problem-solving
  • Awareness of latest research
  • Production ML experience
  • Clear communication
  • Collaborative attitude

Remember: Founding engineer roles value pragmatism + depth. Show you can ship fast while understanding fundamentals deeply. Balance is key.