Progress
0 / 0
// AI Engineering Mastery · Java Edition · June 2026

Build production AI systems.
With current tools.

12 weeks. Claude 4 internals, Cursor 3.x agent mode, Adaptive Thinking, MCP 2025-06-18 spec, production RAG, multi-agent systems, Spring AI 1.1 + LangChain4j 1.x, RAGAS evals, and everything for 2026 AI engineering interviews.

12
Weeks
6
Phases
30+
Modules
60+
Resources
✓ UPDATED JUNE 2026 — all content current
Phase 0 — Weeks 1–2

Internals: Transformers, Claude 4, Cursor 3.x, MCP 2025

The mental models that change how you build. Transformers, context windows, Cursor's 2026 feature set (Cloud Agents, BugBot, Memories), MCP 2025-06-18 spec, and the full Claude 4 model lineup.

2
Weeks
2h
Per day
WK 1
Transformers + Claude 4 Models
Attention · context · 1M window · model selection
WK 2
Cursor 3.x + MCP 2025 spec
Cloud Agents · BugBot · Memories · MCP 2025-06-18
00
How Transformers Work — The Parts That Matter
Days 1–2 · Attention, position effects, context windows
Core
DAY 1
Attention Mechanics & "Lost in the Middle"
  • Watch: 3Blue1Brown — But what is a GPT? — best visual explanation of attention
  • Watch Part 2: 3Blue1Brown — Attention in transformers
  • O(n²) cost: every token attends to every other. Doubling context length quadruples compute. This is why 1M-token windows are expensive.
  • "Lost in the middle" (2023, still applies in 2026): content at the start and end of a prompt gets more attention weight. Content buried in the middle is systematically under-attended. Put critical instructions FIRST.
  • Read: Lost in the Middle — Stanford 2023 — required reading
Still true in Claude 4: even with 1M-token context windows, position effects exist. Critical system prompt content belongs at the top.
DAY 2
Tokens, Temperature, Constitutional AI
  • Watch: Andrej Karpathy — Let's build the GPT tokeniser — understand BPE tokenisation from scratch
  • Tokens ≠ characters. "JavaScriptDeveloper" = ~4 tokens. Test in OpenAI's tokeniser — Claude uses the same approach.
  • Temperature 0: deterministic — use for JSON extraction, tool calls, classification. Temperature 0.7+: creative, varied — use for brainstorming, writing.
  • Read: Claude Model Spec — the public document behind Claude's behaviour. Understanding it changes how you prompt.
📖 PaperLost in the Middle (Stanford)Proves position effects in long-context LLMs. Still the standard reference.arxiv.org →
Free
📖 DocsClaude Model SpecAnthropic's public document on Claude's values, behaviour, and decision-making.anthropic.com →
Free
▶ YouTube3Blue1Brown: GPT + Attention (2 parts)The definitive visual explanation. Watch both. Non-negotiable for understanding the internals.YouTube →
YouTubeFree
▶ YouTubeAndrej Karpathy: Build GPT from scratchBest way to internalise transformers — build a small one in Python. 2h deep dive.YouTube →
YouTubeFree
01
Claude 4 Model Family — Current Lineup
Day 3 · Which model to use for what
Core
DAY 3
2026 Model Selection Guide
ModelBest forContextPrice (in/out per 1M)
Opus 4.8 LATESTComplex agents, long-horizon tasks, multi-step coding, browser use. 4x less likely to let code flaws pass.1M tokens$5 / $25 standard
Fast mode: $10/$50 at 2.5× speed
Sonnet 4.6 DEFAULTDaily driver. 70% of devs prefer over Sonnet 4.5. Matches Nov 2025 Opus 4.5 quality at Sonnet price. 1M context in beta.1M (beta)$3 / $15
Haiku 4.5Classification, routing, simple summarisation, task scheduling. Claude Code smart-routes here automatically.200K$1 / $5
⚠ Deprecated: Claude 3.x models, Claude 3.7 Sonnet — upgrade path is straightforward. If you're still on 3.7 Sonnet, upgrade to Sonnet 4.6: same pricing, dramatically better performance.
Decision rule: Sonnet 4.6 by default → Haiku 4.5 for routing/classification → Opus 4.8 for complex architecture decisions, long agentic sessions, and browser/computer use.
02
Context Window Engineering — 1M Tokens
Days 4–5 · Managing cost at scale
Concept
DAYS 4–5
Cost Traps, Caching, Compression Strategies
  • 1M tokens ≠ free: at Sonnet 4.6 pricing, 1M input tokens = $3. A 10-turn conversation re-sends everything each turn — graph your costs.
  • Prompt caching (1-hour TTL in 2026): Claude 4 supports 1-hour cache duration for long agentic tasks. Previously 5 minutes — this changes long-running agent economics significantly. UPDATED
  • Cache with thinking: extended thinking tasks often take >5 min — use the 1-hour cache to maintain hits across multi-step workflows.
  • Read: Prompt caching docs — includes 1-hour TTL details
  • Build ContextManager.java with 4 strategies: full history, sliding window, summarisation, selective inclusion. Benchmark cost vs quality. ☕ Java
Strategy → When to use → Cost profile Full hist → Short sessions (<20 turns) → O(n²) — explodes Sliding → Fixed budget, any length → O(1) per turn Summarise → Quality matters, long sess → Medium, one-time Selective → Task-specific context → Lowest — most complex
03
Cursor 3.x — Current Feature Set
Days 6–7 · Cloud Agents, BugBot, Memories, Rules
Core
DAY 6
What's Current in Cursor (2026)
  • Notepads are deprecated. Replaced by Cursor Memories — a persistent knowledge base the AI maintains automatically across sessions. No manual creation needed. CHANGED
  • Cloud Agents (formerly Background Agents): run up to 8 agents in parallel on a single prompt. Each agent operates in its own isolated git worktree or remote machine — no file conflicts. NEW
  • BugBot: automatically reviews PRs, identifies potential issues, assists debugging at project level. Runs without being asked. NEW
  • Cursor Composer model: Anthropic's first agentic coding model — 4x faster than similarly intelligent models. Default in Agent Mode.
  • Browser for Agent (GA): Agent can browse the web inline. Embeddable in-editor with element selection and DOM forwarding.
  • Read: Cursor 2.0 changelog and latest changelog
DAY 7
Cursor Rules — Current Behaviour
  • Rules location changed: `.cursorrules` is deprecated. Current system uses .cursor/rules/ folder with .mdc files. CHANGED
  • Rule types: Always (every request), Auto-attached (when matching files open), Agent-requested (agent decides when relevant), Manual (you add with @)
  • Cursor Memories replaces Notepads for cross-session context — seed it with your architecture conventions from the first session
  • Team Rules: share custom rules, commands, prompts across the team via Cursor Docs deeplinks NEW
  • Read: Cursor Rules docs — current .mdc format
⚠ Outdated info to ignore: any tutorial referencing .cursorrules files or Notepads (not Memories) is pre-late-2025 and partially obsolete.
📖 ChangelogCursor 2.0 Release NotesCloud Agents, parallel execution, BugBot, browser GA, new agentic model.cursor.com →
Free2026
📖 DocsCursor Rules (current).mdc files in .cursor/rules/ — the current rules system replacing .cursorrules.docs.cursor.com →
FreeCurrent
▶ YouTubeCursor AI 2026 Guide (DEV Community)Comprehensive 2026 walkthrough of all current Cursor features including Cloud Agents.dev.to →
Free2026
04
MCP Protocol — 2025-06-18 Spec
Days 8–10 · The universal AI tool standard
Core
DAYS 8–9
MCP in 2026 — Now the Industry Standard
  • 97 million monthly SDK downloads, 10,000+ public servers, adopted by OpenAI, Google DeepMind, Microsoft, AWS. MCP is the de facto connectivity standard for agentic AI. 2026 STATUS
  • Donated to Linux Foundation (Dec 2025): co-founded with Block and OpenAI, backed by Google, Microsoft, AWS, Cloudflare. Long-term neutrality guaranteed.
  • Latest spec: 2025-06-18 — introduces structured tool outputs, enhanced OAuth security, server-initiated user interactions. Removed JSON-RPC batching (added in March spec) to simplify. LATEST SPEC
  • Transport change: HTTP+SSE replaced by Streamable HTTP — more robust, proxy-friendly. BREAKING from 2024-11-05
  • Tool annotations: tools now declare their behaviour (read-only, destructive) for safer execution
  • Spring AI 1.1 + LangChain4j 1.x both support MCP as of late 2025. No custom wiring needed.
  • Read: MCP 2025-06-18 spec
MCP 2024-11-05 → 2025-03-26 → 2025-06-18 (current) stdio/HTTP+SSE Streamable HTTP Structured outputs Basic auth OAuth 2.0 Enhanced OAuth + server-init interactions Simple tools Tool annotations Safer destructive-op model
DAY 10
Build: MCP Server with Spring AI 1.1
  • Spring AI 1.1 (Nov 2025): full MCP auto-configuration. Expose @Bean methods as MCP tools. NEW
  • LangChain4j 1.x: dedicated langchain4j-mcp module. Works on Quarkus, Micronaut — not just Spring.
  • Build a server with all 3 primitives: Tools (actions), Resources (read-only), Prompts (templates)
  • Use the 2025-06-18 transport: Streamable HTTP, not the old SSE transport
  • Add tool annotations: mark destructive tools, mark read-only tools
  • Security reminder: never trust LLM-supplied arguments. Claude can be prompt-injected via retrieved content. Validate all tool inputs server-side regardless of source.
  • Read: Spring AI 1.1 MCP docs
  • Read: LangChain4j MCP module
📖 SpecMCP 2025-06-18Current spec. Structured outputs, OAuth, Streamable HTTP transport. Read the diff from 2025-03-26.modelcontextprotocol.io →
FreeJune 2025
📖 DocsSpring AI 1.1 MCPAuto-configuration for MCP servers in Spring Boot. MCP tool annotations on @Bean methods.spring.io →
Free☕ JavaNov 2025
📖 DocsLangChain4j MCP modulelangchain4j-mcp — works on Quarkus, Spring, Micronaut. First-class MCP support in 1.x.langchain4j.dev →
Free☕ Java1.x
📖 GuideMCP Complete Guide 2026Comprehensive 2026 MCP guide — architecture, all primitives, security, ecosystem.sureprompts.com →
Free2026
📖 BlogMCP 2026 RoadmapOfficial MCP roadmap blog — transport scalability, agent communication, governance.blog.modelcontextprotocol.io →
Free2026
05
Java AI Frameworks — Spring AI vs LangChain4j
Day 11 · Both hit 1.0 GA in May 2025 — choose correctly
Concept
DAY 11
Current State — Both Production-Ready Since May 2025
FrameworkVersion (June 2026)Choose if…
Spring AI1.1 (Nov 2025) + 2.0-M1 previewAlready on Spring Boot. Advisors API, Micrometer observability, autoconfiguration. Tightest Spring integration.
LangChain4j1.10.x (monthly releases since May 2025)Quarkus/Micronaut/Helidon/plain Java. More provider support (25+). GraalVM native image: 100ms start, 50MB RAM. Explicit/modular style.
Direct HttpClientJava 21+Simple tasks, no framework lock-in, full control. What we used in this plan. JEP 517 (JDK 26) adds HTTP/3.
Decision: Spring Boot team → Spring AI 1.1. Quarkus/other → LangChain4j. Both support MCP, RAG, tool calling, chat memory, 20+ LLM providers. The choice is architectural, not capability-driven.

// Phase 0 Checkpoint

I can explain the "lost in the middle" effect and restructured my prompts accordingly
I know when to use Opus 4.8 vs Sonnet 4.6 vs Haiku 4.5 — with cost reasoning
I updated my project to use .cursor/rules/*.mdc files instead of .cursorrules
I understand MCP 2025-06-18 changes and the Streamable HTTP transport
I chose between Spring AI 1.1 and LangChain4j 1.x for my project with clear reasoning
Phase 1 — Weeks 3–4

Advanced Claude 4 — Adaptive Thinking, Prompt Caching, Batch API

Adaptive Thinking (the 2026 replacement for budget_tokens), interleaved thinking with tools, 1-hour prompt cache, systematic prompt engineering, and a self-testing prompt management system.

2
Weeks
2–3h
Per day
06
Adaptive Thinking — The 2026 Standard
Days 15–17 · Replaces manual budget_tokens
Concept
DAYS 15–17
Adaptive vs Extended Thinking, Interleaved, Effort Levels
  • Adaptive Thinking (Claude Opus 4.6+, Sonnet 4.6): set thinking: {type: "adaptive"} — Claude decides when and how much to think based on query complexity and your effort setting. NEW — replaces budget_tokens
  • Effort parameter: low / medium / high / max / xhigh — you control intensity, not token count. "high" is the default; Claude almost always thinks at high.
  • Interleaved thinking (auto-enabled in adaptive mode): Claude can think between tool calls. Previous limitation: thought once at the start, then called tools without further thinking. Now: think → call tool → think about result → call next tool → think → answer.
  • Extended thinking (budget_tokens) is deprecated on Sonnet 4.6 and Opus 4.6. Still functional but prefer adaptive. On Opus 4.5 and older, budget_tokens is still required.
  • Summarised thinking: Claude 4 returns a summary of its reasoning, not the full token stream. Full trace available via Anthropic request (for audit/compliance). CHANGED from 3.x
  • For Opus 4.8 at max/xhigh effort: set max_tokens to 64k minimum — the model needs space for subagents and tool calls
  • Read: Adaptive Thinking docs — current reference
  • Read: Extended Thinking docs — for Sonnet/Haiku 4.5 and older
// Adaptive thinking — 2026 standard { "model": "claude-sonnet-4-6", "max_tokens": 16000, "thinking": { "type": "adaptive" }, // let Claude decide // OR override effort: "thinking": { "type": "adaptive", "effort": "high" } } // Extended thinking — still used for Opus/Sonnet 4.5 and older { "model": "claude-opus-4-5", "max_tokens": 16000, "thinking": { "type": "enabled", "budget_tokens": 8000 } }
Build adaptive budget: use effort: "low" for simple queries (faster, cheaper), effort: "high" for architecture decisions, effort: "max" for complex multi-step agents. Benchmark quality vs cost per effort level.
📖 DocsAdaptive ThinkingOfficial docs for adaptive thinking — effort levels, interleaved, cache considerations.claude.com →
Free2026 API
📖 DocsPrompt Engineering: Thinking TipsAnthropic's official guide to getting the most from thinking — when to trigger, what to guide.claude.com →
Free
📖 ArticleBuilding with Claude Extended Thinking (2026)Practical article covering adaptive mode, interleaved tool use, and summarised thinking. April 2026.Medium →
FreeApr 2026
07
Prompt Caching — 1-Hour TTL for Agents
Days 18–19 · 90% cost reduction on repeated contexts
Production
DAYS 18–19
TTL Update, Agent-Aware Caching, 3 Patterns
  • Cache TTL is now 1 hour for long agentic tasks — up from 5 minutes. This changes the economics of long agent loops significantly. UPDATED 2025
  • Use 1-hour cache for: large system prompts in multi-step agents, RAG context reused across turns, few-shot examples referenced throughout a session
  • Thinking + cache: changes to thinking parameters (enabled/disabled or budget changes) invalidate cache breakpoints. Interleaved thinking amplifies this. Plan cache strategy before enabling adaptive thinking.
  • Pattern 1 — Static system prompt: large instructions cached, only user message varies. Best ROI.
  • Pattern 2 — RAG context caching: when retrieved docs are reused across turns in same session, cache them. Significant savings on document Q&A.
  • Pattern 3 — Few-shot bank: cache example bank, vary only the query.
  • Build a cache-aware service: log cache_hit rate, cost saved per request. Target >60% hit rate.
  • Read: Prompt caching docs — includes 1-hour TTL and thinking interactions
08
Systematic Prompt Engineering — 2026 Patterns
Days 20–21 · From guessing to engineering
Concept
DAYS 20–21
XML Structure, Self-Critique, Evaluator-Optimizer, Promptability of Thinking
  • Read: Anthropic prompt engineering guide — comprehensive reference
  • Rebuild a prompt using full XML structure: <role>, <context>, <task>, <constraints>, <output_format>
  • Adaptive thinking is promptable: if the model over-thinks on simple queries (large system prompts can trigger this), add guidance: "Only use extended reasoning for genuinely complex problems. Respond directly for simple queries."
  • Self-critique loop: answer → Claude critiques → Claude revises. 3-turn Java method. Measurably improves output quality.
  • Evaluator-optimizer: generate → score (1–5) → if <4, regenerate with critique. Loop max 3x. Use Claude-as-judge.
  • Free course: DeepLearning.AI: Prompt Engineering with Anthropic Claude
09
Batch API & Async Pipelines
Days 22–23 · 50% cost on offline workloads
Production
DAYS 22–23
Message Batches API, Spring WebFlux, Decision Matrix
  • Read: Message Batches API — 50% cheaper, async, up to 100k requests per batch
  • Build Java batch processor: List<String> → submit batch → poll status → process results ☕ Java
  • Convert streaming endpoint to reactive: Flux<String> with Spring WebFlux ☕ WebFlux
  • Decision rule: sync+streaming for interactive (user is watching). Batch API for document processing, classifiers, bulk analysis, nightly jobs.
  • Build a cost calculator: compare real-time vs batch for your specific workload mix

// Phase 1 Checkpoint

Adaptive thinking implemented — effort levels tested, not using deprecated budget_tokens on 4.6 models
Prompt caching saves ≥60% — using 1-hour TTL for my agent's system prompt
Self-critique loop measurably improves output on my 20-question test set
Batch API used for at least one offline workload — cost savings measured
Phase 2 — Weeks 5–7

Production RAG — 2026 Techniques

Hybrid retrieval, GraphRAG, contextual compression, RAGAS evaluation, hallucination detection. RAG has evolved beyond simple chunk-and-retrieve — this phase covers what production systems actually do in 2026.

3
Weeks
2–3h
Per day
WK 5
Retrieval Foundations
Chunking · embeddings · recall benchmarks
WK 6
Advanced Retrieval 2026
GraphRAG · hybrid · reranking · contextual compression
WK 7
RAG Evaluation + Guardrails
RAGAS · LLM-as-judge · hallucination prevention
10
Chunking Strategies — What Production Uses
Days 29–30 · Semantic chunking wins
Deep Dive
DAYS 29–30
Contextual Headers, Parent-Child, Semantic Boundaries
  • 2026 consensus: semantic chunking with contextual headers outperforms fixed-size chunking. Preserve document structure (sections, headings). Tools like LlamaIndex can do LLM-based compression of retrieved sets.
  • Parent-child pattern: index small chunks (precise retrieval), return parent chunk to LLM (full context). Gains 10–20% recall. Still the best single improvement for most pipelines.
  • Contextual retrieval (Anthropic, 2024): prepend a sentence of chunk context before embedding — "This chunk is from the section about X in document Y". Reported 49% reduction in retrieval failures when combined with BM25.
  • Read: Contextual Retrieval — Anthropic blog
  • Implement 3 chunkers in Java + LangChain4j: fixed-size, sentence-aware, semantic. Benchmark on 50 golden questions: recall@5, precision, latency. ☕ LangChain4j
📖 BlogContextual Retrieval (Anthropic)49% retrieval failure reduction by prepending chunk context before embedding. Read this.anthropic.com →
Free2024–still current
📖 DocsLangChain4j RAG + Document SplittersJava-native recursive, sentence, semantic splitters. Parent-document retriever built-in.langchain4j.dev →
Free☕ Java
📖 GuideRAG Pipeline 2026 — kapa.aiProduction RAG guide updated Jan 2026. Chunking, hybrid retrieval, evaluation, deployment.kapa.ai →
FreeJan 2026
▶ YouTubeJames Briggs: Advanced RAG TechniquesPractical RAG — hybrid search, reranking, parent-document retrieval. Highly applied.YouTube →
YouTubeFree
11
Hybrid Search + GraphRAG
Days 31–33 · Beyond vector-only retrieval
Deep Dive
DAYS 31–33
BM25 + Semantic + RRF + Knowledge Graphs
  • Hybrid search (BM25 + semantic) with RRF is the baseline for production RAG in 2026. Pure vector-only is no longer sufficient. Implement: add BM25 via pgvector full-text search, merge with RRF score = Σ 1/(rank_i + 60). Target: ≥15% recall improvement over pure semantic.
  • GraphRAG (Microsoft, 2024 — mainstream 2025): index documents as a knowledge graph. Retrieve via graph traversal, not just vector similarity. Handles complex multi-hop questions that vector search misses ("what connects A to C via B?"). PRODUCTION in 2025
  • GraphRAG use cases: legal research (connecting related cases), medical literature (drug interaction chains), enterprise knowledge (org relationships)
  • Neo4j LangChain4j integration available for Java GraphRAG. ☕ Java
  • Read: Advanced RAG techniques including GraphRAG — Neo4j, Oct 2025
  • Benchmark: hybrid vs vector-only on 50 multi-hop questions in your domain
12
Reranking, HyDE, Query Decomposition
Days 34–36 · The retrieval quality stack
Deep Dive
DAYS 34–36
Cross-Encoders, HyDE, Sub-Questions, Conversational Rewriting
  • Reranking: retrieve top 50 with bi-encoder → rerank to top 5 with Cohere Rerank API. Two-stage pipeline is standard. Build Reranker interface: Cohere, LLM, passthrough.
  • HyDE: generate a hypothetical answer → embed that → retrieve. Dramatically improves vague query retrieval. Read: HyDE paper.
  • Query decomposition: complex → multiple single-fact sub-questions → merge. Essential for "What are the tax implications of X given Y and Z?"
  • Conversational rewriting: "What about Q3?" → standalone "What was Apple's Q3 2024 revenue?" using history. Required for any multi-turn RAG app.
  • Contextual compression (new 2025 pattern): ask Claude to extract only the relevant sentence(s) from each retrieved chunk before passing to the LLM. Reduces noise in the context.
13
RAG Evaluation — RAGAS + LLM-as-Judge
Days 37–38 · What gets measured gets improved
Eval
DAYS 37–38
4 RAGAS Metrics, Golden Sets, CI Integration
  • Read: RAGAS docs — all 4 metrics: Faithfulness, Answer Relevance, Context Precision, Context Recall
  • Faithfulness is the most important metric in 2026. LLM hallucination in RAG is the #1 production complaint. Every claim must derive from retrieved context.
  • Implement all 4 as JUnit-compatible Java evaluators using Claude-as-judge (Claude grading Claude)
  • Span-level evaluation (2025 practice): for multi-step RAG, evaluate each stage independently — retrieval quality separate from generation quality. Maxim AI and similar platforms support this.
  • Generate 100 synthetic Q&A pairs from your corpus with Claude. Curate to 50 golden questions. Run full pipeline. Record baseline score. This is your regression suite forever.
  • Add to GitHub Actions: PR that drops faithfulness below 3.5/5 fails automatically.
📖 DocsRAGAS FrameworkStandard RAG eval framework. Metric definitions are language-agnostic — implement in Java.docs.ragas.io →
Free
📖 GuideAnthropic Evals GuideHow to use Claude as a judge for structured evaluation. Core pattern for all evals.docs.anthropic.com →
Free
📖 GuideRAG Eval Guide 2025 — Maxim AIComprehensive 2025 guide on RAG metrics, LLM-as-judge, span-level evaluation.getmaxim.ai →
FreeNov 2025
▶ YouTubeEvaluating RAG Systems (LangChain)Build RAG eval pipelines with RAGAS. Python but concepts identical for Java implementation.YouTube →
YouTubeFree
14
RAG vs Fine-tuning vs Long Context — 2026 Decision
Day 39 · The right tool for the right problem
Concept
DAY 39
When to use which approach (updated for 1M context windows)
  • Few-shot first: always. 3–5 examples in the prompt. Zero infrastructure, fastest to test. If this works, stop here.
  • Long context window (1M tokens): with Sonnet 4.6's 1M token context (beta), small-to-medium private knowledge bases can now be injected directly. RAG becomes optional for datasets under ~500k tokens. 2026 CHANGE — re-evaluate your RAG decisions
  • RAG when: corpus >500k tokens, knowledge changes frequently, need source citations, multi-document search. Still the production standard for large corpora.
  • Fine-tuning when: consistent output format or style across thousands of calls, domain-specific vocabulary, latency is critical. Never fine-tune to add factual knowledge — use RAG. Fine-tuned facts hallucinate.
  • New 2025 pattern — "Needle in a haystack" evals: test whether your model can actually find a specific fact in a 500k-token context. Long context windows don't automatically mean accurate recall of buried facts.
Decision tree: Corpus < 500k tokens? → Try long context first (1M beta) Changes frequently? → RAG required Need citations? → RAG required Style consistency? → Fine-tune (format, not facts) Anything else? → RAG + few-shot is the right stack

// Phase 2 Checkpoint

Contextual retrieval implemented — 50-question benchmark shows improvement over plain chunking
Hybrid search beats pure semantic by ≥15% recall@5 — with benchmark evidence
All 4 RAGAS metrics run automatically on every PR — faithfulness threshold enforced in CI
I evaluated whether long-context (1M) eliminates the need for RAG in my use case
Phase 3 — Weeks 8–10

Agentic Systems — 2026 Patterns

Multi-agent orchestration with interleaved thinking, all 6 memory types, computer use, A2A protocol, human-in-the-loop, and a production research agent with full evaluation suite.

3
Weeks
2–3h
Per day
15
Tool Design — 2026 Best Practices
Days 50–51 · Annotations, descriptions, error contracts
Concept
DAYS 50–51
Tool Annotations, Parallel Calls, Validation
  • Read: Building Effective Agents (Anthropic) — still the canonical reference
  • Read: Tool definition best practices
  • Tool annotations (MCP 2025-03-26+): declare readOnly: true for read tools, destructive: true for write/delete. Claude uses these to make safer decisions. NEW IN MCP
  • Parallel tool calls (Claude 4): Claude can call multiple independent tools simultaneously. Design tools to be idempotent where possible. Build your parallel-safe tool set. Claude 4 feature
  • Description is still the most important field — tell Claude WHEN to use the tool, not just what it does. "Use this when you need current stock price data" not just "Gets stock prices."
  • Build ToolResult record with: success, data, error, requiresConfirmation, isIdempotent ☕ Java
📖 ResearchBuilding Effective AgentsAnthropic's canonical guide to agent architectures. Required reading for this phase.anthropic.com →
FreeStill current
📖 CourseDeepLearning.AI: AI AgentsFree short course on building AI agents with tool use and memory. Made with Anthropic.learn.deeplearning.ai →
Free
▶ YouTubeDave Ebbelaar: LLM Agents (2025)Practical agentic AI patterns — tool design, memory, orchestration. Updated 2025.Dave Ebbelaar →
YouTubeFree
16
Memory Architecture — All 6 Types
Days 52–54 · Production memory system
Deep Dive
DAYS 52–54
In-Context · Episodic · Semantic · Procedural · Cache · Weights
  • 1. In-context (working memory): context window. Fast, temporary, expensive at scale.
  • 2. Episodic: past conversation logs. Retrieve when relevant. PostgreSQL with semantic search.
  • 3. Semantic: distilled facts in vector store. "User prefers functional Java style." Update via end-of-session consolidation.
  • 4. Procedural: how-to knowledge in tool definitions and few-shot examples. Most durable.
  • 5. Prompt cache: Anthropic KV cache. 1-hour TTL in 2026. Reduces cost, doesn't change knowledge.
  • 6. File-based memory (NEW in Claude 4): when given access to local files, Claude can extract and save key facts, building persistent memory with continuity. More reliable than semantic memory for structured facts. Claude 4 feature
  • Read: Generative Agents (Stanford) — importance × recency × relevance scoring formula
  • Build unified MemorySystem.java with all 6 types, store() / retrieve(query, userId) interface ☕ Java
17
Multi-Agent Orchestration + A2A Protocol
Days 55–57 · Agent-to-Agent communication
Deep Dive
DAYS 55–57
Orchestrator-Subagent, Parallel Futures, A2A (Google), Checkpointing
  • Orchestrator-subagent pattern: orchestrator decomposes task → delegates to specialised subagents → merges results
  • Parallel with CompletableFuture: independent subagents run simultaneously. Handle partial failures (1 of 5 fails → continue with 4 results). ☕ Java
  • A2A (Agent-to-Agent) protocol (Google, 2025): standardises how agents communicate across systems. LangChain4j 1.x has A2A support in the agentic module. Enables agent teams from different vendors/frameworks to coordinate. 2025 STANDARD
  • Reviewer subagents: a reviewer agent can delegate to a test-writer, which can delegate further. Each level keeps its own prompt and model (Cursor 2.0 pattern — applicable to your own systems).
  • Checkpointing: persist state after each tool call. Build resume endpoint. Critical for long-running agents.
  • Per-job budget limits with Resilience4j. Parallel agents multiply cost — measure it.
18
Computer Use & Browser Agents
Day 58 · Claude can control a browser
Deep Dive
DAY 58
Computer Use API, Prompt Injection Defence
  • Computer Use (GA in Claude 4): Claude can move a cursor, click, type in browser windows. Opus 4.8 scored 84% on Online-Mind2Web (browser agent SoTA). GA in Claude 4
  • OSWorld-Verified (Jul 2025): updated benchmark replacing original OSWorld. Sonnet 4.6 shows dramatic improvement in computer use vs Sonnet 4.5.
  • Read: Computer Use API docs
  • Prompt injection risk: browser agents reading web content can be hijacked by hidden instructions on visited pages. This is not theoretical — it's actively exploited. Mitigations: sandboxed browser, instruction validation, human approval before any write action.
  • Use case focus: web scraping automation, form filling, GUI testing, research agents that browse. Not for actions with financial consequences without human approval.
19
Agent Evaluation & Failure Catalogue
Days 59–60 · Agents fail in non-obvious ways
Eval
DAYS 59–60
Task Completion, Trajectory, Adversarial, 2026 Eval Platforms
  • Task completion eval: JUnit harness, submit task, capture trace, score with Claude judge
  • Trajectory eval: right tools, right order, no wasted steps. Efficiency = steps taken / theoretical minimum
  • Adversarial tests: ambiguous instructions (does it ask?), conflicting results (does it reconcile?), errors (does it recover?), empty tool results (does it loop?)
  • Agent eval platforms (2026): Maxim AI now rated #1 for multi-agent eval. Supports simulation, experimentation, and observability for agent teams. 2026 tooling
  • Build failure mode catalogue: ≥10 patterns with fix strategies. Document as you discover them.
  • Common 2026 failure modes: infinite tool loop on empty results, ignoring tool errors, over-calling the same tool, not using parallel calls when independent

// Phase 3 Checkpoint

All 6 memory types implemented — including file-based memory for persistent facts (Claude 4 feature)
Multi-agent system uses parallel CompletableFuture, handles partial failures, includes checkpointing
Failure mode catalogue has ≥10 patterns — all tested adversarially
Research agent scores ≥80% task completion on 20-task eval suite
Phase 4 — Weeks 11–12

Ship & Scale

Cost engineering with Opus 4.8 Fast mode, latency SLAs, prompt A/B testing, AI system design decisions, and a portfolio capstone that integrates everything from the plan.

2
Weeks
3h
Per day
20
Cost Engineering — 2026 Model Economics
Days 71–72 · Fast modes, smart routing, budget management
Production
DAYS 71–72
Opus 4.8 Fast Mode, Smart Routing, Attribution
  • Opus 4.8 Fast mode ($10/$50 per 1M): 2.5× faster than standard, 3× cheaper than previous Opus Fast modes. Use for latency-sensitive Opus-quality work. 2026 PRICING
  • Smart model routing (like Claude Code does it): classify query → route Haiku 4.5 ($1/$5) for simple/routine → Sonnet 4.6 ($3/$15) for most work → Opus 4.8 ($5/$25) for complex architecture/long agents. Track quality delta per route.
  • Build CostTracker: log cost per user/feature/model/day to PostgreSQL ☕ Java
  • Per-user budget limits with Resilience4j rate limiter
  • Response caching with TTL: near-identical queries return cached response
  • Context compression: summarise long conversations before sending. Measure cost saved vs quality lost.
  • Use Langfuse Java SDK for cost + latency observability
21
Latency Optimisation & SLA Monitoring
Days 73–74 · Every 100ms matters
Production
DAYS 73–74
Profile → Parallelise → Circuit Breaker → Stream
  • Profile every stage with Langfuse: embedding → search → rerank → LLM → parse. The bottleneck is usually not where you expect.
  • Parallelise independent operations: fetch user context + embed query + check cache simultaneously with CompletableFuture.allOf()
  • Set SLAs: p50 <1s, p95 <3s, p99 <8s. Alert on breach.
  • Circuit breaker: Claude API latency >10s → fail fast with cached/fallback response (Resilience4j)
  • Stream every response — perceived latency drops even when total latency is unchanged
  • Opus 4.8 Fast mode is the answer for latency-sensitive Opus-quality work — not a different architecture.
22
Prompt A/B Testing & Architecture Decisions
Days 75–76 · Treat prompts like code
Eval
DAYS 75–76
Experiment Infrastructure, RAG vs Long Context vs Fine-tune, Patterns
  • Extend prompt library: traffic split A/B, metric tracking, statistical significance (≥200 samples per variant)
  • 2026 decision framework: few-shot first → long context (1M) if corpus <500k → RAG for larger/dynamic → fine-tune only for style/format at high volume. Never fine-tune for knowledge.
  • Architecture patterns: Chain / Router / Evaluator-Optimizer / Parallelisation / Orchestrator-Subagent — know when each is right
  • Read: Effective Agents workflow patterns — the canonical reference, remains current
23
Capstone: Your AI Product
Days 77–84 · Everything comes together
Build
DAYS 77–84
Real problem · Full stack · All phases integrated · Deploy · Document
  • Choose a real problem: Java code reviewer, legal doc analyser, customer support agent, competitive intelligence, enterprise knowledge base
  • Must use from Phase 0: MCP server (2025-06-18 spec), tool annotations, Spring AI 1.1 or LangChain4j 1.x
  • Must use from Phase 1: adaptive thinking (not budget_tokens), 1-hour prompt cache, versioned prompts
  • Must use from Phase 2: contextual retrieval + hybrid search, RAGAS evals in CI
  • Must use from Phase 3: ≥2 agents with parallel execution, all 6 memory types, HITL for at least 1 action
  • Must use from Phase 4: smart model routing, cost tracking, latency SLAs, 1 completed A/B test
  • Deliverables: ADR, README, 5-min Loom demo, GitHub Actions CI/CD with RAGAS gate, live URL, Langfuse dashboard screenshot

// Phase 4 Final Checkpoint

Smart routing reduces average cost vs single model — with evidence from Langfuse traces
System meets p95 <3s SLA — proven with production traces
Capstone integrates RAG + agents + memory + HITL — deployed publicly with live URL
Eval suite caught at least one regression before it reached production
Can whiteboard: Claude 4 model selection, MCP 2025-06-18 architecture, adaptive thinking, RAG pipeline, multi-agent orchestration
// Interview Prep — 2026 AI Engineering Roles

AI engineering interviews in 2026 are 60%+ GenAI-focused. Based on real 2026 interview loops. Eval methodology is the new system design. Questions about adaptive thinking, MCP, GraphRAG, and A2A are now common. Answers must include projects from this plan.

🌍 Real-World Examples — Reference in Interviews
Code Assistant Cursor (Anysphere) $9.9B valuation (Series C, 2025), team under 100. Composer model (Anthropic-built). Cloud Agents run 8 parallel instances with git worktrees. BugBot auto-reviews PRs. Context assembly is the core product. Parallel agents · MCP · Cloud execution · Smart routing
Lesson: The quality of context assembly is more valuable than model choice. What goes in the prompt determines everything.
Enterprise Search Glean RAG over Slack, Jira, email, docs. Permission-aware retrieval (show only what you can access). Hybrid search across all sources. Production since 2024. Hybrid RAG · Permission filtering · Multi-source · 2026 standard
Lesson: Permission filtering must happen at the retrieval layer, not post-retrieval. Security can't be bolted on.
Customer Support Intercom Fin Handles 50%+ tickets autonomously. RAG over help docs. Escalates when confidence is low. HITL for sensitive issues. Model: knows when NOT to answer. Agentic RAG · Confidence scoring · HITL · Escalation
Lesson: Knowing when NOT to answer is as important as answering well. Confidence thresholds and graceful escalation are the key design decisions.
Legal AI Harvey LLM for legal docs, contract review, research. Citation grounding is non-negotiable. Extended/adaptive thinking for complex legal analysis. Hallucination detection is the primary quality metric. Citation grounding · Adaptive thinking · GraphRAG for precedents
Lesson: In high-stakes domains, hallucination detection > answer quality. Build it first, always.
AI Search Perplexity Real-time RAG: query → web search → rerank → LLM answer with citations. Sub-second perceived latency via streaming. Freshness vs accuracy tradeoff managed by caching strategy. Real-time RAG · Reranking · Streaming · Citation grounding
Lesson: Streaming hides latency. Real-time retrieval is slow; caching stales. Manage the tradeoff explicitly.
Enterprise AI Microsoft Copilot for M365 MCP-based tool connectivity to Office apps. GraphRAG for organisational knowledge. A2A protocol for cross-agent coordination. 1M context for large documents. MCP 2025-06-18 · GraphRAG · A2A · 1M context
Lesson: MCP is now the enterprise integration layer. Building proprietary connectors is the wrong choice in 2026.
💡 Technical Questions — Current (2026)
Technical · Claude 4
What is adaptive thinking and how does it differ from the old budget_tokens approach?
Core Answer

budget_tokens (deprecated on 4.6 models): you specify a fixed token budget. Claude used up to that many thinking tokens regardless of complexity. Fine-grained but over-specified — simple queries still consumed budget.

Adaptive thinking (Sonnet 4.6, Opus 4.6+): you set an effort level (low/medium/high/max/xhigh). Claude decides when and how much to think based on query complexity. On simple queries, it may skip thinking entirely. On complex multi-step problems, it thinks extensively. In internal evaluations, adaptive thinking outperforms fixed budget_tokens.

Key addition — interleaved thinking: automatically enabled in adaptive mode. Claude can think between tool calls, not just before the first one. Critical for complex agentic workflows.

Your Real Example
Project: "I migrated my agent from budget_tokens: 8000 to adaptive thinking with effort: high. Cost dropped 30% because simple routing queries no longer consumed thinking budget, and complex architecture queries got more thinking than 8k tokens allowed. The interleaved thinking also resolved a bug where the agent would call the wrong tool after a failed tool result — it now reasons about the result before deciding the next step."
Technical · RAG
With 1M token context windows available, when would you still choose RAG over stuffing the context?
The 2026 Answer (updated — don't give the old answer)

Long context windows changed this calculation. For corpora under ~500k tokens that don't change frequently, injecting directly into a 1M context is now viable and often simpler than a RAG pipeline.

Still choose RAG when:
  • Corpus exceeds 500k–1M tokens — physically can't fit, or cost is prohibitive ($0.15–3 per request at current pricing)
  • Knowledge changes frequently — re-injecting a full corpus each request is expensive and wasteful
  • Citation grounding required — RAG retrieves traceable sources; long context doesn't
  • Latency matters — loading 1M tokens takes time; RAG retrieves relevant subset in milliseconds
  • "Needle in a haystack" accuracy — even 1M context has position effects. For precise fact recall in large corpora, retrieval is more reliable than long context
  • Multi-user systems — each user has a different relevant subset. RAG is personalised; shared long context isn't
2026 decision rule: "I evaluate corpus size, update frequency, citation needs, and latency budget. For our enterprise search product, the corpus is 5M documents and changes hourly — RAG is the only viable choice. For our internal FAQ bot with 200 documents that rarely change, I switched to long context and retired the RAG pipeline."
Technical · MCP
What changed in MCP between the 2024 spec and 2025-06-18? Why does it matter?
Key Changes
  • Transport: HTTP+SSE → Streamable HTTP. More proxy-friendly, used in enterprise environments
  • Auth: basic auth → structured OAuth 2.0. Enterprise-grade security built into the protocol
  • Tool annotations: tools can now declare read-only, destructive, etc. Claude uses these for safer autonomous decisions
  • Structured tool outputs: richer return types, not just strings
  • Server-initiated interactions: servers can now prompt users for input mid-execution
  • Donated to Linux Foundation (Dec 2025): vendor-neutral governance. Long-term stability guaranteed.
Why it matters for interviews

MCP is now the de facto standard — adopted by OpenAI, Google, Microsoft, AWS. 10,000+ public servers, 97M monthly downloads. Any senior AI engineer in 2026 is expected to know MCP deeply, including protocol evolution. Showing you know the spec version history signals you're working with real production systems.

Technical · Agents
How would you evaluate an agent that calls four tools in a loop? What metrics matter?
Eval Framework for Agentic Systems

This is the #1 thing interviewers ask about in 2026 — "eval methodology is the new system design." Most candidates can describe building an agent; fewer can describe evaluating one rigorously.

  • Task completion rate: did the agent achieve the stated goal? Binary for simple tasks, rubric-scored for complex ones.
  • Trajectory efficiency: steps taken / theoretical minimum steps. An agent that achieves a goal in 12 steps when 4 were needed is a problem.
  • Tool call correctness: was the right tool called with the right arguments? Log and score each call.
  • Loop detection: does it detect when it's stuck? Does it call the same tool repeatedly with the same arguments?
  • Graceful degradation: when a tool fails, does it recover or cascade-fail?
  • Token efficiency: total tokens used per task completed. Cost per successful completion.
Your Real Example
Project: "I built a research agent that calls 4 tools: web_search, pdf_reader, summariser, fact_checker. I ran 20 research tasks and scored: task completion (17/20 = 85%), trajectory efficiency (avg 1.3x optimal), tool correctness (94%). The worst failures were empty web_search results causing infinite retry loops — I fixed by adding a max-retries constraint and a 'conclude with partial results' fallback."
Technical · Eval
What is contextual retrieval and why did it reduce RAG failures by 49%?
The Technique

Contextual retrieval (Anthropic, 2024) prepends a short context sentence to each chunk before embedding it. Without this, a chunk like "Revenue increased by 3% year-over-year" has no context when retrieved. With contextual retrieval, it becomes: "This is from the Q3 2024 earnings section of Apple's annual report. Revenue increased by 3% year-over-year."

The embedding of the contextualised chunk is much richer — it captures both the content AND its place in the document. Combined with BM25 hybrid search, Anthropic reported 49% reduction in retrieval failures.

Implementation Detail

For each chunk, call Claude with: the full document + the chunk + "Please give a short context sentence (max 2 sentences) explaining where this chunk fits in the document." Prepend that to the chunk before embedding. This adds one LLM call per chunk at indexing time, not at query time — a one-time cost.

Your example: "I implemented contextual retrieval on our product documentation RAG system. Precision improved from 0.61 to 0.79. The biggest gains were on vague queries — users asking things like 'how does billing work' where the chunk alone gave no context about which product feature it referred to."
Technical · Java
Spring AI 1.1 or LangChain4j 1.x — which would you choose and why?
The 2026 Answer

Both hit 1.0 GA in May 2025. Both are production-ready, both support MCP, RAG, tool calling, chat memory, and 20+ LLM providers. The choice is architectural, not capability-driven.

Choose Spring AI 1.1 if:
  • Your team is on Spring Boot — seamless autoconfiguration, familiar patterns
  • You need Micrometer observability integration out of the box
  • You want the Advisors API for standardised RAG and chat patterns
Choose LangChain4j 1.x if:
  • Running Quarkus, Micronaut, Helidon, or plain Java (not Spring Boot)
  • Need GraalVM native image (100ms start, 50MB RAM vs Spring's ~300MB)
  • Want more LLM provider coverage (25+ vs Spring AI's 20+)
  • Need A2A protocol support or advanced agentic modules
Your answer: "We chose Spring AI 1.1 for our order-management AI assistant because we're a Spring Boot shop — the autoconfiguration and Micrometer integration let us add Claude capability to an existing service in half a day. For a separate serverless classification service where cold start matters, I used LangChain4j on Quarkus with GraalVM — 80ms cold start instead of 8 seconds."
🏗️ System Design — 2026 Questions
System Design
Design an enterprise document Q&A system for a legal firm with 100,000 changing documents.
Key Design Decisions (2026)
  • Don't use long context: 100k docs far exceeds 1M token window. RAG required.
  • GraphRAG for legal reasoning: legal documents reference each other (precedents, statutes). Graph traversal retrieves related cases that vector search misses.
  • Contextual retrieval + BM25 hybrid: legal docs have precise terminology (exact clause names, statute numbers) that BM25 handles better than semantic alone.
  • Permission-aware retrieval: lawyer A cannot see client B's documents. Row-level security in pgvector — filter at retrieval, not after.
  • Citation grounding + adaptive thinking: every claim must cite source. Use adaptive thinking with effort: high for complex legal analysis.
  • Audit trail: every query + what was retrieved + what was answered. Compliance requirement in legal.
  • MCP server for internal APIs: expose document management, matter lookup, billing codes as MCP tools with tool annotations (read-only vs write).
System Design
How would you reduce your LLM application's p95 latency from 8 seconds to 2 seconds?
Systematic Approach — profile first, then act
  • Step 1 — Profile with Langfuse: find where time actually goes. Often the bottleneck is embedding or retrieval, not the LLM call.
  • Step 2 — Parallelise independent work: embed query + check cache + fetch user context simultaneously with CompletableFuture.allOf(). Easy 40–60% reduction.
  • Step 3 — Opus 4.8 Fast mode: if you need Opus quality, Fast mode runs at 2.5× speed at $10/$50 per 1M tokens — 3× cheaper than previous Opus Fast.
  • Step 4 — Smart model routing: Haiku 4.5 for classification (10× faster than Sonnet), Sonnet 4.6 for generation, Opus only for complex architecture decisions.
  • Step 5 — Prompt caching (1-hour TTL): large system prompts cached, TTFT drops significantly.
  • Step 6 — Stream everything: p95 wall-clock unchanged but perceived latency drops dramatically.
  • Step 7 — Contextual compression: extract only the relevant sentences from retrieved chunks before sending to LLM. Reduces input tokens.
Your example: "Profiling showed 3.2s of 8s was embedding + retrieval running sequentially. After CompletableFuture parallelisation: 0.9s. Switched classification from Sonnet to Haiku: 60% latency reduction on simple queries. Final p95: 1.8s."
👔 STAR Behavioral Questions
Behavioral · STAR
Tell me about a time you improved the quality of an AI system. How did you measure it?
Framework
SITUATIONOur Document Q&A system had faithfulness of 2.4/5 — users reported answers with claims not in the source documents.
TASKImprove faithfulness to ≥3.8/5 without degrading relevance or exceeding 3s p95 latency.
ACTIONRan RAGAS eval against 50-question golden set to find root cause. Two issues: (1) plain chunking was splitting mid-sentence causing Claude to bridge with invented facts. (2) vague queries were retrieving irrelevant chunks. Fix: semantic sentence-boundary chunking + contextual retrieval (prepend document context to each chunk). Added citation grounding requirement to system prompt.
RESULTFaithfulness 2.4 → 3.9. Context precision 0.61 → 0.84. RAGAS now runs on every PR and caught two subsequent regressions before production.
Behavioral · Opinion
What's the hardest part of building AI systems in 2026? What changed from 2024?
Strong Answer (2026-specific)
  • Evaluation complexity grew: in 2024, RAG eval was novel. In 2026, you're expected to have LLM-as-judge, golden sets, span-level eval, and regression gating in CI. The bar moved.
  • Model selection is harder: in 2024 there was one good model. In 2026, Opus 4.8, Sonnet 4.6, Haiku 4.5, Opus 4.8 Fast mode — wrong choice is now expensive, not just suboptimal.
  • Context vs RAG decision: 1M token windows created a new architecture choice that didn't exist. You have to evaluate this correctly or you're over-engineering.
  • Long-horizon agents are in production: in 2024 agents were demos. In 2026, if your agent fails after 45 minutes of work, that's a production incident. Checkpointing, resumability, cost limits — table stakes now.
  • What didn't change: non-determinism is still the hardest thing to test. Prompt brittleness is still real. Hallucinations in high-stakes domains still require the same defence-in-depth.
🎯 Quick-Fire Q&A
Quick-Fire
What is the difference between adaptive and extended thinking?

Extended thinking (older): you set budget_tokens — a hard cap on how many thinking tokens Claude uses. Claude uses up to the budget regardless of whether the problem needs it.

Adaptive thinking (current, Sonnet 4.6 / Opus 4.6+): you set an effort level. Claude decides whether and how much to think. On simple queries it may not think at all. On complex ones it thinks as much as needed. Interleaved thinking (thinking between tool calls) is automatically enabled. Outperforms fixed budget in Anthropic's internal evaluations.

Quick-Fire
What is contextual retrieval?

A technique from Anthropic (2024) that prepends a short context sentence to each document chunk before embedding it. "This chunk discusses the cancellation policy from the refund section of the terms of service." The embedding is richer, retrieval is more accurate. Combined with BM25 hybrid search, reported 49% fewer retrieval failures. The contextualisation uses Claude at indexing time — a one-time cost, not per-query.

Quick-Fire
What is GraphRAG and when would you use it?

GraphRAG (Microsoft, 2024) indexes document content as a knowledge graph. Retrieval traverses graph relationships, not just vector similarity. Handles multi-hop questions that vector search misses: "What connected Person A to Event B via Organisation C?" Use for: legal research (connecting related statutes/cases), medical literature (drug interactions, treatment chains), enterprise knowledge (org charts, project dependencies). Neo4j + LangChain4j provides Java implementation. More expensive to build and maintain than vector RAG — use only when multi-hop reasoning is required.

Quick-Fire
What changed in the MCP protocol in 2025?

March 2025 (2025-03-26): OAuth 2.0 authorization, Streamable HTTP transport (replaced HTTP+SSE), tool annotations (read-only, destructive), JSON-RPC batching (later removed). June 2025 (2025-06-18): structured tool outputs, enhanced OAuth, server-initiated user interactions, JSON-RPC batching removed. December 2025: donated to Linux Foundation — vendor-neutral permanently. Status June 2026: 97M monthly downloads, 10,000+ servers, adopted by OpenAI, Google, Microsoft, AWS. De facto industry standard.

Quick-Fire
Which Java AI framework do you use and why?

Both Spring AI 1.1 and LangChain4j 1.x are production-ready as of May 2025. Spring AI 1.1 — best for Spring Boot teams. Autoconfiguration, Micrometer observability, Advisors API, tight Boot integration. LangChain4j 1.x — best for Quarkus/non-Spring teams, GraalVM native image (100ms start, 50MB RAM), more LLM provider coverage (25+), A2A protocol support in 1.x agentic module. For new projects on Spring Boot: Spring AI. For existing Quarkus services or serverless: LangChain4j.

Quick-Fire
Why was the Cursor Notepads feature deprecated?

Notepads were deprecated in late 2025 and replaced by Cursor Memories. Notepads required manual creation and maintenance. Memories is a persistent knowledge base that the AI maintains automatically across sessions — it extracts and stores conventions, patterns, and preferences from your conversations without you having to curate them. The result: more up-to-date context with less developer overhead. Team Memories can be shared via deeplinks (Cursor 2.0). Rules (.cursor/rules/*.mdc files) still exist for static, always-on constraints — Memories handles dynamic, evolving project knowledge.

// 2026 Interview Readiness Checklist

I can explain adaptive thinking vs budget_tokens and why the change was made
I can explain contextual retrieval and its 49% failure reduction result
I have a clear opinion on when 1M context replaces RAG vs when RAG is still required
I can explain MCP 2025-06-18 changes and why the spec evolution matters
I have a STAR story with metrics for: improving RAG quality, debugging an agent failure, reducing cost
I can explain the Spring AI vs LangChain4j choice with concrete architectural reasoning
I've done ≥2 mock interviews: "Interview me for a senior AI engineering role in 2026"
My GitHub shows 5+ AI projects built with current tools (Spring AI 1.1 / LangChain4j 1.x / MCP 2025)