Internals: Transformers, Claude 4, Cursor 3.x, MCP 2025
The mental models that change how you build. Transformers, context windows, Cursor's 2026 feature set (Cloud Agents, BugBot, Memories), MCP 2025-06-18 spec, and the full Claude 4 model lineup.
- Watch: 3Blue1Brown — But what is a GPT? — best visual explanation of attention
- Watch Part 2: 3Blue1Brown — Attention in transformers
- O(n²) cost: every token attends to every other. Doubling context length quadruples compute. This is why 1M-token windows are expensive.
- "Lost in the middle" (2023, still applies in 2026): content at the start and end of a prompt gets more attention weight. Content buried in the middle is systematically under-attended. Put critical instructions FIRST.
- Read: Lost in the Middle — Stanford 2023 — required reading
- Watch: Andrej Karpathy — Let's build the GPT tokeniser — understand BPE tokenisation from scratch
- Tokens ≠ characters. "JavaScriptDeveloper" = ~4 tokens. Test in OpenAI's tokeniser — Claude uses the same approach.
- Temperature 0: deterministic — use for JSON extraction, tool calls, classification. Temperature 0.7+: creative, varied — use for brainstorming, writing.
- Read: Claude Model Spec — the public document behind Claude's behaviour. Understanding it changes how you prompt.
| Model | Best for | Context | Price (in/out per 1M) |
|---|---|---|---|
| Opus 4.8 LATEST | Complex agents, long-horizon tasks, multi-step coding, browser use. 4x less likely to let code flaws pass. | 1M tokens | $5 / $25 standardFast mode: $10/$50 at 2.5× speed |
| Sonnet 4.6 DEFAULT | Daily driver. 70% of devs prefer over Sonnet 4.5. Matches Nov 2025 Opus 4.5 quality at Sonnet price. 1M context in beta. | 1M (beta) | $3 / $15 |
| Haiku 4.5 | Classification, routing, simple summarisation, task scheduling. Claude Code smart-routes here automatically. | 200K | $1 / $5 |
- Read: Current Anthropic pricing — verify before budgeting
- Read: Model overview with API names
- 1M tokens ≠ free: at Sonnet 4.6 pricing, 1M input tokens = $3. A 10-turn conversation re-sends everything each turn — graph your costs.
- Prompt caching (1-hour TTL in 2026): Claude 4 supports 1-hour cache duration for long agentic tasks. Previously 5 minutes — this changes long-running agent economics significantly. UPDATED
- Cache with thinking: extended thinking tasks often take >5 min — use the 1-hour cache to maintain hits across multi-step workflows.
- Read: Prompt caching docs — includes 1-hour TTL details
- Build
ContextManager.javawith 4 strategies: full history, sliding window, summarisation, selective inclusion. Benchmark cost vs quality. ☕ Java
- Notepads are deprecated. Replaced by Cursor Memories — a persistent knowledge base the AI maintains automatically across sessions. No manual creation needed. CHANGED
- Cloud Agents (formerly Background Agents): run up to 8 agents in parallel on a single prompt. Each agent operates in its own isolated git worktree or remote machine — no file conflicts. NEW
- BugBot: automatically reviews PRs, identifies potential issues, assists debugging at project level. Runs without being asked. NEW
- Cursor Composer model: Anthropic's first agentic coding model — 4x faster than similarly intelligent models. Default in Agent Mode.
- Browser for Agent (GA): Agent can browse the web inline. Embeddable in-editor with element selection and DOM forwarding.
- Read: Cursor 2.0 changelog and latest changelog
- Rules location changed: `.cursorrules` is deprecated. Current system uses
.cursor/rules/folder with.mdcfiles. CHANGED - Rule types: Always (every request), Auto-attached (when matching files open), Agent-requested (agent decides when relevant), Manual (you add with @)
- Cursor Memories replaces Notepads for cross-session context — seed it with your architecture conventions from the first session
- Team Rules: share custom rules, commands, prompts across the team via Cursor Docs deeplinks NEW
- Read: Cursor Rules docs — current .mdc format
- 97 million monthly SDK downloads, 10,000+ public servers, adopted by OpenAI, Google DeepMind, Microsoft, AWS. MCP is the de facto connectivity standard for agentic AI. 2026 STATUS
- Donated to Linux Foundation (Dec 2025): co-founded with Block and OpenAI, backed by Google, Microsoft, AWS, Cloudflare. Long-term neutrality guaranteed.
- Latest spec: 2025-06-18 — introduces structured tool outputs, enhanced OAuth security, server-initiated user interactions. Removed JSON-RPC batching (added in March spec) to simplify. LATEST SPEC
- Transport change: HTTP+SSE replaced by Streamable HTTP — more robust, proxy-friendly. BREAKING from 2024-11-05
- Tool annotations: tools now declare their behaviour (read-only, destructive) for safer execution
- Spring AI 1.1 + LangChain4j 1.x both support MCP as of late 2025. No custom wiring needed.
- Read: MCP 2025-06-18 spec
- Spring AI 1.1 (Nov 2025): full MCP auto-configuration. Expose
@Beanmethods as MCP tools. NEW - LangChain4j 1.x: dedicated
langchain4j-mcpmodule. Works on Quarkus, Micronaut — not just Spring. - Build a server with all 3 primitives: Tools (actions), Resources (read-only), Prompts (templates)
- Use the 2025-06-18 transport: Streamable HTTP, not the old SSE transport
- Add tool annotations: mark destructive tools, mark read-only tools
- Security reminder: never trust LLM-supplied arguments. Claude can be prompt-injected via retrieved content. Validate all tool inputs server-side regardless of source.
- Read: Spring AI 1.1 MCP docs
- Read: LangChain4j MCP module
| Framework | Version (June 2026) | Choose if… |
|---|---|---|
| Spring AI | 1.1 (Nov 2025) + 2.0-M1 preview | Already on Spring Boot. Advisors API, Micrometer observability, autoconfiguration. Tightest Spring integration. |
| LangChain4j | 1.10.x (monthly releases since May 2025) | Quarkus/Micronaut/Helidon/plain Java. More provider support (25+). GraalVM native image: 100ms start, 50MB RAM. Explicit/modular style. |
| Direct HttpClient | Java 21+ | Simple tasks, no framework lock-in, full control. What we used in this plan. JEP 517 (JDK 26) adds HTTP/3. |
- Read: Spring AI 1.1 reference
- Read: LangChain4j 1.x docs
- Read: Detailed 2026 comparison
// Phase 0 Checkpoint
Advanced Claude 4 — Adaptive Thinking, Prompt Caching, Batch API
Adaptive Thinking (the 2026 replacement for budget_tokens), interleaved thinking with tools, 1-hour prompt cache, systematic prompt engineering, and a self-testing prompt management system.
- Adaptive Thinking (Claude Opus 4.6+, Sonnet 4.6): set
thinking: {type: "adaptive"}— Claude decides when and how much to think based on query complexity and your effort setting. NEW — replaces budget_tokens - Effort parameter:
low/medium/high/max/xhigh— you control intensity, not token count. "high" is the default; Claude almost always thinks at high. - Interleaved thinking (auto-enabled in adaptive mode): Claude can think between tool calls. Previous limitation: thought once at the start, then called tools without further thinking. Now: think → call tool → think about result → call next tool → think → answer.
- Extended thinking (budget_tokens) is deprecated on Sonnet 4.6 and Opus 4.6. Still functional but prefer adaptive. On Opus 4.5 and older, budget_tokens is still required.
- Summarised thinking: Claude 4 returns a summary of its reasoning, not the full token stream. Full trace available via Anthropic request (for audit/compliance). CHANGED from 3.x
- For Opus 4.8 at max/xhigh effort: set
max_tokensto 64k minimum — the model needs space for subagents and tool calls - Read: Adaptive Thinking docs — current reference
- Read: Extended Thinking docs — for Sonnet/Haiku 4.5 and older
effort: "low" for simple queries (faster, cheaper), effort: "high" for architecture decisions, effort: "max" for complex multi-step agents. Benchmark quality vs cost per effort level.- Cache TTL is now 1 hour for long agentic tasks — up from 5 minutes. This changes the economics of long agent loops significantly. UPDATED 2025
- Use 1-hour cache for: large system prompts in multi-step agents, RAG context reused across turns, few-shot examples referenced throughout a session
- Thinking + cache: changes to thinking parameters (enabled/disabled or budget changes) invalidate cache breakpoints. Interleaved thinking amplifies this. Plan cache strategy before enabling adaptive thinking.
- Pattern 1 — Static system prompt: large instructions cached, only user message varies. Best ROI.
- Pattern 2 — RAG context caching: when retrieved docs are reused across turns in same session, cache them. Significant savings on document Q&A.
- Pattern 3 — Few-shot bank: cache example bank, vary only the query.
- Build a cache-aware service: log cache_hit rate, cost saved per request. Target >60% hit rate.
- Read: Prompt caching docs — includes 1-hour TTL and thinking interactions
- Read: Anthropic prompt engineering guide — comprehensive reference
- Rebuild a prompt using full XML structure:
<role>,<context>,<task>,<constraints>,<output_format> - Adaptive thinking is promptable: if the model over-thinks on simple queries (large system prompts can trigger this), add guidance: "Only use extended reasoning for genuinely complex problems. Respond directly for simple queries."
- Self-critique loop: answer → Claude critiques → Claude revises. 3-turn Java method. Measurably improves output quality.
- Evaluator-optimizer: generate → score (1–5) → if <4, regenerate with critique. Loop max 3x. Use Claude-as-judge.
- Free course: DeepLearning.AI: Prompt Engineering with Anthropic Claude
- Read: Message Batches API — 50% cheaper, async, up to 100k requests per batch
- Build Java batch processor:
List<String>→ submit batch → poll status → process results ☕ Java - Convert streaming endpoint to reactive:
Flux<String>with Spring WebFlux ☕ WebFlux - Decision rule: sync+streaming for interactive (user is watching). Batch API for document processing, classifiers, bulk analysis, nightly jobs.
- Build a cost calculator: compare real-time vs batch for your specific workload mix
// Phase 1 Checkpoint
Production RAG — 2026 Techniques
Hybrid retrieval, GraphRAG, contextual compression, RAGAS evaluation, hallucination detection. RAG has evolved beyond simple chunk-and-retrieve — this phase covers what production systems actually do in 2026.
- 2026 consensus: semantic chunking with contextual headers outperforms fixed-size chunking. Preserve document structure (sections, headings). Tools like LlamaIndex can do LLM-based compression of retrieved sets.
- Parent-child pattern: index small chunks (precise retrieval), return parent chunk to LLM (full context). Gains 10–20% recall. Still the best single improvement for most pipelines.
- Contextual retrieval (Anthropic, 2024): prepend a sentence of chunk context before embedding — "This chunk is from the section about X in document Y". Reported 49% reduction in retrieval failures when combined with BM25.
- Read: Contextual Retrieval — Anthropic blog
- Implement 3 chunkers in Java + LangChain4j: fixed-size, sentence-aware, semantic. Benchmark on 50 golden questions: recall@5, precision, latency. ☕ LangChain4j
- Hybrid search (BM25 + semantic) with RRF is the baseline for production RAG in 2026. Pure vector-only is no longer sufficient. Implement: add BM25 via pgvector full-text search, merge with RRF
score = Σ 1/(rank_i + 60). Target: ≥15% recall improvement over pure semantic. - GraphRAG (Microsoft, 2024 — mainstream 2025): index documents as a knowledge graph. Retrieve via graph traversal, not just vector similarity. Handles complex multi-hop questions that vector search misses ("what connects A to C via B?"). PRODUCTION in 2025
- GraphRAG use cases: legal research (connecting related cases), medical literature (drug interaction chains), enterprise knowledge (org relationships)
- Neo4j LangChain4j integration available for Java GraphRAG. ☕ Java
- Read: Advanced RAG techniques including GraphRAG — Neo4j, Oct 2025
- Benchmark: hybrid vs vector-only on 50 multi-hop questions in your domain
- Reranking: retrieve top 50 with bi-encoder → rerank to top 5 with Cohere Rerank API. Two-stage pipeline is standard. Build
Rerankerinterface: Cohere, LLM, passthrough. - HyDE: generate a hypothetical answer → embed that → retrieve. Dramatically improves vague query retrieval. Read: HyDE paper.
- Query decomposition: complex → multiple single-fact sub-questions → merge. Essential for "What are the tax implications of X given Y and Z?"
- Conversational rewriting: "What about Q3?" → standalone "What was Apple's Q3 2024 revenue?" using history. Required for any multi-turn RAG app.
- Contextual compression (new 2025 pattern): ask Claude to extract only the relevant sentence(s) from each retrieved chunk before passing to the LLM. Reduces noise in the context.
- Read: RAGAS docs — all 4 metrics: Faithfulness, Answer Relevance, Context Precision, Context Recall
- Faithfulness is the most important metric in 2026. LLM hallucination in RAG is the #1 production complaint. Every claim must derive from retrieved context.
- Implement all 4 as JUnit-compatible Java evaluators using Claude-as-judge (Claude grading Claude)
- Span-level evaluation (2025 practice): for multi-step RAG, evaluate each stage independently — retrieval quality separate from generation quality. Maxim AI and similar platforms support this.
- Generate 100 synthetic Q&A pairs from your corpus with Claude. Curate to 50 golden questions. Run full pipeline. Record baseline score. This is your regression suite forever.
- Add to GitHub Actions: PR that drops faithfulness below 3.5/5 fails automatically.
- Few-shot first: always. 3–5 examples in the prompt. Zero infrastructure, fastest to test. If this works, stop here.
- Long context window (1M tokens): with Sonnet 4.6's 1M token context (beta), small-to-medium private knowledge bases can now be injected directly. RAG becomes optional for datasets under ~500k tokens. 2026 CHANGE — re-evaluate your RAG decisions
- RAG when: corpus >500k tokens, knowledge changes frequently, need source citations, multi-document search. Still the production standard for large corpora.
- Fine-tuning when: consistent output format or style across thousands of calls, domain-specific vocabulary, latency is critical. Never fine-tune to add factual knowledge — use RAG. Fine-tuned facts hallucinate.
- New 2025 pattern — "Needle in a haystack" evals: test whether your model can actually find a specific fact in a 500k-token context. Long context windows don't automatically mean accurate recall of buried facts.
// Phase 2 Checkpoint
Agentic Systems — 2026 Patterns
Multi-agent orchestration with interleaved thinking, all 6 memory types, computer use, A2A protocol, human-in-the-loop, and a production research agent with full evaluation suite.
- Read: Building Effective Agents (Anthropic) — still the canonical reference
- Read: Tool definition best practices
- Tool annotations (MCP 2025-03-26+): declare
readOnly: truefor read tools,destructive: truefor write/delete. Claude uses these to make safer decisions. NEW IN MCP - Parallel tool calls (Claude 4): Claude can call multiple independent tools simultaneously. Design tools to be idempotent where possible. Build your parallel-safe tool set. Claude 4 feature
- Description is still the most important field — tell Claude WHEN to use the tool, not just what it does. "Use this when you need current stock price data" not just "Gets stock prices."
- Build
ToolResultrecord with: success, data, error, requiresConfirmation, isIdempotent ☕ Java
- 1. In-context (working memory): context window. Fast, temporary, expensive at scale.
- 2. Episodic: past conversation logs. Retrieve when relevant. PostgreSQL with semantic search.
- 3. Semantic: distilled facts in vector store. "User prefers functional Java style." Update via end-of-session consolidation.
- 4. Procedural: how-to knowledge in tool definitions and few-shot examples. Most durable.
- 5. Prompt cache: Anthropic KV cache. 1-hour TTL in 2026. Reduces cost, doesn't change knowledge.
- 6. File-based memory (NEW in Claude 4): when given access to local files, Claude can extract and save key facts, building persistent memory with continuity. More reliable than semantic memory for structured facts. Claude 4 feature
- Read: Generative Agents (Stanford) — importance × recency × relevance scoring formula
- Build unified
MemorySystem.javawith all 6 types,store()/retrieve(query, userId)interface ☕ Java
- Orchestrator-subagent pattern: orchestrator decomposes task → delegates to specialised subagents → merges results
- Parallel with CompletableFuture: independent subagents run simultaneously. Handle partial failures (1 of 5 fails → continue with 4 results). ☕ Java
- A2A (Agent-to-Agent) protocol (Google, 2025): standardises how agents communicate across systems. LangChain4j 1.x has A2A support in the agentic module. Enables agent teams from different vendors/frameworks to coordinate. 2025 STANDARD
- Reviewer subagents: a reviewer agent can delegate to a test-writer, which can delegate further. Each level keeps its own prompt and model (Cursor 2.0 pattern — applicable to your own systems).
- Checkpointing: persist state after each tool call. Build resume endpoint. Critical for long-running agents.
- Per-job budget limits with Resilience4j. Parallel agents multiply cost — measure it.
- Computer Use (GA in Claude 4): Claude can move a cursor, click, type in browser windows. Opus 4.8 scored 84% on Online-Mind2Web (browser agent SoTA). GA in Claude 4
- OSWorld-Verified (Jul 2025): updated benchmark replacing original OSWorld. Sonnet 4.6 shows dramatic improvement in computer use vs Sonnet 4.5.
- Read: Computer Use API docs
- Prompt injection risk: browser agents reading web content can be hijacked by hidden instructions on visited pages. This is not theoretical — it's actively exploited. Mitigations: sandboxed browser, instruction validation, human approval before any write action.
- Use case focus: web scraping automation, form filling, GUI testing, research agents that browse. Not for actions with financial consequences without human approval.
- Task completion eval: JUnit harness, submit task, capture trace, score with Claude judge
- Trajectory eval: right tools, right order, no wasted steps. Efficiency = steps taken / theoretical minimum
- Adversarial tests: ambiguous instructions (does it ask?), conflicting results (does it reconcile?), errors (does it recover?), empty tool results (does it loop?)
- Agent eval platforms (2026): Maxim AI now rated #1 for multi-agent eval. Supports simulation, experimentation, and observability for agent teams. 2026 tooling
- Build failure mode catalogue: ≥10 patterns with fix strategies. Document as you discover them.
- Common 2026 failure modes: infinite tool loop on empty results, ignoring tool errors, over-calling the same tool, not using parallel calls when independent
// Phase 3 Checkpoint
Ship & Scale
Cost engineering with Opus 4.8 Fast mode, latency SLAs, prompt A/B testing, AI system design decisions, and a portfolio capstone that integrates everything from the plan.
- Opus 4.8 Fast mode ($10/$50 per 1M): 2.5× faster than standard, 3× cheaper than previous Opus Fast modes. Use for latency-sensitive Opus-quality work. 2026 PRICING
- Smart model routing (like Claude Code does it): classify query → route Haiku 4.5 ($1/$5) for simple/routine → Sonnet 4.6 ($3/$15) for most work → Opus 4.8 ($5/$25) for complex architecture/long agents. Track quality delta per route.
- Build
CostTracker: log cost per user/feature/model/day to PostgreSQL ☕ Java - Per-user budget limits with Resilience4j rate limiter
- Response caching with TTL: near-identical queries return cached response
- Context compression: summarise long conversations before sending. Measure cost saved vs quality lost.
- Use Langfuse Java SDK for cost + latency observability
- Profile every stage with Langfuse: embedding → search → rerank → LLM → parse. The bottleneck is usually not where you expect.
- Parallelise independent operations: fetch user context + embed query + check cache simultaneously with
CompletableFuture.allOf() - Set SLAs: p50 <1s, p95 <3s, p99 <8s. Alert on breach.
- Circuit breaker: Claude API latency >10s → fail fast with cached/fallback response (Resilience4j)
- Stream every response — perceived latency drops even when total latency is unchanged
- Opus 4.8 Fast mode is the answer for latency-sensitive Opus-quality work — not a different architecture.
- Extend prompt library: traffic split A/B, metric tracking, statistical significance (≥200 samples per variant)
- 2026 decision framework: few-shot first → long context (1M) if corpus <500k → RAG for larger/dynamic → fine-tune only for style/format at high volume. Never fine-tune for knowledge.
- Architecture patterns: Chain / Router / Evaluator-Optimizer / Parallelisation / Orchestrator-Subagent — know when each is right
- Read: Effective Agents workflow patterns — the canonical reference, remains current
- Choose a real problem: Java code reviewer, legal doc analyser, customer support agent, competitive intelligence, enterprise knowledge base
- Must use from Phase 0: MCP server (2025-06-18 spec), tool annotations, Spring AI 1.1 or LangChain4j 1.x
- Must use from Phase 1: adaptive thinking (not budget_tokens), 1-hour prompt cache, versioned prompts
- Must use from Phase 2: contextual retrieval + hybrid search, RAGAS evals in CI
- Must use from Phase 3: ≥2 agents with parallel execution, all 6 memory types, HITL for at least 1 action
- Must use from Phase 4: smart model routing, cost tracking, latency SLAs, 1 completed A/B test
- Deliverables: ADR, README, 5-min Loom demo, GitHub Actions CI/CD with RAGAS gate, live URL, Langfuse dashboard screenshot
// Phase 4 Final Checkpoint
AI engineering interviews in 2026 are 60%+ GenAI-focused. Based on real 2026 interview loops. Eval methodology is the new system design. Questions about adaptive thinking, MCP, GraphRAG, and A2A are now common. Answers must include projects from this plan.
budget_tokens (deprecated on 4.6 models): you specify a fixed token budget. Claude used up to that many thinking tokens regardless of complexity. Fine-grained but over-specified — simple queries still consumed budget.
Adaptive thinking (Sonnet 4.6, Opus 4.6+): you set an effort level (low/medium/high/max/xhigh). Claude decides when and how much to think based on query complexity. On simple queries, it may skip thinking entirely. On complex multi-step problems, it thinks extensively. In internal evaluations, adaptive thinking outperforms fixed budget_tokens.
Key addition — interleaved thinking: automatically enabled in adaptive mode. Claude can think between tool calls, not just before the first one. Critical for complex agentic workflows.
Long context windows changed this calculation. For corpora under ~500k tokens that don't change frequently, injecting directly into a 1M context is now viable and often simpler than a RAG pipeline.
- Corpus exceeds 500k–1M tokens — physically can't fit, or cost is prohibitive ($0.15–3 per request at current pricing)
- Knowledge changes frequently — re-injecting a full corpus each request is expensive and wasteful
- Citation grounding required — RAG retrieves traceable sources; long context doesn't
- Latency matters — loading 1M tokens takes time; RAG retrieves relevant subset in milliseconds
- "Needle in a haystack" accuracy — even 1M context has position effects. For precise fact recall in large corpora, retrieval is more reliable than long context
- Multi-user systems — each user has a different relevant subset. RAG is personalised; shared long context isn't
- Transport: HTTP+SSE → Streamable HTTP. More proxy-friendly, used in enterprise environments
- Auth: basic auth → structured OAuth 2.0. Enterprise-grade security built into the protocol
- Tool annotations: tools can now declare read-only, destructive, etc. Claude uses these for safer autonomous decisions
- Structured tool outputs: richer return types, not just strings
- Server-initiated interactions: servers can now prompt users for input mid-execution
- Donated to Linux Foundation (Dec 2025): vendor-neutral governance. Long-term stability guaranteed.
MCP is now the de facto standard — adopted by OpenAI, Google, Microsoft, AWS. 10,000+ public servers, 97M monthly downloads. Any senior AI engineer in 2026 is expected to know MCP deeply, including protocol evolution. Showing you know the spec version history signals you're working with real production systems.
This is the #1 thing interviewers ask about in 2026 — "eval methodology is the new system design." Most candidates can describe building an agent; fewer can describe evaluating one rigorously.
- Task completion rate: did the agent achieve the stated goal? Binary for simple tasks, rubric-scored for complex ones.
- Trajectory efficiency: steps taken / theoretical minimum steps. An agent that achieves a goal in 12 steps when 4 were needed is a problem.
- Tool call correctness: was the right tool called with the right arguments? Log and score each call.
- Loop detection: does it detect when it's stuck? Does it call the same tool repeatedly with the same arguments?
- Graceful degradation: when a tool fails, does it recover or cascade-fail?
- Token efficiency: total tokens used per task completed. Cost per successful completion.
Contextual retrieval (Anthropic, 2024) prepends a short context sentence to each chunk before embedding it. Without this, a chunk like "Revenue increased by 3% year-over-year" has no context when retrieved. With contextual retrieval, it becomes: "This is from the Q3 2024 earnings section of Apple's annual report. Revenue increased by 3% year-over-year."
The embedding of the contextualised chunk is much richer — it captures both the content AND its place in the document. Combined with BM25 hybrid search, Anthropic reported 49% reduction in retrieval failures.
For each chunk, call Claude with: the full document + the chunk + "Please give a short context sentence (max 2 sentences) explaining where this chunk fits in the document." Prepend that to the chunk before embedding. This adds one LLM call per chunk at indexing time, not at query time — a one-time cost.
Both hit 1.0 GA in May 2025. Both are production-ready, both support MCP, RAG, tool calling, chat memory, and 20+ LLM providers. The choice is architectural, not capability-driven.
- Your team is on Spring Boot — seamless autoconfiguration, familiar patterns
- You need Micrometer observability integration out of the box
- You want the Advisors API for standardised RAG and chat patterns
- Running Quarkus, Micronaut, Helidon, or plain Java (not Spring Boot)
- Need GraalVM native image (100ms start, 50MB RAM vs Spring's ~300MB)
- Want more LLM provider coverage (25+ vs Spring AI's 20+)
- Need A2A protocol support or advanced agentic modules
- Don't use long context: 100k docs far exceeds 1M token window. RAG required.
- GraphRAG for legal reasoning: legal documents reference each other (precedents, statutes). Graph traversal retrieves related cases that vector search misses.
- Contextual retrieval + BM25 hybrid: legal docs have precise terminology (exact clause names, statute numbers) that BM25 handles better than semantic alone.
- Permission-aware retrieval: lawyer A cannot see client B's documents. Row-level security in pgvector — filter at retrieval, not after.
- Citation grounding + adaptive thinking: every claim must cite source. Use adaptive thinking with effort: high for complex legal analysis.
- Audit trail: every query + what was retrieved + what was answered. Compliance requirement in legal.
- MCP server for internal APIs: expose document management, matter lookup, billing codes as MCP tools with tool annotations (read-only vs write).
- Step 1 — Profile with Langfuse: find where time actually goes. Often the bottleneck is embedding or retrieval, not the LLM call.
- Step 2 — Parallelise independent work: embed query + check cache + fetch user context simultaneously with CompletableFuture.allOf(). Easy 40–60% reduction.
- Step 3 — Opus 4.8 Fast mode: if you need Opus quality, Fast mode runs at 2.5× speed at $10/$50 per 1M tokens — 3× cheaper than previous Opus Fast.
- Step 4 — Smart model routing: Haiku 4.5 for classification (10× faster than Sonnet), Sonnet 4.6 for generation, Opus only for complex architecture decisions.
- Step 5 — Prompt caching (1-hour TTL): large system prompts cached, TTFT drops significantly.
- Step 6 — Stream everything: p95 wall-clock unchanged but perceived latency drops dramatically.
- Step 7 — Contextual compression: extract only the relevant sentences from retrieved chunks before sending to LLM. Reduces input tokens.
- Evaluation complexity grew: in 2024, RAG eval was novel. In 2026, you're expected to have LLM-as-judge, golden sets, span-level eval, and regression gating in CI. The bar moved.
- Model selection is harder: in 2024 there was one good model. In 2026, Opus 4.8, Sonnet 4.6, Haiku 4.5, Opus 4.8 Fast mode — wrong choice is now expensive, not just suboptimal.
- Context vs RAG decision: 1M token windows created a new architecture choice that didn't exist. You have to evaluate this correctly or you're over-engineering.
- Long-horizon agents are in production: in 2024 agents were demos. In 2026, if your agent fails after 45 minutes of work, that's a production incident. Checkpointing, resumability, cost limits — table stakes now.
- What didn't change: non-determinism is still the hardest thing to test. Prompt brittleness is still real. Hallucinations in high-stakes domains still require the same defence-in-depth.
Extended thinking (older): you set budget_tokens — a hard cap on how many thinking tokens Claude uses. Claude uses up to the budget regardless of whether the problem needs it.
Adaptive thinking (current, Sonnet 4.6 / Opus 4.6+): you set an effort level. Claude decides whether and how much to think. On simple queries it may not think at all. On complex ones it thinks as much as needed. Interleaved thinking (thinking between tool calls) is automatically enabled. Outperforms fixed budget in Anthropic's internal evaluations.
A technique from Anthropic (2024) that prepends a short context sentence to each document chunk before embedding it. "This chunk discusses the cancellation policy from the refund section of the terms of service." The embedding is richer, retrieval is more accurate. Combined with BM25 hybrid search, reported 49% fewer retrieval failures. The contextualisation uses Claude at indexing time — a one-time cost, not per-query.
GraphRAG (Microsoft, 2024) indexes document content as a knowledge graph. Retrieval traverses graph relationships, not just vector similarity. Handles multi-hop questions that vector search misses: "What connected Person A to Event B via Organisation C?" Use for: legal research (connecting related statutes/cases), medical literature (drug interactions, treatment chains), enterprise knowledge (org charts, project dependencies). Neo4j + LangChain4j provides Java implementation. More expensive to build and maintain than vector RAG — use only when multi-hop reasoning is required.
March 2025 (2025-03-26): OAuth 2.0 authorization, Streamable HTTP transport (replaced HTTP+SSE), tool annotations (read-only, destructive), JSON-RPC batching (later removed). June 2025 (2025-06-18): structured tool outputs, enhanced OAuth, server-initiated user interactions, JSON-RPC batching removed. December 2025: donated to Linux Foundation — vendor-neutral permanently. Status June 2026: 97M monthly downloads, 10,000+ servers, adopted by OpenAI, Google, Microsoft, AWS. De facto industry standard.
Both Spring AI 1.1 and LangChain4j 1.x are production-ready as of May 2025. Spring AI 1.1 — best for Spring Boot teams. Autoconfiguration, Micrometer observability, Advisors API, tight Boot integration. LangChain4j 1.x — best for Quarkus/non-Spring teams, GraalVM native image (100ms start, 50MB RAM), more LLM provider coverage (25+), A2A protocol support in 1.x agentic module. For new projects on Spring Boot: Spring AI. For existing Quarkus services or serverless: LangChain4j.
Notepads were deprecated in late 2025 and replaced by Cursor Memories. Notepads required manual creation and maintenance. Memories is a persistent knowledge base that the AI maintains automatically across sessions — it extracts and stores conventions, patterns, and preferences from your conversations without you having to curate them. The result: more up-to-date context with less developer overhead. Team Memories can be shared via deeplinks (Cursor 2.0). Rules (.cursor/rules/*.mdc files) still exist for static, always-on constraints — Memories handles dynamic, evolving project knowledge.