AI / LLM Intelligence Briefing — Dec 2025 to 14 Jun 2026

Frontier AI & large language models · six-month lookback · technology-intelligence delta

First run. No ai_seen_items.md existed, so this is the initial AI/LLM briefing — all items reported for the first time and seeded into memory. Tiers: Demonstrated peer-reviewed/replicated · Reported vendor/single-source/preprint · Projected roadmap · Contested.

Source-quality caveat. This topic attracts heavy AI-generated aggregator content. Load-bearing items were verified against primary sources (lab blogs, Nature/Axios/Time/CNBC, HuggingFace). Almost all benchmark numbers are vendor- or partner-reported and un-replicated — treat "state of the art," olympiad "gold-level," and partner anecdotes as marketing-adjacent. Items resting only on aggregators are flagged unverified.

1 · Top takeaways

The single most consequential, best-corroborated story is governance, not capability: on 12–13 Jun 2026 the U.S. government ordered Anthropic to suspend all access to its new Fable 5 and Mythos 5 models for any foreign national — arguably the first use of export-control powers to switch off a deployed frontier model, three days after launch.
Anthropic overtook OpenAI as the most valuable AI startup — a $65B Series H at ~$965B post-money (28 May 2026), with stated run-rate revenue crossing ~$47B.
Capability releases clustered around agentic coding: OpenAI's GPT-5.3-Codex (the first OpenAI model rated "High capability" for cybersecurity) and Anthropic's Mythos-class tier above Opus headline a window where coding/computer-use is the dominant capability axis.
Open-weights stayed competitive with DeepSeek V4 (MIT-licensed, ~1.6T-param MoE, new sparse-attention architecture) and the Qwen3.5+ line — though DeepSeek's V3.2 line drew benchmark-contamination scrutiny.

2 · By area

Models & releases

OpenAI GPT-5.3-Codex Reported — 5 Feb 2026. Most capable agentic coding model at release; first OpenAI model classified "High capability" for cybersecurity under its Preparedness Framework and first trained directly to find software vulnerabilities. Vendor benchmarks (SWE-Bench Pro 56.8%, Terminal-Bench 2.0 77.3%) are self-reported. (openai.com, 5 Feb 2026)
OpenAI GPT-5.3 / 5.4 / 5.5 line Reported — early–mid 2026. The family's existence is primary-confirmed via OpenAI's index; exact 5.4/5.5 dates rest on secondary sources. (openai.com)
Anthropic Claude Opus 4.8 Reported — 28 May 2026. Builds on 4.7 at the same price ($5/$25 per M tokens); adds user-controllable "effort," dynamic Claude Code workflows, and a ~2.5× fast mode. The circulating "4× fewer code flaws" claim is aggregator-only — unverified. (anthropic.com, 28 May 2026)
Anthropic Claude Fable 5 & Mythos 5 Reported — 9 Jun 2026. A new "Mythos-class" tier above Opus; Fable 5 is general-access with classifiers, Mythos 5 is the same model with cyber safeguards lifted, restricted to vetted partners + U.S. government. $10/$50 per M tokens. All capability/customer claims are vendor/partner-sourced and unreplicated. Both were suspended 3 days later — see Policy. (anthropic.com, 9 Jun 2026)
Google DeepMind Gemini 3.1 Pro (19 Feb) & Gemini 3.5 / 3.5 Flash (19 May 2026) Reported — in-window follow-ons to the Nov 2025 Gemini 3 launch; 3.5 framed as "frontier intelligence with action" (agentic), 1M context. Benchmark deltas vendor-reported. (blog.google)
DeepSeek V4 (Pro + Flash) Reported — 24 Apr 2026, open weights, MIT license. V4-Pro ~1.6T total / 49B active MoE, ~1M context; replaces V3's MLA with hybrid attention + DeepSeek Sparse Attention (DSA). NVIDIA published an NVFP4 variant. R2 remains unreleased (rumor-only). Apply caution to self-reported scores (see Evaluation). (api-docs.deepseek.com; huggingface.co)
Alibaba Qwen3.5 → 3.7 line Reported — Qwen3.5 mid-Feb 2026 ("agentic AI era," open-weight ~397B/17B-active); 3.6 (~Apr) and 3.7/3.7-Plus (May–Jun) followed. CNBC reliable for 3.5; later specifics secondary-sourced. (cnbc.com, 17 Feb 2026)
Mistral Voxtral TTS Reported — ~26 Mar 2026, first Mistral open-source speech-generation model, 9 languages. (techcrunch.com)
Meta strategy pivot & xAI Grok 5 — Meta reportedly shifting toward a closed-weight model ("Muse Spark," ~Apr 2026) and xAI's Grok 5 are aggregator-only / rumor; excluded as confirmed releases, tracked as roadmap. Projected

Training & architecture

RL-compute scaling laws for reasoning Reported — a 2026 preprint wave (CoScale-RL, BroRL, "The Art of Scaling RL Compute for LLMs") formalizing how to co-scale SFT data and RL rollout compute for reasoning/agentic capability. Standard-preprint experiments, no independent replication noted. (arXiv)
ASI-EVOLVE — autonomous architecture/algorithm discovery Reported — claims to autonomously generate 105 novel linear-attention architectures beating human baselines on math reasoning. Single-outlet, strong claim; needs peer review/replication. (venturebeat.com)
DeepSeek V4 architecture (DSA + hybrid attention) — a concrete architecture advance tied to shipped open weights (see Models).

Inference & systems

Long-context KV-cache compression cluster Reported — an active 2026 preprint line (VecInfer, DynSplit-KV, LLM-CoOpt) targeting ~10M-token inference via low-bit KV quantization. Long-context serving cost/memory is now the dominant inference bottleneck (inference cited as >90% of operational cost). Individual-paper gains, not a settled standard. (arXiv)
Production stack economics Reported — FP8 + Flash Attention 3 + continuous batching + speculative decoding cited at ~5–8× cost-efficiency vs naïve FP16/H100. Directional, secondary-sourced. (morphllm.com)

Evaluation & benchmarks

DeepSeek V3.2 contamination scrutiny Contested — independent evaluators reportedly flagged statistically unusual score patterns in 2026; relevant when reading any DeepSeek self-reported numbers (incl. V4). (llm-stats.com)
Benchmark saturation → new SWE-* suites Reported — MMLU/HumanEval no longer separate frontier models (clustered >90%); the field is migrating to contamination-resistant evals (LiveCodeBench, ConStat) and new coding benchmarks (SWE-Bench Pro, SWE-EVO, SWE-Universe, SWE-Hub). A methodological shift, not a leaderboard rerun. Hype check: aggregator leaderboards listing exact "Mythos 5 / Fable 5 / Opus 4.8" SWE-bench figures are un-replicated third-party tabulations — harness differences make cross-model comparison unreliable.

Agents & applications

Coding / computer-use agents are the headline capability axis. Reported Flagship releases (GPT-5.3-Codex, Fable 5) are explicitly agentic, with vendor claims of multi-hour-to-multi-day autonomous work, OSWorld jumps (GPT-5.3-Codex 38%→65%), and partner anecdotes (Stripe: a 2-month, 50M-line migration in a day). Separate demo from deployment: these are curated partner quotes, not controlled studies; measured deployed autonomy remains below the impression they create.

Safety, alignment & interpretability

Anthropic Mythos-class safeguards Reported — first public deployment of a model Anthropic judged to cross a serious dual-use threshold. Classifier models route cyber/bio-chem queries to Opus 4.8 (<5% of sessions); an external bug bounty found no universal jailbreak in 1,000+ hours, though UK AISI reportedly made progress toward one — a candid disclosure. The most concrete "responsible scaling in practice" datapoint of the window. (anthropic.com, 9 Jun 2026)
Automated alignment / interpretability Reported — Anthropic's automated-alignment and circuit-tracing work continued; a specific "0.97 PGR in 5 days" figure is aggregator-only — unverified. (anthropic.com)

Policy, standards & governance

U.S. export-control directive suspends Fable 5 & Mythos 5 Demonstrated (multi-source: Time, Fortune, CNBC, Al Jazeera, Anthropic) — 12–13 Jun 2026. Citing national-security authorities, the government ordered all access suspended for any foreign national, inside or outside the U.S., including Anthropic's own foreign-national employees; Anthropic complied within hours. Reported trigger: a method of jailbreaking Fable 5 to analyze code for flaws. Opus/Sonnet/Haiku unaffected. A landmark — first apparent use of export-control powers to switch off a deployed frontier model. (time.com / cnbc.com / anthropic.com, 13 Jun 2026)
EU AI Act — 2 Aug 2026 high-risk deadline approaching Reported — high-risk obligations take effect 2 Aug 2026; mid-2026 saw "timeline relief / targeted simplification" debate plus new prohibitions, with extraterritorial reach affecting U.S. providers. (hklaw.com; globalpolicywatch.com)

3 · New commercial activity

Org	What they do	Stage / funding	This window's update	Tier
Anthropic	Frontier LLMs (Claude)	$65B Series H, ~$965B post-money; ~$47B run-rate revenue	Overtook OpenAI as most valuable AI startup (28 May 2026)	Demonstrated
OpenAI	Frontier LLMs (GPT)	~$110B raised at ~$840B post (secondary-sourced)	Mega-round ~Feb 2026 (SoftBank, Nvidia, Amazon)	Reported
xAI	Frontier LLMs (Grok)	~$250B all-stock (reported)	SpaceX reportedly acquired xAI (~Feb 2026) — needs primary confirmation	Reported
DeepSeek	Open-weights LLMs	—	V4 (MIT license) released 24 Apr 2026; R2 still unreleased	Reported

U.S. venture funding reportedly hit a record ~$267B (PitchBook) with OpenAI/Anthropic/xAI dominating; large Amazon Trainium and Nvidia GPU commitments reported (specific GW/GPU figures secondary-sourced).

4 · Watch list

Fable 5 / Mythos 5 reinstatement terms — whether and how access is restored, and the precedent it sets for export-controlled model access.
Meta "Muse Spark" closed-weight pivot — unverified; awaiting a Meta primary source.
SpaceX–xAI acquisition — single-source; needs primary confirmation.
DeepSeek R2 — still unreleased; rumor-only.
EU AI Act 2 Aug 2026 deadline — high-risk compliance impact on U.S. providers.
Independent SWE-bench replication — third-party leaderboard figures need replication before cross-model comparison.

5 · Quiet areas

Cohere — no credible frontier release surfaced in-window.
xAI Grok 5 / DeepSeek R2 — roadmap/rumor only.
Standalone accelerator launches tied to a concrete model advance — beyond GB200/Trainium/NVFP4 mentions, nothing met the bar.
Formal NIST/ISO AI standards — no notable in-window primary item (EU AI Act is regulation, covered above).

6 · Sources

GPT-5.3-Codex — openai.com (5 Feb 2026); GPT-5.5 — openai.com
Claude Opus 4.8 — anthropic.com (28 May 2026)
Claude Fable 5 & Mythos 5 — anthropic.com (9 Jun 2026); suspension statement — anthropic.com
Anthropic valuation — axios.com (28 May 2026)
Export-control suspension — time.com · cnbc.com · simonwillison.net (13 Jun 2026)
Gemini 3.5 — blog.google (19 May 2026); Gemini 3 Pro card — deepmind.google
DeepSeek V4 — api-docs.deepseek.com · huggingface.co (24 Apr 2026)
Qwen3.5 — cnbc.com (17 Feb 2026)
Mistral Voxtral — techcrunch.com (26 Mar 2026)
OpenAI / venture funding — siliconangle.com (Apr 2026)
EU AI Act — hklaw.com · globalpolicywatch.com
RL-scaling / contamination / KV preprints — CoScale-RL · RL-compute scaling · VecInfer; index — Raschka 2026 list; contamination — llm-stats.com

Confidence note: Mixed — the policy/governance headline and Anthropic's valuation are well-corroborated across reputable outlets; nearly all capability/benchmark claims are vendor- or partner-reported and un-replicated, and several confident aggregator specifics (Meta "Muse Spark," SpaceX–xAI, exact GPT-5.4/5.5 dates, third-party SWE-bench figures, the "0.97 PGR" and "4× fewer flaws" claims) remain unverified. The standout reliability caveat on the open-weights side is DeepSeek contamination scrutiny. report.css could not be inlined this run; a self-contained fallback style was used.