<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>The AI Timeline</title>
    <description>Follow The Latest Cutting Edge AI Research in 5 minutes a week.</description>
    
    <link>https://mail.bycloud.ai/</link>
    <atom:link href="https://rss.beehiiv.com/feeds/Vy37NcFo03.xml" rel="self"/>
    
    <lastBuildDate>Tue, 14 Apr 2026 20:44:53 +0000</lastBuildDate>
    <pubDate>Tue, 14 Apr 2026 19:30:00 +0000</pubDate>
    <atom:published>2026-04-14T19:30:00Z</atom:published>
    <atom:updated>2026-04-14T20:44:53Z</atom:updated>
    
      <category>Machine Learning</category>
      <category>Software Engineering</category>
      <category>Artificial Intelligence</category>
    <copyright>Copyright 2026, The AI Timeline</copyright>
    
    <image>
      <url>https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/publication/logo/23043e0b-1a8b-4e75-85b4-a980ed68d059/143861144.png</url>
      <title>The AI Timeline</title>
      <link>https://mail.bycloud.ai/</link>
    </image>
    
    <docs>https://www.rssboard.org/rss-specification</docs>
    <generator>beehiiv</generator>
    <language>en-us</language>
    <webMaster>support@beehiiv.com (Beehiiv Support)</webMaster>

      <item>
  <title>Neural Computer: Running an OS within an AI?!</title>
  <description>plus more about In-Place TTT, TriAttention, and Interleaved Head Attention. </description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b828e8e4-97fe-42e9-8a9c-a1d8746a14f4/issue_103.jpg" length="232128" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/neural-computer-running-an-os-within-an-ai</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/neural-computer-running-an-os-within-an-ai</guid>
  <pubDate>Tue, 14 Apr 2026 19:30:00 +0000</pubDate>
  <atom:published>2026-04-14T19:30:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Apr 7th ~ Apr 14th</i><br><i>#103 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.5k</span></span> Following its initial debut last month, MiniMax has now made the weights for <a class="link" href="https://www.minimax.io/news/minimax-m27-en?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow"><b>MiniMax M2.7</b></a><b> openly available</b> to the public under a restrictive license that limits commercial use and derivative works. The model shows SoTA performance in software engineering and command-line tasks, achieving a 56.22% on SWE-Pro and 57.0% on Terminal Bench 2. You can try it today on <a class="link" href="https://huggingface.co/MiniMaxAI/MiniMax-M2.7?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1061a0ff-50cb-4527-a57e-3935d2404aa6/image.png?t=1776193852"/></div><p class="paragraph" style="text-align:left;"> </p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 10k</span></span> Meta Superintelligence Lab (MSL) recently released their first ever model <a class="link" href="https://ai.meta.com/blog/introducing-muse-spark-msl/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Muse Spark</a>, a natively multimodal reasoning model that features a &quot;contemplating mode&quot; for complex, parallel agent orchestration. The model serves as the backbone for Meta AI&#39;s new deep reasoning and shopping capabilities, demonstrating performance competitive with other leading frontier models like GPT Pro and Gemini Deep Think. You can try it on <a class="link" href="https://www.meta.ai/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Meta AI Platform</a> for free.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ec018918-dfe5-4890-993a-085be190a25e/image.png?t=1776189287"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 10k</span></span> <a class="link" href="https://z.ai/blog/glm-5.1?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Z.ai has launched GLM-5.1</a>, an open-source model that currently leads the open-weight rankings with a state-of-the-art 58.4 score on SWE-Bench Pro. The model is specifically optimized for long-horizon agentic tasks, capable of running autonomously for up to eight hours to solve complex engineering and database optimization problems. You can try it on <a class="link" href="https://huggingface.co/zai-org/GLM-5.1?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a> or via the <a class="link" href="https://docs.z.ai/guides/llm/glm-5.1?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">API</a>. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/15a2fa71-432c-432a-91c4-c9ddb8b91ef1/20260407-235121.jpeg?t=1776189454"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Thunder Compute: The cheapest cloud GPU</h2><div class="image"><a class="image__link" href="https://www.thundercompute.com/?utm_source=bycloud&utm_medium=newsletter&utm_campaign=bycloud_newsletter" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6e90658f-59f2-4bef-8d09-4929bb07ffa3/CleanShot_2026-04-14_at_16.54.58_2x.png?t=1776165909"/></a><div class="image__source"><span class="image__source_text"><p>H100 @ $1.38/GPU/hr!!!</p></span></div></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.thundercompute.com/?utm_source=bycloud&utm_medium=newsletter&utm_campaign=bycloud_newsletter" target="_blank" rel="noopener noreferrer nofollow">Thunder Compute</a> has one of the cheapest cloud GPUs for developers. They offer on-demand GPU cloud instances in enterprise-grade data centers for a fraction of the price of competitors.</p><p class="paragraph" style="text-align:left;">With on-demand H100 sitting at <b>$</b><b>1.38/GPU/hr</b>, you’d get best-in-class reliability and networking, compared to other competitors that offer at least $4/GPU/hr.</p><div class="image"><a class="image__link" href="https://www.thundercompute.com/?utm_source=bycloud&utm_medium=newsletter&utm_campaign=bycloud_newsletter" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fea4b71c-8570-456e-8239-c85733a5caaf/CleanShot_2026-04-14_at_16.56.10_2x.png?t=1776165983"/></a></div><p class="paragraph" style="text-align:left;">They have additional features like:</p><ul><li><p class="paragraph" style="text-align:left;">VSCode extension and CLI which let you connect to instances without SSH config.</p></li><li><p class="paragraph" style="text-align:left;">Snapshots to save instance state and restore on any number of instances</p></li><li><p class="paragraph" style="text-align:left;">Templates for ComfyUI, Ollama, Unsloth Studio, and more</p></li><li><p class="paragraph" style="text-align:left;"><b>$20 of free credit for students</b></p></li></ul><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.thundercompute.com/?utm_source=bycloud&utm_medium=newsletter&utm_campaign=bycloud_newsletter"><span class="button__text" style=""> Create a GPU instance now! </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="in-place-test-time-training">In-Place Test-Time Training</h2><p class="paragraph" style="text-align:left;"><i>Feng et al. [ByteDance Seed, Peking University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Test time training </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Current LLMs follow a strict “train then deploy” rule, meaning once they are released, their underlying knowledge is completely frozen. They cannot adjust their internal wiring to absorb continuous streams of new information in real time.</p><p class="paragraph" style="text-align:left;">While scientists have tried a workaround called Test-Time Training (allowing a tiny fraction of the model to update on the fly) it historically required changing the system&#39;s architecture and undertaking a massive, costly retraining process.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9837480a-d19b-40a3-91c8-608d33ba582b/pipeline.png?t=1776187992"/></div><p class="paragraph" style="text-align:left;">To overcome this, researchers designed a brilliant upgrade called In-Place Test-Time Training. Instead of bolting on brand-new components to the system, they realized they could simply repurpose existing ones. They targeted ubiquitous processing centers inside the model, known as MLP blocks, which normally store the static knowledge acquired during initial training.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/998b6a4f-27f0-4c9b-ac49-b5312a945b94/CleanShot_2026-04-14_at_23.03.33_2x.png?t=1776188024"/><div class="image__source"><span class="image__source_text"><p>Efficiency analysis of In-Place TTT.</p></span></div></div><p class="paragraph" style="text-align:left;">The team unlocked the final layer of these blocks to act as a flexible, fast-updating memory. Because this elegant drop-in design leaves the original architecture perfectly intact, it preserves the system&#39;s foundational knowledge while seamlessly granting it the ability to adapt as it processes new data.</p><p class="paragraph" style="text-align:left;">The team paired this structural cleverness with an efficient engine that updates the memory in scalable chunks, avoiding heavy computing bottlenecks. Additionally, they aligned this real-time learning with the system&#39;s natural goal of predicting the next word.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2604.06169?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="tri-attention-efficient-long-reason">TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</h2><p class="paragraph" style="text-align:left;"><i>Mao et al. [MIT, NVIDIA, ZJU]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.1K </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Modern LLMs generate incredibly long chains of thought to solve logic puzzles, but this creates a massive memory bottleneck. Every thought the model holds onto is stored in a cache, and as reasoning grows, this memory gets completely overwhelmed. Until now, the best solution was deleting older memories based on recent observations. However, just like someone forgetting the beginning of a math problem midway through, this causes the system to lose critical context.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0287f294-7c3f-409d-abe2-aa4c87415e61/motivation.png?t=1776188257"/><div class="image__source"><span class="image__source_text"><p>Q/K concentration and its implications for attention.</p></span></div></div><p class="paragraph" style="text-align:left;">To solve this, researchers looked deeper into the architecture of the model, the raw space before the system applies rotational math to track word positions. Here, they noticed a beautifully consistent pattern. The specific components the model uses to match questions with answers naturally cluster around stable centers, regardless of the actual text.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/303546c6-9b97-4289-8ba1-2207e05e9f29/results.png?t=1776188271"/><div class="image__source"><span class="image__source_text"><p>Performance comparison on Qwen3-8B.</p></span></div></div><p class="paragraph" style="text-align:left;">Because these centers never wander, researchers realized they could predict exactly which memories the model would need in the future based purely on distance. By using a mathematical curve known as a trigonometric series, they mapped out the natural distance preferences of the model to perfectly score memory importance.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1eb0dd17-9212-47d2-8d7e-f65ca0142908/tradeoff.png?t=1776188281"/><div class="image__source"><span class="image__source_text"><p>Performance trade-offs on AIME25 (Qwen3-8B)</p></span></div></div><p class="paragraph" style="text-align:left;">Building on this insight, the team designed a system named TriAttention. Instead of guessing which past thoughts matter based on recent windows, it calculates the true future importance of every piece of data.</p><p class="paragraph" style="text-align:left;">Researchers demonstrated that their method perfectly matches the reasoning accuracy of an uncompressed model, yet it slashes memory usage by nearly eleven times and runs two and a half times faster. This shortcut frees up enough memory that advanced AI can now run smoothly on a single consumer graphics card.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2604.04921?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="neural-computers">Neural Computers</h2><p class="paragraph" style="text-align:left;"><i>Zhuge et al. [Meta AI, KAUST]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.2K </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Computer Use </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Think about how computers operate today: the hardware, the operating system, the applications, and the AI tools navigating them are all completely separate pieces. Researchers are trying to solve this fundamental fragmentation by asking a bold question: what if a single AI model could actually <i>be</i> the entire computer?</p><p class="paragraph" style="text-align:left;">Currently, traditional computers execute explicit programs, AI agents click around those programs from the outside, and predictive models guess what a screen should look like next. To bridge this gap, scientists are building Neural Computers.</p><p class="paragraph" style="text-align:left;">Instead of relying on a rigid, traditional stack of physical processors, memory banks, and standard code, this approach combines computation, working memory, and user inputs into one continuously learning neural network.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b351e445-5c0a-4ec4-bc44-cd2fa25a6fa7/action_inject.png?t=1776188431"/></div><p class="paragraph" style="text-align:left;">To test this ambitious idea, the team built early prototypes using advanced video generation technology, observing how the AI handled both text-heavy command-line terminals and traditional visual desktops. By feeding the system streams of user actions, text prompts, and starting screen visuals, the network&#39;s internal state essentially became the computer’s processor and RAM.</p><p class="paragraph" style="text-align:left;">The researchers discovered that these neural systems can intuitively learn the physical rules of our digital worlds from observation alone. The models successfully rendered fast-scrolling text, aligned precise cursor movements, and perfectly simulated short-term desktop responses like hovering over menus or clicking buttons.</p><p class="paragraph" style="text-align:left;">While they still require careful instruction to solve complex math or maintain focus over long periods, this fascinating discovery proves the foundational building blocks of a completely neural computer are already within our reach.</p><div class="embed"><a class="embed__url" href="https://metauto.ai/neuralcomputer/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai" target="_blank"><div class="embed__content"><p class="embed__title"> Neural Computer: A New Machine Form Is Emerging </p><p class="embed__description"> A research essay on Neural Computer: how it differs from agents, world models, and conventional computers; what runtime and CNC would mean; what current prototypes already show; and how software and hardware might change. </p><p class="embed__link"> METAUTO.ai • Mingchen Zhuge </p></div><img class="embed__image embed__image--right" src="https://metauto.ai/neuralcomputer/references/assets/teaser10_top.png"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2604.06425?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">Interleaved Head Attention</h2><p class="paragraph" style="text-align:left;"><i>Duvvuri et al. [Meta, UT Austin, UC Berkeley, Harvard University, MIT]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 487 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">When modern artificial intelligence reads a prompt, it relies on independent processors called &quot;attention heads.&quot; Think of these heads as a team of isolated researchers, where each person analyzes a document in a sealed room without speaking to their colleagues. While this works for simple facts, researchers realized it creates a massive bottleneck for multi-step reasoning.</p><p class="paragraph" style="text-align:left;">If you ask an AI where the author of a specific book was born, the system must first identify the author, then find their birthplace. Because these processors cannot communicate during their computation, standard models are forced to rely on an inefficient, ever-growing number of isolated heads to piece together these chains of logic.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ba6790fc-c2c3-475f-a392-831d7cea2b3d/CleanShot_2026-04-14_at_23.15.39_2x.png?t=1776188749"/><div class="image__source"><span class="image__source_text"><p>Overview of Interleaved Head Attention (IHA).</p></span></div></div><p class="paragraph" style="text-align:left;">To solve this, researchers developed a brilliant new approach called Interleaved Head Attention. Instead of forcing processors to work in isolation, the system constructs &quot;pseudo-heads&quot; that actively blend information from the entire team before analyzing the text. By mixing their perspectives together, these virtual processors can suddenly share context.</p><p class="paragraph" style="text-align:left;">Rather than learning just one pattern per head, this collaborative mixing allows a single processor to recognize multiple complex patterns simultaneously. It literally multiplies the model&#39;s ability to connect the dots while capturing complex, overlapping relationships.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2b2efac7-86e0-472a-9a1c-de61542ca26b/CleanShot_2026-04-14_at_23.16.07_2x.png?t=1776188775"/><div class="image__source"><span class="image__source_text"><p>RULER long-context results after 64k fine-tuning.</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers proved that this technique requires vastly less underlying code to achieve the same reasoning power as older models. When put to the test, this cross-mixing improved the system&#39;s ability to retrieve multiple facts hidden in extremely long documents by up to twenty percent.</p><p class="paragraph" style="text-align:left;">Furthermore, when fine-tuned for complex logic, it boosted accuracy on advanced math problem-solving benchmarks by nearly six percent.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.21371?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><blockquote align="center" class="twitter-tweet"><a href="https://twitter.com/TheAITimeline/status/2043201732717531578?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=neural-computer-running-an-os-within-an-ai"><p> Twitter tweet </p></a></blockquote></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=5782c04a-5edd-4989-9e6e-5e6c0575d8f0&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Embarrassingly Simple Self-Distillation Technique</title>
  <description>plus more on Path-Constrained MoE, HISA, and Screening is not enough</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c3ed40f5-8df8-47fc-b60b-3850488c60ba/issue_102.jpg" length="173190" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/embarrassingly-simple-self-distillation-technique</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/embarrassingly-simple-self-distillation-technique</guid>
  <pubDate>Tue, 07 Apr 2026 18:52:00 +0000</pubDate>
  <atom:published>2026-04-07T18:52:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Apr 1st ~ Apr 7th</i><br><i>#102 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2k</span></span> <a class="link" href="https://www.arcee.ai/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Arcee.ai</a> has introduced Trinity-Large-Thinking, a new open-weights language model designed specifically for complex agent workflows and long-horizon tool use. This model offers improved multi-turn coherence, stable instruction following, and the efficiency required for production-scale deployments. Try the model out for yourself via API on <a class="link" href="https://openrouter.ai/arcee-ai/trinity-large-thinking?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">OpenRouter</a> or <a class="link" href="https://huggingface.co/collections/arcee-ai/trinity-large-thinking?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">explore it on Hugging Face</a>. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/155b1917-938a-4ab5-87d3-7713d7d77665/image.png?t=1775580721"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 7.2k</span></span> Google has introduced <a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Gemma 4</a>, a new family of open models released under the permissive Apache 2.0 license. The lineup features four distinct sizes tailored for various deployment needs, including a 31B dense model for raw performance, a 26B Mixture-of-Experts (MoE) variant for low latency, and efficient 2B and 4B options optimized for edge devices. You can download the model weights and start fine-tuning for specific tasks today by checking it out on <a class="link" href="https://huggingface.co/collections/google/gemma-4?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a> or <a class="link" href="https://ollama.com/library/gemma4?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Ollama</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3cfdfaa7-5942-41b7-8552-37e772c12ec7/image.png?t=1775580910"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.8k</span></span> Z.ai has launched <a class="link" href="https://docs.z.ai/guides/vlm/glm-5v-turbo?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">GLM-5V-Turbo</a>, a native vision coding model capable of translating multimodal inputs (such as design drafts, videos, and UI screenshots) directly into executable code. Backed by a new CogViT visual encoder and collaborative reinforcement learning across over 30 task types. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ef56eb49-f1c3-4838-8bba-292fbb2ccb85/image.png?t=1775581330"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 4.6k</span></span> Alibaba&#39;s Qwen team has introduced <a class="link" href="https://qwen.ai/blog?id=qwen3.5-omni&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Qwen3.5-Omni</a>, a new family of native multimodal models designed to seamlessly process and integrate text, image, audio, and video inputs. It is available in Plus, Flash, and Light variants, the models feature massive context capacities capable of handling up to 10 hours of audio, along with advanced real-time capabilities like emotion-controlled voice interaction and &quot;Audio-Visual Vibe Coding.&quot; You can explore the <a class="link" href="https://huggingface.co/spaces/Qwen/Qwen3.5-Omni-Online-Demo?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">real-time voice demo</a> on Hugging Face or access the models via the <a class="link" href="https://www.alibabacloud.com/help/en/model-studio/qwen-omni?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Alibaba Cloud API</a>. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0cbbb601-d595-4f46-a19c-8394357c5778/qwen3.5-omni-banner.png?t=1775581578"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW MoE Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;"><b>We just added a new chapter on MoE</b>, that goes through the history, the key techniques, and the current state of MoE that frontier model uses. With over 10,000 words written!</p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e379f8f4-ca30-4804-a3fd-bfb8df42f3d8/image.png?t=1773169810"/></div><p class="paragraph" style="text-align:left;">We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="hisa-efficient-hierarchical-indexin">HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention</h2><p class="paragraph" style="text-align:left;"><i>Xu et al. </i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 256 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">When we ask an LLM to analyze a massive document, it uses a clever trick called sparse attention, where it only focuses on the most relevant words instead of processing everything equally. The internal tool that selects these important words still has to scan every single word in the document one by one.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/97450e98-843f-4ba6-8562-3fef35e501e1/CleanShot_2026-04-07_at_21.43.54_2x.png?t=1775578444"/><div class="image__source"><span class="image__source_text"><p> Comparison of the DSA token-wise indexer (left) and HISA hierarchical block-level coarse filter followed by token-level refinement (right)</p></span></div></div><p class="paragraph" style="text-align:left;">To solve this, researchers developed an elegant workaround called Hierarchical Indexed Sparse Attention. Instead of a flat, word-by-word scan, this new method uses a brilliant two-step strategy. First, it chunks the massive document into larger blocks and looks at a quick summary of each block to instantly filter out the irrelevant sections.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/33b098cb-0d54-4630-b9cc-c3b854506578/CleanShot_2026-04-07_at_21.44.10_2x.png?t=1775578458"/><div class="image__source"><span class="image__source_text"><p>Latency comparison of the indexer kernel between the original DSA</p></span></div></div><p class="paragraph" style="text-align:left;">Once the bulk of the text is safely discarded, the system zooms in on the surviving blocks, scanning only those specific words to find the exact information the AI needs. It is much like skimming the chapter titles of a textbook to find the right section before reading the actual sentences.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cd855bd8-e8d1-40c3-a19d-98b6b2d8ae9f/CleanShot_2026-04-07_at_21.45.42_2x.png?t=1775578551"/></div><p class="paragraph" style="text-align:left;">By rewriting this search path, researchers managed to speed up the scanning process by nearly four times for exceptionally long texts, all while perfectly preserving the model&#39;s accuracy.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.28458?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="embarrassingly-simple-self-distilla">Embarrassingly Simple Self-Distillation Improves Code Generation</h2><p class="paragraph" style="text-align:left;"><i>Zhang et al. [Apple]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.5k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Distillation </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching an AI to write better code requires expensive human examples, a smarter teacher model, or highly complex reward systems that verify every single line of code. This heavy reliance on outside help has become a massive bottleneck in AI development. Researchers recently asked a fascinating question to address this: could a model pull itself up by its bootstraps, improving its own capabilities using absolutely nothing but its own raw, unverified outputs?</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5b50eed6-b51c-4dde-8d2e-9ad06b13a48f/CleanShot_2026-04-07_at_21.53.41_2x.png?t=1775579033"/><div class="image__source"><span class="image__source_text"><p>Simple self-distillation (SSD) is embarrassingly simple, yet yields substantial LiveCodeBench v6 gains across five models spanning two families, three scales, with both instruct and thinking variants.</p></span></div></div><p class="paragraph" style="text-align:left;">This can be done through a method called simple self-distillation. Researchers asked the AI to generate solutions to coding prompts (without running test cases or checking if the code actually worked) and then retrained the AI on those exact responses. Remarkably, this caused a massive leap in performance across several different models, with the absolute biggest gains seen on the hardest coding challenges. Rather than just memorizing a single dominant way to solve a problem, the AI actually preserved its ability to explore multiple viable solution paths.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d9f6b568-1a1b-4d34-a970-dc2e02a70510/CleanShot_2026-04-07_at_21.54.23_2x.png?t=1775579073"/><div class="image__source"><span class="image__source_text"><p>SSD improves every evaluated model on LiveCodeBench, with the largest gains on medium and hard problems</p></span></div></div><p class="paragraph" style="text-align:left;">To understand why this works, researchers uncovered a fascinating tug-of-war inside the model called the precision-exploration conflict. When writing code, an AI encounters &quot;locks&quot; (moments requiring rigid exactness with zero ambiguity) and &quot;forks&quot;, i.e. moments requiring creative exploration to choose a problem-solving approach.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b7f1e707-71d7-4029-9c9a-5811137e2431/CleanShot_2026-04-07_at_21.54.53_2x.png?t=1775579103"/><div class="image__source"><span class="image__source_text"><p>Training and evaluation temperatures compose through a broad effective-temperature band, while truncation raises the achievable pass@1 within that band.</p></span></div></div><p class="paragraph" style="text-align:left;">Normally, adjusting an AI&#39;s generation settings forces a strict compromise: making it flexible enough to navigate creative forks causes it to make sloppy, distracting errors at the rigid locks.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2604.01193?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="path-constrained-mixtureof-experts">Path-Constrained Mixture-of-Experts</h2><p class="paragraph" style="text-align:left;"><i>Gu et al. [Apple, Google]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 356 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> MoE </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">In &quot;Mixture-of-Experts&quot; architectures, instead of activating the entire AI for every word, this design acts like a traffic system, routing each piece of information only to specialized mini-programs, or experts. But there is a catch.</p><p class="paragraph" style="text-align:left;">Historically, these models make independent routing decisions at every single layer, creating an astronomical number of possible pathways. Because the vast majority of these paths are never explored during training, researchers realized this scattered approach represents a massive inefficiency. They wondered if they could guide information along more intentional, organized routes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3e266131-f595-436d-86e8-e2d291f3f111/CleanShot_2026-04-07_at_22.03.19_2x.png?t=1775579611"/><div class="image__source"><span class="image__source_text"><p>Spectrum of routing constraints in MoE architectures.</p></span></div></div><p class="paragraph" style="text-align:left;">When peering inside these systems, researchers discovered something fascinating: language naturally organizes itself. Even without strict guidance, words cluster into a tiny fraction of specific pathways based on their linguistic purpose, with dedicated paths emerging for things like punctuation, names, or action verbs.</p><p class="paragraph" style="text-align:left;">To amplify this natural structure, the team introduced a streamlined approach called PathMoE. Instead of letting every layer make its own isolated traffic decisions, PathMoE groups consecutive layers into blocks that share the exact same routing rules. Because neighboring layers process similar information, this gentle constraint guides the data along highly concentrated, specialized routes without restricting the model&#39;s overall potential.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c1ae77cd-8907-4b32-84cf-b17b39b45167/CleanShot_2026-04-07_at_22.03.58_2x.png?t=1775579646"/><div class="image__source"><span class="image__source_text"><p>Main results on Fineweb-100B with 0.9B total / 0.37B active MoE architecture. Throughput is reported per GPU and memory reports peak active GPU memory.</p></span></div></div><p class="paragraph" style="text-align:left;">By simply encouraging these natural pathways, models equipped with PathMoE demonstrated consistent improvements in accuracy and language comprehension. This method naturally keeps the AI&#39;s workload perfectly balanced, eliminating the need for the clunky, manual tuning formulas engineers previously relied on.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.18297?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">Screening Is Enough</h2><p class="paragraph" style="text-align:left;"><i>Nakanishi [RIKEN]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 762 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI models distribute a fixed budget of attention across every piece of information they read. Because this budget is strictly capped, the system evaluates data relatively, comparing words against one another rather than valuing them on their own merits. As a document grows, this fixed attention dilutes. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/297d70a7-3911-4b6b-8561-1d9f3f9db806/CleanShot_2026-04-07_at_22.10.00_2x.png?t=1775580011"/></div><p class="paragraph" style="text-align:left;">A new architecture called Multiscreen introduces a beautifully intuitive solution to this problem through a mechanism researchers call &quot;screening.&quot; Instead of forcing words to compete for a slice of a fixed attention pie, screening judges every piece of information entirely independently against a strict, absolute threshold. If a piece of data is useful, it passes the screen; if it is irrelevant, it is completely discarded.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b0887da0-7ce0-461e-b10c-97da07c62a22/CleanShot_2026-04-07_at_22.10.34_2x.png?t=1775580044"/><div class="image__source"><span class="image__source_text"><p>Long-context perplexity comparison between 353M Transformer and 286M Multiscreen models.</p></span></div></div><p class="paragraph" style="text-align:left;">By eliminating global competition among data points, the model cleanly filters out the noise and confidently gathers only the information that actually matters, allowing it to adapt its focus without getting overwhelmed by sheer volume.</p><p class="paragraph" style="text-align:left;">Researchers found that Multiscreen achieves comparable performance using roughly forty percent fewer parameters than standard models. Even more impressively, a vastly scaled-down version of Multiscreen consistently outperformed much larger standard models in retrieving specific information, all while cutting processing delays by over three times on massive texts.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2604.01178?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=embarrassingly-simple-self-distillation-technique"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/4Ij9YOyrNdM" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=a9a2b4c4-00c2-40cb-988d-f65cd5890d2b&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>LeWorldModel: JEPA but more practical</title>
  <description>plus more on Claudini, Composer 2, and self-distillation</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b29ce3f1-0135-4633-9143-6357808d0347/issue_101.jpg" length="244929" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/leworldmodel-jepa-but-more-practical</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/leworldmodel-jepa-but-more-practical</guid>
  <pubDate>Tue, 31 Mar 2026 18:41:00 +0000</pubDate>
  <atom:published>2026-03-31T18:41:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
    <category><![CDATA[Weekly Papers Recap]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Mar 24th ~ Mar 31th</i><br><i>#101 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.5k</span></span> <a class="link" href="https://z.ai/subscribe?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">GLM-5.1 by z.ai</a> is now officially available to all GLM Coding Plan users. To start using the new model, simply update your configuration file, such as <code>~/.claude/settings.json</code>, by manually changing the model name to &quot;glm-5.1.&quot; If you look at the benchmarks, it doesn’t look that impressive, but what’s great is that it offers <b>3× usage of the Claude Pro plan</b>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6a8af96d-e48e-4b97-adda-ea8ade7a62e7/image.png?t=1774973154"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2.2k</span></span> <a class="link" href="https://ai.meta.com/research/sam3/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">Meta has released SAM 3.1</a>, a drop-in update that introduces &quot;object multiplexing&quot; to track up to 16 objects simultaneously in a single forward pass. This update doubles video processing throughput and removes memory bottlenecks, making high-performance AI applications feasible on smaller, more accessible hardware. <a class="link" href="https://github.com/facebookresearch/sam3?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">View on GitHub</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1c35d6ea-7301-41a3-ab84-32ee16a61b9d/image.png?t=1774973405"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 15k</span></span> <a class="link" href="https://aidemos.atmeta.com/tribev2/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">Meta has introduced TRIBE v2</a>, a new foundation model trained on over 500 hours of fMRI data to predict how the human brain responds to visual and auditory stimuli. This &quot;digital twin&quot; of neural activity achieves a nearly 3x improvement in zero-shot predictions over previous methods. <a class="link" href="https://github.com/facebookresearch/tribev2?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">View on GitHub</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/06cfcf13-e7f8-484a-9049-0532d624524c/image.png?t=1774973586"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 38k</span></span> <a class="link" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">Google has released TurboQuant</a>, which is a compression algorithm designed to optimize the key-value (KV) cache of Large Language Models. This new method <b>reduces memory requirements by at least sixfold</b>, allowing developers to run massive models on significantly more modest hardware. Moreover, it also delivers up to an <b>8x speedup</b> in inference. Most importantly, TurboQuant achieves these performance gains with zero loss in accuracy, ensuring that model quality remains untouched despite the extreme compression. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5d14f6ce-f331-4a34-9d8c-d42ae9362292/image.png?t=1774973817"/><div class="image__source"><span class="image__source_text"><p><i>TurboQuant demonstrates robust KV cache compression performance across the</i> <a class="link" href="#industry-news-in-1-line" rel="noopener noreferrer nofollow" style="color: rgb(26, 115, 232)">LongBench</a><i> benchmark relative to various compression methods</i></p></span></div></div><p class="paragraph" style="text-align:left;">The hype for Google’s TurboQuant is facing pushback from the research community as the paper contains serious technical inaccuracies and misleading comparisons. <a class="link" href="https://x.com/gaoj0017/status/2037532673812443214?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">Lead researchers from the RaBitQ project claim</a> that TurboQuant misrepresents their methodology and fails to acknowledge fundamental similarities, specifically regarding the use of the Johnson-Lindenstrauss transform. According to public statements, these flaws were flagged to the authors prior to submission, yet the paper was allegedly published and promoted without the necessary corrections. </p></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW Distillation Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e379f8f4-ca30-4804-a3fd-bfb8df42f3d8/image.png?t=1773169810"/></div><p class="paragraph" style="text-align:left;">We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="composer-2-technical-report">Composer 2 Technical Report</h2><p class="paragraph" style="text-align:left;"><i>Cursor Research Team</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 5.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Coding </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">When developers ask AI models to write code, the models perform brilliantly on laboratory tests but stumble in the messy, ambiguous world of actual software engineering. Researchers realized that existing AI coding benchmarks were simply too neat, providing detailed instructions for isolated bugs. Developers constantly deal with vague bug reports, massive codebases, and confusing production logs. To bridge this gap, scientists set out to build a specialized assistant that genuinely thinks like a seasoned engineer navigating the authentic friction of daily development.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f713f25e-f45b-4a4f-9555-96a75c35cb22/CleanShot_2026-03-31_at_20.55.39_2x.png?t=1774970751"/><div class="image__source"><span class="image__source_text"><p>Overview of a single grouped GEMM training flow in our Mixture-of-Experts layer.</p></span></div></div><p class="paragraph" style="text-align:left;">This paper introduces <b>Composer 2</b>, which is a new model that dramatically improves how AI handles long-term programming tasks. The researchers achieved this through a clever two-step training process.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1dd3f4c7-2697-42f7-8890-c0d87ff3756e/CleanShot_2026-03-31_at_20.55.14_2x.png?t=1774970728"/><div class="image__source"><span class="image__source_text"><p>Example CursorBench task</p></span></div></div><p class="paragraph" style="text-align:left;">First, they immersed the model in massive amounts of code to build profound baseline knowledge. Then, they put it through rigorous reinforcement learning, simulating actual user sessions where the AI practiced solving diverse problems. To help the model stay perfectly focused during lengthy programming tasks, the team introduced a brilliant self-summarization technique.</p><p class="paragraph" style="text-align:left;">Instead of getting overwhelmed by a long history of commands, the model constantly writes little internal summaries for itself. This preserves crucial context while discarding clutter. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e6391c0c-94aa-4c4f-a1d5-ccd79222e870/CleanShot_2026-03-31_at_20.54.54_2x.png?t=1774970704"/></div><p class="paragraph" style="text-align:left;">To prove this approach works, the team bypassed traditional public tests and created a rigorous new evaluation based entirely on authentic engineering scenarios. In these demanding simulations, which require extensive code changes from incredibly brief instructions, the model demonstrated remarkable accuracy and coherence.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.24477?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="claudini-autoresearch-discovers-sta">Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs</h2><p class="paragraph" style="text-align:left;"><i>Panfilov et al. [MATS, ELLIS Institute Tubingen & Max Planck Institute for Intelligent Systems, Tubingen AI Center, Imperial College London]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.5k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Coding Vulnerabilities </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Can AI automatically conduct research to make future technology safer? Researchers built an automated security researcher to see if an AI could invent entirely new methods for pressure-testing digital vulnerabilities. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/80275241-8d37-44a5-adb3-d7e850531489/autoresearch_loop.png?t=1774971183"/></div><p class="paragraph" style="text-align:left;">To test this, the research team built a continuous loop where the AI reviewed dozens of existing security-testing algorithms, wrote new software to improve them, and then evaluated its own creations. Instead of simply generating tricky text prompts to bypass safety filters, the AI engineered the underlying mathematical algorithms that actively search for these vulnerabilities.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4ba4cf30-1d80-47f1-af0f-20109ed307be/pareto_evolution_small.png?t=1774971198"/><div class="image__source"><span class="image__source_text"><p>Claudini Strongly Outperforms a Classical AutoML Method.</p></span></div></div><p class="paragraph" style="text-align:left;">By intelligently splicing together previous techniques and writing clever mechanisms to avoid getting stuck during its analysis, the AI made massive strides. It successfully generated novel algorithms that dramatically outperformed human-made baselines, jumping from a success rate of less than ten percent to forty percent on a targeted safety filter.</p><div class="embed"><a class="embed__url" href="https://github.com/romovpa/claudini?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - romovpa/claudini: Autoresearch for LLM adversarial attacks </p><p class="embed__description"> Autoresearch for LLM adversarial attacks. Contribute to romovpa/claudini development by creating an account on GitHub. </p><p class="embed__link"> github.com/romovpa/claudini </p></div></a></div><p class="paragraph" style="text-align:left;">The most remarkable finding is just how adaptable these new AI-authored algorithms proved to be. When scientists tested these methods on a completely different, highly secured model that the AI had never even encountered, the new algorithms bypassed the defenses with a flawless hundred percent success rate.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.24511?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="le-world-model-stable-endto-end-joi">LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels</h2><p class="paragraph" style="text-align:left;"><i>Maes et al. [Mila & Université de Montréal, New York University, Samsung SAIL, Brown University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 3.7k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM World Models </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI needs an internal &quot;world model&quot; to predict the consequences of its actions. Researchers have tried teaching AI to build these models directly from raw video pixels, compressing complex visual scenes into a streamlined imagination space. However, this process faces a frustrating hurdle known as representation collapse.</p><p class="paragraph" style="text-align:left;">When asked to predict the future, the AI often takes the lazy route, mapping every image to the exact same uniform representation just to guarantee a perfect prediction score. To prevent this, previous systems relied on highly fragile workarounds.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/71d739b0-2b31-40d4-ae5d-8127d628a788/CleanShot_2026-03-31_at_21.06.49_2x.png?t=1774971420"/><div class="image__source"><span class="image__source_text"><p>LeWorldModel Training Pipeline. </p></span></div></div><p class="paragraph" style="text-align:left;">This paper developed LeWorldModel, which is a highly stable world model from scratch using just two simple rules. First, the model predicts the next compressed state of its environment based on a given action. Second, a clever mathematical regulator forces these compressed representations to stay continuously diverse, spreading them out to match a natural bell-curve distribution. By enforcing this varied shape, the model is strictly prevented from collapsing into a single, lazy answer.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d4d5a3ea-149e-447d-8f82-ee42ca6648f7/CleanShot_2026-03-31_at_21.09.04_2x.png?t=1774971554"/><div class="image__source"><span class="image__source_text"><p>Pseudo-code for the training procedure of LeWorldModel.</p></span></div></div><p class="paragraph" style="text-align:left;">By reducing the complex tuning down to a single parameter, this compact model trains on one standard graphics card in just a <b>few hours</b>. Impressively, it plans up to <b>forty-eight times faster</b> than bulkier alternatives across complex control tasks. Even more fascinating, without explicit physics lessons, the model organically learns to track object locations and registers mathematical surprise when shown impossible events like spontaneous teleportation.</p><div class="embed"><a class="embed__url" href="https://le-wm.github.io/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical" target="_blank"><div class="embed__content"><p class="embed__title"> LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels </p><p class="embed__description"> End-to-end joint-embedding predictive architecture from pixels. </p><p class="embed__link"> le-wm.github.io </p></div><img class="embed__image embed__image--right" src=""/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.19312?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">Self-Distillation of Hidden Layers for Self-Supervised Representation Learning</h2><p class="paragraph" style="text-align:left;"><i>Lowe et al. [Vector Institute, Carleton University, Dalhousie University</i>, <i>University of British Columbia, University of Guelph]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 780 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Distillation </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching an AI to understand the visual world is complex. For a long time, scientists have had to choose between two extreme training methods, both with frustrating limitations. One approach forces the AI to perfectly reconstruct raw, low-level pixels. While this keeps the system safely grounded in reality, it leaves the AI struggling to grasp big-picture concepts without a lot of extra hand-holding.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/450e77c4-7fdd-42bb-8723-9373a9dc790d/CleanShot_2026-03-31_at_21.09.58_2x.png?t=1774971612"/><div class="image__source"><span class="image__source_text"><p>Multi-layer self-distillation with Bootleg. </p></span></div></div><p class="paragraph" style="text-align:left;">The other approach asks the AI to predict only highly abstract, final-stage ideas. However, because the system is essentially generating its own study material in a continuous loop, it often becomes unstable, loses touch with the actual image, and completely breaks down during training. </p><p class="paragraph" style="text-align:left;">To solve this, this research has introduced Bootleg. Instead of making the AI guess only the most basic details or only the most complex final concepts, they tasked it with predicting multiple hidden layers of information all at once. This beautifully mirrors how the human brain processes sights, where early visual processing picks up simple edges and colors, while deeper brain regions recognize complete objects.</p><p class="paragraph" style="text-align:left;">By forcing the AI to simultaneously predict early, middle, and late stages of understanding, the system has to compress a wealth of varied knowledge through a tight informational bottleneck. This multi-level approach brilliantly keeps the AI anchored to actual visual features while it masters complex, abstract ideas.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e80bd88e-05eb-4135-8422-343fc4ddf932/CleanShot_2026-03-31_at_21.13.07_2x.png?t=1774971798"/><div class="image__source"><span class="image__source_text"><p>Results for masked self-supervised learning with Bootleg and baselines. </p></span></div></div><p class="paragraph" style="text-align:left;">The researchers discovered that this technique creates a dramatically smarter system, outperforming previous methods by significant margins in both recognizing what is in an image and mapping out exact scenes.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.15553?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=leworldmodel-jepa-but-more-practical"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/P9uNy71YukQ" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=6f4577e6-93dc-4f9b-9f9c-b42c0fb72ada&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Rotate attention by 90 degrees...? Kimi&#39;s New Attention Residuals</title>
  <description>plus more about V-JEPA 2.1, Mamba 3, and latent planning</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5ab46a60-9b20-4927-a94d-9639d7021963/issue_100.jpg" length="244934" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/rotate-attention-by-90-degrees-kimi-s-new-attention-residuals</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/rotate-attention-by-90-degrees-kimi-s-new-attention-residuals</guid>
  <pubDate>Wed, 25 Mar 2026 18:32:00 +0000</pubDate>
  <atom:published>2026-03-25T18:32:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Mar 17th ~ Mar 24th</i><br><i>#100 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.3k</span></span> Xiaomi has announced the release of <a class="link" href="https://mimo.xiaomi.com/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals#blog" target="_blank" rel="noopener noreferrer nofollow">MiMo-V2-Pro</a>, Omni, and TTS models. These new models introduce advanced features with global top-tier agent performance, multimodal interaction for seeing and hearing, and expressive voice synthesis. You can test these models on the <a class="link" href="https://aistudio.xiaomimimo.com/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals#/" target="_blank" rel="noopener noreferrer nofollow">web</a> or via the API portal.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e70dfda8-958b-4e01-ad4c-e9b0f8b0c665/image.png?t=1774372024"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.5k</span></span> MiniMax has launched its <a class="link" href="https://www.minimax.io/news/minimax-m27-en?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank" rel="noopener noreferrer nofollow">M2.7 model</a>, which uses a <b>recursive self-evolution architecture</b> that contributed to an 88% win-rate over its predecessor. This model achieves state-of-the-art performance in software engineering benchmarks and high-fidelity document editing, while demonstrating enhanced agentic capabilities with 97% skill adherence. <a class="link" href="https://agent.minimax.io/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank" rel="noopener noreferrer nofollow">Try it today on the web</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/938b60fb-2159-4dc4-8f64-317e9772eb85/image.png?t=1774372219"/></div><p class="paragraph" style="text-align:left;"></p></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW Distillation Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;"><b>We just added a new chapter on Distillaion!</b></p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e379f8f4-ca30-4804-a3fd-bfb8df42f3d8/image.png?t=1773169810"/></div><p class="paragraph" style="text-align:left;">We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="attention-residuals">Attention Residuals</h2><p class="paragraph" style="text-align:left;"><i>Kimi Team</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 15k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">LLMs rely on residual connections to pass information through deep stacks of layers. While these connections act as a vital &quot;gradient highway&quot;, they work like a blunt instrument. In current architectures, every layer simply adds its output to a uniform sum of all previous layers. This approach causes the information representing the model’s &quot;hidden state&quot; to swell uncontrollably as it moves deeper. Over time, this leads to a dilution effect where early, important information becomes buried and difficult for the model to retrieve effectively, effectively limiting how well the model can leverage its own depth.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d505ae16-d0a9-4540-afdb-b284be0d51dc/overview.png?t=1774371585"/><div class="image__source"><span class="image__source_text"><p>Overview of Attention Residuals.</p></span></div></div><p class="paragraph" style="text-align:left;">Researchers have introduced &quot;Attention Residuals&quot; (AttnRes) to replace this clumsy, fixed accumulation with a smarter mechanism. AttnRes allows each layer to selectively aggregate information from previous layers using learned, input-dependent weights. Instead of blindly adding everything together, the model now chooses which previous layers are most relevant to its current task.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a23a156e-cd15-4169-aa3d-61fd80800d1e/CleanShot_2026-03-24_at_22.30.36_2x.png?t=1774371666"/><div class="image__source"><span class="image__source_text"><p>PyTorch-style pseudo code for Block Attention Residuals.</p></span></div></div><p class="paragraph" style="text-align:left;">To ensure this doesn&#39;t create excessive memory demands in massive models, the team developed &quot;Block AttnRes,&quot; which organizes layers into smaller groups. This allows the model to maintain the benefits of selective, smart aggregation while keeping memory and communication costs efficient.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1f0e6564-c8e7-428a-8401-349c89427f58/training_dynamics.png?t=1774371608"/></div><p class="paragraph" style="text-align:left;">Experiments on large-scale models, including a 48B-parameter architecture, showed that this method successfully prevents the hidden-state growth that plagued previous models. By creating a more uniform distribution of signals and gradients across all layers, AttnRes consistently improves performance on complex reasoning, math, and coding benchmarks. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.15031?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="mamba-3-improved-sequence-modeling-">Mamba-3: Improved Sequence Modeling using State Space Principles</h2><p class="paragraph" style="text-align:left;"><i>Lahoti et al. [</i>Carnegie Mellon University, Princeton University, Together AI, Cartesia AI<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.5k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Mamba</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Transformer based AI models are incredibly powerful, but they suffer from a &quot;bottleneck&quot; in efficiency. As models grow, the computational effort required to generate each new piece of information, and the memory needed to store that context increases very fast. </p><p class="paragraph" style="text-align:left;">This paper has introduced Mamba-3, a new architecture designed with an &quot;inference-first&quot; mindset to solve these efficiency challenges. By revisiting the mathematical foundations of State Space Models (SSMs), the team implemented three core improvements.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7b52ae1c-dbee-4a9b-9122-22519fcd2a75/mamba3.png?t=1774371479"/></div><p class="paragraph" style="text-align:left;">First, they developed a more expressive discretization method called &quot;exponential-trapezoidal,&quot; which allows the model to handle data dynamics with greater precision than previous versions.</p><p class="paragraph" style="text-align:left;">Second, they incorporated a complex-valued state update rule. This acts like a &quot;rotary&quot; mechanism that enables the model to track states, such as solving arithmetic parity tasks, that were previously impossible for similar linear models to master.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f3fb3642-2897-470d-bd87-4ac78ec7ba79/selection.png?t=1774371491"/></div><p class="paragraph" style="text-align:left;">Finally, they shifted to a multi-input, multi-output (MIMO) formulation. This clever adjustment allows the model to perform more computation during the memory-heavy decoding phase without increasing the actual size of its state or slowing down its response time.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6442bb60-ae31-49a9-87f9-93e3c78d6a33/ssd_algorithm.png?t=1774371506"/></div><p class="paragraph" style="text-align:left;">These refinements allow Mamba-3 to achieve a remarkable balance. At the 1.5B scale, it outperforms top-tier competitors in downstream accuracy while simultaneously matching the language-modeling capabilities of its predecessor at half the state size. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b8e7467e-fd11-4b2f-88fa-f4c75bdf40a1/CleanShot_2026-03-24_at_22.28.50_2x.png?t=1774371539"/><div class="image__source"><span class="image__source_text"><p>Prefill and Prefill+Decode latency across sequence lengths.</p></span></div></div><div class="embed"><a class="embed__url" href="https://github.com/state-spaces/mamba?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - state-spaces/mamba: Mamba SSM architecture </p><p class="embed__description"> Mamba SSM architecture. Contribute to state-spaces/mamba development by creating an account on GitHub. </p><p class="embed__link"> github.com/state-spaces/mamba </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/0929e33a4425521a199ffad2179c61bdbc88e4af7b47b558ccd6975412a203cf/state-spaces/mamba"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.15569?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="vjepa-21-unlocking-dense-features-i">V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning</h2><p class="paragraph" style="text-align:left;"><i>Mur-Labadia et al. [FAIR at Meta, Universidad de Zaragoza]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> JEPA </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">To be truly useful, an AI needs to be a &quot;jack-of-all-trades&quot;: it must understand both the big picture (like identifying a person’s action) and fine-grained local details (like pinpointing the exact edge of a glass for a robot to grasp). Previous models, such as the V-JEPA family, excelled at global understanding but struggled to extract precise local information, resulting in noisy, fragmented visual representations.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ef8ff685-a421-4368-9269-34211852d0c2/architecture_vjepa2_1.jpg?t=1774371360"/><div class="image__source"><span class="image__source_text"><p>V-JEPA 2.1 Architecture</p></span></div></div><p class="paragraph" style="text-align:left;">This gap limited their effectiveness in tasks that demand spatial precision, such as depth estimation or delicate robotic manipulation. The team discovered that the &quot;missing link&quot; was in the training objective. Traditional V-JEPA models only practiced predicting what was missing from an image or video, the masked, hidden patches.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/36735fae-fafe-41e2-9d5d-24b94c164344/bars_teaser_tikz-1.png?t=1774371375"/><div class="image__source"><span class="image__source_text"><p>V-JEPA 2.1 ViT-G performance across dense and global prediction tasks.</p></span></div></div><p class="paragraph" style="text-align:left;">Because the model wasn’t forced to analyze the visible parts of the scene, it essentially learned to treat those visible sections as global summaries rather than detailed spatial maps. To fix this, this paper has introduced a &quot;dense prediction loss,&quot; which forces the model to learn from both the hidden <i>and</i> the visible parts of the input.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b6d0b241-f4bb-46f4-bb4c-f992fd9a6c2f/flowchart.png?t=1774371412"/></div><p class="paragraph" style="text-align:left;">By supervising every part of the scene, the model is compelled to build a coherent, fine-grained understanding of where objects actually exist in space. They further refined this by applying the learning signal at multiple layers deep within the network, a method called &quot;deep self-supervision&quot;, and by using specialized tokenizers that handle images and videos in their native formats.</p><p class="paragraph" style="text-align:left;">These changes, combined with scaling the model and the diversity of training data, resulted in a system capable of state-of-the-art performance in both high-level action forecasting and precise, low-level spatial tasks like depth perception.</p><div class="embed"><a class="embed__url" href="https://github.com/facebookresearch/vjepa2?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - facebookresearch/vjepa2: PyTorch code and models for VJEPA2 self-supervised learning from video. </p><p class="embed__description"> PyTorch code and models for VJEPA2 self-supervised learning from video. - facebookresearch/vjepa2 </p><p class="embed__link"> github.com/facebookresearch/vjepa2 </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/b371955bb418f677b546ce6319dcef580d2f0ef6c9a1bbae69c0e1ef648ad642/facebookresearch/vjepa2"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.14482?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">Temporal Straightening for Latent Planning</h2><p class="paragraph" style="text-align:left;"><i>Wang et al. [New York University, Brown University, University of Toronto]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.2k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> JEPA </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">When AI agents interact with complex environments, they typically translate high-dimensional sensory data, like raw video, into a &quot;latent space&quot; to make decisions. For a computer trying to navigate, this means the shortest path between two points in its digital map doesn&#39;t actually correspond to the shortest physical path. Because these internal maps are so twisted, gradient-based planners, which rely on finding the smoothest way to reach a goal, often get stuck or perform poorly, forcing researchers to rely on computationally expensive search-based methods.</p><table width="100%" class="bh__column_wrapper"><tr><td width="50%" class="bh__column"><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/da403088-bd99-43cf-9650-81f49112a9cd/umaze_gt.png?t=1774371120"/><div class="image__source"><span class="image__source_text"><p><i>UMaze: ground-truth geodesic distance.</i></p></span></div></div></td><td width="50%" class="bh__column"><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1e6bf35b-99fc-4b21-beb8-5ae70299ec18/umaze_resnet_global.png?t=1774371184"/><div class="image__source"><span class="image__source_text"><p>UMaze: ResNet-global after straightening.</p></span></div></div></td></tr></table><p class="paragraph" style="text-align:left;">This paper introduces a way to &quot;straighten&quot; these maps, helping AI agents see the world more linearly so they can plan their actions more efficiently. The researchers developed a method called &quot;temporal straightening&quot; to fix the distorted geometry of latent spaces.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0f97268c-96e1-4cb5-9f0e-dff3e1bb0e90/curvature_bars.png?t=1774371216"/><div class="image__source"><span class="image__source_text"><p>Latent Curvature and Open-Loop GD Success Rate for Different Encoders. Higher cosine similarity indicates lower curvature.</p></span></div></div><p class="paragraph" style="text-align:left;">This is based on a simple idea: as the agent moves, the model tracks its latent trajectory and minimizes the &quot;curvature&quot; by keeping the velocity vectors of consecutive steps as aligned as possible. It forces the AI to learn representations where movement in the latent space feels like travel along a straight, predictable line rather than a chaotic curve.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4be08189-e092-4441-b3b8-909d98b197ca/architecture.png?t=1774371108"/></div><p class="paragraph" style="text-align:left;">By adding a straightening objective to the standard training process, the model learns to map observations in a way that Euclidean distance, the simplest way to measure &quot;how far&quot; a goal is, finally matches reality. This has a profound effect on planning: because the path is straighter, the math behind gradient-based optimization becomes much more stable.</p><p class="paragraph" style="text-align:left;">In experiments, this approach drastically improved success rates across a variety of navigation and manipulation tasks, allowing agents to reach goals with far greater precision and efficiency without needing the heavy compute power typically required for complex decision-making.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8de89c07-fe60-40db-b295-8a40d19f4817/results_table.png?t=1774371279"/></div><div class="embed"><a class="embed__url" href="https://agenticlearning.ai/temporal-straightening/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals" target="_blank"><div class="embed__content"><p class="embed__title"> Temporal Straightening for Latent Planning </p><p class="embed__description"> Temporal straightening improves latent planning with world models. </p><p class="embed__link"> agenticlearning.ai/temporal-straightening </p></div></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.12231?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rotate-attention-by-90-degrees-kimi-s-new-attention-residuals"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/xUlX6jvwVfM" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=b3d4d11e-e424-4e1d-ac2c-18d1ab400aaf&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>You can train OpenClaw just by talking to it?</title>
  <description>and more about GLM-OCR, pre-pre-training on NCA, IndexCache, and neural thickets</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1d2bc86a-9536-44e2-b56a-7fe7d40b5018/issue_99.jpg" length="291057" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/you-can-train-openclaw-just-by-talking-to-it</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/you-can-train-openclaw-just-by-talking-to-it</guid>
  <pubDate>Tue, 17 Mar 2026 22:10:00 +0000</pubDate>
  <atom:published>2026-03-17T22:10:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Mar 10th ~ Mar 17th</i><br><i>#99 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 24k</span></span> Claude 3.5 models (Opus and Sonnet) now support a <a class="link" href="https://claude.com/blog/1m-context-ga?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">1-million-token context window</a>, and allow users to process large codebases, extensive document sets, and up to 600 images or PDF pages per request. This update is available across all plans and is integrated by default into Claude Code at standard pricing. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f11c0ca9-21ef-40d7-b65a-4524007ee59a/69b49c06e1c573f3ce50276b_image__3_.png?t=1773763196"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 44k</span></span> <a class="link" href="https://blog.google/products-and-platforms/products/maps/ask-maps-immersive-navigation/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Google Maps has integrated Gemini AI</a> to help you explore and navigate more easily. You can now use a new &quot;Ask Maps&quot; feature to get conversational answers to specific, real-world questions, like finding a place to charge your phone or a well-lit tennis court. Additionally, a new &quot;Immersive Navigation&quot; tool is rolling out, which provides vivid 3D visuals and more detailed route guidance to help you navigate your surroundings with more confidence.</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.3k</span></span> <a class="link" href="https://hermes-agent.nousresearch.com/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Hermes Agent</a> by Nous Research is an open-source, Python-based tool designed to grow with you by utilizing a multi-level memory system and persistent machine access, similar to OpenClaw. It works across your CLI and various messaging platforms, and offers developers an extensible framework for complex tasks like subagent management and programmatic tool calling.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/717b1744-c254-4544-a813-5be565ec7663/image.png?t=1773785089"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 6.2k</span></span> <a class="link" href="https://huggingface.co/1Covenant/Covenant-72B?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Covenant-72B</a> is the largest decentralized LLM pre-training run in history. It is a 72B parameter model trained across commodity internet connections without centralized clusters or whitelisting. By using innovative techniques like SparseLoCo for bandwidth efficiency and a blockchain-based &quot;Gauntlet&quot; system for validation, the project achieved performance levels competitive with models trained in traditional data centers. <a class="link" href="https://www.tplr.ai/chat?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Try it in browser</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3308bd02-f95f-45e1-a2c1-7cfaf0b9bd3a/image.png?t=1773763642"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 19k</span></span> Yann LeCun’s new startup <a class="link" href="https://amilabs.xyz/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Advanced Machine Intelligence</a> (AMI) has secured $1.03 billion in one of the largest seed rounds in history to develop AI systems capable of advanced reasoning, persistent memory, and world-model understanding. These models are designed to understand the physical world while featuring persistent memory and the ability to reason, plan, and operate safely.</p></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW Distillation Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;"><b>We just added a new chapter on Distillation</b> <b>too!</b></p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e379f8f4-ca30-4804-a3fd-bfb8df42f3d8/image.png?t=1773169810"/></div><p class="paragraph" style="text-align:left;">We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="open-claw-rl-train-any-agent-simply">OpenClaw-RL: Train Any Agent Simply by Talking</h2><p class="paragraph" style="text-align:left;"><i>Wang et al. [Princeton University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 674 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Every time we interact with an AI, it receives immediate feedback: a follow-up question, a software error, or a screen transition. Existing AI systems throw this valuable experience away. They treat our replies merely as context for their very next move, missing a massive opportunity to learn. Researchers wanted to solve this by capturing these everyday reactions, what they call next-state signals, and turning them into a live, continuous learning stream.</p><p class="paragraph" style="text-align:left;">We can build a system where an AI naturally improves simply by being used, turning ordinary conversations and software tasks into a seamless training loop without needing to pause for offline updates. The researchers built a unified framework called OpenClaw-RL.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ccc5932b-f549-4e87-9c66-3688f60fcce5/framework.png?t=1773760687"/></div><p class="paragraph" style="text-align:left;">The AI receives two powerful forms of information. First, there are evaluative signals, which act like a simple score indicating whether an action succeeded or frustrated the user. Second, there are directive signals. When a user corrects an AI by explaining how it should have responded, or when a software tool outputs a detailed error, it provides a clear map for improvement.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ffc13183-eb06-465b-8f18-17dcb068f6fb/rlserver.png?t=1773760715"/></div><p class="paragraph" style="text-align:left;">The framework extracts these specific textual hints to give the AI rich, word-by-word guidance. Because the system is built asynchronously, the AI can chat with a user, a background judge can evaluate its performance, and a training engine can update the AI&#39;s core behavior all at the exact same time, without ever interrupting the workflow.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/99ad0a95-e5a4-4b24-b4fe-9b7b6521cbf9/openclawrl1performance.png?t=1773760703"/></div><p class="paragraph" style="text-align:left;">By combining basic scoring with these rich textual hints, researchers found that personal assistants rapidly adapt their tone, becoming much more natural after just a handful of conversations.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.10165?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="neural-thickets-diverse-task-expert">Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights</h2><p class="paragraph" style="text-align:left;"><i>Gan and Isola [MIT CSAIL]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 818 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Pretraining </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching an AI a new skill feels like searching for a microscopic needle in a haystack. Researchers have long believed that adapting a massive, billion-parameter model required highly complex, meticulous, step-by-step mathematical adjustments just to find a version of the system that performed a specific task well. Blindly guessing the right settings was considered mathematically impossible and entirely out of the question.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f6a5b673-9ba8-466d-855d-b147850e6379/CleanShot_2026-03-17_at_20.52.39_2x.png?t=1773760972"/></div><p class="paragraph" style="text-align:left;">However, this study discovered that as models grow larger and undergo extensive initial training, adapting them to specific tasks like mathematical reasoning or coding no longer requires such rigid, painstaking effort. The heavy lifting of learning has already been done, signaling an exciting era where customizing powerful technology is becoming surprisingly natural, fast, and accessible.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/da09c942-e16a-427c-b4ff-cbec467d428e/image.png?t=1773760932"/></div><p class="paragraph" style="text-align:left;">The researchers found that scaling up these models fundamentally transforms their underlying structure. Instead of a desolate landscape where a good solution is a lone needle, massive models are surrounded by a dense, flourishing &quot;thicket&quot; of specialized solutions.</p><p class="paragraph" style="text-align:left;">By making random, tiny adjustments to the model&#39;s underlying numbers, the researchers uncovered an abundance of hidden specialists waiting nearby. One random tweak might produce an expert in creative writing, while another creates a brilliant chemist, each specializing in one area while forgetting others.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b55e95a0-a834-4c9e-b907-ba6cf42e88e3/image.png?t=1773760946"/></div><p class="paragraph" style="text-align:left;">To harness this rich diversity, the team tested a beautifully simple approach: they generated thousands of random tweaks simultaneously, kept the top performers for a specific task, and had them vote on the final answer. This parallel guess-and-check strategy matched the accuracy of today’s most advanced training methods but operated in a <b>fraction of the time</b> because it avoided slow, sequential updates. </p><div class="embed"><a class="embed__url" href="https://thickets.mit.edu/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank"><div class="embed__content"><p class="embed__title"> Neural Thickets · MIT </p><p class="embed__description"> Diverse Task Experts Are Dense Around Pretrained Weights </p><p class="embed__link"> thickets.mit.edu </p></div><img class="embed__image embed__image--right" src=""/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.12228?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="index-cache-accelerating-sparse-att">IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse</h2><p class="paragraph" style="text-align:left;"><i>Bai et al. [Tsinghua University, Z.ai]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 540 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI models use an attention mechanism to connect different pieces of information, but as the text gets longer, this demands a staggering amount of computing power. Developers recently introduced a clever shortcut called sparse attention, which uses a specialized &quot;indexer&quot; tool to scan the text and pick only the most relevant words for the AI to focus on at each step. It is a brilliant fix, but researchers ran into a new bottleneck. This indexer operates independently at every single layer of the AI network. As the text grows, just running this indexer consumes a massive chunk of the system&#39;s processing time, slowing everything down.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0faeefe4-a5f3-4892-9ada-6e795d560984/CleanShot_2026-03-17_at_20.59.42_2x.png?t=1773761393"/><div class="image__source"><span class="image__source_text"><p>Side-by-side comparison of inference loops.</p></span></div></div><p class="paragraph" style="text-align:left;">Looking closely at how these models process data, researchers noticed an incredible inefficiency. They discovered that consecutive layers of the AI were repeatedly selecting almost the exact same important words, often with a near-perfect overlap. To solve this, the researchers developed an elegant, hopeful solution called IndexCache. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5c9dd9f4-553b-4d7b-8983-fc61ed557c7a/CleanShot_2026-03-17_at_20.59.59_2x.png?t=1773761411"/></div><p class="paragraph" style="text-align:left;">Instead of forcing every layer to do the heavy lifting of scanning and selecting information, IndexCache designates a few specific layers as &quot;Full&quot; layers to run the indexer. The remaining layers become &quot;Shared&quot; layers, which simply borrow the selected words from the nearest Full layer. The team created two ways to apply this: a training-free method that calculates the absolute best pattern of Full and Shared layers for existing models, and a training-aware method that actively teaches the AI to share this data.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e5e8d358-d8b0-4118-b1ca-7bb548b9a795/CleanShot_2026-03-17_at_21.00.30_2x.png?t=1773761439"/><div class="image__source"><span class="image__source_text"><p>Training-free IndexCache at 1/2, 1/4, and 1/8 indexer retention. ‘Long’ and ‘G&R’ aggregate benchmark scores.</p></span></div></div><p class="paragraph" style="text-align:left;">By simply reusing this cached information, researchers eliminated 75 percent of the heavy indexer computations with negligible drops in the model&#39;s reasoning quality. When tested on massive systems, this straightforward change nearly doubled the speed at which the AI reads information and significantly accelerated its ability to generate answers.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.12201?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">Training Language Models via Neural Cellular Automata</h2><p class="paragraph" style="text-align:left;"><i>Lee et al. [MIT, Improbable AI Lab]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.5k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Sampling </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">LLMs rely on massive amounts of human-written text to learn how to reason and communicate. However, this approach faces a looming wall: high-quality human data is finite, often riddled with biases, and mixes actual reasoning with messy, subjective language. They investigated whether models could learn the fundamental mechanics of reasoning by training on purely synthetic, non-linguistic data before ever seeing a human sentence.</p><p class="paragraph" style="text-align:left;">To test this, researchers turned to neural cellular automata (NCA), algorithmic systems that generate complex, ever-changing grid patterns using simple, local rules. Unlike static text, these patterns can be generated cheaply and in infinite supply, allowing for precise control over the &quot;complexity&quot; of the training data.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5524ce76-66dd-4891-be54-98935b454324/image.png?t=1773761613"/></div><p class="paragraph" style="text-align:left;">The team pre-trained models on these synthetic grid trajectories, then followed up with standard training on natural language. The results show that models that practiced on just 164 million synthetic tokens learned faster and performed better than those trained solely on significantly larger amounts of traditional internet text.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7e2ad5b0-e9c5-4e39-8137-0c2051ab7c38/image.png?t=1773761635"/></div><p class="paragraph" style="text-align:left;">While traditional text training can cause a model to rely on human biases or semantic shortcuts, the synthetic grids force the model to focus purely on tracking long-range patterns and inferring underlying rules. Furthermore, the researchers found they could tune the complexity of this synthetic data to match specific domains; for instance, code benefited from simpler rules, while math and web text thrived on higher complexity.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/50dd5649-a6bb-49af-8857-54207f5f196e/CleanShot_2026-03-17_at_21.04.21_2x.png?t=1773761678"/></div><div class="embed"><a class="embed__url" href="https://hanseungwook.github.io/blog/nca-pre-pre-training/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank"><div class="embed__content"><p class="embed__title"> Training Language Models via Neural Cellular Automata </p><p class="embed__link"> hanseungwook.github.io/blog/nca-pre-pre-training </p></div><img class="embed__image embed__image--right" src=""/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.10055?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="glmocr-technical-report">GLM-OCR Technical Report</h2><p class="paragraph" style="text-align:left;"><i>Duan et al. [Zhipu AI, Tsinghua University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Modern information systems rely heavily on extracting knowledge from complex, visually dense documents like financial reports, invoices, and scientific papers. While recent multimodal AI models have improved how we read these documents, they often suffer from a major drawback: their massive size makes them slow, memory-intensive, and difficult to deploy in practical, real-world settings.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/115f675f-7e46-41af-b090-3147108a1f17/CleanShot_2026-03-17_at_21.08.05_2x.png?t=1773761902"/><div class="image__source"><span class="image__source_text"><p>Architecture and workflow of the GLM-OCR framework</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers developed GLM-OCR, a lightweight, highly optimized framework that packs significant power into a compact <b>0.9-billion-parameter</b> design. The system merges a specialized visual encoder with a streamlined language decoder. What makes it particularly clever is the shift away from standard “one-token-at-a-time” generation, which is notoriously slow for structured documents.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/33a6a399-2319-4610-90c4-3c1092f69e02/docparse.png?t=1773761864"/></div><p class="paragraph" style="text-align:left;">Instead, the researchers implemented a Multi-Token Prediction mechanism that allows the model to predict several tokens simultaneously. By using a shared-parameter scheme to keep memory usage low, the model effectively boosts throughput, the speed at which it processes data, by roughly 50% without sacrificing accuracy.</p><p class="paragraph" style="text-align:left;">To handle real-world complexities, the system employs a two-stage pipeline. First, it uses an analysis module to detect the layout of a document, breaking a complex page into manageable regions. These regions are then processed in parallel, allowing for faster and more robust recognition of everything from handwritten text to complicated table structures.</p><div class="embed"><a class="embed__url" href="https://huggingface.co/zai-org/GLM-OCR?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it" target="_blank"><div class="embed__content"><p class="embed__title"> zai-org/GLM-OCR · Hugging Face </p><p class="embed__description"> We’re on a journey to advance and democratize artificial intelligence through open source and open science. </p><p class="embed__link"> huggingface.co/zai-org/GLM-OCR </p></div><img class="embed__image embed__image--right" src="https://cdn-thumbnails.huggingface.co/social-thumbnails/models/zai-org/GLM-OCR.png"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.10910?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=you-can-train-openclaw-just-by-talking-to-it"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/qznFV59f3Uk" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=18f3814c-56e8-4b68-8e85-f4475dd94f24&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Flash Attention 4 is nuts</title>
  <description>and more about Speculative Speculative Decoding, SWE-CI, and Beyond Language Modeling</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/242099e8-0ada-4e3c-8919-7c8f74e9104f/issue_98.jpg" length="421025" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/flash-attention-4-is-nuts</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/flash-attention-4-is-nuts</guid>
  <pubDate>Tue, 10 Mar 2026 19:30:00 +0000</pubDate>
  <atom:published>2026-03-10T19:30:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Mar 3rd ~ Mar 10th</i><br><i>#98 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 6.8k</span></span> Sarvam AI has announced the <a class="link" href="https://www.sarvam.ai/blogs/sarvam-30b-105b?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" target="_blank" rel="noopener noreferrer nofollow">open-source release of its 30B and 105B parameter models</a>, which were developed entirely in-house to target both global benchmarks and Indian language tasks. The model weights are now available on Hugging Face and AIKosh, featuring day-zero support for SGLang with vLLM compatibility expected soon. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cf4452b8-d497-484a-a7df-31b778e5e372/image.png?t=1773165675"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 23k</span></span> OpenAI has <a class="link" href="https://openai.com/index/introducing-gpt-5-4/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" target="_blank" rel="noopener noreferrer nofollow">launched GPT-5.4 Thinking and GPT-5.4 Pro models</a> across ChatGPT, the API, and Codex. These new models are designed to enhance reasoning, coding, and agentic workflows. It also includes features such as advanced deep web research capabilities and the ability for users to interrupt the model mid-process to provide real-time instructions or course corrections.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0483e7fc-d71d-437f-a91e-adc9d49bb8d5/SWE-Bench_Pro__public_.png?t=1773165809"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 15k</span></span> Andrej Karpathy has developed an <a class="link" href="https://github.com/karpathy/autoresearch?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" target="_blank" rel="noopener noreferrer nofollow">&quot;autoresearch&quot; agentic workflow</a> designed to autonomously optimize neural network training through iterative experimentation. When applied to his nanochat project, the tool identified 20 additive changes that reduced the &quot;Time to GPT-2&quot; training benchmark from 2.02 hours to 1.80 hours. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cc49f92d-ccdd-4a13-b192-b33e98b2076d/image.png?t=1773165985"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW MoE Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;"><b>We just added a new chapter on MoE</b>, that goes through the history, the key techniques, and the current state of MoE that frontier model uses. With over 10,000 words written!</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e379f8f4-ca30-4804-a3fd-bfb8df42f3d8/image.png?t=1773169810"/></div><p class="paragraph" style="text-align:left;">We currently have a early bird offer, where you would get 40% off yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="speculative-speculative-decoding">Speculative Speculative Decoding</h2><p class="paragraph" style="text-align:left;"><i>Kumar et al. [Stanford University, Princeton University, Together AI]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 22k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Decoding </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">The primary hurdle in making LLMs feel instantaneous is a phenomenon known as the sequential bottleneck. Standard AI models generate text one word, or &quot;token&quot;, at a time, which fails to fully use the massive parallel computing power of modern hardware. Researchers previously introduced &quot;speculative decoding&quot;, a method where a small, fast model drafts a few guesses for a larger model to verify. The drafting model has to wait for the larger model to finish checking its work before it can start guessing the next set of words. This creates a persistent lag that limits how fast even the most advanced systems can communicate.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6c894daa-cc17-47b9-99da-8c9fb68f1991/CleanShot_2026-03-10_at_22.56.12_2x.png?t=1773163585"/><div class="image__source"><span class="image__source_text"><p>Ordinary speculative decoding (SD) requires the verifier to wait idly for the draft to speculate.</p></span></div></div><p class="paragraph" style="text-align:left;">To eliminate this idle time, researchers have developed Speculative Speculative Decoding (SSD) via an optimized algorithm called Saguaro. The biggest change is decoupling the &quot;guesser&quot; from the &quot;checker&quot; entirely. While the large target model is busy verifying a current batch of text, Saguaro’s draft model looks ahead and predicts several possible outcomes of that verification. It prepares a menu of &quot;potential futures.&quot; If the larger model confirms one of these predicted outcomes, the system can immediately provide the next set of words without any drafting delay. This parallel approach effectively hides the time spent guessing, transforming a sequential process into a streamlined, continuous flow.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8209abcd-3b89-425b-b631-38eb8d009ab5/CleanShot_2026-03-10_at_22.57.11_2x.png?t=1773163646"/></div><p class="paragraph" style="text-align:left;">By using a clever &quot;geometric fan-out&quot; strategy, Saguaro focuses its computational effort on the most likely verification results, ensuring the &quot;speculation cache&quot; is highly accurate even at higher creative temperatures. This method is entirely lossless, meaning it achieves the exact same high-quality output as the original model but at much higher speeds.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7e6b4af7-9d1e-4ca6-b2e2-b94ddd748d6d/CleanShot_2026-03-10_at_22.58.11_2x.png?t=1773163702"/><div class="image__source"><span class="image__source_text"><p>Advantage of geometric fan out strategy increases at higher temperatures, improving both speculation cache hit rate (right) and thus end-to-end speed (left).</p></span></div></div><p class="paragraph" style="text-align:left;">Initial results show that Saguaro can deliver text up to five times faster than traditional generation methods and twice as fast as previous state-of-the-art speculative techniques.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.03251?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="beyond-language-modeling-an-explora">Beyond Language Modeling: An Exploration of Multimodal Pretraining</h2><p class="paragraph" style="text-align:left;"><i>Tong et al. [FAIR, New York University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 424 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Multimodal LLMs </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI is getting good at manipulating language, but it lacks a fundamental grasp of the physical world. Researchers describe this limitation using the &quot;allegory of the cave&quot;: current models have mastered the description of shadows on a wall, text, without ever seeing the actual objects casting those shadows. Because text is a human abstraction, it is a &quot;lossy&quot; version of reality that misses the raw physics, geometry, and causality of our environment. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b9e32d82-c9ea-43b4-ad90-ee7563414f84/CleanShot_2026-03-10_at_23.03.51_2x.png?t=1773164041"/><div class="image__source"><span class="image__source_text"><p>Overview of this study.</p></span></div></div><p class="paragraph" style="text-align:left;">By training models to &quot;see&quot; and &quot;read&quot; simultaneously from birth, we can build a more grounded intelligence that understands the world’s dynamics directly.</p><p class="paragraph" style="text-align:left;">The researchers developed a unified model using a framework called Transfusion, which trains a single &quot;brain&quot; to perform two different tasks at once: predicting the next word in a sequence and reconstructing visual frames through a process called diffusion.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5e68c62c-b95a-47d1-8559-e1402a06236c/CleanShot_2026-03-10_at_23.04.21_2x.png?t=1773164081"/><div class="image__source"><span class="image__source_text"><p>Examples of training data.</p></span></div></div><p class="paragraph" style="text-align:left;">They discovered that the most effective way to do this is by using a single, high-quality visual representation known as a Representation Autoencoder. This contradicts the traditional belief that you need different &quot;eyes&quot; for understanding an image versus creating one; instead, a unified representation excels at both while keeping the model’s language skills sharp.</p><p class="paragraph" style="text-align:left;">The study also showed that learning from diverse visual data, like video and image-text pairs, actually improves the model’s performance on downstream tasks like reasoning and planning. To manage this complexity, they utilized a &quot;Mixture-of-Experts&quot; architecture.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/753b0252-9988-4ad6-a43d-8c9c3251ae9a/CleanShot_2026-03-10_at_23.05.00_2x.png?t=1773164109"/><div class="image__source"><span class="image__source_text"><p>Multimodal co-training exceeds unimodal performance.</p></span></div></div><p class="paragraph" style="text-align:left;">This design allows the model to naturally evolve specialized internal &quot;experts&quot; for different tasks. It learned to dedicate more capacity to language while efficiently processing the massive amounts of data required by vision. Most impressively, this unified training allowed &quot;world modeling&quot; capabilities to emerge.</p><p class="paragraph" style="text-align:left;">The model could predict the physical outcome of actions, like navigating a robot through a room, using simple text commands, proving that a truly multimodal foundation can bridge the gap between human language and physical reality.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b040cd9f-0240-4e01-b5a9-4265be7ae36d/dense_scaling.png?t=1773164142"/><div class="image__source"><span class="image__source_text"><p>Scaling laws for unified dense models. </p></span></div></div><div class="embed"><a class="embed__url" href="https://beyond-llms.github.io/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts" target="_blank"><div class="embed__content"><p class="embed__title"> Beyond Language Modeling: An Exploration of Multimodal Pretraining </p><p class="embed__description"> Empirical insights on representation, data, architecture, and scaling for native multimodal pretraining. </p><p class="embed__link"> beyond-llms.github.io </p></div><img class="embed__image embed__image--right" src="https://beyond-llms.github.io/assets/figures/progression.png"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.03276?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="flash-attention-4-algorithm-and-ker">FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling</h2><p class="paragraph" style="text-align:left;"><i>Zadouri et al. [Princeton University, Meta, Colfax Research, NVIDIA, Georgia Tech, Together AI]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 430 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Latest AI hardware is significantly faster at basic matrix multiplication, but other components, like the units responsible for specialized math or moving data around, haven&#39;t kept the same pace. This creates a digital traffic jam where the fastest parts of a processor are frequently left idling, waiting for slower sections to finish their work.</p><p class="paragraph" style="text-align:left;">FlashAttention-4 was designed to bridge this gap, by offering a clever software design that can overcome the physical limitations of even the most advanced hardware.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4410b593-2aea-4683-8a58-7cfd1edeef45/CleanShot_2026-03-10_at_23.09.29_2x.png?t=1773164404"/><div class="image__source"><span class="image__source_text"><p>FlashAttention-4 forward pipeline.</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers taught the software to find creative shortcuts around these hardware bottlenecks. Because the chip’s dedicated unit for calculating exponentials is often the slowest link in the chain, they developed a way to emulate its functions using more plentiful, general-purpose math units.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/416000d7-9f8d-4de6-84f4-b3375a1e6f4e/CleanShot_2026-03-10_at_23.10.29_2x.png?t=1773164446"/><div class="image__source"><span class="image__source_text"><p>FlashAttention-4 backward computation graph (5 MMA operations + 2 elementwise operations), showing the 1-CTA MMA mode software pipeline order across the prologue, main loop, and tail.</p></span></div></div><p class="paragraph" style="text-align:left;">By using a mathematical technique called polynomial approximation, the software effectively mimics the specialized unit, achieving nearly identical accuracy at much higher speeds. Additionally, they introduced a &quot;conditional rescaling&quot; method that intelligently skips redundant calculations unless they are truly necessary for precision.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/84be4ffc-797a-4dee-95e7-3fbabf052516/CleanShot_2026-03-10_at_23.11.11_2x.png?t=1773164482"/><div class="image__source"><span class="image__source_text"><p>Backward pass TFLOPS on B200 (FP16/BF16) with head dimension 128.</p></span></div></div><p class="paragraph" style="text-align:left;">By keeping more data in high-speed &quot;tensor memory&quot; and coordinating tasks so that different parts of the chip work in perfect sync, FlashAttention-4 can reach incredible speeds, hitting up to 1613 trillion operations per second on the latest Blackwell GPUs. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.05451?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration</h2><p class="paragraph" style="text-align:left;"><i>Chen et al. [Sun Yat-sen University, Alibaba Group]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 855 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Evaluation </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Software engineering is rarely about writing a perfect piece of code in one go; it is an ongoing marathon of maintenance, updates, and evolving requirements. While current AI models have become quite good at solving isolated, one-shot coding tasks, these models are often tested on their ability to provide a quick fix rather than their capacity to sustain a healthy codebase over time.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6a9b38f2-3c2f-42bc-bce1-3794987fc8c1/CleanShot_2026-03-10_at_23.17.38_2x.png?t=1773164868"/><div class="image__source"><span class="image__source_text"><p>Unlike previous benchmarks, SWE-CI proposes an evolution-based evaluation.</p></span></div></div><p class="paragraph" style="text-align:left;">To address this, a new benchmark called SWE-CI shifts the focus from simple functional correctness to long-term maintainability, providing a much-needed lens into how AI agents handle the messy, continuous reality of software development.</p><p class="paragraph" style="text-align:left;">The researchers built this benchmark using actual evolutionary histories from real-world software repositories, with tasks spanning an average of seven months and dozens of consecutive updates. To mirror a professional environment, they employed a dual-agent system where one AI acts as an Architect to identify gaps and set requirements, while a second Programmer agent implements the changes. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3fd18bb6-d095-4d4a-b4b1-2319b1542084/CleanShot_2026-03-10_at_23.18.07_2x.png?t=1773164901"/><div class="image__source"><span class="image__source_text"><p>Data curation process of SWE-CI.</p></span></div></div><p class="paragraph" style="text-align:left;">Success is measured by a novel metric called EvoScore, which specifically rewards agents that make decisions facilitating future growth rather than just immediate fixes. </p><p class="paragraph" style="text-align:left;">The results show that the newest AI models are improving at an accelerating pace, but most still struggle to prevent regressions, the frustrating phenomenon where adding a new feature accidentally breaks an old one. This discovery highlights that the next great frontier for AI developers isn&#39;t just writing code that works, but writing code that lasts.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e3bf44d7-f49b-4a1b-ba2a-d8471677be5f/CleanShot_2026-03-10_at_23.18.41_2x.png?t=1773164952"/><div class="image__source"><span class="image__source_text"><p>SWE-CI uses an architect-programmer dual-agent workflow to model the continuous integration cycle of professional software teams in the real world.</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.03823?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="real-money-fake-models-deceptive-mo"><b>Real Money, Fake Models: Deceptive Model Claims in Shadow APIs</b></h1><p class="paragraph" style="text-align:left;"><i>Zhang et al. [CISPA Helmholtz Center for Information Security]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 805 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Access to frontier models is restricted by high costs, complex payment barriers, or geographical limitations. To bridge this gap, many researchers and developers have turned to &quot;shadow APIs&quot;, third-party services that promise the same power as official models like GPT-5 or Gemini through unofficial, often cheaper channels. While these services appear to democratize access, they operate in a digital gray market with almost no transparency.</p><p class="paragraph" style="text-align:left;">Researchers recently set out to investigate whether these shadow services are truly delivering what they advertise or if they are quietly undermining the integrity of the scientific work built upon them. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0e303853-0a65-491c-84ce-90e8cc3e488b/CleanShot_2026-03-10_at_23.21.12_2x.png?t=1773165088"/><div class="image__source"><span class="image__source_text"><p>Landscape of the shadow APIs.</p></span></div></div><p class="paragraph" style="text-align:left;">The investigation revealed a concerning gap between marketing promises and technical reality. By auditing these services through &quot;model fingerprinting&quot;, a technique that identifies an AI by analyzing the unique statistical patterns and &quot;signatures&quot; in its responses, researchers discovered that nearly half of the shadow services were not using the models they claimed.</p><p class="paragraph" style="text-align:left;">In a classic &quot;<b>bait-and-switch</b>&quot;, premium proprietary models were frequently swapped for cheaper, open-source alternatives behind the scenes. This deception led to significant performance collapses; in high-stakes fields like medicine and law, accuracy dropped by nearly half when compared to official versions.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/60bea12e-9481-4b55-b0f6-9aa9422f0d04/CleanShot_2026-03-10_at_23.26.05_2x.png?t=1773165377"/><div class="image__source"><span class="image__source_text"><p>Fingerprinting results via LLMmap matched model and mean cosine distance D with standard.</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers found that while these shadow services might handle simple tasks well, they often fail during complex reasoning and display unpredictable safety behaviors. In addition to identifying the mismatch, the study used statistical testing to prove that the outputs from these services were fundamentally different from the official sources.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2603.01919?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-4-is-nuts"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/aQr_FWJETOk" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1fddbe5c-9018-46d7-9b07-7ec128ff5df5&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Compress Context... Into a LoRA!?</title>
  <description>plus more on Learning Without Training and The Geometry of Noise</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/786806a7-849f-46b5-9ae8-211df95da43c/issue_97.jpg" length="138646" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/compress-context-into-a-lora</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/compress-context-into-a-lora</guid>
  <pubDate>Wed, 04 Mar 2026 20:00:00 +0000</pubDate>
  <atom:published>2026-03-04T20:00:00Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Feb 24th ~ Mar 4th</i><br><i>#97 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 10k</span></span> Google has begun rolling out Nano Banana 2, its latest image generation model. The updated model uses real-time web search data to improve real-world accuracy and introduces the ability to render clear, multilingual text for designs like posters and logos. Additionally, Nano Banana 2 brings faster generation speeds alongside enhancements to lighting, textures, and overall image detail. Try it today via the <a class="link" href="https://gemini.google.com/app?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Gemini app</a> and <a class="link" href="https://aistudio.google.com/prompts/new_chat?model=gemini-3.1-flash-image-preview&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">web interface</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/50859a00-0a5a-49b2-8549-48636a46a2b0/image.png?t=1772642049"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 3.7k</span></span> <a class="link" href="https://openai.com/index/our-agreement-with-the-department-of-war/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">OpenAI has signed a classified deployment contract with the Department of War</a>, insisting that a &quot;cloud-only&quot; architecture, an internal safety stack, and cleared engineers will somehow strictly prevent the military from using their models for <b>autonomous lethal weapons</b> or mass NSA surveillance. Entrusting a tech corporation to independently self-police lethal and intelligence applications sounds straight out of a black mirror episode.</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 20k</span></span> Alibaba&#39;s Qwen team has launched the Qwen 3.5 Small Model Series, a family of native multimodal models ranging from a highly compact <b>0.8B</b> to a remarkably capable <b>9B</b> designed for edge devices and lightweight agents. Both the base and instruct models are now available on <a class="link" href="https://huggingface.co/collections/Qwen/qwen35?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a> and <a class="link" href="https://modelscope.cn/collections/Qwen/Qwen35?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">ModelScope</a>. Additionally, the entire suite is already optimized for local deployment via <a class="link" href="https://ollama.com/library/qwen3.5?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Ollama</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/48c9eb2c-c884-45a8-8601-e59eaa5828de/image.png?t=1772642591"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 8k</span></span> After small series, Alibaba has also introduced the <a class="link" href="https://chat.qwen.ai/?models=qwen3.5-flash&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Qwen 3.5 Medium Model</a> Series, which includes the <b>27B</b>, <b>35B-A3B</b>, and <b>122B-A10B</b> models designed to bridge the gap between mid-sized and frontier AI capabilities. Highlighting a shift toward architectural efficiency over parameter size, the new <b>35B-A3B</b> model notably outperforms the previous generation&#39;s massive 235B model thanks to improved data quality and reinforcement learning. <a class="link" href="https://chat.qwen.ai/?models=qwen3.5-122b-a10b&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Try it in browser</a>.</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 8.2k</span></span> Google has announced <a class="link" href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Gemini 3.1 Flash-Lite</a>, its fastest and most cost-effective Gemini 3 model to date. It has dynamic &quot;thinking levels&quot; and the model instantly processes high-volume queries while scaling its reasoning for complex edge cases, delivering a 2.5X faster time-to-first-token than its 2.5 Flash predecessor. <a class="link" href="https://aistudio.google.com/prompts/new_chat?model=gemini-3.1-flash-lite-preview&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Try it in browser</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8df63b5f-70b3-4fbe-b844-00acfd49ddfa/gemini-3.1-flash-lite-table_1.gif?t=1772643155"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Intuitive AI Academy - NEW MoE Chapter!</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;"><b>We just added a new chapter on MoE</b>, that goes through the history, the key techniques, and the current state of MoE that frontier model uses. With over 10,000 words written.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a0011805-b53d-4cc5-9d88-d2cc98e1c565/image.png?t=1772650337"/></div><p class="paragraph" style="text-align:left;">We currently have a early bird offer, where you would get 40% off yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="learning-without-training"><b>Learning Without Training</b></h1><p class="paragraph" style="text-align:left;"><i>Ryan O’Dowd [Claremont Graduate University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 720 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Training </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Engineers usually build machine learning models by guessing a structure and running exhaustive optimization processes to train them. Researchers wanted to know if there was a more elegant way to overcome the hurdles of high-dimensional, noisy data without relying on these brute-force training methods.</p><p class="paragraph" style="text-align:left;">The current approach assumes a model will eventually learn the underlying patterns, but it lacks constructive mathematical guarantees. By rooting their approach in classical approximation theory, we can save immense computational power while tackling complex problems like tracking brain diseases or analyzing hyperspectral images.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/14240e47-718d-40f1-a2b0-780d02534b82/CleanShot_2026-03-04_at_21.02.01_2x.png?t=1772638331"/><div class="image__source"><span class="image__source_text"><p>Normalized histogram of the density of interest (left), paired with our density estimation by σ128 based on 3900 samples (right).</p></span></div></div><p class="paragraph" style="text-align:left;">This paper discovered how to mathematically construct highly accurate models directly on unknown, complex data surfaces, known as manifolds. This method bypasses the need to map out the entire geometry of the dataset first; it only requires knowing the dimension of the data. </p><p class="paragraph" style="text-align:left;">The team unlocked a breakthrough in transfer learning by figuring out how to successfully lift learned information from just a localized portion of one data space and apply it to a completely different domain. As a result, adapting a massive model to a new problem no longer requires processing the entire original dataset.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/545991e8-e24f-4603-ba38-67eec3f1b16f/CleanShot_2026-03-04_at_21.02.32_2x.png?t=1772638363"/></div><p class="paragraph" style="text-align:left;">Finally, the researchers reimagined data classification by treating it like a signal separation problem. By mathematically estimating the underlying sources of these signals, their new algorithm quickly zeroes in on the most informative data points. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.17985?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="docto-lo-ra-learning-to-instantly-i">Doc-to-LoRA: Learning to Instantly Internalize Contexts</h2><p class="paragraph" style="text-align:left;"><i>Charakorn et al. [Sakana AI, Minerva University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.4k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LoRA </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Let’s assume you asked an AI to analyze a massive technical manual. Currently, every time you ask a follow-up question, the system has to re-read the entire document. This repetitive reading eats up massive amounts of computing power, memory, and time.</p><p class="paragraph" style="text-align:left;">While researchers can technically train the AI to memorize the document permanently, that traditional training process is painfully slow, expensive, and completely impractical for quick updates.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bbb89628-6d3d-4f78-9864-1885cb320838/CleanShot_2026-03-04_at_21.12.40_2x.png?t=1772638972"/></div><p class="paragraph" style="text-align:left;">To solve this, researchers developed a brilliant workaround called Doc-to-LoRA, or D2L. Instead of forcing the main AI to constantly re-read text or undergo grueling training, they built a specialized, lightweight helper system called a hypernetwork.</p><p class="paragraph" style="text-align:left;">This helper reads the document exactly once and instantly generates a tiny, customized plug-in. Think of it like instantly downloading a new skill directly into the AI&#39;s brain. Once this plug-in is attached, the main AI can answer subsequent queries fluidly without ever needing the original text in its prompt. It performs this complex mental compression in just a single step, completely bypassing the steep costs of standard training.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a5557701-506c-43e2-b515-718c6c17edd8/CleanShot_2026-03-04_at_21.13.00_2x.png?t=1772638992"/><div class="image__source"><span class="image__source_text"><p>QA performance on SQuAD compared to the used context length ratio (left), update latency (middle), and additional memory needed for model updates (right).</p></span></div></div><p class="paragraph" style="text-align:left;">The results are incredibly promising. In testing, D2L successfully hunted down specific facts hidden inside massive walls of text, achieving near-perfect accuracy on documents over four times larger than the AI’s normal limits.</p><p class="paragraph" style="text-align:left;">It works drastically faster and uses far less memory than previous memorization methods. It can even translate visual information from image-based models into these text plug-ins, allowing a text-only AI to classify images.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d1e8c8ec-6143-43e9-8c20-2e131d5ae1d5/CleanShot_2026-03-04_at_21.13.32_2x.png?t=1772639027"/><div class="image__source"><span class="image__source_text"><p>Long document QA performance. LLMLingua-2 compresses the input with [20%, 40%, 60%, 80%, 90%] compression rates from right to left (gray dots).</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.15902?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="the-geometry-of-noise-why-diffusion">The Geometry of Noise: Why Diffusion Models Don&#39;t Need Noise Conditioning</h2><p class="paragraph" style="text-align:left;"><i>Sahraee-Ardakan et al. [Google]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 382 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Diffusion Models </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Can you restore a severely damaged painting without knowing how much damage was originally done? Standard diffusion models avoid this problem by relying on a strict timer that tells them exactly how much &quot;noise&quot; or corruption they are dealing with at any given step.</p><p class="paragraph" style="text-align:left;">Recently, researchers have been incredibly hopeful about &quot;autonomous&quot; models that strip away this timer, learning a single rule to handle everything from pure static to nearly perfect data. However, this creates a profound mathematical paradox. </p><p class="paragraph" style="text-align:left;">As these blind models approach the clean data, the underlying mathematical landscape forms an infinitely deep pit. The directional signals diverge completely, creating a severe geometric singularity. By all conventional logic, these models should become hopelessly unstable and crash.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c0930cfa-02fa-4142-bee5-dd014a509eb1/CleanShot_2026-03-04_at_21.14.23_2x.png?t=1772639072"/><div class="image__source"><span class="image__source_text"><p>The Singular Geometry of the Marginal Energy Landscape.</p></span></div></div><p class="paragraph" style="text-align:left;">Yet, researchers have beautifully resolved this mystery. They discovered that these autonomous systems are actually charting a course across a unified map called &quot;Marginal Energy.&quot;</p><p class="paragraph" style="text-align:left;">More importantly, the scientists proved that the models naturally develop a hidden geometric shock absorber. As the AI approaches that infinitely deep mathematical pit, this built-in feature perfectly counteracts the extreme steepness. It transforms a catastrophic plunge into a smooth, stable descent known as a Riemannian gradient flow.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/28dbfec3-260b-496d-9501-8ecacb8f0570/CleanShot_2026-03-04_at_21.14.50_2x.png?t=1772639104"/><div class="image__source"><span class="image__source_text"><p>Generative performance on Fashion MNIST.</p></span></div></div><p class="paragraph" style="text-align:left;">Models attempting to directly predict the noise act like faulty amplifiers, magnifying tiny errors until the system catastrophically breaks down. Conversely, models based on predicting &quot;velocity&quot; inherently absorb that uncertainty into a smooth, stable drift. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.18428?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">dLLM: Simple Diffusion Language Modeling</h2><p class="paragraph" style="text-align:left;"><i>Zhou et al. [UC Berkeley, UIUC]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 120 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Diffusion LLM </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Language models traditionally generate text strictly left to right. But recently, researchers have found a promising alternative: diffusion language models. These systems can generate words in any order and iteratively refine their answers, unlocking highly flexible AI.</p><div class="embed"><a class="embed__url" href="https://github.com/ZHZisZZ/dllm?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - ZHZisZZ/dllm: dLLM: Simple Diffusion Language Modeling </p><p class="embed__description"> dLLM: Simple Diffusion Language Modeling. Contribute to ZHZisZZ/dllm development by creating an account on GitHub. </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/cf5bb5f1995e865bce2db24a0df4ad74a8833f65b1be06aa1997556ee01b71c2/ZHZisZZ/dllm"/></a></div><p class="paragraph" style="text-align:left;">However, the underlying code was scattered across complex, isolated research repositories, making it incredibly difficult for developers to reproduce results or build upon each other’s work. To solve this, researchers created dLLM, a unified open-source framework that elegantly standardizes the development pipeline so the community can innovate together.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6f00097b-7faf-4531-85f6-5f94d3c16f32/CleanShot_2026-03-04_at_21.15.35_2x.png?t=1772639149"/><div class="image__source"><span class="image__source_text"><p>Inference pipeline: sampler swap from vanilla to FastdLLM MDLM sampler.</p></span></div></div><p class="paragraph" style="text-align:left;">The framework seamlessly connects the core pillars of AI development: training, generation, and testing. Using a highly modular design, dLLM allows developers to snap different components together effortlessly. A researcher can easily swap out a training method or plug in a high-speed generation algorithm without rewriting the model&#39;s core architecture.</p><p class="paragraph" style="text-align:left;">By standardizing how these models are evaluated, researchers also uncovered a hidden quirk: diffusion models are intensely sensitive to tiny adjustments in generation settings. A single tweaked parameter can drastically alter a model&#39;s performance, highlighting exactly why a transparent, shared testing environment is vital for meaningful progress.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/594a3fcc-0f64-435a-991e-e6dfc3ce600b/CleanShot_2026-03-04_at_21.16.01_2x.png?t=1772639171"/><div class="image__source"><span class="image__source_text"><p>Terminal Visualizer showing transition from masked to decoded tokens.</p></span></div></div><p class="paragraph" style="text-align:left;">Perhaps the most hopeful breakthrough is how this framework democratizes AI research. The team proved that building these dynamic models does not require massive supercomputers. Using their new recipes, they successfully transformed standard, off-the-shelf systems (including traditional discriminative architectures like BERT) into functional diffusion chatbots. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cb8c5c67-b464-432e-8919-46e3ca1f9f61/CleanShot_2026-03-04_at_21.16.23_2x.png?t=1772639200"/><div class="image__source"><span class="image__source_text"><p>Sensitivity to decoding hyperparameters.</p></span></div></div><p class="paragraph" style="text-align:left;">They achieved this with minimal computing power and simple fine-tuning, requiring no structural changes to the original models.</p><div class="embed"><a class="embed__url" href="https://huggingface.co/dllm-hub?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora" target="_blank"><img class="embed__image embed__image--left" src="https://cdn-thumbnails.huggingface.co/social-thumbnails/dllm-hub.png"/><div class="embed__content"><p class="embed__title"> dllm-hub (dLLM) </p><p class="embed__description"> Org profile for dLLM on Hugging Face, the AI community building the future. </p></div></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.22661?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=compress-context-into-a-lora"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/qFttD0060QA" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=7c762817-d958-44d2-a40b-cb9a8160ccdf&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Google Presents A Brand New Way To Train Latents</title>
  <description>plus more about Experiential RL, GLM-5 Report, and Attention Matching</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/12aeafde-3816-4305-8cf8-c2cfb5de4a4c/issue_96.jpg" length="167350" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/google-presents-a-brand-new-way-to-train-latents</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/google-presents-a-brand-new-way-to-train-latents</guid>
  <pubDate>Tue, 24 Feb 2026 19:15:22 +0000</pubDate>
  <atom:published>2026-02-24T19:15:22Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Feb 19th ~ Feb 24th</i><br><i>#96 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 8.2k</span></span> <a class="link" href="https://x.com/Replit/status/2024578806208745637?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" target="_blank" rel="noopener noreferrer nofollow">Replit has introduced Replit Animation</a>, a new tool that lets users create animated videos in minutes using conversational prompts. It is powered by Gemini 3.1 Pro and makes it easier to produce polished, shareable content without traditional editing software.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/63b38e00-c1a8-45c6-8d31-40180c9b4379/CleanShot_2026-02-24_at_21.31.48_2x.png?t=1771948927"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 22k</span></span> Anthropic has released Claude Sonnet 4.6, which comes with stronger performance in complex spreadsheet tasks, multi-step web forms, and expanded integrations through Excel MCP connectors. The Claude API now also supports <a class="link" href="https://claude.com/blog/improved-web-search-with-dynamic-filtering?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" target="_blank" rel="noopener noreferrer nofollow">more accurate web search</a>, dynamic filtering, and general availability of code execution, memory, and programmatic tool use.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/88ebd7bb-7fd3-4c53-8ec0-5f79625645e8/1206645ef5a618dabce8587b472b21c67a30a0db-3840x1948.webp?t=1771949348"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 49k</span></span> <a class="link" href="https://www.anthropic.com/news/claude-code-security?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" target="_blank" rel="noopener noreferrer nofollow">Anthropic has announced Claude Code Security</a> to scan codebases for vulnerabilities and suggest targeted patches for human review. This release triggered a sharp selloff in cybersecurity stocks, with companies like JFrog, CrowdStrike, Okta, and Cloudflare all recording declines. The system uses reasoning to trace data flows and flag subtle errors, while enforcing a human-in-the-loop safeguard to ensure developer oversight. If you want to try it, then <a class="link" href="https://claude.com/solutions/claude-code-security?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" target="_blank" rel="noopener noreferrer nofollow">join the waitlist here</a>.</p></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Learn LLMs Intuitively - Intuitive AI Academy</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">Want to learn about LLMs, but never have a good place to start?</p><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/377a8b3f-2557-4a32-8115-242eb6a2146e/image.png?t=1769004014"/><div class="image__source"><span class="image__source_text"><p>content overview</p></span></div></div><p class="paragraph" style="text-align:left;">We currently have a early bird offer, where you would get 40% off yearly plan for our early users. </p><p class="paragraph" style="text-align:left;">Use code: <b>TIMELINE</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="unified-latents-ul-how-to-train-you">Unified Latents (UL): How to train your latents</h2><p class="paragraph" style="text-align:left;"><i>Heek [Google DeepMind Amsterdam]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 2.1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> ALDiffusion </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">To create high-quality images and videos efficiently, models usually compress data into a &quot;latent&quot; space (a digital shorthand that represents the original image in a smaller, more manageable package). If you compress the data too much, the AI loses fine details like textures and sharp edges; if you don&#39;t compress it enough, the AI becomes incredibly slow and expensive to train.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d95f0533-b5ca-4584-9088-db02e704fa9c/CleanShot_2026-02-23_at_17.10.28_2x.png?t=1771846840"/><div class="image__source"><span class="image__source_text"><p>Schematic overview of our model, include the Encoder, the prior latent diffusion model, and the diffusion decoder model.</p></span></div></div><p class="paragraph" style="text-align:left;">Traditional methods often rely on manual tuning to strike this balance, which is often more of an art than a science. Researchers have recently introduced a framework called Unified Latents (UL) to turn this guesswork into a systematic, more efficient process. By co-training the compression and generation steps together, they have found a way to maintain stunningly high-quality details while actually lowering the computational cost required for training.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/87043c06-a8f6-4960-97b8-5cd4ff3956ef/CleanShot_2026-02-23_at_17.11.12_2x.png?t=1771846881"/></div><p class="paragraph" style="text-align:left;">Instead of treating the compressed representation as a static container, Unified Latents uses a &quot;diffusion prior&quot; to monitor and regularize the information flow. By linking the noise level produced during the encoding process directly to the precision of the diffusion model, researchers could create a mathematically tight way to control the latent bitrate.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cb8077f4-0de0-4ad7-9919-3391d102c993/CleanShot_2026-02-23_at_17.11.31_2x.png?t=1771846903"/><div class="image__source"><span class="image__source_text"><p>A selection of samples from a text-to-image trained with Unified Latents</p></span></div></div><p class="paragraph" style="text-align:left;">They also paired this with a diffusion-based decoder, which is remarkably better at reconstructing high-frequency details than previous methods. This unified approach allows the system to navigate the trade-off between compression and quality with much greater precision.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.17270?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="fast-kv-compaction-via-attention-ma">Fast KV Compaction via Attention Matching</h2><p class="paragraph" style="text-align:left;"><i>Zweiger et al. [Massachusetts Institute of Technology]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 196 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI models can take on more complex tasks like long-form coding and multi-day conversations, but they face a significant memory hurdle known as the key-value (KV) cache bottleneck. Every word the model processes adds to a digital &quot;short-term memory&quot; that can quickly balloon into several gigabytes of data. </p><p class="paragraph" style="text-align:left;">Until now, researchers have managed this by either summarizing the text (which often strips away vital nuances) or by using expensive optimization techniques that take hours of computing time to compress a single document.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7c67a3bb-49f1-4d53-8ac7-bbec52deb6a5/CleanShot_2026-02-23_at_17.16.12_2x.png?t=1771847181"/></div><p class="paragraph" style="text-align:left;">This paper has introduced a technique called Attention Matching that changes how we think about memory compaction. Instead of relying on slow, iterative training to condense information, this approach treats memory like a mathematical puzzle that can be solved directly in &quot;latent space&quot;.</p><p class="paragraph" style="text-align:left;">By focusing on how a model &quot;pays attention&quot; to specific pieces of information, the researchers found they could create a compact version of the memory that mimics the original&#39;s behavior. They discovered that this problem can be broken down into smaller sub-problems with efficient, closed-form solutions, allowing them to bypass the slow trial-and-error process of traditional machine learning.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ab8f5664-b736-4ff6-881f-5072e0016409/CleanShot_2026-02-23_at_17.16.30_2x.png?t=1771847205"/><div class="image__source"><span class="image__source_text"><p>Accuracy vs. compaction ratio across methods.</p></span></div></div><p class="paragraph" style="text-align:left;">This new framework can shrink a model&#39;s memory by up to 50 times in just a matter of seconds, rather than hours, with almost no impact on the quality of the model&#39;s output. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.16284?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="experiential-reinforcement-learning">Experiential Reinforcement Learning</h2><p class="paragraph" style="text-align:left;"><i>Shi et al. [KRAFTON, University of Wisconsin–Madison, UC Berkeley, Microsoft Research]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching artificial intelligence to navigate complex tasks is often a game of high-stakes guessing. In standard reinforcement learning, a model typically receives a single &quot;reward&quot; signal. This makes it incredibly difficult for the AI to pinpoint exactly where it tripped up or how to adjust its behavior for the next try.</p><p class="paragraph" style="text-align:left;">Researchers recognized that this &quot;blind&quot; trial-and-error is far less efficient than how humans naturally learn. When we fail at a task, we don&#39;t just try again at random; we stop, reflect on what went wrong, and form a mental plan to do better. To solve this, researchers introduced Experiential Reinforcement Learning (ERL), a new approach designed to turn silent failures into structured, durable lessons.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/aeff92cc-0799-4f0f-9a31-c3d7ff12a915/CleanShot_2026-02-23_at_17.26.15_2x.png?t=1771847784"/><div class="image__source"><span class="image__source_text"><p>In Experiential Reinforcement Learning (ERL), instead of learning from feedback or outcome directly</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers developed a clever &quot;experience-reflection-consolidation&quot; loop that embeds human-like reasoning directly into the training process. Instead of moving on immediately after an attempt, the model receives feedback from its environment and is prompted to generate a verbal reflection on its own performance.</p><p class="paragraph" style="text-align:left;">It looks at its errors and produces a self-critique that guides a refined second attempt at the same task. If this second try succeeds, the model &quot;internalizes&quot; the successful correction. This is the breakthrough moment: by training the model to reproduce the improved behavior from the original task alone, the AI eventually learns to skip the reflection step entirely.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5aa552dc-9e18-4470-a671-d245bb33690c/CleanShot_2026-02-23_at_17.26.47_2x.png?t=1771847818"/><div class="image__source"><span class="image__source_text"><p>Conceptual comparison of learning dynamics in RLVR and Experiential Reinforcement Learning (ERL)</p></span></div></div><p class="paragraph" style="text-align:left;">This ensures that the final model is both smarter and faster, maintaining high performance at deployment without any extra computational cost.</p><p class="paragraph" style="text-align:left;">In complex multi-step tasks like Sokoban, which require deep planning and spatial reasoning, ERL improved performance by a staggering 81% over standard methods. It also showed reliable gains in agentic reasoning tasks that involve using external tools to answer questions. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4da17c0b-eb2c-4ba5-895a-f81c534edd59/CleanShot_2026-02-23_at_17.27.15_2x.png?t=1771847847"/><div class="image__source"><span class="image__source_text"><p>Overview of Experiential Reinforcement Learning (ERL).</p></span></div></div><p class="paragraph" style="text-align:left;">By allowing the AI to accumulate &quot;corrective knowledge&quot; in a persistent memory, researchers have created a way for models to build on their past successes rather than repeating the same mistakes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e57f4395-edd3-42e9-83bb-02865e4ff28c/CleanShot_2026-02-23_at_17.27.40_2x.png?t=1771847869"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.13949?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">GLM-5: from Vibe Coding to Agentic Engineering</h2><p class="paragraph" style="text-align:left;"><i>Zhipu AI & Tsinghua University</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 290 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Agents </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Researchers are working to shift our relationship with AI from &quot;vibe coding&quot; (where humans provide constant prompts) toward a more autonomous era of &quot;agentic engineering.&quot; Traditional models often struggle with the sheer computational cost of keeping track of long conversations, or they lose their way during tasks that require hours of planning and execution.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/34475712-6dcc-4869-bb82-35137a00e8d3/bench.png?t=1771848560"/></div><p class="paragraph" style="text-align:left;">GLM-5 was designed to overcome these bottlenecks by creating a more independent assistant that can plan, implement, and iterate on technical challenges with minimal human intervention.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/49ee0681-4b06-4479-9225-9cab867c60a6/realworld_bench.png?t=1771848577"/></div><p class="paragraph" style="text-align:left;">To make this possible, the researchers introduced a dynamic mechanism called DeepSeek Sparse Attention. Rather than the model straining to analyze every single word in a massive document with equal intensity (which is incredibly expensive and slow), it now identifies which tokens are truly important for the task at hand.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ab27461c-7c68-47be-827e-776bd9b3448d/CleanShot_2026-02-23_at_17.40.29_2x.png?t=1771848644"/></div><p class="paragraph" style="text-align:left;">This allows the model to manage massive amounts of data, such as entire codebases or long-term business simulations, while significantly reducing the hardware power required. The model also features an &quot;interleaved thinking&quot; process where it pauses to reason before every action it takes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/deec37e5-87f7-48b7-91fb-649be1ed9748/vending_bench.png?t=1771848588"/></div><p class="paragraph" style="text-align:left;">It can &quot;preserve&quot; these thoughts across a long conversation, ensuring it doesn&#39;t lose its train of thought when moving between different stages of a project. By using a new asynchronous training infrastructure that allows the model to learn from complex, it was able to reach an <b>unprecedented ability</b> to solve end-to-end software engineering challenges.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.15763?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=google-presents-a-brand-new-way-to-train-latents"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/httnhdpu_W4" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=81de0e5a-b3cd-49dd-9898-aaa969440917&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Using Diffusion To Interpret LLMs?! Generative Latent Prior</title>
  <description>plus more on Evolving Agents via Recursive Skill-Augmented RL and Low Hanging Fruits in Vision Transformers</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/01d3e3be-9457-45bc-ac39-9262d5b12592/issue_95.jpg" length="202752" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/using-diffusion-to-interpret-llms-generative-latent-prior</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/using-diffusion-to-interpret-llms-generative-latent-prior</guid>
  <pubDate>Tue, 17 Feb 2026 19:44:12 +0000</pubDate>
  <atom:published>2026-02-17T19:44:12Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Feb 9th ~ Feb 16th</i><br><i>#95 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.7k</span></span> Google DeepMind has upgraded its <a class="link" href="https://blog.google/products-and-platforms/products/gemini/gemini-3/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior#gemini-3-deep-think" target="_blank" rel="noopener noreferrer nofollow">Gemini 3 Deep Think reasoning mode</a> to tackle complex scientific and engineering challenges. This new mode is setting new performance records on frontier benchmarks like ARC-AGI-2. You can access it via Google AI Ultra subscription and is available to researchers via a Vertex AI Early Access Program. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2a03e0fd-803a-492a-b5d8-59738fa19216/final_dt_blog_evals_2.gif?t=1771344414"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 3.1k</span></span> <a class="link" href="https://blog.e01.ai/glm5-gameboy-and-long-task-era-64db7074a026?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">Zhipu AI’s GLM-5</a> has demonstrated the ability to <a class="link" href="https://e01.ai/gba/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">manage over 700 tool calls and 800 context handoffs</a> during continuous <b>24-hour operations</b>. This advancement allows AI agents to handle highly complex, multi-step workflows, such as autonomously reverse-engineering software code from visual video demonstrations. Currently #1 open source model, and beating Gemini-3-Pro-high.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e5faa503-5325-4032-9cc1-f87061f17d08/image.png?t=1771346505"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 8.9k</span></span> <a class="link" href="https://www.minimax.io/models/text?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">MiniMax has launched M2.5</a>, an open-source model that delivers state-of-the-art performance in coding, search, and agentic tool-calling, including a top-tier 80.2% score on SWE-Bench. This model is optimized for productivity and it is 37% faster and can be used as a <a class="link" href="https://platform.minimax.io/subscribe/coding-plan?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">drop-in replacement for Claude Code via its coding plan</a>. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/25800d40-9e88-4288-ab38-d95dd43cd3ee/image.png?t=1771344934"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 4.2k</span></span> <a class="link" href="https://qwen.ai/blog?id=qwen3.5&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">Alibaba has published Qwen3.5-397B-A17B</a>, its first open-weight, native multimodal model featuring a sparse Mixture of Experts (MoE) architecture that is optimized for real-world agents. It is <a class="link" href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">licensed under Apache 2.0</a> and it supports over 200 languages and delivers up to 19 times the decoding throughput of its predecessor through advanced reinforcement learning environment scaling. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4a5dfd73-5da8-4e7f-be20-6cd3349bd61c/qwen3.5_397b_a17b_inference.png?t=1771345095"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Learn LLMs Intuitively - Intuitive AI Academy</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">Want to learn about LLMs, but never have a good place to start?</p><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/377a8b3f-2557-4a32-8115-242eb6a2146e/image.png?t=1769004014"/><div class="image__source"><span class="image__source_text"><p>content overview (a total of 100k words explainer so far!)</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="vi-t-5-vision-transformers-for-the-">ViT-5: Vision Transformers for The Mid-2020s</h2><p class="paragraph" style="text-align:left;"><i>Wang et al. [</i>Johns Hopkins University, UC Santa Cruz<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> ViT </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">LLMs are sprinting ahead with rapid architectural refinements, but Vision Transformers (ViTs) have remained largely stagnant since their debut in 2020. Vision models struggle with stability issues and a limited ability to handle complex spatial reasoning.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d7b8e08e-5ac4-4444-a937-e08cd4bcce38/fig1.png?t=1771343596"/><div class="image__source"><span class="image__source_text"><p>ViT architecture</p></span></div></div><p class="paragraph" style="text-align:left;">The research team developed ViT-5 by systematically testing five years of AI advancements to see which ones actually improve a model&#39;s &quot;eyesight.&quot; They discovered that simply copying language model tricks doesn&#39;t always work; for instance, a popular method for filtering information in text models actually caused &quot;over-gating&quot; in vision, making the internal representations too sparse to be useful.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1ac4c4b9-62f8-4c7a-8127-58c731ef82fd/fig4.png?t=1771343611"/></div><p class="paragraph" style="text-align:left;">Instead, they found success by combining a more efficient normalization method with a clever dual-positioning system. This allows the model to understand where every pixel is relative to its neighbors while still maintaining a &quot;big picture&quot; sense of the entire image.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6e605599-daea-4ce7-b73f-18c9a8015bef/fig6.png?t=1771343623"/></div><p class="paragraph" style="text-align:left;">To further refine performance, the researchers introduced &quot;register tokens,&quot; which act like digital scratchpads to clean up visual artifacts and help the model focus on what is semantically important. They also implemented a technique called QK-normalization, which smoothed out the training process and eliminated the frustrating &quot;error spikes&quot; that often crash large-scale AI projects.</p><p class="paragraph" style="text-align:left;">The final model can handles images of varying sizes with ease and consistently outperforms previous standards in identifying objects and generating new images.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.08071?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="skill-rl-evolving-agents-via-recurs">SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning</h2><p class="paragraph" style="text-align:left;"><i>Xia et al. [UNC-Chapel Hill, University of Chicago, University of California San Diego, NEC Labs America, University of California Berkeley, University of California Santa Cruz]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 455 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Skill </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Even the most advanced models often treat every new task as a blank slate. Researchers have long tried to give these agents a memory, but simply feeding them long, messy logs of past actions often results in &quot;noisy&quot; confusion that slows the system down.</p><p class="paragraph" style="text-align:left;">The team behind SKILLRL realized that for AI to truly evolve, it shouldn&#39;t just record what happened; it needs to distill those experiences into compact, actionable skills. This team developed a framework that transforms raw, verbose interaction data into a structured &quot;SkillBank.&quot;</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8aaacab4-081d-4055-89f9-df71198111a1/pipeline.png?t=1771343801"/></div><p class="paragraph" style="text-align:left;">Instead of saving every redundant step of a task, the system uses a teacher model to extract the core logic behind a success and the critical lessons from a failure. These insights are organized into a hierarchy: general principles for broad strategy and specialized tactics for specific tasks.</p><p class="paragraph" style="text-align:left;">To make this work, the researchers introduced a recursive evolution process. As the agent practices using reinforcement learning, it doesn&#39;t just improve its own performance; it simultaneously updates its library.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/33d278fa-6faf-4467-ab03-7d3b44a16fb9/CleanShot_2026-02-17_at_21.27.14_2x.png?t=1771343845"/></div><p class="paragraph" style="text-align:left;">When the agent hits a new type of roadblock, the system analyzes the failure, writes a new &quot;skill&quot; to handle it, and adds it to the collection. This co-evolution creates a virtuous cycle where the agent becomes more efficient and avoids &quot;context bloat,&quot; using ten to twenty times less data than raw logs.</p><p class="paragraph" style="text-align:left;">The results are striking, showing that smaller, open-source models can actually outperform massive, closed-source giants like GPT-4o by using this structured expertise.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/920066ed-5cde-49bd-8104-3bd11f038e34/CleanShot_2026-02-17_at_21.27.29_2x.png?t=1771343863"/><div class="image__source"><span class="image__source_text"><p>Performance on ALFWorld and WebShop.</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.08234?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="learning-a-generative-meta-model-of">Learning a Generative Meta-Model of LLM Activations</h2><p class="paragraph" style="text-align:left;"><i>Luo et al. [UC Berkeley, Transluce]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 353 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Activations </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Current tools for analyzing LLMs rely on rigid mathematical guesses that don&#39;t quite capture the messy, organic complexity of an AI’s inner workings. When we try to &quot;steer&quot; an AI toward a specific persona or sentiment, the intervention often corrupts the model’s internal logic, causing it to descend into repetitive or nonsensical babble.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af8ab16b-9a0e-4674-8ac3-7f1cd6ab80a6/teaser.jpg?t=1771343902"/></div><p class="paragraph" style="text-align:left;">To achieve this, the team developed a &quot;meta-model&quot; called a Generative Latent Prior, or GLP. Rather than forcing the data into simple boxes, they trained a massive diffusion model on over a billion internal states of a functioning language model. This meta-model effectively learns the &quot;natural habitat&quot; of the AI&#39;s internal activations.</p><p class="paragraph" style="text-align:left;">When a human intervention pushes the AI into an unnatural or &quot;off-manifold&quot; state, the GLP acts as a corrective guide. It &quot;denoises&quot; the internal signal, pulling it back onto a path that makes sense to the language model while preserving the intended change in behavior.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9681400a-c4fb-4ede-92a1-2048dc904f06/CleanShot_2026-02-17_at_21.32.24_2x.png?t=1771344162"/><div class="image__source"><span class="image__source_text"><p>GLP generates activation samples near-indistinguishable from real activations</p></span></div></div><p class="paragraph" style="text-align:left;">This approach follows reliable scaling laws, meaning that as they invested more computing power, the meta-model became significantly better at predicting and refining the AI’s internal states. They also found that the meta-model’s own internal units, which they call &quot;meta-neurons,&quot; are exceptionally good at isolating specific, human-understandable concepts like geography or mathematics.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1cc0c7a0-bbd1-4c97-8695-11698c06a25e/CleanShot_2026-02-17_at_21.32.55_2x.png?t=1771344185"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.06964?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=using-diffusion-to-interpret-llms-generative-latent-prior"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/qFpps5Ur-qs" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=d9244129-8883-4b9e-9590-e9af0e2607d8&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>New Generative Paradigm: Drifting Model</title>
  <description>an insane big week in AI reseasrch</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b4bc9e1d-b153-479c-8604-442c6d66e4f6/issue_94.jpg" length="210268" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/new-generative-paradigm-drifting-model</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/new-generative-paradigm-drifting-model</guid>
  <pubDate>Tue, 10 Feb 2026 19:01:27 +0000</pubDate>
  <atom:published>2026-02-10T19:01:27Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Feb 3rd ~ Feb 9th</i><br><i>#94 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 72k</span></span> <a class="link" href="https://seed.bytedance.com/en/seedance?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">ByteDance has released Seedance 2.0 in China</a>, a powerful new AI tool that empowers creators to generate stunningly realistic, high-resolution videos from simple prompts. It is so advanced and its output so lifelike that it is already being hailed as a game-changing force which can revolutionize the filmmaking and visual effects industries.</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 10k</span></span> OpenAI has launched GPT-5.3-Codex, which has the ability to handle long-running, complex tasks that involve research and tool use, effectively acting as an interactive collaborator. It is 25% faster and has set new industry standards in performance (real-world software engineering to building complex games and apps from scratch). </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b3f1cf66-ad1f-466d-99fa-1b35fef6668c/CleanShot_2026-02-10_at_20.23.51_2x.png?t=1770735240"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 39k</span></span> <a class="link" href="https://www.anthropic.com/news/claude-opus-4-6?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">Anthropic has launched Claude Opus 4.6</a>, its latest model for agentic coding, reasoning, and complex knowledge work. It now supports brand-consistent slide creation in PowerPoint, sophisticated multi-step planning in Excel, and autonomous &quot;agent teams&quot; for parallelized coding. Claude Opus 4.6 provides creators with an incredibly smart and efficient partner for handling long-running, high-stakes projects.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0fa53b1b-26a2-4d0b-931f-28bf18073630/653e04afc43612d3a0f8427da86b6549800005f9-3840x2160.jpg?t=1770735505"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 961</span></span> <a class="link" href="https://chat.intern-ai.org.cn/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">Intern-S1-Pro</a> has launched as a <b>1T open-source model</b> that delivers state-of-the-art scientific reasoning, rivaling the world&#39;s leading closed-source models in complex AI4Science tasks. It uses Fourier Position Encoding for superior time-series modeling and integrates with vLLM, this powerful tool provides researchers with an incredibly efficient and accessible way to master complex physical signals and advanced multimodal data. Get <a class="link" href="https://huggingface.co/internlm/Intern-S1-Pro?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">weights on Hugging Face</a> or <a class="link" href="https://internlm.intern-ai.org.cn/api/document?lang=en&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">try via API</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/230f3d02-0c1b-40da-887d-d1bda12825df/performance.jpeg?t=1770735785"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Learn LLMs Intuitively - Intuitive AI Academy</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">Want to learn about LLMs, but never have a good place to start?</p><p class="paragraph" style="text-align:left;">My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/377a8b3f-2557-4a32-8115-242eb6a2146e/image.png?t=1769004014"/><div class="image__source"><span class="image__source_text"><p>content overview (a total of 100k words explainer so far!)</p></span></div></div><p class="paragraph" style="text-align:left;">We currently have a New Year New Me launch offer, where you would get 50% off yearly plan FOREVER for our early users. Only a few left & the deal will not be back! </p><p class="paragraph" style="text-align:left;">Use code: <b>2026</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="generative-modeling-via-drifting"><b>Generative Modeling via Drifting</b></h1><p class="paragraph" style="text-align:left;">Deng wt al.<i> [MIT, Harvard University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLMs </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Generative modeling is about teaching a computer to map simple random noise into complex, meaningful data, like a realistic image. Until now, leading approaches work by taking a noisy sample and refining it step-by-step until it becomes clear. This multi-step requirement makes the actual generation process computationally heavy and time-consuming. The researchers wanted to know whether it is possible to condense this complex evolution into a single, instant step without sacrificing quality.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/279ca11a-28ed-4f7e-895d-64880799dd9b/CleanShot_2026-02-08_at_23.42.24_2x.png?t=1770574357"/><div class="image__source"><span class="image__source_text"><p>Illustration of drifting a sample. </p></span></div></div><p class="paragraph" style="text-align:left;">The researchers introduced a new approach called &quot;Drifting Models,&quot; which reimagines how a neural network learns to create data. Instead of forcing the model to refine an image iteratively every time it runs, this approach allows the network to act as a single-pass generator.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3258d551-0398-40b9-a226-60af63584cac/CleanShot_2026-02-08_at_23.42.49_2x.png?t=1770574379"/></div><p class="paragraph" style="text-align:left;">The biggest change is the introduction of a &quot;drifting field&quot; that operates during the training process. This field governs how the distribution of generated samples moves, or &quot;drifts,&quot; toward the target data distribution as the network optimizes. The system is designed to seek equilibrium: when the generated samples match the real data, the drift becomes zero, and the model stabilizes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d4b36ddb-345b-44fc-85e9-ed828ea57846/CleanShot_2026-02-08_at_23.43.07_2x.png?t=1770574401"/><div class="image__source"><span class="image__source_text"><p>Evolution of samples.</p></span></div></div><p class="paragraph" style="text-align:left;">This method effectively moves the iterative evolution out of the inference stage and into the learning stage. Once the model is trained, it can generate high-fidelity content in just one step. The results are highly encouraging, with the method achieving state-of-the-art performance on ImageNet benchmarks.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c987be00-a698-4899-b515-f8a6ae149a0f/CleanShot_2026-02-08_at_23.43.42_2x.png?t=1770574431"/><div class="image__source"><span class="image__source_text"><p>System-level comparison: ImageNet 256×256 generation in pixel space</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.04770?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="learning-to-reason-in-13-parameters"><b>Learning to Reason in 13 Parameters</b></h1><p class="paragraph" style="text-align:left;"><i>Morris et al. [FAIR at Meta, Cornell University, Carnegie Mellon University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 2K </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Reasoning </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Researchers are teaching models to &quot;reason&quot; and thoughtfully work through math and logic problems rather than just predicting the next word. Fine-tuning a massive model to improve these cognitive skills has been a heavy computational lift, which requires engineers to adjust millions, if not billions, of internal connections.</p><p class="paragraph" style="text-align:left;">It is a common belief that learning a complex new behavior requires a substantial overhaul of the model’s weights, which makes personalization expensive and difficult to scale. However, this paper suggests that when it comes to reasoning, a model might not need to learn nearly as much new information as previously thought; perhaps it just needs a very precise, microscopic nudge in the right direction.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/865105b2-cec7-4693-a0e2-4cb738e58345/CleanShot_2026-02-08_at_23.44.43_2x.png?t=1770574492"/><div class="image__source"><span class="image__source_text"><p>Using Qwen2.5-7B-Instruct as a base model, our TinyLoRA achieves performance within 5% of full finetuning on GSM8K with only 13 parameters. Dashed lines indicate untrained and full-FT baselines.</p></span></div></div><p class="paragraph" style="text-align:left;">This paper introduced a method called <b>TinyLoRA</b>, which showed that large language models can learn to reason effectively by updating as few as thirteen parameters (a change so small it occupies merely <b>26 bytes</b> of computer memory).</p><p class="paragraph" style="text-align:left;">By using reinforcement learning, they were able to train a massive model to achieve over 90% accuracy on complex math problems, effectively matching the performance of traditional methods that update thousands of times more data. The researchers found that this extreme efficiency works because reinforcement learning acts as a highly effective filter.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4136ac49-12a3-40da-900f-99731c2378db/CleanShot_2026-02-08_at_23.51.40_2x.png?t=1770574910"/><div class="image__source"><span class="image__source_text"><p>TinyLoRA performance during training on MATH.</p></span></div></div><p class="paragraph" style="text-align:left;">Standard training methods train the model with &quot;noisy&quot; data that it tries to memorize, but reinforcement learning provides a clean, sparse signal (simply telling the model if its logic was right or wrong). This clarity allows the model to ignore irrelevant details and isolate the specific neural pathways needed for reasoning.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.04118?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="maximum-likelihood-reinforcement-le"><b>Maximum Likelihood Reinforcement Learning</b></h1><p class="paragraph" style="text-align:left;"><i>Tajwar et al. [Carnegie Mellon University, Tsinghua University, Zhejiang University</i>, <i>UC Berkeley, Impossible Inc.]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 714 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Researchers use RL to teach AI models how to solve tasks with clear right-or-wrong outcomes, such as writing code or solving complex math proofs. The authors of this paper suggest that this strategy is actually just a rough, first-order approximation of a far more powerful mathematical goal: maximum likelihood.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3abfb1ce-e25f-4cfa-9167-6fcad03631c9/teaser6.png?t=1770574951"/></div><p class="paragraph" style="text-align:left;">The problem has always been that targeting maximum likelihood directly in these complex, open-ended scenarios was considered computationally intractable. The research team set out to fix this misalignment, asking if there was a principled way to trade more computing power for a better, more mathematically rigorous learning objective without hitting a computational wall.</p><p class="paragraph" style="text-align:left;">The researchers introduced a framework called <b>MaxRL</b>, which elegantly bridges the gap between standard reinforcement learning and the ideal maximum likelihood objective. They demonstrated that the learning signal can be viewed as a mathematical series where standard methods only optimize the very first term. MaxRL unlocks a &quot;compute-indexed&quot; family of objectives. By generating more sample attempts (or &quot;rollouts&quot;) during training, the algorithm incorporates higher-order terms from this series, creating a progressively more accurate approximation of the ideal objective.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2a9ce19f-8cf7-47f2-82df-6c4dc78e80fb/large_scale_experiments_summary_qwen_3_4B.png?t=1770574966"/><div class="image__source"><span class="image__source_text"><p>Results on Qwen3-4B. MaxRL Pareto-dominates GRPO across all benchmarks, achieving similar or better Pass@1 while significantly improving Pass@K. This translates to 7.9×–19.2× gains at test-time scaling efficiency.</p></span></div></div><p class="paragraph" style="text-align:left;">This approach effectively transforms extra computing time into a smarter learning signal. The study reveals that MaxRL focuses the model’s attention on harder, lower-probability successes rather than just reinforcing easy wins.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8b800a5d-1760-4fd2-8a29-cd45b70861df/maze-example-dual.png?t=1770574999"/><div class="image__source"><span class="image__source_text"><p>Example maze: successful navigation (left) vs. failure case (right).</p></span></div></div><p class="paragraph" style="text-align:left;">Additionally, the method consistently outperforms existing techniques in mathematical reasoning and navigation tasks. The researchers found that their approach prevents the model from &quot;overfitting,&quot; or losing its creativity, as training progresses.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.02710?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="kimi-k-25-visual-agentic-intelligen"><b>Kimi K2.5: Visual Agentic Intelligence</b></h1><p class="paragraph" style="text-align:left;"><i>Kimi team at Moonshot AI</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 16K </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Visual LLM </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">There is a difference between how models process language and how they understand images, combined with their inability to handle massive, multi-step tasks efficiently. Until now, visual capabilities were often bolted onto text models like an afterthought, and &quot;agents&quot; worked linearly. The team behind Kimi K2.5 created a model where vision and language are treated as equal partners from day one, while simultaneously reinventing how AI manages complex workflows. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1b672441-8b2a-4bfd-8b32-456b46cf5aba/CleanShot_2026-02-09_at_00.02.26_2x.png?t=1770575557"/><div class="image__source"><span class="image__source_text"><p>Vision RL training curves on vision benchmarks starting from minimal zero-vision SFT.</p></span></div></div><p class="paragraph" style="text-align:left;">It has two clever architectural innovations. First, rather than teaching the model to read and then teaching it to see, researchers integrated visual data and text together at the very beginning of the training process. This &quot;early fusion&quot; created a surprising synergy: learning to interpret images actually made the model better at text-based reasoning, and text training helped it understand visuals without needing specific image examples. It turns out that a balanced diet of inputs helps the AI generalize better across the board, creating a shared understanding where vision refines text and text bootstraps vision.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/28fe6782-c48e-4630-b217-bc0f008df1eb/CleanShot_2026-02-09_at_00.02.51_2x.png?t=1770575581"/><div class="image__source"><span class="image__source_text"><p>An agent swarm has a trainable orchestrator that dynamically creates specialized frozen subagents and decomposes complex tasks into parallelizable subtasks for efficient distributed execution.</p></span></div></div><p class="paragraph" style="text-align:left;">Secondly, the researchers introduced a framework called &quot;Agent Swarm.&quot; Instead of a single AI struggling through a long to-do list, Kimi K2.5 acts as an orchestrator. It looks at a complex problem, breaks it down into smaller pieces, and assigns them to specialized sub-agents that work in parallel.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d4a0d66b-87c4-4f9d-a008-a20381361ef9/CleanShot_2026-02-09_at_00.03.25_2x.png?t=1770575616"/><div class="image__source"><span class="image__source_text"><p>Overview of training stages: data composition, token volumes, sequence lengths, and trainable components.</p></span></div></div><p class="paragraph" style="text-align:left;">By treating the AI as a manager of a digital swarm rather than a solo worker, the system can handle heavy workloads in coding and research that would typically overwhelm a standard model.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0e702234-2ce6-4a79-97f9-a13aaa5c33b9/CleanShot_2026-02-09_at_00.03.54_2x.png?t=1770575658"/><div class="image__source"><span class="image__source_text"><p>Performance and token efficiency of some reasoning models. Average output token counts (in thousands) are shown in parentheses.</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2602.02276?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=new-generative-paradigm-drifting-model"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/IdV5TEIsJhs" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=80a6ce4f-8505-4aac-8d56-0b159e5f993c&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Adaptive Intelligence 2026: The Rise of Continual Learning &amp; The End of Frozen AI Models?</title>
  <description>An early preview of Continual Learning in 2026</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f6d79fda-932a-4f0b-a488-5def4c87b278/premium_insights_continual_learning.jpg" length="158224" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/adaptive-intelligence-2026-the-rise-of-continual-learning-the-end-of-frozen-ai-models</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/adaptive-intelligence-2026-the-rise-of-continual-learning-the-end-of-frozen-ai-models</guid>
  <pubDate>Fri, 06 Feb 2026 19:00:09 +0000</pubDate>
  <atom:published>2026-02-06T19:00:09Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">We are seeing ground-breaking discoveries in the field of AI every couple of days, and in 2026, AI is entering a new period where static models are becoming a thing of the past. A lot of new research are starting to move towards systems that learn, adapt, and evolve in real time. For nearly a decade (2017-2025), we were following a simple approach: <span style="color:rgb(24, 128, 56);"><i>train → freeze → deploy</i></span>. Models became fixed artifacts the moment training ended, unable to grow with new data or shifting environments.</p><p class="paragraph" style="text-align:left;">But in 2026, we are seeing a new shift where the boundary between <i>training</i> and <i>inference</i> is disappearing. This is giving rise to a new age of <b>Adaptive Intelligence</b>. Continual Learning (CL) is no longer just about preventing catastrophic forgetting, but it’s about active adaptation through Test-Time Training (TTT) and the stability offered by Reinforcement Learning (RL) compared to traditional Supervised Fine-Tuning (SFT). </p><p class="paragraph" style="text-align:left;">In this post, we will see how this shift is redefining how AI systems are evolving, and turning them into dynamic partners rather than frozen tools.</p><p class="paragraph" style="text-align:left;"><i>Let’s get started!</i></p><h2 class="heading" style="text-align:left;" id="what-is-continual-learning-in-ll-ms"><b>What is Continual Learning in LLMs</b></h2><p class="paragraph" style="text-align:left;">For over a decade, we built models under the following assumptions: &quot;Training&quot; was the expensive, compute-heavy phase where knowledge was forged, and &quot;Inference&quot; was the cheap, static phase where that knowledge was applied. This separation created a fundamental brittleness. A model frozen at the end of training could effectively handle the &quot;average&quot; case it saw during supervision, but it lacked the plasticity to adapt to the specific nuances of a new, complex problem instance.</p><p class="paragraph" style="text-align:left;">In 2026, we are seeing a lot of new views of a neural network not as a fixed artifact. Instead, we view the inference process itself as a continuous learning loop. The model does not just <i>read</i> the test instance; it <i>trains</i> on it.</p><p class="paragraph" style="text-align:left;">This shift is currently driven by two distinct but complementary breakthroughs: <b>TTT for Context</b> (solving the memory bottleneck) and <b>TTT for Discovery</b> (solving the search bottleneck).</p><h3 class="heading" style="text-align:left;" id="ttt-for-context"><b>TTT for Context</b></h3><div class="paywall"><hr class="paywall__break"/><div class="paywall__content"><h2 class="paywall__header"> Subscribe to our premium insights to read more </h2><p class="paywall__description"> Become a paying subscriber to get access to this post and other subscriber-only content. </p><p class="paywall__links"><a class="paywall__upgrade_link" href="https://mail.bycloud.ai/upgrade?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=adaptive-intelligence-2026-the-rise-of-continual-learning-the-end-of-frozen-ai-models">Upgrade</a> Translation missing: en.app.shared.conjuction.or <a class="paywall__login_link" href="https://mail.bycloud.ai/login?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=adaptive-intelligence-2026-the-rise-of-continual-learning-the-end-of-frozen-ai-models">Sign In</a></p></div></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=5ff089f2-573d-40fd-ae57-b357afbeef17&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>The First End-to-End Interpretability Method for Transformers</title>
  <description>and more on Quantization-Aware Distillation for NVFP4, RL via Self-Distillation </description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3a3dea7e-97a5-47d9-a64b-2ae51862da97/issue_93.jpg" length="216703" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/the-first-end-to-end-interpretability-method-for-transformers</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/the-first-end-to-end-interpretability-method-for-transformers</guid>
  <pubDate>Tue, 03 Feb 2026 19:57:15 +0000</pubDate>
  <atom:published>2026-02-03T19:57:15Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Jan 26th ~ Feb 2nd</i><br><i>#93 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 34k</span></span> <a class="link" href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">Google DeepMind has released Project Genie</a>, which is an experimental prototype that uses the Genie 3 world model to generate interactive virtual environments from text and visual prompts. This new tool allows users to design, edit, and <b>explore</b> their own worlds in <b>real-time</b>. Google subscribers in US can try it today on <a class="link" href="http://labs.google/projectgenie/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">Google Labs</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/10f80d0b-e295-4d62-aa4d-40beef4d8319/CleanShot_2026-02-03_at_22.09.44_2x.png?t=1770136820"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 15k</span></span> Claude has rolled out a <a class="link" href="https://claude.ai/directory?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">suite of interactive integrations</a> for paid subscribers. Csers can connect directly with tools like Slack, Figma, Asana, and Box. These updates enable users to draft messages, visualize diagrams, manage timelines, and query data from apps like Hex and Clay without leaving the interface. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8b8e417d-25b5-43a0-b6d5-01e98dd1fb4e/CleanShot_2026-02-03_at_22.14.49_2x.png?t=1770137099"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2.9k</span></span> <b><a class="link" href="https://Z.ai?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">Z.ai</a></b> has released <b>GLM-OCR</b>, a <b>0.9B</b>-parameter OCR model tuned for complex documents (tables, formulas, code-heavy pages). They got SOTA results on major doc-understanding benchmarks and runs fast, up to <b>1.86 pages/sec</b> on PDFs. Now available through <a class="link" href="https://docs.z.ai/guides/vlm/glm-ocr?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">API</a>, their <a class="link" href="https://ocr.z.ai/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">website</a>, and <a class="link" href="https://huggingface.co/zai-org/GLM-OCR?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e8167713-5631-41b4-b413-ba8f01af9f9a/image.png?t=1770141000"/><div class="image__source"><span class="image__source_text"><p>GLM-OCR benchmark</p></span></div></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Support My Newsletter</h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(34, 34, 34);font-family:Georgia, "Times New Roman", serif;font-size:16px;">As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!</span></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://patreon.com/bycloud/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers"><span class="button__text" style=""> Check Out My Patreon </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="quantization-aware-distillation-for"><b>Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery</b></h1><p class="paragraph" style="text-align:left;">Xin<i> et al. [</i>NVIDIA<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 4.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Quantization</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">The race to make Artificial Intelligence more efficient is facing a common problem. As engineers try to shrink massive language models into ultra-fast, energy-saving 4-bit formats (specifically NVFP4), they often encounter a steep penalty: the compressed models become less intelligent.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9133f95d-b082-4e56-b2b1-52b0939ee953/CleanShot_2026-02-03_at_21.47.53_2x.png?t=1770135483"/><div class="image__source"><span class="image__source_text"><p>Comparison of quantization-aware training (QAT) and quantization-aware distillation (QAD).</p></span></div></div><p class="paragraph" style="text-align:left;">A common fix has been to simply retrain the compressed model, but this is no longer practical. Modern AI training has become a labyrinth of complex steps (such as supervised fine-tuning, reinforcement learning, and model merging) that is incredibly difficult to replicate perfectly. Researchers faced a common challenge: finding a way to preserve a model’s complex reasoning capabilities during compression without needing to retrace every complicated step of its original education.</p><p class="paragraph" style="text-align:left;">The team discovered that a technique called Quantization-Aware Distillation (QAD) can change how compressed models recover their lost accuracy. Instead of forcing the compressed model to relearn tasks from raw data, the researchers set up a digital mentorship. The original, full-precision model acts as a &quot;teacher,&quot; and the compressed 4-bit model acts as a &quot;student.&quot; Through a mathematical process involving KL divergence, the student stops trying to solve problems from scratch and instead focuses entirely on mimicking the teacher’s exact output patterns.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f1df69cd-e735-444b-a604-ec16ac658936/CleanShot_2026-02-03_at_21.48.23_2x.png?t=1770135513"/><div class="image__source"><span class="image__source_text"><p>Impact of training data on QAD for AceReason Nemotron 1.1 7B.</p></span></div></div><p class="paragraph" style="text-align:left;">This method successfully restores compressed models to nearly the same accuracy as their full-sized counterparts. Since the student is learning behavior directly from the teacher rather than facts from a textbook, it does not require the original, high-quality training datasets. The researchers found that the student could recover its capabilities even when looking at partial data or synthetic information, making high-performance, energy-efficient AI far more accessible than previously thought.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.20088?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="shaping-capabilities-with-tokenleve"><b>Shaping capabilities with token-level data filtering</b></h1><p class="paragraph" style="text-align:left;">Rathi and Radford<i> [</i>Anthropic, Stanford<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 882 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Data filtering </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Building safe artificial intelligence often feels like a cat-and-mouse game. Currently, developers train massive models on the entire internet, which inevitably means the AI learns dangerous information (like how to synthesize bioweapons) alongside useful facts.</p><p class="paragraph" style="text-align:left;">Safety teams then try to suppress this knowledge after the fact, essentially putting a muzzle on the model. The problem is that because the dangerous knowledge is still buried deep inside the system, adversaries can often find ways to break the muzzle and retrieve the information.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/90b82484-1d65-45a0-bfca-bf1fa238c0fa/CleanShot_2026-02-03_at_21.52.08_2x.png?t=1770135736"/><div class="image__source"><span class="image__source_text"><p>Operationalizing token filtering.</p></span></div></div><p class="paragraph" style="text-align:left;">Researchers recently tackled this foundational issue by asking a simple but difficult question: Is it possible to surgically remove dangerous concepts from the training data itself, so the model never learns them in the first place? They tested this by trying to teach an AI general biology while completely preventing it from learning medical advice, using this as a proxy for blocking dangerous capabilities.</p><p class="paragraph" style="text-align:left;">Traditionally, if a training document contained harmful information, engineers would discard the entire file. This is a blunt instrument that wastes valuable context and unrelated knowledge. Instead, these researchers developed a method called token-level filtering. Think of it like a government redacting a classified file: instead of shredding the whole page, they simply take a black marker to the specific dangerous words and phrases while leaving the surrounding sentences visible.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bd9f8bde-7e14-452b-8914-ae0f239cd5c2/CleanShot_2026-02-03_at_21.52.30_2x.png?t=1770135759"/><div class="image__source"><span class="image__source_text"><p>Token filtering scales better than document filtering.</p></span></div></div><p class="paragraph" style="text-align:left;">By identifying and hiding these specific pieces of information during the training process, they found they could effectively lobotomize the model’s ability to perform dangerous tasks without hurting its general intelligence or ability to understand related topics.</p><p class="paragraph" style="text-align:left;">What makes this approach so hopeful is how well it scales. The study revealed that this surgical filtering actually becomes more effective as the AI models get larger and more powerful, creating a massive efficiency gap between the effort required to train a safe model versus the effort required to make it dangerous again.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3ef01ace-ea8e-49d5-b433-1f39e9680029/CleanShot_2026-02-03_at_21.52.57_2x.png?t=1770135790"/><div class="image__source"><span class="image__source_text"><p>Data filtering decreases MCQ performance on the forget domain without substantial damage to the retain domain. </p></span></div></div><p class="paragraph" style="text-align:left;">Even more surprisingly, the researchers found that &quot;forgetting&quot; the data didn’t blind the model completely. These filtered models could still be easily trained to recognize and refuse questions about the forbidden topics, proving that an AI doesn&#39;t need to know how to build a bomb to know it should refuse to help you build one. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.21571?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="reinforcement-learning-via-self-dis"><b>Reinforcement Learning via Self-Distillation</b></h1><p class="paragraph" style="text-align:left;">Hübotter<i> et al. [</i>ETH Zurich, Max Planck Institute for Intelligent Systems, MIT, Stanford<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 320 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Scaling Law </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Current methods for teaching artificial intelligence to handle complex reasoning are like a particularly harsh school exam: the model attempts a math problem or writes code, and the system simply tells it &quot;Pass&quot; or &quot;Fail.&quot; This approach, known as Reinforcement Learning with Verifiable Rewards, creates a significant bottleneck.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/32a91017-8152-4053-841c-e7cd7c20e461/sdpo-fig.png?t=1770136007"/></div><p class="paragraph" style="text-align:left;">It is incredibly difficult for a model to figure out exactly which step in a long chain of reasoning caused the failure when the only feedback is a single score. Researchers realized that the software environments these models operate in usually offer much more detail (like error messages) that were being ignored. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5315e835-a0cc-4459-bcaa-0c50625410b5/chemistry-accuracy-response.png?t=1770135990"/><div class="image__source"><span class="image__source_text"><p>Training progression of Olmo3-7B-Instruct on Chemistry.</p></span></div></div><p class="paragraph" style="text-align:left;">To solve this, researchers developed a method called Self-Distillation Policy Optimization (SDPO), which allows the model to act as its own mentor. When the model generates an answer that fails, it takes the detailed feedback from the environment and re-evaluates its original attempt. With this new context, the model can retrospectively see exactly where it went wrong and calculate what it <i>should</i> have done differently. It then &quot;distills&quot; this insight back into its own network, effectively updating its behavior to match this wiser version of itself.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d7b3ed7f-388a-43b6-9314-8b5456227e30/sdpo-fig-training-loop.png?t=1770136018"/><div class="image__source"><span class="image__source_text"><p>Self-Distilled Policy Optimization (SDPO) Loop</p></span></div></div><p class="paragraph" style="text-align:left;">This approach transforms error messages from simple penalties into dense, actionable lessons. The study found that this self-teaching method allowed models to reach high levels of accuracy much faster than traditional methods, often requiring four times fewer attempts to reach the same performance. The models became more efficient reasoners as they learning to avoid the circular logic and verbal filler that often plague AI &quot;thinking&quot; processes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fe437bf0-20f0-4073-9580-c0897a1dda4b/very-hard-questions.png?t=1770136037"/><div class="image__source"><span class="image__source_text"><p>Test-time self-distillation on hard coding problems.</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.20802?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="tensor-lens-endto-end-transformer-a"><b>TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors</b></h1><p class="paragraph" style="text-align:left;">Atad<i> et al. [</i>Tel Aviv University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 902 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Transformer Interpretability </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">If you try to understand how a massive, complex machine works, then you will probably be able to inspect one gear at a time. Researchers have the same problem in analyzing Transformer models. While scientists can examine individual &quot;attention heads&quot; or specific layers to guess how the model processes information, they have struggled to see the full picture.</p><p class="paragraph" style="text-align:left;">Previous attempts to map the model&#39;s global behavior used rough averages or incomplete combinations of these parts, which ignored components like normalization or feed-forward networks. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1e6d9bc3-e938-4c2c-8317-114de93cb598/CleanShot_2026-02-03_at_22.01.32_2x.png?t=1770136301"/><div class="image__source"><span class="image__source_text"><p>Transformers are re-formulated as data controlled linear operators, characterized by an inputdependent high-order attention tensor T.</p></span></div></div><p class="paragraph" style="text-align:left;">To solve this, the team introduced a new mathematical framework called TensorLens. Instead of treating the Transformer as a collection of disjointed parts, they successfully reformulated the entire architecture into a single, unified concept known as a high-order interaction tensor. This approach captures every element of the computation (including the attention mechanisms, feed-forward networks, activations, and residual connections) and expresses them as one cohesive &quot;linear operator&quot; that adapts based on the input data.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6b18451d-34c8-4f96-855b-f190407b8ef5/CleanShot_2026-02-03_at_22.02.35_2x.png?t=1770136365"/><div class="image__source"><span class="image__source_text"><p>A schematic visualization of our method, where each sub-component of the transformer architecture</p></span></div></div><p class="paragraph" style="text-align:left;">By replacing simple matrices with these high-order tensor structures, the researchers created a theoretically grounded way to represent exactly how the model transforms information from start to finish. Their validation showed that this method yields much richer and more accurate representations of the model’s behavior than previous aggregation techniques. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/12d958f8-7a5d-40f4-bb57-a3dbe14cb321/CleanShot_2026-02-03_at_22.03.16_2x.png?t=1770136405"/><div class="image__source"><span class="image__source_text"><p>Perturbation Tests in NLP</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.17958?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=the-first-end-to-end-interpretability-method-for-transformers"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/9GWOksNjFpY" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=b9ba486d-37a3-40e7-ab5f-080cc9192365&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Learning to Discover at Test Time</title>
  <description>plus more on Memorization Dynamics in Knowledge Distillation and  Efficient Agents</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d3694795-5a8c-4eea-a5be-7c09bac10d14/issue_92.jpg" length="145133" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/learning-to-discover-at-test-time</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/learning-to-discover-at-test-time</guid>
  <pubDate>Tue, 27 Jan 2026 19:13:14 +0000</pubDate>
  <atom:published>2026-01-27T19:13:14Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Jan 20th ~ Jan 27th</i><br><i>#92 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 11k</span></span> <a class="link" href="https://docs.molt.bot/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Moltbot is THE personal AI assistant</a> everyone imagined AI to be. Previously known as Clawdbot, it is an open source AI assistant that can connect messaging platforms like WhatsApp, Telegram, and Slack to a private gateway running on your own hardware. It can run seamlessly on Linux and Windows (via WSL2), and it even integrates with iMessage (requires macOS hardware) and automate everything for you. Just be aware of <a class="link" href="https://x.com/theonejvo/status/2015401219746128322?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">security</a> when setting it up!</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 44k</span></span> <a class="link" href="https://claude.com/claude-in-excel?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Claude has released a beta integration</a> that adds AI reasoning directly into your spreadsheets to analyze complex formulas and multi-tab dependencies. The tool allows you to debug errors, update assumptions, and build models while preserving your original structure, all accessible via a simple keyboard shortcut. Unlike standard chat interfaces, it provides transparent explanations with cell-level citations, ensuring you can verify the logic behind every calculation.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d6da0762-9918-4898-a923-eff2c34130b0/CleanShot_2026-01-27_at_20.34.33_2x.png?t=1769526286"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 6.1k</span></span> <a class="link" href="https://qwen.ai/blog?id=qwen3tts-0115&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Qwen has open-sourced the Qwen3-TTS family</a>, which is a suite of high-performance speech generation models (1.7B and 0.6B) that deliver <b>state-of-the-art voice cloning</b> and design across 10 languages. It is built on a novel non-DiT architecture with Dual-Track modeling, the tool enables ultra-low latency streaming (outputting audio after a single character input) and allows for precise control over emotion, tone, and prosody via natural language instructions. Try it on <a class="link" href="https://huggingface.co/spaces/Qwen/Qwen3-TTS?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a> or via the <a class="link" href="https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-design?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Alibaba Cloud API</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/103c23ec-a25f-43cc-a0d6-3f823477ef29/archi.png?t=1769526718"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 4.4k</span></span> <a class="link" href="https://plus.excalidraw.com/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Excalidraw</a> has added a new &quot;smarter, faster, stronger&quot; text-to-diagram feature that uses a streaming chat interface for real-time visual generation. This update is accessible now within the free version of the tool (with some usage limits) and remains open-source on GitHub, though users may need to refresh their browser cache or use incognito mode to ensure they aren&#39;t loading the older build.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9cba6448-8ac6-44c9-b688-0261326274ab/CleanShot_2026-01-27_at_20.46.03_2x.png?t=1769526976"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 7.4k</span></span> <a class="link" href="https://www.kimi.com/blog/kimi-k2-5.html?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Moonshot AI has released Kimi K2.5</a>, an open-source visual agentic model that achieves state-of-the-art performance on major benchmarks, including HLE (50.2%) and BrowseComp (74.9%). The model excels at converting multimodal inputs (like chats, images, and videos) into fully functional, aesthetic websites and features a beta &quot;Agent Swarm&quot; capability that allows up to 100 self-directed sub-agents to collaborate on complex tasks 4.5× faster than single-agent setups. <a class="link" href="https://platform.moonshot.ai/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Try it now</a> or download from <a class="link" href="https://huggingface.co/moonshotai/Kimi-K2.5?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Hugging Face.</a></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/84b48350-1575-4ef6-b6ad-7e585177d5f2/orchestrator-1.png?t=1769527208"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Learn LLMs Intuitively - Intuitive AI Academy</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">Want to learn about LLMs, but never have a good place to start?</p><p class="paragraph" style="text-align:left;">Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;">We have just added a brand new chapter, an in-depth walkthrough of the current <a class="link" href="https://www.intuitiveai.academy/en/llm-fundamentals/advanced/lora?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">LoRA research landscape</a>. Coming up next is MoE.</p><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/377a8b3f-2557-4a32-8115-242eb6a2146e/image.png?t=1769004014"/></a><div class="image__source"><span class="image__source_text"><p>content overview (a total of 110k words explainer so far!)</p></span></div></div><p class="paragraph" style="text-align:left;">We currently have a New Year New Me launch offer, where you would get 50% off yearly plan FOREVER for our early users. Use code: <b>2026</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="memorization-dynamics-in-knowledge-">Memorization Dynamics in Knowledge Distillation for Language Models</h2><p class="paragraph" style="text-align:left;">Borkar<i> et al. [</i>Meta Superintelligence Labs, Meta Central Applied Science, FAIR at Meta, Northeastern University, Carnegie Mellon University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 502 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Knowledge Distillation</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span></p><p class="paragraph" style="text-align:left;">Knowledge distillation is a process where a highly capable &quot;teacher&quot; AI helps train a smaller, more efficient &quot;student&quot; model. It allows us to capture the teacher&#39;s intelligence without the massive computational cost. However, large models are notorious for memorizing specific snippets of their training data, from phone numbers to unique sentences.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/aa77473a-ed55-4773-82f2-9ffa0b1d4621/CleanShot_2026-01-27_at_19.54.14_2x.png?t=1769523867"/><div class="image__source"><span class="image__source_text"><p>Experimental framework</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers of this paper want to know does the student simply parrot the teacher&#39;s memories, or can it learn to think like the teacher while forgetting the specific, sensitive data that was used for training?</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/24ff8222-0a13-45a6-84f3-b515f917c041/CleanShot_2026-01-27_at_19.54.59_2x.png?t=1769523915"/><div class="image__source"><span class="image__source_text"><p>Overlap of memorized examples.</p></span></div></div><p class="paragraph" style="text-align:left;">The study shows that when a student model learns from a teacher, it actually acts as a privacy filter. These distilled models ended up memorizing <b>significantly less</b> training data, cutting the rate of memorization by more than half compared to models trained the standard way. It turns out that the student primarily holds onto &quot;easy&quot; information that is simple to compress and understand, while successfully rejecting the complex, specific examples that the teacher had memorized.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af012521-c18f-415a-b69c-805929510e88/CleanShot_2026-01-27_at_19.55.36_2x.png?t=1769523947"/><div class="image__source"><span class="image__source_text"><p>Shannon Entropy vs. Log-Probability Analysis.</p></span></div></div><p class="paragraph" style="text-align:left;">By allowing the student to mimic the teacher’s uncertainty rather than forcing it to memorize hard answers, the distillation process naturally prevents the model from overfitting to specific training examples. The student effectively learns the general skills without retaining the exact data used to teach them. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.15394?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="learning-to-discover-at-test-time"><b>Learning to Discover at Test Time</b></h1><p class="paragraph" style="text-align:left;"><i>Yuksekgonul et al. [</i>Stanford University, NVIDIA, Astera Institute, UC San Diego, Together AI<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 680 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Until now, most AI models have functioned like the student who stops learning once the test begins; they are &quot;frozen&quot; and can only search for answers based on past training. To succeed, they need to learn and adapt in the moment, internalizing new concepts from their own failed attempts.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/859e4f87-a4d8-4523-9a12-8a4e7849e2c7/CleanShot_2026-01-27_at_20.08.03_2x.png?t=1769524695"/><div class="image__source"><span class="image__source_text"><p>TTT-Discover continues to train an LLM on a single problem at test time.</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers for this paper have developed a method called &quot;Test-Time Training to Discover&quot;. Instead of trying to be good at everything on average, this approach allows the AI to become a hyper-specialist on a single problem. By using reinforcement learning during the test phase, the model treats its own search attempts as new training data, ignoring safe, average answers to hunt for the single, exceptional &quot;eureka&quot; moment.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0e411c49-87d2-4074-af18-97217f5e650c/CleanShot_2026-01-27_at_20.08.54_2x.png?t=1769524746"/></div><p class="paragraph" style="text-align:left;">In mathematics, the system improved upon the bounds of a problem posed by Erdős in 1955, discovering a complex, asymmetric solution that had eluded both human mathematicians and previous AI systems.</p><p class="paragraph" style="text-align:left;">In computer engineering, it wrote code for GPU chips that ran significantly faster than kernels hand-optimized by human experts. It achieved these state-of-the-art results using an open-source model, outperforming larger, proprietary models simply by having the freedom to learn while it worked.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/38e45423-558d-4414-ae32-f5500bf52525/CleanShot_2026-01-27_at_20.09.57_2x.png?t=1769524808"/><div class="image__source"><span class="image__source_text"><p>Results in two AtCoder Heuristic Competitions.</p></span></div></div><p class="paragraph" style="text-align:left;">By shifting the focus from massive, static training to dynamic, problem-specific adaptation, this research has opened the door to solving highly specific challenges in biology, engineering, and mathematics that currently seem out of reach. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.16175?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="toward-efficient-agents-memory-tool"><b>Toward Efficient Agents: Memory, Tool learning, and Planning</b></h1><p class="paragraph" style="text-align:left;">Yang et al.<i> [</i>Shanghai Artificial Intelligence Laboratory, Fudan University, University of Science and Technology of China, Shanghai Jiaotong University, Chinese Academy of Sciences, The Chinese University of Hong Kong (Shenzhen), Hong Kong Polytechnic University, Wuhan University, Tsinghua University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 96 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Agents Survey </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI models can generate text, but we need &quot;agents&quot; capable of actively interacting with the world to execute complex workflows. However, this autonomy comes with a heavy price. Unlike a standard chatbot that answers a question and moves on, an agent operates in a recursive loop (planning, acting, remembering, and observing).</p><p class="paragraph" style="text-align:left;">This creates a compounding accumulation of information, where the output of one step becomes the costly input for the next. This paper aims to solve the problem of &quot;context window saturation&quot; to ensure that AI systems remain sustainable, responsive, and accessible to everyone rather than becoming too computationally expensive to run.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a825d40c-24dd-4f0c-acd1-867daecf8662/CleanShot_2026-01-27_at_20.21.07_2x.png?t=1769525477"/><div class="image__source"><span class="image__source_text"><p>From LLMs to agents: standalone reasoning to trajectory-level reasoning with memory, planning, and tool learning, while introducing additional cost sources.</p></span></div></div><p class="paragraph" style="text-align:left;">This survey suggests that creating an &quot;efficient agent&quot; does not simply mean building a smaller model; it requires optimizing the entire system to maximize success while minimizing resource consumption. The researchers found that efficiency gains come from rethinking three specific areas: memory, tool usage, and planning.</p><p class="paragraph" style="text-align:left;">In terms of <b>memory</b>, the field is moving away from forcing an agent to re-read every past interaction. Instead, new techniques allow agents to compress history into summaries or &quot;latent&quot; states, essentially giving the AI a working memory that retains the gist of a conversation without the computational weight of the raw text.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7af5e30c-82df-4b79-9a99-4e7c3ad0b8e7/paper-memory-picture.png?t=1769525439"/><div class="image__source"><span class="image__source_text"><p>Efficient memory overview.</p></span></div></div><p class="paragraph" style="text-align:left;">When it comes to <b>using external tools</b>, we should focus on optimizing how the agent interacts with its environment. Rather than testing tools sequentially or randomly, efficient agents are now using sophisticated retrieval systems to identify the correct tool instantly. Advanced agents can also perform parallel execution, where agents can run multiple tasks simultaneously rather than waiting for one to finish before starting another, drastically reducing the latency between a thought and an action.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f24a7841-5df4-464b-8274-a0faa5c376df/CleanShot_2026-01-27_at_20.22.06_2x.png?t=1769525535"/><div class="image__source"><span class="image__source_text"><p>Efficient tool learning comprises three stages</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers propose reimagining <b>how agents plan</b> their next moves by treating &quot;thinking&quot; as a limited resource. Instead of allowing for unbounded reasoning, new methods introduce the concept of &quot;budgeted deliberation.&quot;</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c97a0434-b3a6-4f90-8b7f-40a9eef44200/CleanShot_2026-01-27_at_20.22.36_2x.png?t=1769525567"/><div class="image__source"><span class="image__source_text"><p>Overview of Efficient Planning.</p></span></div></div><p class="paragraph" style="text-align:left;">By using reinforcement learning, systems are being trained to value brevity, effectively rewarding the agent not just for getting the correct answer, but for solving the problem with the fewest possible steps and the least amount of computational effort.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.14192?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=learning-to-discover-at-test-time"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/IdV5TEIsJhs" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=a742ef62-ae12-4e4b-936b-4cd15cfc011f&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Yet Another DeepSeek Architectural Research: Engram</title>
  <description>plus more on DroPE: Dropping RoPE, STEM, and Dr. Zero</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b01f9f81-8c7e-4fd5-bd51-f7f5509f2cfa/issue_91.jpg" length="175813" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/yet-another-deepseek-architectural-research-engram</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/yet-another-deepseek-architectural-research-engram</guid>
  <pubDate>Wed, 21 Jan 2026 15:00:47 +0000</pubDate>
  <atom:published>2026-01-21T15:00:47Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Jan 13th ~ Jan 19th</i><br><i>#91 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 5.9k</span></span> <a class="link" href="https://Z.ai?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank" rel="noopener noreferrer nofollow">Z.ai</a> has released<b> GLM-4.7-Flash</b>, a lightweight ~30B-class local model aimed at coding and agentic workflows, with support for long-context writing, translation, and roleplay. Weights are available on<a class="link" href="https://huggingface.co/zai-org/GLM-4.7-Flash?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank" rel="noopener noreferrer nofollow"> Hugging Face</a>, and the API includes a free GLM-4.7-Flash tier (1 concurrency) plus a faster GLM-4.7-FlashX option.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f1d88ab2-a74e-4372-b28d-19e39b6ccdc0/image.png?t=1769004158"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 9.5k</span></span> OpenAI says it will begin<b> testing ads in ChatGPT </b>in the coming weeks for logged-in adults on the Free and Go tiers in the U.S., with ads clearly separated and labeled and responses not influenced by advertising. OpenAI also says chats won’t be shared or sold to advertisers, and Pro, Business, and Enterprise tiers will remain ad-free</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 6.9k</span></span> Google has introduced <a class="link" href="https://t.co/7b09LlhYh7?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank" rel="noopener noreferrer nofollow"><b>Personal Intelligence</b></a> inside Gemini App, an opt-in beta that lets it securely connect your Gmail, Photos, Search, and YouTube history to deliver more personalized help. </p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 7.4k</span></span> Google DeepMind has released <a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/translategemma?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank" rel="noopener noreferrer nofollow"><b>TranslateGemma</b></a>, a new family of open translation models built on Gemma 3 with 55-language support in 4B, 12B, and 27B sizes. It’s designed for efficient, low-latency translation (including text-in-image) and is available on Hugging Face and Kaggle.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6fa82edd-9014-44a9-92c4-dab42183e581/image.png?t=1769004177"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Learn LLMs Intuitively - Intuitive AI Academy</h2><div class="image"><a class="image__link" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/734c79dc-aaa6-46ce-ac7d-41a5f4d84381/image.png?t=1769003669"/></a></div><p class="paragraph" style="text-align:left;">Want to learn about LLMs, but never have a good place to start?</p><p class="paragraph" style="text-align:left;">Intuitive AI Academy has the perfect starting point for you! We focus on<b> building your intuition to understand LLMs</b>, from transformer components, to post-training logic. All in one place.</p><p class="paragraph" style="text-align:left;">This is my latest project, and we will include write-up accompanying my latest videos on the frontier of AI research, all to help you keep up without the hassle of digging through papers and understanding math.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/377a8b3f-2557-4a32-8115-242eb6a2146e/image.png?t=1769004014"/><div class="image__source"><span class="image__source_text"><p>content overview (a total of 100k words explainer so far!)</p></span></div></div><p class="paragraph" style="text-align:left;">We currently have a New Year New Me launch offer, where you would get 50% off yearly plan FOREVER for our early users. Use code: <b>2026</b></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.intuitiveai.academy/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram"><span class="button__text" style=""> Check Out Intuitive AI Academy </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="extending-the-context-of-pretrained"><b>Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings</b></h1><p class="paragraph" style="text-align:left;"><i>Gelberg et al. [</i>Sakana AI, University of Oxford<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.7k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Positional Embeddings</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching LLMs to process long documents is often very costly. Models are trained on shorter snippets of text to save computing power. But when they are asked to handle longer sequences, like analyzing a whole book instead of a chapter, they struggle because they rely on specific &quot;positional embeddings&quot; (mathematical tags that mark the order of words). Until now, fixing this required an expensive second phase of training to retune the model for longer contexts.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ba15b9a3-2405-4b1c-85eb-dabe37b9f4be/CleanShot_2026-01-20_at_22.27.39_2x.png?t=1768928271"/><div class="image__source"><span class="image__source_text"><p>DroPE matches RoPE’s in-context perplexity.</p></span></div></div><p class="paragraph" style="text-align:left;">This paper made a counterintuitive discovery that the tools used to help models learn word order are actually holding them back. The team found that while positional embeddings act as a guide during the initial learning process, they become a hindrance later on, which prevents the model from generalizing to lengths it hasn&#39;t seen before. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/737448ff-86bd-422d-a999-35e5b3739495/CleanShot_2026-01-20_at_22.28.07_2x.png?t=1768928299"/><div class="image__source"><span class="image__source_text"><p>RoPE transformers have higher positional bias gradients at initialization.</p></span></div></div><p class="paragraph" style="text-align:left;">By simply removing these embeddings after the initial training (a method they call DroPE), the model acts more flexibly. Surprisingly, once these rigid guides were removed and the model underwent a very brief recalibration, it could immediately understand and retrieve information from sequences far longer than anything it had seen before. This method effectively <b>unlocks long-context capabilities</b> without compromising the model’s performance on standard tasks.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8b79e724-30fd-419a-a578-4a29f57b4083/CleanShot_2026-01-20_at_22.28.54_2x.png?t=1768928349"/><div class="image__source"><span class="image__source_text"><p>Length generalization results on larger models.</p></span></div></div><p class="paragraph" style="text-align:left;">By proving that complex fine-tuning isn&#39;t always necessary to upgrade a model&#39;s memory, this approach opens the door to systems that can digest vast amounts of information (from entire codebases to legal archives) at a fraction of the current cost.</p><div class="embed"><a class="embed__url" href="https://github.com/SakanaAI/DroPE?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - SakanaAI/DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embedding </p><p class="embed__description"> Extending the Context of Pretrained LLMs by Dropping Their Positional Embedding - SakanaAI/DroPE </p><p class="embed__link"> github.com/SakanaAI/DroPE </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/426c0b6aa44676a233aa7426ea4713c1085f496f49b82b9924f2310a1dab45bb/SakanaAI/DroPE"/></a></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.12167?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="stem-scaling-transformers-with-embe"><b>STEM: Scaling Transformers with Embedding Modules</b></h1><p class="paragraph" style="text-align:left;">Sadhukhan<i> et al. [</i>Carnegie Mellon University, Meta AI<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 257 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Transformers</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span></p><p class="paragraph" style="text-align:left;">When we need a smarter LLM, we often try to get a bigger model, which makes them slow and expensive to run. To solve this problem, researchers are trying to build sparse models by teaching the model to activate only a small fraction of its &quot;brain&quot; for any given task.</p><p class="paragraph" style="text-align:left;">However, current methods for doing this are notoriously fickle. They often suffer from training instability and create complex communication bottlenecks between computer chips, all while making the model’s decision-making process harder for humans to interpret. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9fbfa37a-0628-425d-b7d6-4d3db3427f10/CleanShot_2026-01-20_at_22.32.34_2x.png?t=1768928566"/><div class="image__source"><span class="image__source_text"><p>Schematics of (a) SwiGLU FFN, (b) MoE FFN, and (c) STEM with a single prefetched token embedding.</p></span></div></div><p class="paragraph" style="text-align:left;">The team developed a new architecture called STEM (Scaling Transformers with Embedding Modules) that simplifies how models process information. Instead of forcing every piece of data through the same complex mathematical transformations, the system uses a static, specialized approach. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c159f06c-c3b6-44e6-bff1-9ad486f75d0e/CleanShot_2026-01-20_at_22.33.01_2x.png?t=1768928592"/><div class="image__source"><span class="image__source_text"><p>Knowledge injection/edit demonstration.</p></span></div></div><p class="paragraph" style="text-align:left;">By replacing a computationally heavy slice of the network with a simple, dedicated lookup table for each specific word, they successfully decoupled the model&#39;s storage capacity from its processing cost. This approach proved remarkably stable, effectively removing about one-third of the usual parameters required for these calculations while actually improving accuracy on knowledge-heavy tasks.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/96758f67-3b94-4501-835b-1d3649afc5b5/CleanShot_2026-01-20_at_22.33.29_2x.png?t=1768928620"/><div class="image__source"><span class="image__source_text"><p>STEM-based knowledge editing schemes for length-mismatched source (ns) and target (nt) entity tokenizations.</p></span></div></div><p class="paragraph" style="text-align:left;">Because this information is organized by specific tokens rather than hidden inside a &quot;black box&quot; of varying experts, the model became much more interpretable. The researchers found they could even &quot;edit&quot; the model&#39;s knowledge in a transparent way without retraining the entire system.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2601.10639?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="conditional-memory-via-scalable-loo"><b>Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models</b></h1><p class="paragraph" style="text-align:left;">Cheng<i> et al. [</i>Peking University, DeepSeek-AI<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 2.8k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Memory </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">LLMs are getting better at complex reasoning, but they don’t have a built-in mechanism to simply &quot;remember&quot; fixed information. To recall a common phrase, a specific name, or a formulaic pattern, a standard Transformer model must spend valuable processing power computing the answer from scratch, effectively &quot;thinking&quot; its way to a memory every single time.</p><p class="paragraph" style="text-align:left;">What if we gave the model a dedicated, instantly accessible memory bank for static facts, allowing its neural circuitry to focus entirely on difficult problem-solving? This approach acknowledges that language consists of two distinct parts (creative composition and rote knowledge) and suggests that our AI architectures should reflect that duality.</p><p class="paragraph" style="text-align:left;">The team introduced &quot;Engram,&quot; a sophisticated memory module that modernizes classic text-analysis techniques to function as a high-speed lookup table for the model. By analyzing how to balance this new memory system with traditional computation, they discovered a distinct &quot;U-shaped&quot; scaling law, pinpointing the optimal ratio between processing power and static memory.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8a3cce19-ba80-4c7e-92e4-0990e7ca3f41/arch.png?t=1768928713"/></div><p class="paragraph" style="text-align:left;">When they built a large-scale model using these principles, the results were counterintuitive. While the system predictably improved at factual retrieval, it saw even larger gains in complex reasoning, mathematics, and coding. The researchers found that by offloading the &quot;easy&quot; work of recognizing local patterns to the Engram module, the model’s deeper layers were relieved of the burden of reconstruction.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2fe02c42-d953-4bcb-a87e-e8a2efeaa321/case.png?t=1768928744"/></div><p class="paragraph" style="text-align:left;">This effectively freed up the network&#39;s attention mechanisms to process global context and tackle much harder cognitive tasks, leading to superior performance without increasing the computational budget.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fdbeb154-f9b0-40f5-a4ff-0346598ab0fa/27b_exp_results.png?t=1768928692"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2601.07372?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="dr-zero-self-evolving-search-agents"><b>Dr. Zero: Self-Evolving Search Agents without Training Data</b></h1><p class="paragraph" style="text-align:left;">Yue<i> et al. [</i>Meta Superintelligence Labs, University of Illinois Urbana-Champaign<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 426 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Building bigger LLMs requires massive amounts of high-quality data curated by humans. But we are slowly running out of high-quality human data, particularly for complex tasks like researching the web to answer open-ended questions.</p><p class="paragraph" style="text-align:left;">While AI has successfully taught itself to solve math problems or play chess (domains with clear rules), teaching a machine to navigate the messy, unstructured internet without human supervision has proven far more difficult. </p><p class="paragraph" style="text-align:left;">Researchers at the University of Illinois Urbana-Champaign and Meta set out to solve this puzzle, asking if an AI could essentially pull itself up by its bootstraps to become a better researcher without seeing a single human example.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cdf3cca8-b3aa-4ed5-b5c4-1dfd4a8497b0/CleanShot_2026-01-20_at_22.38.20_2x.png?t=1768928911"/><div class="image__source"><span class="image__source_text"><p>The self-evolving LLM training framework (Huang et al., 2025a) that iteratively trains a proposer and a solver with minimal supervision.</p></span></div></div><p class="paragraph" style="text-align:left;">The team developed a framework called Dr. Zero, which creates a symbiotic relationship between two versions of the same artificial intelligence. They split the system into two distinct roles: a &quot;Proposer&quot; and a &quot;Solver.&quot; The Proposer is tasked with inventing questions based on documents it finds, while the Solver uses a search engine to answer them. </p><p class="paragraph" style="text-align:left;">The brilliance lies in how they learn from each other. The system uses a specialized feedback loop that rewards the Proposer only when it creates questions that are challenging but ultimately solvable. This creates an automated curriculum that gets progressively harder as the Solver gets smarter.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c3042d83-a0ce-471b-b7fc-3efd6d6e26e4/CleanShot_2026-01-20_at_22.38.47_2x.png?t=1768928940"/><div class="image__source"><span class="image__source_text"><p>The Dr. Zero self-evolution feedback loop.</p></span></div></div><p class="paragraph" style="text-align:left;">To make this computationally feasible, the team introduced a technique called Hop-Grouped Relative Policy Optimization (HRPO). This method cleverly groups similar questions to estimate their difficulty efficiently, allowing the model to learn complex, multi-step reasoning without requiring the massive computing power usually needed to test every possible outcome repeatedly.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/24f16f4b-db66-4f91-bc1c-12808743b940/CleanShot_2026-01-20_at_22.40.32_2x.png?t=1768929046"/><div class="image__source"><span class="image__source_text"><p>System prompt for the proposer in Dr. Zero</p></span></div></div><p class="paragraph" style="text-align:left;">The study revealed that this self-taught system often outperformed AI agents that were explicitly trained with human supervision. This suggests that complex reasoning and search capabilities can emerge naturally through self-evolution, without needing the expensive human datasets.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2601.07055?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=yet-another-deepseek-architectural-research-engram"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/82DyXL0ZXI8" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1eebf271-8639-4d04-8749-ff135ab3b924&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Wait, Wait, Wait... Why Do Reasoning Models Loop?</title>
  <description>and more on Dead Salmons of AI Interp, GDPO, From Entropy to Epiplexity</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b36b85e6-43b9-4d36-bab5-99454d16eb5d/issue_90.jpg" length="171322" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/wait-wait-wait-why-do-reasoning-models-loop</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/wait-wait-wait-why-do-reasoning-models-loop</guid>
  <pubDate>Tue, 13 Jan 2026 21:00:14 +0000</pubDate>
  <atom:published>2026-01-13T21:00:14Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Jan 7th ~ Jan 13th</i><br><i>#90 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 6.7k</span></span> <a class="link" href="https://x.com/midjourney/status/2009748519133827304?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop" target="_blank" rel="noopener noreferrer nofollow">Midjourney has launched Niji V7</a>, and it delivers significantly more realistic anime aesthetics alongside improved text rendering and coherence. Many users are already praising the stunning quality of the new model as it is nearly indistinguishable from Anime created by traditional artists.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/dc567572-0500-4db8-9aef-fe8882105db6/CleanShot_2026-01-13_at_22.00.37_2x.png?t=1768321854"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 596</span></span> <a class="link" href="https://x.com/MiniMax_AI/status/2009491818690547938?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop" target="_blank" rel="noopener noreferrer nofollow">MiniMax has made a debut on the Hong Kong Stock Exchange</a> and it achieved a staggering $13.7 billion valuation on its first day. MiniMax is one of the few open-source models dominating global benchmarks; it has an Anthropic compatible API that delivers top-tier performance, but you can also <a class="link" href="https://agent.minimax.io/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop" target="_blank" rel="noopener noreferrer nofollow">access the MiniMax agent on the web</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2363bd23-3016-48e9-8076-771469ae843d/image.png?t=1768322411"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 486</span></span> <a class="link" href="https://axiommath.ai/territory/from-seeing-why-to-checking-everything?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop" target="_blank" rel="noopener noreferrer nofollow">Axiom has released an analysis of its AxiomProver model</a>, which reveals that the AI successfully solved complex problems (like A6) that stumped human mathematicians, even while struggling with calculus concepts humans consider &quot;obvious.&quot; The breakdown highlights a fascinating divergence in logic, where the model often ignores human elegance in favor of brute-force analysis or unexpected geometric strategies to construct valid Lean proofs. <a class="link" href="https://axiommath.github.io/Putnam2025/B6/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop#spacing=850&external=1&theme=light" target="_blank" rel="noopener noreferrer nofollow">View a </a><a class="link" href="https://axiommath.github.io/Putnam2025/B6/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop#spacing=850&external=1&theme=light" target="_blank" rel="noopener noreferrer nofollow">lean dependcy </a><a class="link" href="https://axiommath.github.io/Putnam2025/B6/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop#spacing=850&external=1&theme=light" target="_blank" rel="noopener noreferrer nofollow">graph of how AI solved Putnam problems</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ce92b8c5-9392-460d-8e19-13d0f435ab36/image.png?t=1768322622"/></div><p class="paragraph" style="text-align:left;"></p></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Support My Newsletter</h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(34, 34, 34);font-family:Georgia, "Times New Roman", serif;font-size:16px;">As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!</span></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://patreon.com/bycloud/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Check Out My Patreon </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="the-dead-salmons-of-ai-interpretabi">The Dead Salmons of AI Interpretability</h2><p class="paragraph" style="text-align:left;"><i>Meloux et al. [</i>Universite Grenoble Alpes, Icahn School of Medicine at Mount Sinai<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 4.9k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Interpretability</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Many years ago, neuroscientists detected brain activity in a dead salmon, which was a mistake caused by statistical errors. Similarly, today, the researchers are arguing that the field of AI interpretability (which tries to explain how complex models think) is facing its own &quot;dead salmon&quot; moment.</p><p class="paragraph" style="text-align:left;">As artificial intelligence becomes a part of our lives, we need to know if the methods we use to understand these systems are actually working. The authors of this study tried to determine whether current techniques are finding true insights into machine logic or simply hallucinating patterns in the noise.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/167cf00c-7b9c-4fae-b8ac-a0881dec2496/CleanShot_2026-01-13_at_18.07.51_2x.png?t=1768307882"/><div class="image__source"><span class="image__source_text"><p>The tuple of a target behavior PU and the computational system with its internal components form an SCM.</p></span></div></div><p class="paragraph" style="text-align:left;">The team discovered that many popular methods used to interpret AI behave surprisingly well on neural networks that are completely random and untrained. Much like the salmon experiment, these tools generated plausible-sounding explanations for mathematical gibberish, which reveals that the field suffers from significant statistical fragility. </p><p class="paragraph" style="text-align:left;">The researchers explained that this happens because current queries are often &quot;non-identifiable,&quot; meaning convincing answers can be found even where no real logic exists. They propose that instead of just accepting an explanation at face value, we must treat it as a statistical estimate.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a57c09c0-af7e-4292-88a2-f6e6a5af9fc9/CleanShot_2026-01-13_at_18.08.32_2x.png?t=1768307922"/><div class="image__source"><span class="image__source_text"><p>An interpretability task is defined by three elements: E, the hypothesis space; µ, the distribution over causal queries about the SCM (model and behavior); and D, the error measure.</p></span></div></div><p class="paragraph" style="text-align:left;">This requires a new framework where findings are rigorously tested against random baselines to prove they describe a real computational mechanism rather than a statistical fluke.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.18792?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="from-entropy-to-epiplexity-rethinki">From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence</h2><p class="paragraph" style="text-align:left;"><i>Zheng et al. [New York University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.9K </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Entropy</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;">Computationally Bounded Intelligence</span></span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Traditional theories suggest that processing data deterministically cannot create new information, but despite this, AI systems like AlphaZero develop superhuman strategies without seeing human data, and models trained on synthetic data often outperform their predecessors.</p><p class="paragraph" style="text-align:left;">Researchers realized that treating all observers as having infinite computing power (which is a standard mathematical assumption) misses the point. This paper tries to define how a computer with limited processing power actually perceives value in data to distinguish between random noise and the useful, learnable structures.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/becad03a-a056-46eb-9ec1-7684918cc981/CleanShot_2026-01-13_at_18.13.10_2x.png?t=1768308201"/><div class="image__source"><span class="image__source_text"><p>Illustration of random vs structural information.</p></span></div></div><p class="paragraph" style="text-align:left;">The team introduced a concept called &quot;epiplexity,&quot; a <b>measure of the structural information</b> a specific, resource-constrained observer can extract. They found that information isn&#39;t just about what is in the data, but how hard a computer has to work to decode it. When models are forced to solve harder problems (such as deducing the logic of a chess game from the board state rather than just predicting the next move), they acquire higher epiplexity.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/33001b97-d17d-4ed6-a228-b2f1c13c4d82/CleanShot_2026-01-13_at_18.14.08_2x.png?t=1768308264"/><div class="image__source"><span class="image__source_text"><p>Experiments on Factorization </p></span></div></div><p class="paragraph" style="text-align:left;">This struggle forces the AI to construct richer internal programs and sophisticated mental models. The research also explains why language data, which is dense with logical rules, often builds more versatile intelligence than image data, which contains high randomness but less complex, learnable structure.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ec7f3d95-3c01-46a3-8793-500ca5935c08/CleanShot_2026-01-13_at_18.13.48_2x.png?t=1768308237"/><div class="image__source"><span class="image__source_text"><p>Information created with cellular automata.</p></span></div></div><p class="paragraph" style="text-align:left;">This framework suggests that by measuring epiplexity, engineers could move beyond trial-and-error to scientifically select training data that maximizes learning.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.03220?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="gdpo-group-reward-decoupled-normali">GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</h2><p class="paragraph" style="text-align:left;"><i>Liu et al. [NVIDIA]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 680 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> GDPO </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">As artificial intelligence evolves, users are no longer satisfied with models that simply output correct facts. We now expect them to juggle complex behaviors simultaneously (writing clean code, adhering to strict formatting, and remaining concise), all while being accurate.</p><p class="paragraph" style="text-align:left;">However, teaching models to balance these competing actions has proven surprisingly difficult. The standard training technique, known as GRPO, tends to blur these distinct goals together. When researchers looked closely, they realized this approach causes a &quot;signal collapse,&quot; where the mathematical feedback for different levels of success looks identical to the model.</p><p class="paragraph" style="text-align:left;">This means that the AI becomes unable to distinguish between a partially correct attempt and a truly successful one, which often causes it to ignore harder tasks in favor of easier ones.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1d7af77b-b17f-4add-86a7-b2fb1782b51e/tool_gdpo.png?t=1768308482"/></div><p class="paragraph" style="text-align:left;">To solve this, researchers at NVIDIA and HKUST introduced a new method called Group reward-Decoupled Normalization Policy Optimization (GDPO). Instead of mixing every piece of feedback into a single, muddy signal, this approach processes each goal independently before combining them.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/87e8ba7b-7aa9-457a-9133-7ad306fef5f5/gdpo_toy.png?t=1768308499"/></div><p class="paragraph" style="text-align:left;">By normalizing rewards separately, the method preserves the fine-grained resolution of the training signal, ensuring the model understands that hitting two targets is numerically better than hitting just one. In tests spanning mathematical reasoning, coding, and tool usage, this clearer feedback allowed models to significantly outperform previous standards. The AI could finally balance strict constraints, such as keeping answers short, without sacrificing the accuracy of its reasoning.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a86e7d69-3bfc-4da4-b57b-3a7c207b1cd1/GDPO_FORMULA.png?t=1768308511"/></div><p class="paragraph" style="text-align:left;">This allows researchers to balance incentives used during training and fine-tune models to respect complex, multi-layered preferences without destabilizing the learning process.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.05242?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">How to Set the Learning Rate for Large-Scale Pre-training?</h2><p class="paragraph" style="text-align:left;">Zhou<i> et al. [</i>Shanghai AI Laboratory, Shanghai JiaoTong University, Fudan University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 450 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Learning Rate </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Training the massive AI systems of tomorrow is incredibly expensive, consuming vast amounts of time and computational power. One of the biggest headaches engineers face is setting the &quot;learning rate&quot;—essentially the speed at which the AI absorbs new information. If the rate is set too slow, training drags on inefficiently; if it is set too fast, the model gets confused and fails to learn. Until now, finding that &quot;Goldilocks&quot; speed for giant models has felt like a high-stakes guessing game, because running trial-and-error tests on such a massive scale costs a fortune. Researchers set out to solve this by determining if we can run small, cost-effective experiments to accurately predict the perfect settings for the big leagues.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2ae3f6fd-5c44-460d-9e47-9f9e48d7ddc6/CleanShot_2026-01-13_at_21.34.00_2x.png?t=1768320253"/><div class="image__source"><span class="image__source_text"><p>Visualization of the optimal learning rate relative to model size N and data size D</p></span></div></div><p class="paragraph" style="text-align:left;">The team compared two major strategies: trying to mechanically transfer settings from small models to big ones, versus using mathematical &quot;scaling laws&quot; to predict the best numbers based on trends. The results were illuminating. The researchers found that by analyzing the relationship between the model&#39;s size and the amount of data it consumes, they could derive a precise formula to predict the optimal learning speed. This &quot;fitting&quot; approach significantly outperformed older methods. Surprisingly, they also discovered that simpler is often better. While some theories suggest tweaking the learning speed for different parts of the AI’s &quot;brain&quot; separately, this study showed that a single, globally optimized speed works just as well. Modern AI architectures proved robust enough to learn effectively without needing complex, piece-by-piece micromanagement.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cb98c1aa-a0bc-4b64-aec3-5971e081b4d6/CleanShot_2026-01-13_at_21.34.50_2x.png?t=1768320305"/><div class="image__source"><span class="image__source_text"><p>Performance comparison between the global optimal LR (red line) and module-wise optimal LR (blue line) on a 4B model trained for 120B tokens.</p></span></div></div><p class="paragraph" style="text-align:left;"></p><p class="paragraph" style="text-align:left;">This discovery is a breath of fresh air for the future of AI development. It offers a reliable map for engineers, allowing them to skip the expensive, wasteful trial-and-error phase and move straight to efficient training. By understanding exactly how data volume and model size influence the learning process, developers can confidently scale up their systems to unprecedented sizes. It suggests a future where building more capable AI isn&#39;t just about having the biggest budget, but about leveraging the fundamental laws that govern how machines learn to build smarter and more sustainable systems.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2601.05049?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="wait-wait-wait-why-do-reasoning-mod">Wait, Wait, Wait... Why Do Reasoning Models Loop?</h2><p class="paragraph" style="text-align:left;">Pipis<i> et al. [</i>MIT, Microsoft Research, University of Wisconsin-Madison<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 731 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Reasoning Models </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI is evolving from simple chatbots to complex reasoning engines and a peculiar behavior has emerged: when faced with difficult math or logic puzzles, models sometimes get stuck in endless repetitive loops.</p><p class="paragraph" style="text-align:left;">Researchers recently launched a deep dive to understand why these models get stuck in a rut. This paper tries to answer whether adding randomness (raising the &quot;temperature&quot;) is a genuine solution or just a temporary band-aid. By investigating how smaller models learn from larger, smarter ones, the team sought to uncover the root causes of this stalling behavior to build more reliable thinkers.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/926e26c6-5ba5-4c61-8b89-ec1b84c0df87/CleanShot_2026-01-13_at_21.35.43_2x.png?t=1768320354"/><div class="image__source"><span class="image__source_text"><p>Looping with greedy decoding.</p></span></div></div><p class="paragraph" style="text-align:left;">The investigation revealed that looping is often a symptom of &quot;risk aversion&quot; born from imperfect learning. When a smaller &quot;student&quot; model tries to mimic a &quot;teacher,&quot; it often fails to grasp the difficult, precise steps required to make progress. Instead, it retreats to safe, easy-to-learn actions, such as restating the problem or repeating a previous thought.</p><p class="paragraph" style="text-align:left;">The researchers discovered that while <b>dialing up the randomness does help break these loops</b>, it does not actually fix the model&#39;s underlying confusion. Instead of learning the correct path, the model simply explores more chaotically until it potentially stumbles across the solution, resulting in reasoning chains that are much longer than necessary.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bf0583ec-6b31-41c5-aa66-2d11a7f7e9c8/CleanShot_2026-01-13_at_21.36.18_2x.png?t=1768320388"/><div class="image__source"><span class="image__source_text"><p>Temporally correlated errors induce low-temperature loops.</p></span></div></div><p class="paragraph" style="text-align:left;">Now that we understand looping is caused by specific learning errors rather than just bad settings, engineers can design better training benchmarks. Additionally, instead of relying on randomness to shake models out of a loop, researchers can target these &quot;hard-to-learn&quot; steps.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.12895?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=wait-wait-wait-why-do-reasoning-models-loop"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/TVlFf_Po1bs" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=9907aafc-cec8-4a46-bbe1-b10aed08c148&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>DeepSeek Just Added Parameters Where There Were None...</title>
  <description>And more about Recursive Language Models, LongCat ZigZag Attention, and LoRA RL</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4c5f161f-c31f-4593-8f52-02a22b9a1224/issue_89.jpg" length="183756" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/deepseek-just-added-parameters-where-there-were-none</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/deepseek-just-added-parameters-where-there-were-none</guid>
  <pubDate>Tue, 06 Jan 2026 18:19:11 +0000</pubDate>
  <atom:published>2026-01-06T18:19:11Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Dec 31st ~ Jan 6th</i><br><i>#89 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 7.2k</span></span> NVIDIA has announced <a class="link" href="https://nvidianews.nvidia.com/news/alpamayo-autonomous-vehicle-development?=mail.bycloud.ai&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">Alpamayo</a>, a new “thinking, reasoning” autonomous-vehicle AI. The first rollout is slated to <b>reach U.S. roads</b> in Q1 2026, starting with the all-new Mercedes-Benz CLA. NVIDIA’s first model, <b>Alpamayo 1 (10B parameters)</b>, uses video to generate driving trajectories and reasoning traces, now available <a class="link" href="https://huggingface.co/nvidia/Alpamayo-R1-10B?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">on HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f68d59e0-555a-4d9f-a81c-4f9f23d0a1a7/nvidia-alpamayo.jpg?t=1767717691"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 17k</span></span> Boston Dynamics has released a new video of its upgraded next-gen humanoid robot, <a class="link" href="https://bostondynamics.com/products/atlas/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">Atlas</a>, now fully electric with a <b>4-hour swappable battery</b> for continuous operation. Atlas stands 6&#39;2&quot;, weighs 198 lbs, has 56 degre</p><p class="paragraph" style="text-align:left;">es of freedom, can lift 110 lbs (66 lbs sustained), and reach 7.5 ft, using tactile, reconfigurable hands to adapt grip in real time. </p><div class="image"><a class="image__link" href="https://bostondynamics.com/products/atlas/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6d548227-d06a-4a5d-882a-a39fe42a8dc2/image.png?t=1767717649"/></a></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1k</span></span> Liquid AI has released <a class="link" href="https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai?=mail.bycloud.ai&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">LFM2.5</a>, its most capable family of <b>tiny on-device foundation models</b> (~1B class). Built on the LFM2 hybrid, device-optimized architecture, LFM2.5 scales pretraining from 10T → 28T tokens and expands RL post-training to improve instruction following. The initial open-weight lineup (including 1.2B Base/Instruct, plus vision-language and native audio-language variants) is available on <a class="link" href="https://huggingface.co/collections/LiquidAI/lfm25?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7fa4ed84-8984-4c6c-b34f-1d1eba1470b1/image.png?t=1767717792"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Support My Newsletter</h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(34, 34, 34);font-family:Georgia, "Times New Roman", serif;font-size:16px;">As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!</span></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://patreon.com/bycloud/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none"><span class="button__text" style=""> Check Out My Patreon </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="recursive-language-models">Recursive Language Models</h2><p class="paragraph" style="text-align:left;"><i>Zhang et al. [MIT CSAIL]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 2k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LM Context </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">LLMs are making significant progress and can handle several complex tasks but, they struggle when we try to process massive amounts of information in a context window, like trying to read a library of books simultaneously without losing the plot. Even the most advanced models suffer from a phenomenon known as &quot;context rot,&quot; where their reasoning ability degrades as the context is used up overtime.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e76afe45-4fcc-4250-93ce-2b85d5677ef1/CleanShot_2026-01-06_at_18.56.39_2x.png?t=1767706010"/><div class="image__source"><span class="image__source_text"><p>A Recursive Language Model (RLM) treats prompts as part of the environment.</p></span></div></div><p class="paragraph" style="text-align:left;">This paper tries to determine if it is possible to enable AI to tackle long-horizon tasks involving millions of words without needing a bigger brain, but rather a smarter way to manage information.</p><p class="paragraph" style="text-align:left;">The team introduced a concept called Recursive Language Models (RLMs). Instead of forcing a neural network to ingest a massive document all at once, this approach treats the text as an external part of the environment, much like a reference book sitting on a desk rather than a memory in one&#39;s head.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c35bac89-c7b3-455b-b7da-5e14b93437f9/CleanShot_2026-01-06_at_18.57.14_2x.png?t=1767706045"/><div class="image__source"><span class="image__source_text"><p>Performance comparison of different methods across long-context benchmarks of varying complexity.</p></span></div></div><p class="paragraph" style="text-align:left;">The AI effectively acts as a programmer, writing code to peek into specific parts of the text, break complex problems into smaller chunks, and recursively call upon copies of itself to analyze those snippets. This strategy allowed models to successfully handle inputs up to two orders of magnitude larger than their designed limits.</p><p class="paragraph" style="text-align:left;">On complex reasoning tasks, this method dramatically <b>outperformed standard models</b> and summarization techniques, while maintaining high accuracy even as the information load grew immense.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.24601?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="m-hc-manifold-constrained-hyper-con"><b>mHC: Manifold-Constrained Hyper-Connections</b></h1><p class="paragraph" style="text-align:left;">Xie<i> et al. [</i>DeepSeek AI<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Residual Connection </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Transformers lean heavily on residual connections because they keep information and gradients flowing cleanly through many layers. Hyper-Connections (HC) try to push this idea further by widening the residual stream into multiple parallel “streams” and learning how to mix them, so the model can exchange information across depth without increasing the core layer FLOPs much.</p><p class="paragraph" style="text-align:left;">The problem is that the more freedom HC gives those cross-stream mixing matrices, the less it behaves like an identity path. When you stack many layers, the product of these unconstrained residual mixing matrices can amplify or shrink signals unpredictably, which shows up as training instability in large runs. In their 27B setup, the paper reports a loss surge for HC around 12k steps and extremely large composite “gain magnitudes” that can peak around 3000, a sign of exploding residual dynamics.</p><p class="paragraph" style="text-align:left;">Their fix is <b>Manifold-Constrained Hyper-Connections (mHC)</b>. Instead of letting the residual mixing matrix be anything, they project it onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and every row and column sums to 1. That keeps the residual pathway closer to a stable identity-like behavior while still allowing streams to mix, since each stream becomes a convex combination of the others rather than an arbitrary linear remapping.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/68e2e976-8c5b-41ea-be3a-eb1fdde71892/CleanShot_2026-01-06_at_18.59.11_2x.png?t=1767706172"/><div class="image__source"><span class="image__source_text"><p>Illustrations of Residual Connection Paradigms.</p></span></div></div><p class="paragraph" style="text-align:left;">Practically, they build this projection with the Sinkhorn-Knopp algorithm, running a limited number of iterations (they use 20) to turn an unconstrained matrix into an approximately doubly stochastic one. Because doubly stochastic matrices stay doubly stochastic under multiplication, the stability property should persist even when you multiply many layers’ residual mappings together, which is exactly where HC would tend to drift.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8dfc6c2b-9827-4b95-9b6f-254704dba951/CleanShot_2026-01-06_at_19.00.20_2x.png?t=1767706233"/><div class="image__source"><span class="image__source_text"><p>Visualizations of Learnable Mappings.</p></span></div></div><p class="paragraph" style="text-align:left;">They also treat systems cost as part of the method. Widening the residual stream increases memory traffic and activation storage, so they add fused kernels, mixed precision kernels, and selective recomputation, and adjust pipeline overlap. With expansion rate n = 4, they report only about a 6.7% training time overhead after these optimizations.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0b439787-6306-4463-a4ce-910a14d863dc/CleanShot_2026-01-06_at_19.00.59_2x.png?t=1767706269"/><div class="image__source"><span class="image__source_text"><p>Training Stability of Manifold-Constrained Hyper-Connections (mHC).</p></span></div></div><p class="paragraph" style="text-align:left;">On results, mHC appears to keep HC’s accuracy benefits while avoiding its instability. In the 27B run, mHC reaches a final training loss reduction of 0.021 versus the baseline and keeps gradient norms closer to baseline behavior. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/906b37b3-675b-43b5-a749-e82ec63cc1bb/image.png?t=1767719908"/><div class="image__source"><span class="image__source_text"><p>mHC benchmark against baseline</p></span></div></div><p class="paragraph" style="text-align:left;">On downstream benchmarks, mHC beats the baseline across the board and usually edges out HC too, for example improving BBH and DROP relative to HC by about 2.1 and 2.3 points, respectively. On the stability metrics, the composite gain that could hit ~3000 in HC stays bounded around ~1.6 in mHC, which matches the paper’s story that constraining the residual topology can make this kind of widened residual stream scale more safely.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.24880?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="efficient-context-scaling-with-long"><b>Efficient Context Scaling with LongCat ZigZag Attention</b></h1><p class="paragraph" style="text-align:left;">Zhang<i> et al. [</i>Meituan<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 211 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Attention </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">There is a bottleneck in how AI processes information, as models try to &quot;read&quot; longer documents (like entire books, legal archives, or massive codebases), the computational cost skyrockets because the system traditionally tries to pay equal attention to every single connection between every word.</p><p class="paragraph" style="text-align:left;">The research team sought a way to break this inefficient cycle, aiming to create a model that can handle up to one million tokens of context without the computational weight that comes with it. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a3787feb-b2f5-4ab2-aabd-bd36da2fb43f/CleanShot_2026-01-06_at_19.20.26_2x.png?t=1767707437"/><div class="image__source"><span class="image__source_text"><p>The illustration of LongCat ZigZag Attention (LoZA), which involves first calibration and then training for realizing the sparsity.</p></span></div></div><p class="paragraph" style="text-align:left;">The team discovered a method called <b>LongCat ZigZag Attention (LoZA)</b>, which effectively teaches the model how to &quot;skim&quot; intelligently without missing the details. By carefully calibrating the system, researchers identified which layers of the network were doing the most important work and which could be optimized.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b1df88f2-cc3f-49bf-bed1-84c241ee91da/CleanShot_2026-01-06_at_19.21.19_2x.png?t=1767707493"/><div class="image__source"><span class="image__source_text"><p>The efficiency of LoZA. The relative cost and speed-up are practically measured on H20 clusters.</p></span></div></div><p class="paragraph" style="text-align:left;">They found a streamlined structure hidden inside the larger model, converting about half of the attention mechanisms to a &quot;sparse&quot; mode that focuses only on essential information. They found that by retraining the model mid-process after this switch, they could lock in significant speed improvements while maintaining the same high level of intelligence and accuracy as the heavier, original models.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/18d69597-1672-4cab-a28d-cc440dfe229a/CleanShot_2026-01-06_at_19.21.54_2x.png?t=1767707523"/><div class="image__source"><span class="image__source_text"><p>The effectiveness of LongCat-Flash-Exp-Chat across different context lengths on MRCR. </p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.23966?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="evaluating-parameter-efficient-meth"><b>Evaluating Parameter Efficient Methods for RLVR</b></h1><p class="paragraph" style="text-align:left;">Yin<i> et al. [</i>Zhejiang University, HKUST, WUST, USTC, Brown University, Hong Kong Polytechnic University, INSAIT<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 433 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RLVR </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">As artificial intelligence moves from simply predicting the next word to solving complex mathematical problems, the training process has evolved. Researchers are increasingly relying on <b>Reinforcement Learning with Verifiable Rewards (RLVR)</b>, a method where models improve by receiving a simple &quot;correct&quot; or &quot;incorrect&quot; signal on their reasoning. While this approach is powerful, retraining an entire massive model is incredibly expensive.</p><p class="paragraph" style="text-align:left;">To save cost and time, the industry has largely settled on a specific efficiency shortcut known as <b>LoRA (Low-Rank Adaptation)</b>. However, is this tool that everyone uses actually the best one for this specific type of learning, or are we leaving performance on the table by ignoring better alternatives?</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/64fac71e-fa13-4e03-91e6-eeb2a226f193/CleanShot_2026-01-06_at_19.22.43_2x.png?t=1767707583"/></div><p class="paragraph" style="text-align:left;">The team discovered that the industry standard is suboptimal for reinforcement learning. By testing over a dozen different efficiency methods, they found that newer &quot;structural&quot; variants (approaches that change how weight updates are structured rather than just adding a simple adapter) consistently outperformed the default method. In some cases, these structural variants even surpassed the performance of full-parameter training, which is typically considered the gold standard.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0384ea49-6284-4356-bda5-0902d01b2b1e/CleanShot_2026-01-06_at_19.23.12_2x.png?t=1767707604"/><div class="image__source"><span class="image__source_text"><p>A variety of PEFT methods are listed, each with its specific update formulation and initialization strategy. LN denotes Layernorm.</p></span></div></div><p class="paragraph" style="text-align:left;">The study also revealed a fascinating mismatch in how models learn. Some advanced methods try to initialize training by focusing on the model&#39;s &quot;loudest&quot; or most significant existing features. The researchers found this causes the training to collapse because reinforcement learning actually thrives by tweaking the quieter, less dominant parts of the network.</p><p class="paragraph" style="text-align:left;">Additionally, the team identified a strict limit to efficiency. While it is possible to freeze large portions of a model, attempting extreme compression bottlenecks the system. To learn complex reasoning, the model retains a need for a minimum amount of &quot;plasticity,&quot; or trainable parameters, without which its ability to evolve stalls completely.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c692261a-315e-42e5-af10-0e4394156a8a/CleanShot_2026-01-06_at_19.24.05_2x.png?t=1767707657"/><div class="image__source"><span class="image__source_text"><p>Comparison of accuracy and pass scores (all values are reported in percentages).</p></span></div></div><p class="paragraph" style="text-align:left;">By moving away from the default adoption of standard LoRA and using structural variants like DoRA, engineers can build models that are not only computationally cheaper but also significantly smarter at math and logic.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.23165?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=deepseek-just-added-parameters-where-there-were-none"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/ZgwHaI2C-9s" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=c9c36eb9-51b0-427e-9a34-0fd7f67b1f0d&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>RoPE Is Inherently Flawed</title>
  <description>plus more on Self-Play SWE-RL, Step DeepResearch, and Attention Is Not What You Need</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f6c621ce-0db9-4b93-8ad9-893a2b3e3f82/issue_88.jpg" length="120605" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/rope-is-inherently-flawed</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/rope-is-inherently-flawed</guid>
  <pubDate>Tue, 30 Dec 2025 20:00:16 +0000</pubDate>
  <atom:published>2025-12-30T20:00:16Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Dec 23rd ~ Dec 30th</i><br><i>#88 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.2k</span></span> <a class="link" href="https://Z.ai?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">Z.ai</a> is set for its IPO on Jan 8, 2026 on the Hong Kong Stock Exchange and set to raise $560 million at a valuation of 5.6 billion</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2.1k</span></span> Liquid AI has announced the release of LFM2-2.6B-Exp, an experimental checkpoint built on LFM2-2.6B using pure reinforcement learning. The model delivers consistent gains in instruction following, knowledge, and math benchmarks, and outperforms other 3B models across these areas. Liquid AI also reports that its IFBench score surpasses DeepSeek R1-0528, despite being 263× smaller. Now on <a class="link" href="https://huggingface.co/LiquidAI/LFM2-2.6B-Exp?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">Hugging Face</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cbb48679-1245-426b-a889-321964d35b79/image.png?t=1767120437"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2.4k</span></span> Nvidia recently made a massive $20 billion deal with AI chip startup <a class="link" href="https://groq.com/newsroom/groq-and-nvidia-enter-non-exclusive-inference-technology-licensing-agreement-to-accelerate-ai-inference-at-global-scale?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">Groq</a>, acquiring most of Groq&#39;s AI chip assets and licensing its inference tech (LPU), while also &quot;acquihiring&quot; Groq&#39;s key team, including founder Jonathan Ross, to boost Nvidia&#39;s AI inference performance, though Groq remains independent for its cloud services.</p></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.4k</span></span> MiniMax has announced the open-source release of <a class="link" href="https://www.minimax.io/news/minimax-m21?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">MiniMax M2.1</a>, a SoTA model for real-world dev workflows and agentic applications. It uses a MoE with 10B active parameters and 230B total, aiming to be faster to run and easier to deploy than comparable models. Rankings #1 among open-source models and #6 overall for web dev arena. Now on <a class="link" href="https://huggingface.co/MiniMaxAI/MiniMax-M2.1?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6cd4a7b8-73d0-4236-901f-0145a897b018/image.png?t=1767120610"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Support My Newsletter</h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(34, 34, 34);font-family:Georgia, "Times New Roman", serif;font-size:16px;">As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!</span></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://patreon.com/bycloud/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Check Out My Patreon </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="decoupling-the-what-and-where-with-">Decoupling the &quot;What&quot; and &quot;Where&quot; With Polar Coordinate Positional Embeddings</h1><p class="paragraph" style="text-align:left;">Gopalakrishnan<i> et al. [</i>The Swiss AI Lab (IDSIA), OpenAI, Center for Generative AI, University of Colorado<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> RoPE </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">For modern AI to understand the world, it needs to track two distinct things: <i>what</i> a piece of information is and <i>where</i> it sits in a sequence. Rotary Position Embedding (RoPE) accidentally tangles the &quot;what&quot; and the &quot;where&quot; together, which confuses the model when it needs to make precise decisions based on just one of those factors. To solve this, the team developed a new approach called Polar Coordinate Position Embedding, or PoPE, designed to mathematically untangle these signals so the AI can process content and position independently.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c1a06335-26f7-436b-b263-18545567fa5e/CleanShot_2025-12-30_at_18.25.15_2x.png?t=1767099330"/><div class="image__source"><span class="image__source_text"><p>How RoPE and PoPE encode relative positions via rotations of queries.</p></span></div></div><p class="paragraph" style="text-align:left;">In the polar coordinate system, the magnitude of a signal represents the content, and the angle represents the position. PoPE reduces the confusion found in previous models. When tested against standard baselines, this new method shows superior performance across a diverse range of complex tasks, including the generation of classical music and modeling the human genome.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1ae07dc1-b9a8-4bdc-a841-1d0c1151afdc/CleanShot_2025-12-30_at_18.25.55_2x.png?t=1767099369"/><div class="image__source"><span class="image__source_text"><p>Zero-shot performance on downstream tasks using Transformer models pretrained on OpenWebText with RoPE or PoPE positional encoding.</p></span></div></div><p class="paragraph" style="text-align:left;">The researchers found that models using PoPE demonstrated a remarkable ability to handle sequences ten times longer than those for which they were trained. Unlike current state-of-the-art methods that require complex fine-tuning to &quot;stretch&quot; a model&#39;s attention span, PoPE naturally generalized to these longer contexts immediately, proving to be both more robust and data-efficient.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f24bcf6a-34e1-4a1d-ac53-05b68350af30/CleanShot_2025-12-30_at_18.28.48_2x.png?t=1767099544"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2509.10534v2?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="toward-training-superintelligent-so"><b>Toward Training Superintelligent Software Agents through Self-Play SWE-RL</b></h1><p class="paragraph" style="text-align:left;">Wei<i> et al. [</i>Meta FAIR, Meta, TBD Lab, UIUC, CMU<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.6k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;"><b>Self-play SWE-RL (SSR)</b> is a new framework designed to train superintelligent software engineering agents without relying on human-curated data. Current agents are limited by their dependence on finite resources, such as GitHub issues and manually written tests, which forces them to imitate human developers rather than discover new solutions. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/08ea4305-0dce-479d-aa8e-6be3473e4ecc/CleanShot_2025-12-30_at_18.38.58_2x.png?t=1767100155"/><div class="image__source"><span class="image__source_text"><p>Overview of Self-play SWE-RL.</p></span></div></div><p class="paragraph" style="text-align:left;">To overcome this barrier, SSR allows a Large Language Model (LLM) to self-improve by interacting with raw, sandboxed code repositories. The system requires only the source code and dependencies, which eliminates the need for pre-existing test suites or natural language issue descriptions.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e37cd236-d494-42e8-af0d-d1b08eb24a96/CleanShot_2025-12-30_at_18.34.37_2x.png?t=1767099888"/><div class="image__source"><span class="image__source_text"><p>Bug-injection patches generated by code hunk removal (left) and historical change reversion (right). </p></span></div></div><p class="paragraph" style="text-align:left;">The training process uses a single LLM alternating between two roles: a <b>bug-injection agent</b> and a <b>bug-solving agent</b>. The injection agent explores the repository to generate a &quot;bug artifact,&quot; which consists of a bug-inducing patch, a custom test script, and a patch that weakens existing tests to hide the bug. </p><p class="paragraph" style="text-align:left;">These valid bug artifacts are then passed to the solver agent, which attempts to fix the codebase using the strict test specifications defined by the injector. Failed attempts by the solver are converted into &quot;higher-order bugs,&quot; creating an evolving curriculum that becomes increasingly complex as the agent improves.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ee769c62-240a-4ba5-a35d-ddfeef8780cf/CleanShot_2025-12-30_at_18.38.06_2x.png?t=1767100099"/><div class="image__source"><span class="image__source_text"><p>Key consistency checks applied to validate bug artifacts, the full set described in the text.</p></span></div></div><p class="paragraph" style="text-align:left;">SSR was tested on the <b>SWE-bench Verified</b> and <b>SWE-Bench Pro</b> benchmarks using the Code World Model (CWM) as a base. The results show that SSR achieves significant self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms baselines trained on human-curated data. </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.18552?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="meta-rl-induces-exploration-in-lang"><b>Meta-RL Induces Exploration in Language Agents</b></h1><p class="paragraph" style="text-align:left;">Jiang<i> et al. [</i>EPFL, ETH Zurich, Idiap Research Institute<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 877 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM RL </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI can handle some complex tasks, but it struggles when we ask it to explore. While current LLMs can be trained via RL to solve specific problems, they often become rigid, and end up memorizing a single successful path rather than understanding how to adapt.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1ae95bda-a27a-41d3-b674-7e76c7878a9b/CleanShot_2025-12-30_at_18.46.17_2x.png?t=1767100589"/><div class="image__source"><span class="image__source_text"><p>Comparison of RL and Meta-RL training on the MineSweeper environment.</p></span></div></div><p class="paragraph" style="text-align:left;">When faced with a new or slightly changed environment, these agents frequently fail because they haven&#39;t learned how to learn from their mistakes. The research team sought to bridge this gap by designing a system that treats failure not as a dead end, but as a strategic investment. Their goal was to create agents that actively experiment with their surroundings and use that feedback to improve, mimicking the way a human might play a few practice rounds of a new game to understand the rules before trying to win.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bc9461ce-844f-4896-930a-f51bc69f6256/CleanShot_2025-12-30_at_18.46.45_2x.png?t=1767100614"/><div class="image__source"><span class="image__source_text"><p>Comparison between the training processes of RL (top) and Meta-RL used in LAMER (bottom).</p></span></div></div><p class="paragraph" style="text-align:left;">The team introduced a framework called LAMER that fundamentally shifts the training objective from winning a single episode to maximizing success over a series of attempts. By analyzing how agents behave across multiple tries, the researchers found that their model learned to sacrifice immediate rewards in favor of gathering information.</p><p class="paragraph" style="text-align:left;">The agent effectively &quot;reflects&quot; on its previous performance, using the context of past failures to adjust its strategy in real-time without needing complex mathematical updates to its core programming. In testing across diverse environments (ranging from logic puzzles like Minesweeper to web navigation tasks), this approach created agents that were significantly more successful and creative. Instead of collapsing into repetitive behaviors, these agents maintained a diverse set of strategies and proved capable of solving problems that standard reinforcement learning models simply could not handle.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6a409df5-36a1-4958-ad37-66fbce7dfc20/CleanShot_2025-12-30_at_18.47.46_2x.png?t=1767100677"/><div class="image__source"><span class="image__source_text"><p>Performance on Sokoban, MineSweeper and Webshop environments.</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.16848?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="step-deep-research-technical-report"><b>Step-DeepResearch Technical Report</b></h1><p class="paragraph" style="text-align:left;"><i>StepFun</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 757 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Deep Research </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Researchers have identified an important distinction between simple search and true research. While current AI is excellent at answering specific, closed questions, it often stumbles when faced with open-ended projects that require long-term planning and logical structuring. It is challenging to create an agent that doesn&#39;t just retrieve links but understands the intent behind a request and can navigate the ambiguity of the real world.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b40ae042-a431-40c5-9a03-6be739715fde/CleanShot_2025-12-30_at_18.54.31_2x.png?t=1767101082"/><div class="image__source"><span class="image__source_text"><p>Comprehensive Evaluation of Step-DeepResearch.</p></span></div></div><p class="paragraph" style="text-align:left;">The team developed Step-DeepResearch, a framework that achieves expert-level performance without relying on massive, expensive computational resources. Instead of just feeding the model more data, they focused on training &quot;atomic capabilities&quot;, fundamental skills like decomposing a complex problem, verifying information across multiple sources, and reflecting on mistakes in real-time.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d90440bb-144a-42d9-83bd-bfb495843efd/CleanShot_2025-12-30_at_18.55.09_2x.png?t=1767101124"/></div><p class="paragraph" style="text-align:left;">By teaching the model to internalize this cognitive loop of planning, executing, and self-correcting, they created a medium-sized system that rivals the performance of the industry&#39;s largest proprietary models. It shows that a refined training strategy, which prioritizes decision-making and synthesis over raw size, can produce an agent that effectively navigates complex workflows to produce comprehensive reports.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f5ca0e30-e320-422f-88fd-157f94cd6282/CleanShot_2025-12-30_at_18.56.17_2x.png?t=1767101187"/><div class="image__source"><span class="image__source_text"><p>Step-DeepResearch System Architecture.</p></span></div></div><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/288c1cbb-4416-4b2e-8d01-151e5dde5655/CleanShot_2025-12-30_at_18.57.05_2x.png?t=1767101249"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.20491?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="attention-is-not-what-you-need"><b>Attention Is Not What You Need</b></h1><p class="paragraph" style="text-align:left;"><i>CHONG [Meta, UT Austin, UCL, UC Berkeley, Harvard University, Periodic Labs]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Attention Alternative </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">If you know anything about how LLMs work, then you would have heard about Transformers. Transformers use &quot;self-attention&quot;, which is a mechanism that requires every word in a sequence to continuously check its relationship with every other word. </p><p class="paragraph" style="text-align:left;">Although it is incredibly effective, this process creates a massive, opaque web of calculations that becomes computationally expensive and notoriously difficult for humans to interpret. Researchers recently posed a provocative question: Is this expensive &quot;attention&quot; mechanism actually necessary for AI to reason, or is it just one inefficient way to achieve a goal? </p><p class="paragraph" style="text-align:left;">To solve this, the team developed a new architecture called the Causal Grassmann Transformer, which completely removes the standard attention mechanism. Instead of building a massive grid of connections between all words, the model treats language processing as a flow through a specific mathematical landscape known as a Grassmann manifold. The system condenses information into lower dimensions and interprets the relationships between words as geometric subspaces.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af56be0b-f42a-446c-9db8-c9de4895a0bf/image.png?t=1767120832"/></div><p class="paragraph" style="text-align:left;">This geometry-first approach proved that high performance doesn&#39;t require the traditional heavy machinery of self-attention. When tested on standard language modeling tasks, the simplified Grassmann model performed competitively with standard Transformers, achieving accuracy levels within a close margin of the established baselines.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1e64508b-deb8-4454-ac43-f8cb9e64bd6c/image.png?t=1767120809"/></div><p class="paragraph" style="text-align:left;">This paper suggests that the future of language models may not rely solely on scaling up existing architectures, but rather on redesigning their mathematical foundations. By proving that &quot;attention&quot; can be replaced by &quot;geometric evolution&quot;, this work opens the door to AI systems that are drastically more efficient and easier to audit.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.19428?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=rope-is-inherently-flawed"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><p class="paragraph" style="text-align:left;">Message from bycloud:</p><p class="paragraph" style="text-align:left;">This will be the last issue of 2025! I hope you had a great holiday, and I wish you all the best in your endeavors in 2026!</p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=da760bcc-ba54-4b3a-a35f-f3e508006c13&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Flash Attention Author&#39;s New Work: SonicMoE</title>
  <description>Next-Embedding Prediction Makes Strong Vision Learners, Let&#39;s (not) just put things in Context, Spherical Equivariant Graph Transformers, and moree</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fc8ce780-ddd7-4451-84b4-e11a645d8ccc/issue_87.jpg" length="198756" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/flash-attention-author-s-new-work-sonicmoe</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/flash-attention-author-s-new-work-sonicmoe</guid>
  <pubDate>Tue, 23 Dec 2025 18:44:19 +0000</pubDate>
  <atom:published>2025-12-23T18:44:19Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Dec 15th ~ Dec 22nd</i><br><i>#87 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.5k</span></span> <a class="link" href="https://www.minimax.io/news/minimax-m21?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">MiniMax has launched its M2.1 model</a>, which is designed for complex, real-world tasks. In addition to coding, the model introduces powerful capabilities for office automation, enabling it to handle &quot;digital employee&quot; workflows and complex instructions. M2.1 is also more efficient, delivering <b>faster</b>, more concise responses that use <b>fewer resources</b>. You can access the M2.1 model now through the <a class="link" href="https://platform.minimax.io/docs/guides/text-generation?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">MiniMax API</a> and as an open-source download on HuggingFace.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5c0bbd40-f3da-419d-833f-39b506954754/5c13032c-d33a-4e6f-be03-bac1606294a2.PNG?t=1766514080"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 2.8k</span></span> <a class="link" href="https://z.ai/blog/glm-4.7?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Z.ai has launched GLM-4.7</a>, a new AI model that delivers significant improvements in coding and reasoning capabilities. This update introduces advanced features like &quot;Preserved Thinking,&quot; which allows the model to retain its reasoning process across multiple steps, making it more stable and effective for complex, long-term tasks. Developers can access the model now through the <a class="link" href="https://docs.z.ai/guides/llm/glm-4.7?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Z.ai platform</a> and OpenRouter, or download it for local use via <a class="link" href="https://huggingface.co/zai-org/GLM-4.7?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.<br></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c1d64e32-fa32-4641-a59c-62c58b6423ff/upload_058e166eb117f1c394d0505429b6248c.png?t=1766514356"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.1k</span></span> <a class="link" href="https://seed.bytedance.com/en/seed1_8?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Bytedance has introduced Seed-1.8</a>, a generalized agentic model designed to efficiently handle complex, real-world tasks. This new release excels in multimodal processing and it supports both text and image inputs with strong performance in <b>Graphical User Interface (GUI) interaction</b>, coding, and information retrieval. The model can perceive <b>video streams in real-time</b>, perform non-blocking interactions, and utilize specific video tools to analyze details or extract highlights from long-form content.<br></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b77e0243-95a0-4b89-97be-dcf461043b9c/4og2ymj9w2ywr.png?t=1766514622"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.8k</span></span> <a class="link" href="https://mimo.xiaomi.com/blog/mimo-v2-flash?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Xiaomi has released MiMo-V2-Flash</a>, a new open-source AI model built for high-speed reasoning and coding tasks. This model uses a specialized &quot;Mixture-of-Experts&quot; architecture that delivers exceptionally fast responses (up to 150 tokens/s) while remaining highly cost-effective. With a massive context window capable of handling long documents and seamless integration with coding tools like Cursor and Claude Code, MiMo-V2-Flash is designed to act as an efficient digital assistant. You can access the model now through <a class="link" href="https://huggingface.co/xiaomimimo/MiMo-V2-Flash?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a> or <a class="link" href="https://platform.xiaomimimo.com/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Xiaomi&#39;s API platform</a>.<br></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d5f34d25-0ce4-488b-b0d1-41c787ac3c4f/image.png?t=1766514799"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h3 class="heading" style="text-align:left;" id="modernize-your-marketing-with-ad-qu">Modernize your marketing with AdQuick</h3><div class="image"><a class="image__link" href="https://www.AdQuick.com/?utm_campaign={{publication_alphanumeric_id}}&utm_source=beehiiv&_bhiiv=opp_00ba7f49-9d58-4075-bb99-f5b66f0c1901_a0e96baa&bhcl_id=68526b51-37c8-4c36-846b-c764fcf78b1a_{{subscriber_id}}_{{email_address_id}}" rel="noopener" target="_blank"><img class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/dcbc0411-8010-4366-a920-d2f1da8f4081/AdQuick_Newsletter_Hero_2025__1_.png?t=1738376775"/></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.AdQuick.com/?utm_campaign={{publication_alphanumeric_id}}&utm_source=beehiiv&_bhiiv=opp_00ba7f49-9d58-4075-bb99-f5b66f0c1901_a0e96baa&bhcl_id=68526b51-37c8-4c36-846b-c764fcf78b1a_{{subscriber_id}}_{{email_address_id}}" target="_blank" rel="noopener noreferrer nofollow">AdQuick</a> unlocks the benefits of Out Of Home (OOH) advertising in a way no one else has. Approaching the problem with eyes to performance, created for marketers with the engineering excellence you’ve come to expect for the internet.</p><p class="paragraph" style="text-align:left;">Marketers agree OOH is one of the best ways for building brand awareness, reaching new customers, and reinforcing your brand message. It’s just been difficult to scale. But with AdQuick, you can easily plan, deploy and measure campaigns just as easily as digital ads, making them a no-brainer to add to your team’s toolbox.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.AdQuick.com/?utm_campaign={{publication_alphanumeric_id}}&utm_source=beehiiv&_bhiiv=opp_00ba7f49-9d58-4075-bb99-f5b66f0c1901_a0e96baa&bhcl_id=68526b51-37c8-4c36-846b-c764fcf78b1a_{{subscriber_id}}_{{email_address_id}}" target="_blank" rel="noopener noreferrer nofollow">Learn more now.</a></p><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="lets-not-just-put-things-in-context"><b>Let&#39;s (not) just put things in Context: Test-Time Training for Long-Context LLMs</b></h1><p class="paragraph" style="text-align:left;">Bansal et al.<i> [</i>Meta, Harvard University, Kempner Institute at Harvard, OpenAI, UC Berkeley, UT Austin<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 444 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Context </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI models can theoretically process millions of words at once. However, there is a gap between what these models can read and what they can actually use. When you ask a model to find a specific &quot;needle&quot; in a massive &quot;haystack&quot; of text, it often fails, getting distracted by the sheer volume of information.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5d08c4ee-47df-4ffb-8f4f-ef05be5e01dd/CleanShot_2025-12-23_at_22.19.33_2x.png?t=1766508581"/></div><p class="paragraph" style="text-align:left;">Until now, the industry standard solution has been to let the model &quot;think&quot; longer by generating more text before answering. But researchers have discovered that for truly long documents, simply generating more words doesn&#39;t help as the model’s internal attention mechanism gets overwhelmed.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a04979af-2750-4efe-94e1-387356f58737/CleanShot_2025-12-23_at_22.19.46_2x.png?t=1766508595"/></div><p class="paragraph" style="text-align:left;">This paper has identified a mathematical bottleneck called &quot;<b>score dilution</b>&quot;. As a document gets longer, the &quot;signal&quot; of the correct answer gets drowned out by the &quot;noise&quot; of unrelated text. To fix this, they developed a technique called query-only test-time training (qTTT). Instead of asking the model to generate more text to solve a problem, this method allows the model to pause and perform a tiny, temporary update to its internal settings based specifically on the document it is reading.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/641ca0b4-08f6-4a17-b105-5862baf42a46/CleanShot_2025-12-23_at_22.19.58_2x.png?t=1766508608"/><div class="image__source"><span class="image__source_text"><p>A visual representation of how qTTT improves the logit margin. </p></span></div></div><p class="paragraph" style="text-align:left;">It effectively lets the model &quot;study&quot; the context for a moment before answering. This approach proved far more effective than standard methods, and provided massive double-digit improvements in accuracy on difficult tasks like finding bugs in code or details in long records.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.13898?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="a-complete-guide-to-spherical-equiv">A Complete Guide to Spherical Equivariant Graph Transformers</h2><p class="paragraph" style="text-align:left;">Sophia Tang<i> [</i>University of Pennsylvania<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 876 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Protein Generation </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">For a long time, teaching AI to navigate the 3D world of atoms and molecules was a challenge. The laws of physics remain constant regardless of which way a molecule is facing, but the standard ML models often struggle to recognize a structure simply because it has been rotated. To solve this, researchers have developed a<b> Spherical Equivariant Graph Neural Network</b>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8c8db974-596f-49f9-a2b8-129663677a5d/CleanShot_2025-12-23_at_22.36.27_2x.png?t=1766509601"/></div><p class="paragraph" style="text-align:left;">Rather than treating molecular features as static lists of numbers, this framework uses &quot;spherical tensors&quot;, which is a mathematical concept taken from <b>quantum mechanics</b>. The researchers proved that by representing data this way, the model can perform &quot;equivariant&quot; message-passing. This means that if a molecule rotates in virtual space, the model&#39;s internal calculations transform in perfect synchronization, and it preserves the correct physical relationships between atoms.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/116451f3-5ecb-4dda-ab23-85afc45a858a/CleanShot_2025-12-23_at_22.36.55_2x.png?t=1766509625"/><div class="image__source"><span class="image__source_text"><p>A feature vector f is split into its type-0, type-1, and type-2 components and arranged into a feature tensor with a tensor axis, a channel axis, and a tensor-component axis.</p></span></div></div><p class="paragraph" style="text-align:left;">By adding specific geometric tools like spherical harmonics and Clebsch-Gordan coefficients, the architecture guarantees that the AI respects the rotational symmetries of the physical world without needing to be retrained on every possible orientation of a protein or chemical compound.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3a446adc-e095-4743-bc47-e22a32bfa33e/CleanShot_2025-12-23_at_22.37.49_2x.png?t=1766509685"/></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.13927?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">New Premium Insights release</h2><p class="paragraph" style="text-align:left;">For context, Premium Insights is where I write down longer form content that I think is interesting but not long enough to be make into YouTube videos. </p><p class="paragraph" style="text-align:left;">Last week, I published the following blog:</p><div class="embed"><a class="embed__url" href="https://mail.bycloud.ai/p/aug-nov-ai-research-trend-report?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank"><div class="embed__content"><p class="embed__title"> Aug~Nov AI Research Trend ReportAuBasically recapping what I missed in the last 4 months g~mail.bycloud.ai/p/aug-nov-ai-research-trend-reportNov AI Research Trend Report </p><p class="embed__description"> Basically recapping what I missed in the last 4 months </p><p class="embed__link"> mail.bycloud.ai/p/aug-nov-ai-research-trend-report </p></div><img class="embed__image embed__image--right" src=""/></a></div><h3 class="heading" style="text-align:left;">Aug~Nov AI Research Trend Report</h3><p class="paragraph" style="text-align:left;">Basically recapping what I missed in the last 4 months </p><p class="paragraph" style="text-align:left;">I spent a lot of time on this quarterly(?) AI research trend report (~4000 words), so don’t miss out! </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://mail.bycloud.ai/p/the-only-perfect-score-paper-at-neurips-2025?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Check It Out Now </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><h1 class="heading" style="text-align:left;" id="t-5-gemma-2-seeing-reading-and-unde"><b>T5Gemma 2: Seeing, Reading, and Understanding Longer</b></h1><p class="paragraph" style="text-align:left;"><i>Zhang et al. [</i>Google DeepMind<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.4k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">AI can generate text, but it can’t see images, process multiple languages, and remember long conversations without losing the thread. Recent models have focused on &quot;decoder-only&quot; models, but we still don’t have a lightweight model that can handle all these tasks simultaneously.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cda70a65-7f81-42db-9bd5-941b28297b16/CleanShot_2025-12-23_at_22.45.30_2x.png?t=1766510141"/><div class="image__source"><span class="image__source_text"><p>Summary of pretraining (top) and post-training (bottom) performance for Gemma 3 and T5Gemma 2 at 270M, 1B and 4B over five capabilities.</p></span></div></div><p class="paragraph" style="text-align:left;">This paper introduces T5Gemma 2, which is a new family of models that successfully adapts the powerful Gemma 3 foundation into this specialized encoder-decoder structure. They utilized a clever &quot;recipe&quot; that initializes the new system using pre-existing technology. This teaches a text-generator to become a better reader and observer.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/391ca6fa-43ff-4afa-9788-e6a7e3f9fe23/CleanShot_2025-12-23_at_22.46.05_2x.png?t=1766510176"/><div class="image__source"><span class="image__source_text"><p>Overview of T5Gemma 2. Encoder/decoder parameters are initialized from the pretrained decoder-only model, and then pretrained with UL2.</p></span></div></div><p class="paragraph" style="text-align:left;">To make these models highly efficient, the researchers streamlined the internal machinery by merging different attention mechanisms into a single, unified module and sharing vocabulary tools across the system.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fa2dd73c-7611-4e07-9564-c8531f8b4cb7/CleanShot_2025-12-23_at_22.47.19_2x.png?t=1766510249"/><div class="image__source"><span class="image__source_text"><p>Detailed post-training results for Gemma 3, T5Gemma, and T5Gemma 2.</p></span></div></div><p class="paragraph" style="text-align:left;">The results were impressive: despite being trained on shorter sequences of data, the models demonstrated a surprising ability to <b>handle extremely long contexts</b>, extrapolating well beyond their training wheels. Furthermore, even the smallest versions of the model proved capable of understanding images alongside text, matching or even outperforming their predecessors.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.14856?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="sonic-mo-e-accelerating-mo-e-with-i"><b>SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations</b></h1><p class="paragraph" style="text-align:left;"><i>Karan and Du [Harvard University]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.4k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM MoE Optimization </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">We want better AI, but don’t want to pay millions of dollars to train it. Right now, one popular solution is to use the &quot;Mixture of Experts&quot; approach, which works by dividing a large neural network into many specialized sub-networks, activating only the few necessary ones for any given task.</p><p class="paragraph" style="text-align:left;">This approach theoretically promises smarter and more efficient models, but as these experts become smaller and more numerous to improve precision, the computer chips running them struggle to keep up. The <b>hardware begins spending more time simply moving data around than actually processing it</b>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9735b758-d055-4bea-9e52-c209b4e1e93c/CleanShot_2025-12-23_at_22.58.27_2x.png?t=1766510918"/><div class="image__source"><span class="image__source_text"><p>Computational workflow of SonicMoE’s 8 launched kernels, grouped by yellow boxes. </p></span></div></div><p class="paragraph" style="text-align:left;">This paper has introduced a new system called SonicMoE that changes how these complex networks use computer memory. They discovered that by intelligently reorganizing the mathematical operations required for training, they could drastically reduce the amount of temporary data the system needs to remember, which cuts the memory footprint nearly in half without losing any information.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7f43334c-adeb-4834-a777-440f51adbad8/CleanShot_2025-12-23_at_22.58.52_2x.png?t=1766510941"/></div><p class="paragraph" style="text-align:left;">Additionally, they introduced a clever &quot;token rounding&quot; strategy that ensures data is assigned to experts in perfectly sized chunks. Previously, the hardware would often waste energy processing empty filler space just to satisfy rigid computational requirements.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fbd07b2e-39ef-4073-a210-0c9ef1ff0840/CleanShot_2025-12-23_at_22.57.52_2x.png?t=1766510883"/><div class="image__source"><span class="image__source_text"><p>MoE Scaling Trends</p></span></div></div><p class="paragraph" style="text-align:left;">By aligning the data flow with the physical design of the chips and performing data transfer and calculation simultaneously, the researchers were able to process information nearly <b>twice as fast</b> as previous methods on modern graphics processors.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.14080?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="next-embedding-prediction-makes-str"><b>Next-Embedding Prediction Makes Strong Vision Learners</b></h1><p class="paragraph" style="text-align:left;">Xu<i> et al. [</i>University of Michigan, New York University, Princeton University, University of Virginia<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 624 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Vision </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Teaching computers to &quot;see&quot; has required a different playbook than teaching them to read. While LLMs grew powerful by simply <b>guessing the next word in a sentence</b>, vision models rely on complex engineering tricks, such as reconstructing missing pixels like a puzzle or comparing thousands of image pairs to learn differences.</p><p class="paragraph" style="text-align:left;">A team of researchers recently asked could the simple &quot;predict what comes next&quot; strategy that worked for NLP, work just as well for images? The team introduced a method called Next-Embedding Predictive Autoregression, or NEPA. Instead of asking the model to paint back missing parts of a picture pixel-by-pixel, they trained it to predict the abstract features (or &quot;embeddings&quot;) of the next patch of an image based on what it has already seen.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/21cd41c2-961a-4b03-9b50-366e33375f87/CleanShot_2025-12-23_at_23.08.33_2x.png?t=1766511525"/></div><p class="paragraph" style="text-align:left;">In simple words, this is like asking a person to guess the meaning of the next puzzle piece before picking it up, rather than trying to draw the picture from memory. Remarkably, this simplified approach allowed a standard Transformer model to achieve top-tier accuracy on major classification benchmarks without needing complex decoders, specific visual vocabularies, or heavy data augmentation.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/50a61796-6b59-4cc2-9769-b548765e256c/CleanShot_2025-12-23_at_23.08.58_2x.png?t=1766511549"/><div class="image__source"><span class="image__source_text"><p>Comparison of different self-supervised learning frameworks on ImageNet-1K classification. </p></span></div></div><p class="paragraph" style="text-align:left;">By focusing on predicting high-level information rather than raw details, the model naturally learned rich, transferable visual concepts that performed exceptionally well even on dense tasks like segmentation. This discovery suggests that the heavy architectural machinery often used in computer vision might be unnecessary.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.16922?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=flash-attention-author-s-new-work-sonicmoe"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/cJeqGq0Bx1M" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=d45df430-c681-4cef-bd0c-65da01ede0ba&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Scaling Up Diffusion Language Models to 100B!</title>
  <description>Scaling Up Diffusion Language Models to 100B, Adding 1 Attention Layer &amp; Make Visual Encoders Generate Images, LayerNorm Is Not Needed In Transformer, and more</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/03776770-8985-4587-9f95-1171066ed722/issue_86-1.jpg" length="122483" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/layernorm-is-not-needed-in-transformer</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/layernorm-is-not-needed-in-transformer</guid>
  <pubDate>Tue, 16 Dec 2025 19:44:16 +0000</pubDate>
  <atom:published>2025-12-16T19:44:16Z</atom:published>
    <dc:creator>by cloud</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h6 class="heading" style="text-align:left;" id="nov-18-th-nov-24-th-33-latest-ai-re"><i>Dec 8th ~ Dec 15th</i><br><i>#86 Latest AI Research Explained Simply</i></h6><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="industry-news-in-1-line">🗞️ Industry News in 1 Line</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1.3k</span></span> Alibaba’s Qwen Team has released <a class="link" href="https://qwen.ai/blog?id=qwen3-omni-flash-20251201&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Qwen3-Omni-Flash</a>, which is capable of processing text, audio, and video with seamless real-time responses. This upgraded iteration significantly enhances audio-visual interactions by resolving previous stability issues and offering precise control over system prompts and personas. Furthermore, it seamlessly integrates natural multi-turn video and audio understanding with <b>indistinguishable human voices</b> and fully customizable personalities, all backed by robust support for 119 text and 19 speech languages. Try it on <a class="link" href="https://modelscope.cn/studios/Qwen/Qwen3-Omni-Demo?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Modelscope</a> or <a class="link" href="https://huggingface.co/spaces/Qwen/Qwen3-Omni-Demo?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8a80f748-07a5-40cb-9e61-be51df4d58b8/q3o251201_metric.png?t=1765886772"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 712</span></span> Ai2 has announced the release of <a class="link" href="https://allenai.org/blog/olmo3?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Olmo 3.1</a>, which is their most capable model to date with the new Think 32B and Instruct 32B variants. The Olmo 3.1 Think 32B model achieves significant performance gains in logic and reasoning benchmarks. They also launched updated <b>7B models optimized for math and code</b> along with full weights, data, and training recipes for the entire suite. <a class="link" href="https://playground.allenai.org/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Try it online</a> or download from <a class="link" href="https://huggingface.co/collections/allenai/olmo-31?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">HuggingFace</a>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0ebbb83d-4627-4395-bf99-2df44762f619/1765558559-unnamed-2025-12-12t115545-244.png?t=1765887021"/></div></li><li><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;">♥ 1k</span></span> <a class="link" href="https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">NVIDIA has launched the Nemotron 3 family</a> with a suite of open models, datasets, and libraries designed to advance specialized agentic AI. <a class="link" href="https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Nemotron 3 Nano</a> is a hybrid <b>Mamba-Transformer mixture-of-experts</b> (MoE) model that delivers highly efficient inference and a 1 million token context window while using only roughly 3B active parameters. NVIDIA has released the model under an open license alongside <a class="link" href="https://docs.nvidia.com/nemo/gym/latest/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">NeMo Gym</a>, a new open-source reinforcement learning library for scalable agent training. If you are feeling adventurous, why not <a class="link" href="https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b#join_the_nemotron_model_reasoning_challenge" target="_blank" rel="noopener noreferrer nofollow">join the Nemotron Model Reasoning challenge</a>?</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9739627d-b608-4666-9740-388cc5f68c4f/Nemotron-3-Fig-5-png.jpg?t=1765887698"/></div></li></ol><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><div class="section" style="background-color:transparent;border-color:#2C81E5;border-style:solid;border-width:5px;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">New Premium Insights release</h2><p class="paragraph" style="text-align:left;">For context, Premium Insights is where I write down longer form content that I think is interesting but not long enough to be make into YouTube videos. </p><p class="paragraph" style="text-align:left;">Last week I have published the below blog:</p><div class="embed"><a class="embed__url" href="https://mail.bycloud.ai/p/aug-nov-ai-research-trend-report?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank"><div class="embed__content"><p class="embed__title"> Aug~Nov AI Research Trend Report </p><p class="embed__description"> Basically recapping what I missed in the last 4 months </p><p class="embed__link"> mail.bycloud.ai/p/aug-nov-ai-research-trend-report </p></div><img class="embed__image embed__image--right" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/fdf3146d-19cc-4b7e-96fb-b003bb0ccd1f/Q4_2025_research_trend_report.jpg?t=1765514046"/></a></div><p class="paragraph" style="text-align:left;">I spent a lot of time on this quarterly(?) AI research trend report (~4000 words), so don’t miss out! </p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://mail.bycloud.ai/p/the-only-perfect-score-paper-at-neurips-2025?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Check It Out Now </span></a></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://theaitimeline.carrd.co/?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank" rel="noopener noreferrer nofollow">Advertise with The AI Timeline! </a></p></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h1 class="heading" style="text-align:left;" id="one-layer-is-enough-adapting-pretra"><b>One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation</b></h1><p class="paragraph" style="text-align:left;">Gao<i> et al. [Apple]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 450 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Image Generation </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Using powerful, pre-trained visual AI models with image generators has always been tricky. These understanding models produce rich, high-dimensional features, but today’s best generators need to work in a much smaller, more stable space to create images efficiently. This mismatch usually forces complex solutions. But a new method shows we might have been overcomplicating things.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e5809b93-83f1-4d2c-831e-9721f79ec639/CleanShot_2025-12-16_at_16.53.13_2x.png?t=1765884203"/><div class="image__source"><span class="image__source_text"><p>Comparison between standard VAE, VA-VAE, RAE, and FAE.</p></span></div></div><p class="paragraph" style="text-align:left;">Instead of forcing the generator to work with the bulky original features or building a complex translator, the method, called FAE, uses a minimal encoder (just <b>a single attention layer</b>) to gently compress those features into a compact, generation-friendly space. The real cleverness is in the double-decoder setup.</p><p class="paragraph" style="text-align:left;">First, a dedicated decoder faithfully reconstructs the original high-quality features from this compact code. Then, a second, separate decoder uses those reconstructed features as its guide to generate the final image pixels. This separation of duties means the system retains the semantic understanding from the powerful pre-trained model while giving the image decoder the clear, low-dimensional signals it needs to work reliably.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/401f7067-19d0-4391-bff7-9e791de59c5c/CleanShot_2025-12-16_at_16.54.11_2x.png?t=1765884259"/><div class="image__source"><span class="image__source_text"><p>An illustration of Training Stages of FAE. Stage Ia and Ib can be trained independently.</p></span></div></div><p class="paragraph" style="text-align:left;">It can plug into different types of generators, like diffusion models or normalizing flows, and can use features from various popular pre-trained models like DINO or SigLIP. The results are impressive. On the standard ImageNet 256x256 benchmark, a diffusion model using FAE achieved top-tier image quality, scoring a <b>near state-of-the-art FID</b> of 1.29.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/63fdde1d-0c88-4924-9141-85d664b64e3c/CleanShot_2025-12-16_at_16.54.52_2x.png?t=1765884303"/><div class="image__source"><span class="image__source_text"><p>FID results of different models on MS-COCO validation (256 × 256).</p></span></div></div><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.07829?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="stronger-normalization-free-transfo">Stronger Normalization-Free Transformers</h2><p class="paragraph" style="text-align:left;"><i>Chen et al. [</i>Princeton University, NYU, Carnegie Mellon University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Transformers </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> bycloud’s pick </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Normalization layers are a common trick to help deep neural networks train smoothly, but they come with a cost. They need extra computation to track statistics and can be sensitive to settings like batch size. Researchers have been looking for simpler, drop-in replacements, and point-wise functions like Dynamic Tanh (DyT) showed it was possible to match normalization performance. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7a6db0f4-0a58-4794-a829-71ed83150fcc/CleanShot_2025-12-16_at_16.58.16_2x.png?t=1765884511"/><div class="image__source"><span class="image__source_text"><p>Structure of Dynamic erf (Derf), a point-wise function, that outperforms normalization layers and other point-wise functions.</p></span></div></div><p class="paragraph" style="text-align:left;">To find an optimal design, the researchers first identified what makes a point-wise function work well as a normalization replacement. They tested four key properties: the function should be zero-centered, bounded in its output, sensitive to small changes around zero, and monotonic (always increasing or decreasing). Functions that broke these rules often led to unstable training or worse performance. With these principles as a guide, they performed a large-scale search across many candidate S-shaped functions.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ee396e45-cd8a-4463-bbc7-d6859292482c/CleanShot_2025-12-16_at_16.59.01_2x.png?t=1765884549"/><div class="image__source"><span class="image__source_text"><p>Results of zero-centeredness on ViT-Base.</p></span></div></div><p class="paragraph" style="text-align:left;">The search identified a clear winner: a dynamically parameterized version of the error function, called Derf. This function, related to the Gaussian cumulative distribution, naturally has all the desired properties. When integrated into a model, it simply transforms each neuron&#39;s activation using learnable scaling and shifting parameters. The team tested Derf extensively, replacing normalization layers in Transformers for vision, speech, DNA, and language tasks. In nearly every case, <b>Derf outperformed both standard normalization layers</b> and the previous best alternative, DyT.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.10938?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="on-the-interplay-of-pre-training-mi">On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models</h2><p class="paragraph" style="text-align:left;"><i>Zhang et al. [</i>Carnegie Mellon University, Language Technologies Institute<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM Reasoning </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">We often see language models get better at reasoning after reinforcement learning, but it&#39;s hard to tell if RL is teaching them new skills or just polishing what they already learned in pre-training. To solve this, researchers built a fully controlled test using synthetic math problems. This let them isolate and study the distinct roles of pre-training, mid-training, and RL.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a3735c4f-802f-41bc-949a-09c4b7cbee00/CleanShot_2025-12-16_at_17.03.09_2x.png?t=1765884809"/><div class="image__source"><span class="image__source_text"><p>Overview of the data generation framework, task setup, and process-verified evaluation.</p></span></div></div><p class="paragraph" style="text-align:left;">The study shows <b>RL can create new reasoning ability</b>, but only under specific conditions. If a task is already well understood from pre-training, RL just fine-tunes the model&#39;s existing skill. For RL to teach something genuinely new, the task must be slightly beyond the model&#39;s current ability. The model also needs a seed of knowledge from pre-training.</p><p class="paragraph" style="text-align:left;">For example, to solve a problem in a new context like a &quot;zoo&quot; scenario, the model must have seen that context at least a little bit during pre-training. Even just 1% exposure gives RL enough material to work with and generalize from.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1f84341a-ab22-4ce2-9aee-1fd059862adc/findings.png?t=1765884778"/></div><p class="paragraph" style="text-align:left;">The researchers found that mixing mid-training with RL leads to the best generalization. Mid-training, which uses supervised learning on data from the model&#39;s edge of competence, builds a strong foundation. RL then explores from that foundation. </p><p class="paragraph" style="text-align:left;">Allocating most compute to mid-training with a little RL is great for known tasks, while dedicating more budget to RL is better for tackling completely new, harder problems. Adding rewards for correct reasoning steps, not just the final answer, further reduced errors and improved the reliability of the model&#39;s solutions.</p><div class="embed"><a class="embed__url" href="https://huggingface.co/Interplay-LM-Reasoning?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b" target="_blank"><img class="embed__image embed__image--top" src="https://cdn-thumbnails.huggingface.co/social-thumbnails/Interplay-LM-Reasoning.png"/><div class="embed__content"><p class="embed__title"> Interplay-LM-Reasoning (Interplay-LM-Reasoning) </p><p class="embed__description"> Org profile for Interplay-LM-Reasoning on Hugging Face, the AI community building the future. </p><p class="embed__link"> huggingface.co/Interplay-LM-Reasoning </p></div></a></div><p class="paragraph" style="text-align:left;">In tests, a well-calibrated RL approach improved performance on harder tasks by up to 42%, and proper pre-training seeding allowed contextual generalization improvements of up to 60%. These findings show us that when we design training pipelines, we need to ensure broad pre-training coverage of basic concepts, use mid-training to build robust priors, and then apply RL to explore just beyond the model&#39;s current limits.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.arxiv.org/abs/2512.07783?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="not-all-bits-are-equal-scale-depend">LLaDA2.0: Scaling Up Diffusion Language Models to 100B</h2><p class="paragraph" style="text-align:left;"><i>Bie et al. [</i>Ant Group, Renmin University of China, Zhejiang University, Westlake University, HongKong University of Science and Technology<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 757 </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> Diffusion LLM </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Today&#39;s powerful language models generate text one word at a time, which creates a bottleneck. This new approach aims to break that bottleneck by converting existing models into a different kind that can predict many words at once.</p><p class="paragraph" style="text-align:left;">It uses a <b>three-stage</b> training method. Instead of building a parallel model from scratch, which is very costly, the researchers start with a strong existing model trained for sequential generation. They then carefully retrain it using a process called Warmup-Stable-Decay.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ab68b450-1058-461e-a10b-499e673dcf4d/CleanShot_2025-12-16_at_17.10.30_2x.png?t=1765885238"/><div class="image__source"><span class="image__source_text"><p>A schematic of the progressive training framework for transforming an AR model into a MDLM.</p></span></div></div><p class="paragraph" style="text-align:left;">First, in the Warmup phase, the model slowly learns to reconstruct small, masked blocks of text instead of just the next word. The block size gradually increases until the model is reconstructing entire sequences at once in the Stable phase. Finally, in the Decay phase, the model is tuned back to working with smaller blocks, which makes it much faster for practical use while keeping its new parallel skills.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6f231383-cc3d-44fa-a5ac-c25a36e72964/CleanShot_2025-12-16_at_17.11.14_2x.png?t=1765885283"/></div><p class="paragraph" style="text-align:left;">To make this conversion stable and efficient, the team introduced important techniques. They use a special document-level attention mask during training. This prevents the model from getting confused by attending to unrelated text when multiple documents are packed together for efficiency, ensuring it learns clean, coherent reconstructions. For the final instruction-tuning phase, they also apply a method called complementary masking. This clever trick ensures nearly every token in the training data contributes to the learning signal in each step, speeding up training and improving the model&#39;s grasp of language.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2f028488-c0e2-4da6-9447-9d6f59fef7d5/llada2_flash_main_bench.png?t=1765885133"/></div><p class="paragraph" style="text-align:left;">The final models, including a 100B-parameter version called LLaDA2.0-flash, show strong performance on reasoning, coding, and general knowledge benchmarks. It also shows a significant boost in inference speed, generating many more tokens per second than comparable auto-regressive models.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://github.com/inclusionAI/LLaDA2.0?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><h2 class="heading" style="text-align:left;" id="closing-the-train-test-gap-in-world">Closing the Train-Test Gap in World Models for Gradient-Based Planning</h2><p class="paragraph" style="text-align:left;"><i>Parthasarathy et al. [</i>Columbia University, New York University<i>]</i></p><p class="paragraph" style="text-align:left;"><span style="background-color:#e0e0e0;"><span style="color:rgb(255, 58, 58);font-size:0.6rem;"> ♥ 1.3k </span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span><span style="background-color:#e0e0e0;"><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> LLM</span></span><span style="color:rgb(44, 129, 229);font-size:0.6rem;"> </span></p><p class="paragraph" style="text-align:left;">Gradient-based planning world models can be used for intelligent robot control, but in practice, its performance has often fallen short of slower, search-based alternatives. This is because of train-test mismatch: these models learn to predict the next state from expert demonstrations, but are later used to optimize sequences of actions, which often leads them into unfamiliar and unreliable territory.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/be06feb2-b7bb-4f44-a4ec-ebcf6f9b84e2/524776366-2ad42535-1e5b-474a-9217-efba54f24b18.png?t=1765885371"/><div class="image__source"><span class="image__source_text"><p>Overview of our two proposed methods.</p></span></div></div><p class="paragraph" style="text-align:left;">To bridge this gap, the researchers developed two clever fine-tuning methods. Online World Modeling tackles the problem of the model venturing into unknown states during planning. It works by using a simulator to correct the trajectories that gradient-based planning produces, and then training the model on these new, corrected paths.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7b1cab7c-1428-4b54-a83d-ff1998227a6d/CleanShot_2025-12-16_at_17.13.55_2x.png?t=1765885442"/></div><p class="paragraph" style="text-align:left;">This teaches the model to be accurate even for the non-expert actions it will encounter during optimization. Separately, Adversarial World Modeling focuses on smoothing the optimization landscape itself. It trains the model on deliberately perturbed versions of expert data, making the model more robust and creating a loss surface that is easier for gradient descent to navigate.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fc1ac9ea-9aec-49d5-b827-5da29f8ebe37/CleanShot_2025-12-16_at_17.14.07_2x.png?t=1765885471"/></div><p class="paragraph" style="text-align:left;">These techniques significantly close the train-test performance gap. In tests on object manipulation and navigation tasks, gradient-based planning with an adversarially fine-tuned model matched or exceeded the success rates of the powerful but computationally heavy Cross-Entropy Method (CEM) <b>using only 10% of the computation time</b>.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://arxiv.org/abs/2512.09929?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=scaling-up-diffusion-language-models-to-100b"><span class="button__text" style=""> Read Full Paper </span></a></div><div class="section" style="background-color:#222222;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><p class="paragraph" style="text-align:left;"></p></div><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="true" class="youtube_embed" frameborder="0" height="100%" src="https://youtube.com/embed/pljoUcBniPQ" width="100%"></iframe></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=c9388796-3e8d-4b54-a8fb-58b8fbebc127&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Aug~Nov AI Research Trend Report</title>
  <description>Basically recapping what I missed in the last 4 months </description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fdf3146d-19cc-4b7e-96fb-b003bb0ccd1f/Q4_2025_research_trend_report.jpg" length="570534" type="image/jpeg"/>
  <link>https://mail.bycloud.ai/p/aug-nov-ai-research-trend-report</link>
  <guid isPermaLink="true">https://mail.bycloud.ai/p/aug-nov-ai-research-trend-report</guid>
  <pubDate>Thu, 11 Dec 2025 20:16:17 +0000</pubDate>
  <atom:published>2025-12-11T20:16:17Z</atom:published>
    <dc:creator>by cloud</dc:creator>
    <category><![CDATA[Premium Insights]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-the-so-ta-is-doing" rel="noopener noreferrer nofollow">What the SoTA Is Doing</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-hybrid-attention-saga" rel="noopener noreferrer nofollow">The Hybrid Attention Saga</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-other-so-ta-labs-are-doing" rel="noopener noreferrer nofollow">What other SoTA labs are doing</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-new-cool-architectural-ideas-th" rel="noopener noreferrer nofollow">The New Cool Architectural Ideas That I like</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#new-unique-ideas-thats-not-architec" rel="noopener noreferrer nofollow">New Unique Ideas that’s not architecturally bound</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#new-insights-that-you-should-be-awa" rel="noopener noreferrer nofollow">New Insights That You Should Be Aware</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#non-llm-research-bangers" rel="noopener noreferrer nofollow">Non-LLM Research Bangers</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#conclusionending-notes" rel="noopener noreferrer nofollow">Conclusion/ending notes</a></p></li></ul><hr class="content_break"><p class="paragraph" style="text-align:left;">Some of you might know that I was gone for 4 months. (August to November due to mandatory military service).</p><p class="paragraph" style="text-align:left;">For those of you who don’t, well that’s why I am recapping all the papers properly that I missed in the last 4 months here. </p><p class="paragraph" style="text-align:left;">This list of papers are what I found to have fascinating ideas, potentially pivotal, or discuss/propose critical updates to the current AI research landscape.</p><p class="paragraph" style="text-align:left;">The focus is as usual the LLM papers, but I do have some non-LLM papers too that are pretty big in their own field.</p><p class="paragraph" style="text-align:left;"></p><h1 class="heading" style="text-align:left;" id="what-the-so-ta-is-doing"><b>What the SoTA Is Doing</b></h1><h3 class="heading" style="text-align:left;" id="the-hybrid-attention-saga"><b>The Hybrid Attention Saga</b></h3><p class="paragraph" style="text-align:left;">In the 4 months I have been gone, the Chinese open source community has undergone a pretty active discussion surrounding hybrid attention. I think this tweet sums it up the best.</p><div class="image"><a class="image__link" href="https://x.com/jbhuang0604/status/1984051086140043661?s=20&utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=aug-nov-ai-research-trend-report" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a3ddf6ad-bf20-407b-9934-faf08e925b51/image.png?t=1765042371"/></a></div><p class="paragraph" style="text-align:left;">So what was the timeline? And can we figure out what’s actually going on?</p><p class="paragraph" style="text-align:left;">The trajectory of attention mechanisms while I was gone had a period of quick optimism change for hybrid attention. Thanks to being open source, discussion was publicly shared through papers, blogs, or even X. For context, hybrid architectures are usually a combination of both linear attention & standard attention. For the sake of simplicity, I’ll only be covering the blogs or papers. X’s discussion was all over the place, without much empirical evidence to base off on. </p><div class="paywall"><hr class="paywall__break"/><div class="paywall__content"><h2 class="paywall__header"> Subscribe to our premium insights to read more </h2><p class="paywall__description"> Become a paying subscriber to get access to this post and other subscriber-only content. </p><p class="paywall__links"><a class="paywall__upgrade_link" href="https://mail.bycloud.ai/upgrade?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=aug-nov-ai-research-trend-report">Upgrade</a> Translation missing: en.app.shared.conjuction.or <a class="paywall__login_link" href="https://mail.bycloud.ai/login?utm_source=mail.bycloud.ai&utm_medium=newsletter&utm_campaign=aug-nov-ai-research-trend-report">Sign In</a></p></div></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=7ccf05f6-593a-4c61-b375-7c732f619d7a&utm_medium=post_rss&utm_source=the_ai_timeline">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

  </channel>
</rss>
