<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Hello, World! System Design Newsletter</title>
    <description>Learn System Design through case studies and how big tech companies solved their own problem. Highly relevant for software engineers.</description>
    
    <link>https://hw.glich.co/</link>
    <atom:link href="https://rss.beehiiv.com/feeds/CM06WwHqni.xml" rel="self"/>
    
    <lastBuildDate>Wed, 17 Jun 2026 04:37:28 +0000</lastBuildDate>
    <pubDate>Wed, 17 Jun 2026 04:30:00 +0000</pubDate>
    <atom:published>2026-06-17T04:30:00Z</atom:published>
    <atom:updated>2026-06-17T04:37:28Z</atom:updated>
    
      <category>Programming</category>
      <category>Software Engineering</category>
      <category>Technology</category>
    <copyright>Copyright 2026, Hello, World! System Design Newsletter</copyright>
    
    <image>
      <url>https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/publication/logo/8715568a-af68-4718-adfc-016dbffd2b09/Group_55-1.png</url>
      <title>Hello, World! System Design Newsletter</title>
      <link>https://hw.glich.co/</link>
    </image>
    
    <docs>https://www.rssboard.org/rss-specification</docs>
    <generator>beehiiv</generator>
    <language>en-us</language>
    <webMaster>support@beehiiv.com (Beehiiv Support)</webMaster>

      <item>
  <title>What is an OSI Model?</title>
  <description>Demystify network communication with this practical guide to the OSI model. Understand all 7 layers with real-world examples and analogies.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/osi-model</link>
  <guid isPermaLink="true">https://hw.glich.co/p/osi-model</guid>
  <pubDate>Wed, 17 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-17T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/adc025e0-5db4-4533-8c61-4864283a8cd2/ad1_15-kubernetes-metrics-ebook_1200x628_241229__1_.png?t=1781636734"/></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://r2trck.com/hello-world-datadog-5?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-ebook-DevOpsKubernetes-infra-containers-ww-en-701VY00000F5QHNYA3&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow"><b>15 Kubernetes Metrics Every DevOps Team Should Track</b></a></p><p class="paragraph" style="text-align:left;">Enhance Your Kubernetes Strategy with These Essential Metrics</p><p class="paragraph" style="text-align:left;">Download our comprehensive eBook on optimizing Kubernetes performance. This guide delves into crucial cluster state, resource, and control plane metrics, highlighting 15 of the most essential metrics your DevOps team should be tracking. Learn how to gain complete visibility into your containerized environments and optimize Kubernetes performance with Datadog.</p><hr class="content_break"><h2 class="heading" style="text-align:left;" id="what-does-osi-stand-for">What Does OSI Stand For?</h2><p class="paragraph" style="text-align:left;">OSI stands for <b>Open Systems Interconnection</b>. The name itself reveals its purpose: to create an <i>open</i> standard for different computer systems to connect and communicate with one another, regardless of their underlying architecture. It was developed by the International Organization for Standardization (<a class="link" href="https://www.iso.org/home.html?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">ISO</a>) to guide the creation of universal networking protocols.</p><h3 class="heading" style="text-align:left;" id="the-7-layers-of-the-osi-model-at-a-">The 7 Layers of the OSI Model at a Glance</h3><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Layer Number</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Layer Name</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Primary Function</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Simple Analogy (Sending a Letter)</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">7</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Application</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Provides network services to end-user applications.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Writing the actual content of your letter.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">6</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Presentation</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Translates, encrypts, and compresses data.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Placing the letter in a standard envelope.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">5</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Session</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Manages communication sessions between devices.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Addressing the letter to a specific person.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">4</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Transport</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Ensures reliable data delivery and handles flow control.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Choosing a mail service (e.g., certified, express).</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">3</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Network</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Routes data packets across different networks.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">The postal service routing the letter across cities.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">2</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Data Link</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Manages data transfer between nodes on the same network.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">The local mail carrier delivering to the correct street.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">1</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Physical</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Transmits raw data bits over a physical medium.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">The physical truck or plane carrying the mail.</p></td></tr></table></div><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/67d12855-283f-404f-a36b-24ce9c4ad4c9/cc1346ef-9656-4a3e-8f1b-dcc84320d224.jpg?t=1781635622"/></div><h2 class="heading" style="text-align:left;" id="diving-into-the-upper-layers-where-">Diving Into the Upper Layers: Where Applications Talk</h2><p class="paragraph" style="text-align:left;">Our trip through the <b>OSI model</b> kicks off at the very top of the stack, with the layers closest to you, the end-user. These three Application, Presentation, and Session are often lumped together as the &quot;upper layers.&quot; They&#39;re almost always handled by software and are all about getting data from your applications ready for its big journey across the network. Their job is to make sure the data is understandable, secure, and part of a smooth, organized conversation.</p><h3 class="heading" style="text-align:left;" id="layer-7-the-application-layer">Layer 7: The Application Layer</h3><p class="paragraph" style="text-align:left;">The Application Layer is the interface you interact with, but it often gets confused with the application itself, like Chrome or Outlook. Instead, it&#39;s the set of protocols these applications use to communicate with the network. When you input a URL or send an email, you&#39;re using a Layer 7 protocol, allowing software to access network services directly.</p><p class="paragraph" style="text-align:left;"><b>Examples of Application Layer Protocols:</b></p><ul><li><p class="paragraph" style="text-align:left;"><b>HTTP/S:</b> Used by browsers to request and display web pages.</p></li><li><p class="paragraph" style="text-align:left;"><b>SMTP:</b> Used by email clients to send messages to mail servers.</p></li><li><p class="paragraph" style="text-align:left;"><b>FTP:</b> Designed for transferring files between a client and server.</p></li><li><p class="paragraph" style="text-align:left;"><b>DNS:</b> Translates domain names into IP addresses when you enter a URL.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Key Point:</b> The Application Layer provides the protocols that enable applications to send and receive information, connecting users with the network.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="layer-6-the-presentation-layer">Layer 6: The Presentation Layer</h3><p class="paragraph" style="text-align:left;">Once the Application Layer prepares its message, it passes it to the Presentation Layer, which acts as a universal translator. This layer ensures data from one system&#39;s application layer is readable by another&#39;s. Its main tasks are:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Data Translation and Formatting:</b> Converts data into a standard format, like changing EBCDIC to ASCII, ensuring consistency across systems.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Compression:</b> Reduces data size for efficient transmission, with decompression on the receiving end.</p></li><li><p class="paragraph" style="text-align:left;"><b>Encryption and Decryption:</b> Secures data by encrypting it at the sender&#39;s end and decrypting it at the receiver&#39;s, using protocols like SSL and TLS.</p></li></ol><p class="paragraph" style="text-align:left;"><b>Example:</b> When accessing your bank&#39;s website (<code>https://</code>), TLS encryption at Layer 6 secures your login credentials.</p><h3 class="heading" style="text-align:left;" id="layer-5-the-session-layer">Layer 5: The Session Layer</h3><p class="paragraph" style="text-align:left;">The Session Layer acts as the conversation manager, organizing communication &quot;sessions&quot; between devices. Unlike lower layers that handle raw data transfer, Layer 5 sets up, manages, and dismantles these sessions. For instance, when accessing an online bank account, it maintains your session as you navigate between different sections, preventing repeated logins.</p><p class="paragraph" style="text-align:left;">Additionally, the Session Layer can implement <b>checkpoints</b> during data transfers. If a download, like a 2GB file, is interrupted at 1.5GB, it can resume from the last checkpoint rather than restarting. This mechanism is part of reliable data transfer, further explained in our guide on <a class="link" href="https://hw.glich.co/p/what-is-tcp?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">what is TCP</a>.</p><h2 class="heading" style="text-align:left;" id="understanding-the-core-transport-an">Understanding the Core Transport and Network Layers</h2><p class="paragraph" style="text-align:left;">As our data makes its way down from the application-focused upper layers, it hits the very heart of the <b>OSI model</b>: the Transport and Network layers. These two layers are the engine room of network communication. They work hand-in-hand to make sure your data doesn&#39;t just get to the right place, but that it also arrives reliably and in one piece.</p><p class="paragraph" style="text-align:left;">Think of them as the logisticians and traffic directors of the internet.</p><h3 class="heading" style="text-align:left;" id="layer-4-the-transport-layer">Layer 4: The Transport Layer</h3><p class="paragraph" style="text-align:left;">The Transport Layer, or Layer 4, ensures effective end-to-end communication and data integrity. It divides data from higher layers into smaller units called <b>segments</b> (for TCP) or <b>datagrams</b> (for UDP) and reassembles them in order on the receiving side.</p><p class="paragraph" style="text-align:left;">Think of Layer 4 as a shipping manager deciding on the delivery method: whether a package needs confirmation upon receipt or can be quickly dropped off. This decision is crucial, leading to the two key protocols of the Transport Layer:</p><p class="paragraph" style="text-align:left;"><b>Transport Layer Protocols:</b></p><ul><li><p class="paragraph" style="text-align:left;"><b>TCP (Transmission Control Protocol):</b> Acts like &quot;certified mail&quot; slow but reliable. It establishes a connection via a <b>three-way handshake</b>, numbers each segment, ensures delivery, and requests resends for lost segments. Ideal for tasks needing precision, such as web pages, emails, and file downloads.</p></li><li><p class="paragraph" style="text-align:left;"><b>UDP (User Datagram Protocol):</b> Functions as a speedy, connectionless &quot;fire-and-forget&quot; method. It sends segments without establishing a connection or verifying arrival, reducing overhead. This is perfect for applications where speed is crucial, like video streaming, online gaming, or DNS lookups.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Key Insight:</b> The choice between TCP and UDP at Layer 4 is a fundamental trade-off between reliability and speed. TCP guarantees delivery but comes with higher overhead, while UDP prioritizes low latency at the risk of some data loss. You can explore a detailed comparison in our article covering <b><a class="link" href="https://hw.glich.co/p/what-is-udp?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">what is UDP</a></b>.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="layer-3-the-network-layer">Layer 3: The Network Layer</h3><p class="paragraph" style="text-align:left;">Below the Transport Layer is Layer 3, the Network Layer, which handles routing data packets from a source to a destination network. Using logical addressing, each network device receives a unique IP address. The Network Layer adds a header with source and destination IPs to segments, forming packets.</p><p class="paragraph" style="text-align:left;">For instance, when accessing a website, your computer (e.g., IP <code>192.168.1.10</code>) sends a request to the server (e.g., IP <code>104.18.30.123</code>). Routers forward packets using the destination IP until they reach the server, a process known as routing. If a route gets congested or fails, routers dynamically find alternate paths to ensure data transmission continues smoothly. For more on network addresses and routing, explore topics like <a class="link" href="https://www.mushroomnetworks.com/infographics/ipv4-vs-ipv6-and-nat/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">IPv4, IPv6, and NAT</a>.</p><h2 class="heading" style="text-align:left;" id="navigating-the-lower-layers-of-hard">Navigating the Lower Layers of Hardware and Links</h2><p class="paragraph" style="text-align:left;">As data travels down from the core layers, it finally hits the world of hardware and local connections. Here, abstract ideas like routing and transport get very real. We&#39;re now talking about how to get data between machines that are physically on the same network. This is the realm of MAC addresses, network switches, and the raw electrical signals that form the bedrock of the entire <b>OSI model</b>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/163a8b3c-a564-4bdc-9c6f-2c1da1ebafb7/df29c518-6e6b-4938-abef-0e82843cd1f5.jpg?t=1781635622"/></div><p class="paragraph" style="text-align:left;">We&#39;ll kick things off with Layer 2, which serves as the crucial bridge between the logical addressing of the Network Layer and the raw physical transmission happening just below it.</p><h3 class="heading" style="text-align:left;" id="layer-2-the-data-link-layer">Layer 2: The Data Link Layer</h3><p class="paragraph" style="text-align:left;">The Data Link Layer facilitates direct data transfer between devices on the same local network. It encapsulates packets from Layer 3 into a <b>frame</b>. This layer assigns a physical address, with the <b>MAC (Media Access Control) address</b> acting like an apartment number within a building, distinct from the building&#39;s IP address.</p><p class="paragraph" style="text-align:left;">Key responsibilities include:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Framing:</b> Encapsulates packets with a header containing source and destination MAC addresses.</p></li><li><p class="paragraph" style="text-align:left;"><b>Physical Addressing:</b> Uses MAC addresses to identify devices on a local network.</p></li><li><p class="paragraph" style="text-align:left;"><b>Error Control:</b> Conducts error checking to ensure frame integrity during transmission.</p></li></ul><h3 class="heading" style="text-align:left;" id="layer-1-the-physical-layer">Layer 1: The Physical Layer</h3><p class="paragraph" style="text-align:left;">The Physical Layer, the base of the OSI model, focuses solely on transmitting raw bits (1s and 0s) across a physical medium, ignoring packets, frames, or addresses. It sets the physical and electrical standards for signal transmission.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Key Takeaway:</b> The Physical Layer concerns the hardware that transmits data, such as cables, connectors, and antennas.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">Key components include cables, network interface cards, and physical ports. For businesses, resources like <a class="link" href="https://infrazen.tech/what-is-a-leased-line/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">leased lines</a> provide reliable Layer 1 infrastructure.</p><p class="paragraph" style="text-align:left;"><b>Layer 1 Technology Examples:</b></p><ul><li><p class="paragraph" style="text-align:left;"><b>Cabling:</b> Involves Ethernet, fiber optic, and coaxial cables, detailing pin layouts and max lengths.</p></li><li><p class="paragraph" style="text-align:left;"><b>Radio Frequencies:</b> Defines radio waves for wireless data transmission.</p></li><li><p class="paragraph" style="text-align:left;"><b>Electrical Signals:</b> Specifies voltage for data on copper wires and light pulses for fiber optics.</p></li></ul><p class="paragraph" style="text-align:left;"><b>Hubs vs. Switches: A Layer 1 vs. Layer 2 Story</b></p><p class="paragraph" style="text-align:left;">To understand the difference between the bottom two layers, compare a hub (Layer 1) with a switch (Layer 2).</p><ul><li><p class="paragraph" style="text-align:left;">A <b>hub</b> is basic and repeats signals to all ports without understanding addresses, operating at the Physical Layer.</p></li><li><p class="paragraph" style="text-align:left;">A <b>switch</b> is more advanced, reading MAC addresses to send data to the correct port, enhancing network efficiency and security.</p></li></ul><p class="paragraph" style="text-align:left;">This distinction is crucial for network design and troubleshooting. For more on how low-level addresses map to human-readable names, see our guide on <a class="link" href="https://hw.glich.co/p/what-is-dns-and-how-does-it-work?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">DNS</a>.</p><h2 class="heading" style="text-align:left;" id="tracing-data-flow-through-the-osi-m">Tracing Data Flow Through the OSI Model</h2><p class="paragraph" style="text-align:left;">To understand how the <b>OSI model</b> layers function together, let&#39;s look at the journey of a single piece of data an email from the moment you hit &quot;send&quot; until it appears in your friend&#39;s inbox.</p><h3 class="heading" style="text-align:left;" id="the-journey-encapsulation">The Journey: Encapsulation</h3><p class="paragraph" style="text-align:left;">Your data begins at the top of the OSI stack, moving downward. Each layer adds a new header to prepare it for the network.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Layers 7, 6, and 5 (Application, Presentation, Session):</b> Your email client uses SMTP to send &quot;See you at 8!&quot; The Presentation Layer standardizes and possibly encrypts the text, while the Session Layer establishes the connection with the email server.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 4 (Transport):</b> The Transport Layer uses <b>TCP</b> to segment the message into smaller parts, adding port numbers to direct the data to the email application.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 3 (Network):</b> The Network Layer wraps each segment in a <b>packet</b>, attaching IP addresses for routing.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 2 (Data Link):</b> The packet is enclosed within an <b>Ethernet frame</b>, which includes MAC addresses and error-checking data.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 1 (Physical):</b> The frame is converted into bits and transmitted as electrical signals or radio waves.</p></li></ul><p class="paragraph" style="text-align:left;">Your email is now a series of signals rapidly departing your device.</p><h3 class="heading" style="text-align:left;" id="across-the-internet-and-back-up-dee">Across the Internet and Back Up: De-encapsulation</h3><p class="paragraph" style="text-align:left;">Once signals reach your local router, it reverses the process: it removes the Layer 2 frame to read the Layer 3 IP packet, determines the best path, and forwards it. This hop-by-hop journey continues until the packets reach your friend&#39;s network.</p><p class="paragraph" style="text-align:left;">Upon arrival at the recipient&#39;s computer, de-encapsulation occurs, reversing the initial process:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Layer 1:</b> Physical signals become a bitstream.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 2:</b> Bits form frames, errors are checked, and Ethernet headers/trailers are removed.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 3:</b> IP packets are unwrapped, the destination IP is confirmed, and the IP header is discarded.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layer 4:</b> TCP segments are reordered and original data is recreated, removing the TCP header.</p></li><li><p class="paragraph" style="text-align:left;"><b>Layers 5, 6, and 7:</b> The connection is managed, data is decrypted/formatted, and the message is delivered to the email client.</p></li></ol><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>The Big Picture:</b> This process, from encapsulation to de-encapsulation, occurs in milliseconds, showcasing the OSI model&#39;s layered abstraction. Each layer performs its role without knowing about others.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h2 class="heading" style="text-align:left;" id="the-osi-vs-the-tcpip-model">The OSI vs. The TCP/IP Model</h2><p class="paragraph" style="text-align:left;">Think of it like this: TCP/IP is the trusty, street-legal car that millions of people drive every single day. The OSI model is the incredibly detailed engineering schematic that a master mechanic uses to understand how every component from the engine timing to the wiring works together. You don&#39;t need the schematic to drive the car, but if you want to diagnose a tricky problem or build a new car from scratch, that schematic is priceless.</p><p class="paragraph" style="text-align:left;">The OSI model’s true value became undeniable during the internet&#39;s explosive growth. Its layered approach provided a clear guide for creating new, interoperable technologies. You can dive deeper into how this <a class="link" href="https://en.wikipedia.org/wiki/OSI_model?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-an-osi-model" target="_blank" rel="noopener noreferrer nofollow">conceptual framework became foundational to networking</a> on Wikipedia.</p><h3 class="heading" style="text-align:left;" id="comparing-osi-model-layers-to-tcpip">Comparing OSI Model Layers to TCP/IP Model Layers</h3><p class="paragraph" style="text-align:left;">The best way to see the relationship between the two is to map them side-by-side. The OSI model’s seven layers offer a fine-grained view, while the TCP/IP model’s four layers group several functions together.</p><p class="paragraph" style="text-align:left;">This table shows exactly how they line up.</p><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">OSI Model Layer</p></th><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">TCP/IP Model Layer</p></th><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">Key Protocols</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">7. Application</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">4. Application</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">HTTP, FTP, SMTP, DNS</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">6. Presentation</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">SSL/TLS, JPEG, ASCII</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">5. Session</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">NetBIOS, RPC</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">4. Transport</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">3. Transport</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">TCP, UDP</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">3. Network</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">2. Internet</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">IP, ICMP</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">2. Data Link</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">1. Network Access</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">Ethernet, PPP, Switches</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">1. Physical</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">Hubs, Cables, Wi-Fi</p></td></tr></table></div><p class="paragraph" style="text-align:left;">As you can see, TCP/IP bundles the OSI model&#39;s top three layers (Application, Presentation, and Session) into a single Application layer. It does the same at the bottom, combining the Physical and Data Link layers into its Network Access layer.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Key Takeaway:</b> You really need to know both. TCP/IP is the model for how things <i>actually work</i> in the real world. But the <b>OSI model</b> is a far better tool for learning, teaching, and systematically troubleshooting network problems because it breaks everything down so logically.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=fb0a9bed-684f-4467-a52d-1a43b14935d3&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Uber Standardized Mobile Analytics (Without Slowing Down Teams)</title>
  <description>Uber standardized mobile analytics by moving event logic to the platform, automating metadata, and ensuring consistent, reliable data across apps.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/60caf0a1-5cc0-4e21-9651-a515051d7526/imresizer-_-_visual_selection__5_.png" length="139422" type="image/png"/>
  <link>https://hw.glich.co/p/how-uber-standardized-mobile-analytics-without-slowing-down-teams</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-uber-standardized-mobile-analytics-without-slowing-down-teams</guid>
  <pubDate>Mon, 15 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-15T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Uber]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"></p><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-uber-standardized-mobile-analytics-without-slowing-down-teams" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><b><a class="link" href="https://scaleengineer.com/blog/when-microservices-get-messy-how-api-federation-brings-order?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-uber-standardized-mobile-analytics-without-slowing-down-teams" target="_blank" rel="noopener noreferrer nofollow">When Microservices Get Messy: How API Federation Brings Order</a></b></p></div><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#where-things-started-breaking" rel="noopener noreferrer nofollow">Where Things Started Breaking</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-key-insight-the-platform-should" rel="noopener noreferrer nofollow">The Key Insight: The Platform Should Own Analytics</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-1-standardizing-events-across-" rel="noopener noreferrer nofollow">Step 1: Standardizing Events Across the App</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-2-separating-analytics-from-ui" rel="noopener noreferrer nofollow">Step 2: Separating Analytics from UI Logic</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-3-moving-analytics-into-ui-com" rel="noopener noreferrer nofollow">Step 3: Moving Analytics Into UI Components</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-4-standardizing-metadata-the-h" rel="noopener noreferrer nofollow">Step 4: Standardizing Metadata (The Hidden Problem …</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-app-level-metadata" rel="noopener noreferrer nofollow">1. App-Level Metadata</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-event-level-metadata" rel="noopener noreferrer nofollow">2. Event-Level Metadata</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#standardizing-ui-surfaces" rel="noopener noreferrer nofollow">Standardizing UI Surfaces</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-5-preventing-data-loss-with-sa" rel="noopener noreferrer nofollow">Step 5: Preventing Data Loss with Sampling</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-6-testing-before-full-rollout" rel="noopener noreferrer nofollow">Step 6: Testing Before Full Rollout</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#step-7-scaling-across-the-organizat" rel="noopener noreferrer nofollow">Step 7: Scaling Across the Organization</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#impact-cleaner-data-faster-developm" rel="noopener noreferrer nofollow">Impact: Cleaner Data, Faster Development</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#for-data-teams" rel="noopener noreferrer nofollow">For Data Teams</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#for-engineers" rel="noopener noreferrer nofollow">For Engineers</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#whats-next-component-based-analytic" rel="noopener noreferrer nofollow">What’s Next: Component-Based Analytics</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#takeaways" rel="noopener noreferrer nofollow">Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">When you use an app like Uber, every interaction matters. A tap, a scroll, a screen view, all of these generate signals that help teams understand:</p><ul><li><p class="paragraph" style="text-align:left;">What users are doing</p></li><li><p class="paragraph" style="text-align:left;">Where they’re facing friction</p></li><li><p class="paragraph" style="text-align:left;">Which features are actually working</p></li></ul><p class="paragraph" style="text-align:left;">This is what powers product decisions, A/B experimentation, ML-driven recommendations and personalization. But collecting this data at scale? That’s where things get complicated. Let’s discover what Uber did in this blog today.</p><h2 class="heading" style="text-align:left;" id="how-mobile-analytics-works-at-a-hig">How Mobile Analytics Works (At a High Level)</h2><p class="paragraph" style="text-align:left;">Before diving into the problem, let’s quickly understand how analytics works inside Uber.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6c4f06f7-5398-4b42-9f9f-4fe94b5d1672/image.png?t=1779365664"/><div class="image__source"><span class="image__source_text"><p>Source:<a class="link" href="https://www.uber.com/in/en/blog/how-uber-standardized-mobile-analytics/?uclick_id=4c49f26d-24a9-41ec-902b-e3aae76c3e64&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-uber-standardized-mobile-analytics-without-slowing-down-teams" target="_blank" rel="noopener noreferrer nofollow">How Uber Standardized Mobile Analytics for Cross-Platform Insights</a></p></span></div></div><p class="paragraph" style="text-align:left;">Don’t worry if the above diagram looks too much for now, let me simplify it for you. </p><p class="paragraph" style="text-align:left;">So, Whenever a user interacts with the app:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Engineers define an event (along with metadata)</p></li><li><p class="paragraph" style="text-align:left;">It’s added to the UI component</p></li><li><p class="paragraph" style="text-align:left;">The event gets triggered when the user interacts</p></li><li><p class="paragraph" style="text-align:left;">It’s batched and sent to backend systems</p></li><li><p class="paragraph" style="text-align:left;">Teams use this data for insights</p></li></ol><p class="paragraph" style="text-align:left;">This pipeline is critical for Monitoring feature health, Understanding user journeys and Powering ML-driven recommendations. So far, everything sounds clean. But systems don’t fail at small scale, they fail when multiple teams start building on top of them.</p><h2 class="heading" style="text-align:left;" id="where-things-started-breaking">Where Things Started Breaking</h2><p class="paragraph" style="text-align:left;">Even though Uber had a structured system, things didn’t scale well. But what was the core issue? <b>Every team handled analytics differently. </b>The biggest issue wasn’t the pipeline. It was <b>how teams were using it</b>.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b3e01603-baf6-4aea-9434-e3cc1515fb96/image.png?t=1779365792"/></div><p class="paragraph" style="text-align:left;">Each team:</p><ul><li><p class="paragraph" style="text-align:left;">Defined events differently</p></li><li><p class="paragraph" style="text-align:left;">Logged data in their own way</p></li><li><p class="paragraph" style="text-align:left;">Added metadata manually</p></li><li><p class="paragraph" style="text-align:left;">Created custom events for everything</p></li></ul><p class="paragraph" style="text-align:left;">Over time, this created chaos:</p><ul><li><p class="paragraph" style="text-align:left;">No standard definition of events</p></li><li><p class="paragraph" style="text-align:left;">Missing analytics in shared UI components</p></li><li><p class="paragraph" style="text-align:left;">Duplicate and inconsistent metadata</p></li><li><p class="paragraph" style="text-align:left;">Over 40% of events became generic “custom logs”</p></li><li><p class="paragraph" style="text-align:left;">Disabled events were silently dropped</p></li></ul><p class="paragraph" style="text-align:left;">Now think about the impact. If data is inconsistent then insights become unreliable and if insights are unreliable then decisions become risky. This wasn’t just a data problem. It was a <b>product and business problem</b>.</p><h2 class="heading" style="text-align:left;" id="the-key-insight-the-platform-should">The Key Insight: The Platform Should Own Analytics</h2><p class="paragraph" style="text-align:left;">Uber realized something very important:</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">If every team defines analytics differently, the system will never be consistent. As long as analytics is owned by feature teams, consistency is impossible.</p><figcaption class="blockquote__byline"> Because every team optimizes for speed, No one optimizes for global consistency. </figcaption></blockquote></div><p class="paragraph" style="text-align:left;">So they made a bold shift to move analytics logic from feature teams into the <b>platform. </b>This changed the model completely:</p><ul><li><p class="paragraph" style="text-align:left;">Before, teams decided how to log events. Now, the platform defines the rules and teams just use them.</p></li><li><p class="paragraph" style="text-align:left;">Engineers no longer decide “how to log”</p></li><li><p class="paragraph" style="text-align:left;">They just use standardized tools</p></li></ul><h2 class="heading" style="text-align:left;" id="step-1-standardizing-events-across-">Step 1: Standardizing Events Across the App</h2><p class="paragraph" style="text-align:left;">Uber started by standardizing the most common user interactions. The first step was to define clear event types. They started with three core events and introduced clear definitions for:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Tap</b>: when a user interacts with a UI element</p></li><li><p class="paragraph" style="text-align:left;"><b>Impression</b>: when ≥50% of a view is visible for 500ms</p></li><li><p class="paragraph" style="text-align:left;"><b>Scroll</b>: when scrolling starts and stops</p></li></ul><p class="paragraph" style="text-align:left;">This might sound simple but it’s huge because earlier each team had its own definition of “impression”. Now, one definition across the entire system.</p><h2 class="heading" style="text-align:left;" id="step-2-separating-analytics-from-ui">Step 2: Separating Analytics from UI Logic</h2><p class="paragraph" style="text-align:left;">Another major improvement was introducing <b>AnalyticsBuilder</b>. Instead of mixing analytics inside UI code:</p><ul><li><p class="paragraph" style="text-align:left;">AnalyticsBuilder handles when events should fire</p></li><li><p class="paragraph" style="text-align:left;">UI components focus only on rendering</p></li></ul><p class="paragraph" style="text-align:left;">It tracks lifecycle events like <b>View appearing </b>and<b> View disappearing </b>and decides when to trigger analytics events. This makes the system more reusable, testable and consistent.</p><h2 class="heading" style="text-align:left;" id="step-3-moving-analytics-into-ui-com">Step 3: Moving Analytics Into UI Components</h2><p class="paragraph" style="text-align:left;">Earlier Engineers manually added analytics to each feature. Now Uber instrumented <b>50+ common UI components. </b>So things like Buttons, Lists and Views automatically emit analytics events. </p><p class="paragraph" style="text-align:left;">This means, Engineers don’t need to “remember” analytics anymore. Analytics becomes <b>default behavior</b>, not manual effort.</p><h2 class="heading" style="text-align:left;" id="step-4-standardizing-metadata-the-h">Step 4: Standardizing Metadata (The Hidden Problem)</h2><p class="paragraph" style="text-align:left;">Another major issue was metadata inconsistency.</p><p class="paragraph" style="text-align:left;">Different teams:</p><ul><li><p class="paragraph" style="text-align:left;">Logged the same data differently</p></li><li><p class="paragraph" style="text-align:left;">Used different naming conventions</p></li><li><p class="paragraph" style="text-align:left;">Repeated the same values</p></li></ul><p class="paragraph" style="text-align:left;">Uber fixed this by introducing two layers of metadata.</p><h3 class="heading" style="text-align:left;" id="1-app-level-metadata">1. App-Level Metadata</h3><p class="paragraph" style="text-align:left;">Data common to the entire app (like location or restaurant ID) is defined once and automatically attached to every event.</p><h3 class="heading" style="text-align:left;" id="2-event-level-metadata">2. Event-Level Metadata</h3><p class="paragraph" style="text-align:left;">Things like scroll direction, list index and view position are automatically captured by the system.</p><h3 class="heading" style="text-align:left;" id="standardizing-ui-surfaces">Standardizing UI Surfaces</h3><p class="paragraph" style="text-align:left;">Every UI component is also categorized:</p><ul><li><p class="paragraph" style="text-align:left;">Buttons → BUTTON</p></li><li><p class="paragraph" style="text-align:left;">Containers → CONTAINER_VIEW</p></li><li><p class="paragraph" style="text-align:left;">Sliders → SLIDER</p></li></ul><p class="paragraph" style="text-align:left;">This ensures that every event clearly indicates where it came from.</p><h2 class="heading" style="text-align:left;" id="step-5-preventing-data-loss-with-sa">Step 5: Preventing Data Loss with Sampling</h2><p class="paragraph" style="text-align:left;">Earlier, if an event was disabled or unmapped, it was completely dropped which meant losing data forever.</p><p class="paragraph" style="text-align:left;">Uber fixed this by introducing <b>sampling</b>, 0.1% of sessions log <i>all events </i>even disabled or unmapped ones.</p><p class="paragraph" style="text-align:left;">This creates a safety net so you never fully lose visibility into your data.</p><h2 class="heading" style="text-align:left;" id="step-6-testing-before-full-rollout">Step 6: Testing Before Full Rollout</h2><p class="paragraph" style="text-align:left;">Before scaling, Uber ran a pilot.</p><p class="paragraph" style="text-align:left;">They:</p><ul><li><p class="paragraph" style="text-align:left;">Ran old and new systems in parallel</p></li><li><p class="paragraph" style="text-align:left;">Logged events using both old and new systems</p></li><li><p class="paragraph" style="text-align:left;">Compared outputs</p></li><li><p class="paragraph" style="text-align:left;">Validated consistency and correctness</p></li></ul><p class="paragraph" style="text-align:left;">They checked event volume, metadata accuracy and behavioural correctness.</p><p class="paragraph" style="text-align:left;">This helped uncover:</p><ul><li><p class="paragraph" style="text-align:left;">Differences between iOS and Android</p></li><li><p class="paragraph" style="text-align:left;">Opportunities to simplify event structures</p></li></ul><h2 class="heading" style="text-align:left;" id="step-7-scaling-across-the-organizat">Step 7: Scaling Across the Organization</h2><p class="paragraph" style="text-align:left;">Once validated, Uber rolled it out in two ways:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Central Migration: Platform team replaced old APIs directly</p></li><li><p class="paragraph" style="text-align:left;">Distributed Migration</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Teams were supported with scripts</p></li><li><p class="paragraph" style="text-align:left;">High-impact events were prioritized</p></li></ol></li></ol><p class="paragraph" style="text-align:left;">They also added a linter to block non-standard event usage so the system stays clean going forward.</p><h2 class="heading" style="text-align:left;" id="impact-cleaner-data-faster-developm">Impact: Cleaner Data, Faster Development</h2><h3 class="heading" style="text-align:left;" id="for-data-teams">For Data Teams</h3><ul><li><p class="paragraph" style="text-align:left;">Consistent event definitions</p></li><li><p class="paragraph" style="text-align:left;">Reliable cross-platform analysis</p></li><li><p class="paragraph" style="text-align:left;">More accurate metrics</p></li><li><p class="paragraph" style="text-align:left;">Reduced noise</p></li></ul><p class="paragraph" style="text-align:left;">For example, Impression accuracy improved significantly due to stricter rules</p><h3 class="heading" style="text-align:left;" id="for-engineers">For Engineers</h3><ul><li><p class="paragraph" style="text-align:left;">Much less code to write</p></li><li><p class="paragraph" style="text-align:left;">No manual metadata handling</p></li><li><p class="paragraph" style="text-align:left;">Built-in analytics in UI components</p></li><li><p class="paragraph" style="text-align:left;">Faster onboarding</p></li></ul><h2 class="heading" style="text-align:left;" id="whats-next-component-based-analytic">What’s Next: Component-Based Analytics</h2><p class="paragraph" style="text-align:left;">Uber is now pushing this even further. They’re moving towards <b>component-based analytics</b>. Instead of manually defining events:</p><p class="paragraph" style="text-align:left;">Each component <b>gets a unique ID </b>and<b> automatically generates event names.</b></p><p class="paragraph" style="text-align:left;">Format:</p><div class="codeblock"><pre><code>[component]_[surface]_[event]</code></pre></div><p class="paragraph" style="text-align:left;">Example:</p><div class="codeblock"><pre><code>production_selection_button_tap</code></pre></div><p class="paragraph" style="text-align:left;">This reduces duplication, Improves naming consistency and simplifies tracking.</p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="takeaways">Takeaways</h2><p class="paragraph" style="text-align:left;">Uber’s biggest learning is simple: <b>Don’t leave analytics to individual teams, build it into the platform.</b></p><p class="paragraph" style="text-align:left;">By:</p><ul><li><p class="paragraph" style="text-align:left;">Standardizing events</p></li><li><p class="paragraph" style="text-align:left;">Automating metadata</p></li><li><p class="paragraph" style="text-align:left;">Instrumenting UI components</p></li><li><p class="paragraph" style="text-align:left;">Preserving data through sampling</p></li></ul><p class="paragraph" style="text-align:left;">They built a system that is Scalable, Reliable and Easy to use and most importantly, one that teams can actually trust.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from Uber: <a class="link" href="https://www.uber.com/in/en/blog/how-uber-standardized-mobile-analytics/?uclick_id=4c49f26d-24a9-41ec-902b-e3aae76c3e64&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-uber-standardized-mobile-analytics-without-slowing-down-teams" target="_blank" rel="noopener noreferrer nofollow">How Uber Standardized Mobile Analytics for Cross-Platform Insights</a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of<b>, How Uber Standardized Mobile Analytics (Without Slowing Down Teams)? </b>In a nutshell, Uber fixed inconsistent mobile analytics by moving event logic from feature teams into the platform and standardizing events, metadata, and UI instrumentation. This ensured reliable, scalable data while reducing engineering effort and improving cross-platform insights.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=2e543026-d0db-4445-89f6-f61d52908620&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Claude vs Mimo, Telegram on watch  and share your AI Agents freely!</title>
  <description>Claude vs Mimo, Telegram on watch and everything you need to know about new Claude Fable 5</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c5690d6c-9ae6-4add-b7e6-9a1a23826103/Pasted_image__33_.png" length="997746" type="image/png"/>
  <link>https://hw.glich.co/p/claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely</link>
  <guid isPermaLink="true">https://hw.glich.co/p/claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely</guid>
  <pubDate>Sat, 13 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-13T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"><b>DevSecOps Maturity Model</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://r2trck.com/hello-world-datadog-6?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-whitepaper-DevSecOpsMaturityModel-security-app-ww-en-701VY00000DxPsZYAV&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">The 6 Core Competencies of Mature DevSecOps Orgs</a></p><p class="paragraph" style="text-align:left;">Understand the core competencies that define mature DevSecOps organizations. This whitepaper offers a clear framework to assess your organization&#39;s current capabilities, define where you want to be, and outline practical steps to advance in your journey. Evaluate and strengthen your DevSecOps practices with Datadog&#39;s maturity model. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/06a8a852-b3cc-4629-a094-ec437bd9f1b4/ad_maturitymodel_1080x1080_220725.png?t=1781289325"/></div><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Claude vs MiMo -</b> Xiaomi’s new <a class="link" href="https://github.com/XiaomiMiMo/MiMo-Code?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">MiMo</a> Code is an open-source AI coding agent designed to tackle long-running development tasks without losing context. Built on OpenCode under the MIT license, it preserves memory across sessions using a multi-layered knowledge system. <a class="link" href="https://www.developer-tech.com/news/xiaomi-mimo-code-executes-200-step-agentic-developer-workflows/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Read full blog here.</a></p><p class="paragraph" style="text-align:left;"><b>Telegram enters your Smartwatch:</b> Telegram’s latest update brings dedicated smartwatch apps, letting users send messages, listen to voice notes, view media, and manage chats directly from their wrist. The release also upgrades bots with rich formatting options like tables, carousels, and collapsible sections, making chat experiences far more interactive. <a class="link" href="https://telegram.org/blog/watch-apps-and-more?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><hr class="content_break"><h2 class="heading" style="text-align:left;" id="everything-you-need-to-know-about-c">Everything You Need to Know About Claude Fable 5 and Claude Mythos 5</h2><p class="paragraph" style="text-align:left;">Anthropic has introduced Claude Fable 5 and Claude Mythos 5, its most advanced AI models yet. These models mark the arrival of a new capability tier that goes beyond traditional chatbots, focusing on long-running tasks, autonomous problem-solving, and advanced reasoning. While both models are built on the same foundation, Fable 5 is designed for general users, while Mythos 5 is reserved for trusted cybersecurity and scientific research partners.</p><h3 class="heading" style="text-align:left;" id="built-for-long-and-complex-tasks">Built for Long and Complex Tasks</h3><p class="paragraph" style="text-align:left;">One of the biggest improvements in Fable 5 is its ability to stay focused during lengthy workflows. Many AI models perform well on short prompts but struggle when tasks span hundreds of steps. Fable 5 is designed to maintain context across millions of tokens, allowing it to tackle projects that may take hours or even days to complete. Anthropic says the model can create and use its own notes, helping it remember key information and improve its outputs over time.</p><h3 class="heading" style="text-align:left;" id="major-advances-in-software-engineer">Major Advances in Software Engineering</h3><p class="paragraph" style="text-align:left;">Software development is one area where Fable 5 shines. During early testing, companies reported dramatic productivity gains. In one example, the model completed a codebase-wide migration in a massive 50-million-line codebase in just one day a project that would normally require months of manual work. Fable 5 can analyze large repositories, understand dependencies, update multiple files simultaneously, and generate production-ready code while remaining more token-efficient than previous Claude models.</p><p class="paragraph" style="text-align:left;">Beyond coding, Fable 5 delivers significant improvements in research and analytical tasks. The model performs exceptionally well at document analysis, financial reasoning, chart interpretation, root-cause analysis, and problem-solving. Anthropic claims it achieved leading results on several reasoning benchmarks, making it useful for consultants, analysts, researchers, and business professionals who work with large amounts of information daily.</p><p class="paragraph" style="text-align:left;">Vision is another area where the new model stands out. Fable 5 can extract information from scientific figures, understand diagrams, analyze screenshots, and even recreate software applications from images alone. Anthropic demonstrated these capabilities by showing the model successfully completing Pokémon FireRed using only visual inputs, without access to additional game information or navigation tools.</p><h3 class="heading" style="text-align:left;" id="what-makes-mythos-5-different">What Makes Mythos 5 Different?</h3><p class="paragraph" style="text-align:left;">Claude Mythos 5 is essentially the unrestricted version of Fable 5. It is designed for cybersecurity experts and scientific researchers who require access to the model&#39;s full capabilities. Anthropic claims Mythos 5 possesses the strongest cybersecurity abilities of any AI model currently available. The model has also shown promising results in drug discovery, protein engineering, molecular biology research, and genomics, helping researchers accelerate complex scientific workflows.</p><h3 class="heading" style="text-align:left;" id="safety-comes-first">Safety Comes First</h3><p class="paragraph" style="text-align:left;">With greater capability comes greater risk. To prevent misuse, Anthropic has implemented advanced safeguards in Fable 5. Requests involving cybersecurity, biology, chemistry, or AI model distillation are automatically routed to a less capable model. According to the company, these safeguards affect fewer than 5% of user sessions while helping prevent harmful use cases and jailbreak attempts.</p><h3 class="heading" style="text-align:left;" id="the-bigger-picture">The Bigger Picture</h3><p class="paragraph" style="text-align:left;">The launch of Claude Fable 5 and Claude Mythos 5 highlights a major shift in AI development. The industry is moving beyond simply building smarter models and focusing on how to deploy powerful AI responsibly. By combining cutting-edge capabilities with layered safety mechanisms, Anthropic is attempting to balance innovation with security. As AI continues to evolve, models like Fable 5 and Mythos 5 offer a glimpse into what the next generation of intelligent systems may look like.<br><br><a class="link" href="https://www.anthropic.com/news/claude-fable-5-mythos-5?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Read Official Blog.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Amazon’s new Side Hustle -</b> Amazon is bringing AI-generated merchandise directly into its shopping app, letting users create custom designs with simple Alexa prompts. From T-shirts and hoodies to tumblers and water bottles, the platform handles everything from design generation to production and Prime delivery. <a class="link" href="https://techcrunch.com/2026/06/08/amazon-now-lets-you-design-custom-merch-using-ai/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><p class="paragraph" style="text-align:left;"><b>Share AI Agents, skills securely now:</b> Databricks has unveiled OpenSharing, an open protocol designed to let organizations securely share not just data, but also AI models, agent skills, and unstructured assets across platforms. Hosted by the Linux Foundation, it expands on Delta Sharing to create a vendor-neutral standard for the AI era. <a class="link" href="https://sdtimes.com/data/databricks-announces-opensharing-a-protocol-for-sharing-data-ai-assets/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><h3 class="heading" style="text-align:left;" id="context-persistence">Context Persistence</h3><p class="paragraph" style="text-align:left;">Context Persistence is the ability of an AI system to maintain and retrieve state across long-running tasks, sessions, and workflows. It combines structured memory, retrieval mechanisms, and state management to prevent context loss during extended agent execution. Modern AI agents use context persistence to coordinate multi-step reasoning, resume interrupted processes, and maintain consistency across millions of tokens. It is becoming a foundational capability for autonomous coding agents, enterprise copilots, and agentic AI systems.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>chisel</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/jpillora/chisel?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">chisel</a> is a tool for cutting down Linux containers into ultra-minimal images.</p><p class="paragraph" style="text-align:left;"><b>s5cmd</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/peak/s5cmd?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">s5cmd</a><b> </b>is a blazing-fast alternative to AWS CLI for S3 operations.</p><p class="paragraph" style="text-align:left;"><b>pistol</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/doronbehar/pistol?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">pistol</a> preview files in the terminal automatically while navigating directories.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read: <a class="link" href="https://x.com/elder_plinius/status/2064478648057610422?utm_source=tldrai" target="_blank" rel="noopener noreferrer nofollow">Fable 5 SYS prompt leak.</a></p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://youtu.be/i1R-61zTbD4?si=c9qlHhP0Lodp91iF&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=claude-vs-mimo-telegram-on-watch-and-share-your-ai-agents-freely" target="_blank" rel="noopener noreferrer nofollow">Why is everyone going Blind.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e1a7f6b1-1bc1-4c8c-a0d2-18fc052252a8/image.png?t=1781288733"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1bb7bc87-cb54-4bcd-9f10-8e0f03c4bdfd&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What is Database Replication?</title>
  <description>A practical guide to database replication in system design. Learn about key architectures, trade-offs, and real-world strategies for building scalable systems.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/database-replication-in-system-design</link>
  <guid isPermaLink="true">https://hw.glich.co/p/database-replication-in-system-design</guid>
  <pubDate>Wed, 10 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-10T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/14ce61ee-18ca-4e12-948b-9e55311cd376/BRAND-8629_ebook_ads_benefits_of_end-to-end_observability_1_1200x628px_250812__1_.png?t=1781020362"/></div><p class="paragraph" style="text-align:left;"><a class="link" href="https://r2trck.com/hello-world-datadog-2?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-ebook-BenefitsofEndtoEndObservability-apm-app-ww-en-701VY00000P6SHmYAN&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow"><b>Benefits of End-to-End Observability</b></a><a class="link" href="https://r2trck.com/hello-world-datadog-2?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-ebook-BenefitsofEndtoEndObservability-apm-app-ww-en-701VY00000P6SHmYAN&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow"><b> </b></a></p><p class="paragraph" style="text-align:left;">The real benefits of end-to-end observability<br>How does full-stack observability impact engineering speed, incident response, and cost control?</p><p class="paragraph" style="text-align:left;">In this <a class="link" href="https://r2trck.com/hello-world-datadog-2?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-ebook-BenefitsofEndtoEndObservability-apm-app-ww-en-701VY00000P6SHmYAN&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">eBook from Datadog</a>, you&#39;ll learn how real teams across industries are using observability to:</p><ul><li><p class="paragraph" style="text-align:left;">Reduce mean time to resolution (MTTR)</p></li><li><p class="paragraph" style="text-align:left;">Cut tooling costs and improve team efficiency</p></li><li><p class="paragraph" style="text-align:left;">Align business and engineering KPIs</p></li></ul><p class="paragraph" style="text-align:left;">See how unifying your stack leads to faster troubleshooting and long-term operational gains.</p><p class="paragraph" style="text-align:left;">When developing an application, the database serves as the core system. If it fails, database replication helps by maintaining multiple copies on different servers. This approach ensures the system remains available, scalable, and fast for users globally.</p><h2 class="heading" style="text-align:left;" id="why-replication-is-your-systems-saf">Why Replication Is Your System&#39;s Safety Net</h2><p class="paragraph" style="text-align:left;">Imagine your entire application depends on a single library. If that library suddenly closes even for a few minutes everything grinds to a halt. This is a single point of failure.</p><p class="paragraph" style="text-align:left;">Database replication is like having synchronized copies of that library in different cities. If one closes, the others are still open and up-to-date. This approach is a cornerstone of any resilient, modern architecture. By distributing data, you build a system that can survive outages and handle growth.</p><h3 class="heading" style="text-align:left;" id="the-core-goals-of-replication">The Core Goals of Replication</h3><ul><li><p class="paragraph" style="text-align:left;"><b>High Availability:</b> If the primary database server fails, a replica can quickly take over, ensuring continuous operation without noticeable disruption for users.</p></li><li><p class="paragraph" style="text-align:left;"><b>Improved Scalability:</b> As traffic increases, replication allows workload distribution. &quot;Write&quot; requests go to the main database, while &quot;read&quot; requests are handled by multiple replicas, preventing bottlenecks.</p></li><li><p class="paragraph" style="text-align:left;"><b>Lower Latency:</b> By placing replicas near users, data access is faster. For example, a replica in Tokyo speeds up access for users in Japan, reducing response times.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Database replication strengthens infrastructure by converting a single point of failure into a resilient network. Understanding these advantages is crucial for reliable software design, as replication is essential for fault tolerance and global performance.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h2 class="heading" style="text-align:left;" id="exploring-core-replication-architec">Exploring Core Replication Architectures</h2><p class="paragraph" style="text-align:left;">Choosing a data replication strategy is a crucial design decision, given the variety of models and their trade-offs in complexity, scalability, and failure handling. This is reflected in the global data replication market&#39;s considerable growth.</p><p class="paragraph" style="text-align:left;">Three main models are prevalent: Leader-Follower, Multi-Leader, and Leaderless replication.</p><h3 class="heading" style="text-align:left;" id="leader-follower-model">Leader-Follower Model</h3><p class="paragraph" style="text-align:left;">This common approach involves a leader node handling write operations and streaming data changes to follower nodes, which handle read-only queries. Suitable for read-heavy applications like blogs and e-commerce sites, it enhances read throughput by offloading reads to followers.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/928802e4-a561-491c-ba3f-355a3187f688/369a99f8-4b67-4234-a393-dada6a5afc4d.jpg?t=1781019895"/></div><p class="paragraph" style="text-align:left;">The model&#39;s beauty is its simplicity. With a single source of truth the leader maintaining write consistency is straightforward. However, the leader is a single point of failure for writes. If it goes down, your system can&#39;t accept new data until a follower is promoted. This is a classic example of the trade-offs in <a class="link" href="https://hw.glich.co/p/vertical-vs-horizontal-scaling?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">scaling strategies for your system</a>.</p><h3 class="heading" style="text-align:left;" id="multi-leader-and-leaderless-replica">Multi-Leader and Leaderless Replication</h3><p class="paragraph" style="text-align:left;">For systems requiring intensive write operations, other models may be more appropriate:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Multi-Leader Replication:</b> Multiple nodes can accept writes, beneficial for applications needing low-latency writes in different regions. The main issue is resolving write conflicts when simultaneous edits occur in different locations.</p></li><li><p class="paragraph" style="text-align:left;"><b>Leaderless Replication:</b> Eliminates the leader concept, allowing any node to accept writes, which are then distributed to other nodes. Used by databases like <a class="link" href="https://cassandra.apache.org/_/index.html?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">Apache Cassandra</a> for high fault tolerance and availability, ensuring minimal impact if a node fails.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Each architecture serves a different need. Choose based on your application&#39;s requirements: simple read scaling, low-latency global writes, or maximum resilience.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="comparing-replication-architectures">Comparing Replication Architectures at a Glance</h3><p class="paragraph" style="text-align:left;">This table helps clarify how these models stack up.</p><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Architecture</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Write Complexity</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Read Scalability</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Fault Tolerance</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Best For</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Leader-Follower</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Low</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Medium</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Read-heavy applications like blogs or e-commerce sites.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Multi-Leader</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Globally distributed systems needing low write latency.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Leaderless</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Medium</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Very High</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Applications demanding maximum availability and resilience.</p></td></tr></table></div><p class="paragraph" style="text-align:left;">There&#39;s no single &quot;best&quot; architecture. The Leader-Follower model is a solid starting point for many applications, while Multi-Leader and Leaderless models offer powerful solutions for more complex, high-availability scenarios.</p><h2 class="heading" style="text-align:left;" id="balancing-data-consistency-and-syst">Balancing Data Consistency and System Availability</h2><p class="paragraph" style="text-align:left;">When replicating databases in your system design, you face a classic dilemma. Every distributed system must choose between guaranteeing data is perfectly up-to-date everywhere (<b>consistency</b>) and ensuring the system is always online to handle requests (<b>availability</b>). This challenge is famously described by the CAP theorem.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c6ef5580-5de4-46d6-ab60-d406232e80ed/f68b5cfd-42e1-477f-b5f3-efd253bd6636.jpg?t=1781019895"/></div><p class="paragraph" style="text-align:left;">This isn&#39;t just theory it has real consequences. Our guide on <a class="link" href="https://hw.glich.co/p/what-is-cap-theorem?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">https://hw.glich.co/p/what-is-cap-theorem</a> explains it in detail, but the core idea is that during a network failure, you can&#39;t have perfect consistency and availability simultaneously.</p><h3 class="heading" style="text-align:left;" id="synchronous-vs-asynchronous-replica">Synchronous vs. Asynchronous Replication</h3><p class="paragraph" style="text-align:left;">Choosing between data replication methods often depends on your needs:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Synchronous Replication:</b> The leader waits for a follower&#39;s confirmation before responding, ensuring data safety across machines with strong consistency. However, it&#39;s slower due to the required confirmation.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Replication:</b> The leader immediately confirms writes to the client and updates followers later, offering faster performance and high availability. The risk is data loss if the leader fails before followers update.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">The choice involves balancing strong consistency with high availability speed.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="real-world-consistency-models">Real-World Consistency Models</h3><p class="paragraph" style="text-align:left;">Let&#39;s see how this plays out.</p><p class="paragraph" style="text-align:left;">A banking app <b>must</b> prioritize consistency. A money transfer must be perfectly reflected across all systems. Seeing an old, incorrect balance is unacceptable. Such an application would favor synchronous replication to ensure zero data loss.</p><p class="paragraph" style="text-align:left;">Conversely, a social media feed can be more relaxed. If a &quot;like&quot; takes a few seconds to appear for others, it&#39;s not a critical failure. This model uses <b>eventual consistency</b>, where data syncs across replicas over time, making asynchronous replication a perfect fit.</p><h2 class="heading" style="text-align:left;" id="designing-systems-that-survive-fail">Designing Systems That Survive Failure</h2><p class="paragraph" style="text-align:left;">A <b>database replication</b> strategy is like a safety net, but that net can still get tangled. The real goal is to build a system that expects failure and knows how to recover gracefully.</p><h3 class="heading" style="text-align:left;" id="when-leaders-go-down-and-networks-g">When Leaders Go Down and Networks Get Weird</h3><p class="paragraph" style="text-align:left;">One issue is <b>leader failure</b>; if the leader crashes, new writes can&#39;t be accepted. Automated failover is crucial; a monitoring service detects the failure, holds an &quot;election&quot; among followers, and appoints a new leader quickly to reduce downtime.</p><p class="paragraph" style="text-align:left;">Another problem is <b>network partition</b>, where nodes run but can&#39;t communicate. In multi-leader systems, this causes <b>split-brain</b>, with each partition side acting as the leader, resulting in conflicting writes and data confusion.</p><h3 class="heading" style="text-align:left;" id="battling-replica-lag-and-data-confl">Battling Replica Lag and Data Conflicts</h3><p class="paragraph" style="text-align:left;">Even in normal operation, systems must deal with <b>replica lag</b> the delay between a write hitting the leader and appearing on a follower. High lag means users see stale data, which can break application logic. Monitoring this is critical, and you can learn more about key performance indicators in this guide to <a class="link" href="https://www.integrate.io/blog/database-replication-speed-metrics/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">database replication speed metrics</a>.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">A truly resilient system isn&#39;t just about having copies; it&#39;s about having an intelligent, automated plan for when those copies lose sync or their leader disappears.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">Designing for failure means building systems that embrace instability. Our article on <a class="link" href="https://hw.glich.co/p/how-netflix-ensures-reliability-with-prioritized-load-shedding?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">how Netflix ensures reliability</a> provides a great look into how they tackle these challenges at scale.</p><h2 class="heading" style="text-align:left;" id="how-top-companies-use-database-repl">How Top Companies Use Database Replication</h2><p class="paragraph" style="text-align:left;">Theory is one thing, but seeing <b>database replication in system design</b> in practice makes it real. Major tech companies use replication strategically to solve massive performance and availability problems.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b80093da-dcb0-40bf-828e-7b952d52a0a1/226f1f7a-03b7-4861-96a6-e094369dcf4d.jpg?t=1781019895"/></div><p class="paragraph" style="text-align:left;">This strategic importance is why the database replication software market is growing rapidly. You can find more <a class="link" href="https://www.verifiedmarketreports.com/product/database-replication-software-market/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">insights on this market&#39;s growth</a> and its projections.</p><p class="paragraph" style="text-align:left;">Let&#39;s look at a couple of real-world examples.</p><h3 class="heading" style="text-align:left;" id="ecommerce-giants-and-multi-leader-r">E-commerce Giants and Multi-Leader Replication</h3><p class="paragraph" style="text-align:left;">A global e-commerce site needs to ensure fast user experiences across New York, Berlin, and Tokyo. If all &quot;add to cart&quot; actions went to a single database in North America, delays would occur for users in Asia and Europe.</p><p class="paragraph" style="text-align:left;">To address this, they implement a <b>Multi-Leader architecture</b> with leader nodes in key regions. A shopper in Germany, for instance, updates their cart through a European leader, providing an instant experience. The data is then replicated to other leaders in the background.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">This local write capability greatly enhances user experience, maintaining eventual consistency without compromising local performance.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="streaming-services-and-global-conte">Streaming Services and Global Content Delivery</h3><p class="paragraph" style="text-align:left;">A major video streaming service has a different challenge. Its catalog of titles and user watchlists doesn&#39;t change frequently, but it must serve content to millions of viewers globally without buffering.</p><p class="paragraph" style="text-align:left;">For this, a <b>Leader-Follower</b> model is ideal.</p><ul><li><p class="paragraph" style="text-align:left;">A central leader database is the single source of truth for all content metadata.</p></li><li><p class="paragraph" style="text-align:left;">This data is replicated to hundreds of read-only followers worldwide.</p></li><li><p class="paragraph" style="text-align:left;">These followers are placed alongside video files on a Content Delivery Network (CDN).</p></li></ul><p class="paragraph" style="text-align:left;">When you search for a movie, your request hits a nearby replica, giving a lightning-fast response. This principle of distributing read load is used in many massive systems, as detailed in our analysis of <a class="link" href="https://hw.glich.co/p/how-facebook-handles-billions-of-messages-daily?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-database-replication" target="_blank" rel="noopener noreferrer nofollow">how Facebook handles billions of messages daily</a>.</p><h2 class="heading" style="text-align:left;" id="common-questions-about-database-rep">Common Questions About Database Replication</h2><h3 class="heading" style="text-align:left;" id="what-is-the-difference-between-data">What Is the Difference Between Database Replication and Backups</h3><p class="paragraph" style="text-align:left;">This is a common point of confusion. Both involve copying data, but they solve different problems.</p><p class="paragraph" style="text-align:left;"><b>Replication</b> is a live, continuous process for high availability and performance. It keeps multiple database copies in sync so one can take over instantly if another fails.</p><p class="paragraph" style="text-align:left;"><b>Backups</b> are periodic, offline snapshots for disaster recovery. You use them to restore data to a specific point in time, like after an accidental mass deletion.</p><h3 class="heading" style="text-align:left;" id="how-does-replication-lag-affect-my-">How Does Replication Lag Affect My Application</h3><p class="paragraph" style="text-align:left;"><b>Replication lag</b> refers to the delay between a write on the leader and its visibility on a follower, potentially showing users outdated data, like an old profile photo after refreshing.</p><p class="paragraph" style="text-align:left;">A typical solution is to direct a user’s reads to the leader immediately after they make a write for a few seconds, ensuring they see their updates. Other reads continue to go to followers.</p><h3 class="heading" style="text-align:left;" id="when-to-use-synchronous-vs-asynchro">When to Use Synchronous vs. Asynchronous Replication</h3><p class="paragraph" style="text-align:left;">The choice depends on data loss tolerance:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Synchronous replication:</b> Use when data loss is unacceptable, such as in payment systems, guaranteeing writes are saved to a follower before success confirmation, though it increases latency.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous replication:</b> Opt for this when speed is crucial and minor data loss is tolerable, suitable for social media, logging, or analytics where performance is prioritized over absolute data durability.</p></li></ul><p class="paragraph" style="text-align:left;">Balancing these is essential in system design, with many large systems employing a combination of both.</p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=fa24ddb1-e591-4e0c-8cf8-d06a5cef5b23&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>When Microservices Get Messy: How API Federation Brings Order</title>
  <description>API Federation combines multiple services into one API using shared models and modular features, simplifying complex microservice architectures.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7dce0ac9-0932-439f-b1de-bf951dc8a1ca/image.png" length="275571" type="image/png"/>
  <link>https://hw.glich.co/p/when-microservices-get-messy-how-api-federation-brings-order</link>
  <guid isPermaLink="true">https://hw.glich.co/p/when-microservices-get-messy-how-api-federation-brings-order</guid>
  <pubDate>Mon, 08 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-08T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Salesforce]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=when-microservices-get-messy-how-api-federation-brings-order" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><b><a class="link" href="https://scaleengineer.com/blog/how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=when-microservices-get-messy-how-api-federation-brings-order" target="_blank" rel="noopener noreferrer nofollow">How Slack makes its Mobile App Feel Seamless (Even on Bad Internet)</a></b></p></div><hr class="content_break"><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-problem-with-scaling-microservi" rel="noopener noreferrer nofollow">The Problem with Scaling Microservices</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#so-what-is-api-federation" rel="noopener noreferrer nofollow">So, what is API Federation?</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-its-not" rel="noopener noreferrer nofollow">What It’s NOT</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#breaking-systems-into-bounded-conte" rel="noopener noreferrer nofollow">Breaking Systems into Bounded Contexts</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#how-federation-actually-works" rel="noopener noreferrer nofollow">How Federation Actually Works</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#a-smarter-way-to-design-ap-is" rel="noopener noreferrer nofollow">A Smarter Way to Design APIs</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#how-data-is-fetched-across-services" rel="noopener noreferrer nofollow">How Data is Fetched Across Services</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#balancing-flexibility-and-control" rel="noopener noreferrer nofollow">Balancing Flexibility and Control</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-a-federation-protocol" rel="noopener noreferrer nofollow">What is a federation protocol?</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#handling-changes-over-time" rel="noopener noreferrer nofollow">Handling Changes Over Time</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#keeping-things-stable-in-production" rel="noopener noreferrer nofollow">Keeping Things Stable in Production</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#does-federation-affect-performance" rel="noopener noreferrer nofollow">Does Federation Affect Performance?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#making-it-work-at-scale" rel="noopener noreferrer nofollow">Making it Work at Scale</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#tools-that-help-like-data-graph" rel="noopener noreferrer nofollow">Tools That Help (Like DataGraph)</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#key-takeaways" rel="noopener noreferrer nofollow">Key Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">Microservices promise a lot of features like better scalability, faster development, independent teams and more flexibility. And honestly, in the beginning, they do deliver. But as systems grow, something strange starts happening. Instead of clarity, you get chaos. With microservices, now you have hundreds of APIs, different formats, inconsistent data, hidden services and broken integrations. What once felt like a clean architecture slowly turns into what people jokingly call a <b>“microservices death star.”</b></p><p class="paragraph" style="text-align:left;">So the real question becomes: <b>How do you scale microservices without losing control? </b>And that’s exactly where <b>API Federation</b> comes in.</p><h2 class="heading" style="text-align:left;" id="the-problem-with-scaling-microservi">The Problem with Scaling Microservices</h2><p class="paragraph" style="text-align:left;">When organizations grow, their API landscape grows with them. You start with a few services, then dozens, then hundreds. At that point, new challenges appear:</p><ul><li><p class="paragraph" style="text-align:left;">APIs become inconsistent</p></li><li><p class="paragraph" style="text-align:left;">Discoverability becomes difficult</p></li><li><p class="paragraph" style="text-align:left;">Observability becomes harder</p></li><li><p class="paragraph" style="text-align:left;">Changes start breaking things</p></li></ul><p class="paragraph" style="text-align:left;">And the worst part? Every team is moving fast but in slightly different directions. Without a structured approach, your system becomes <b>hard to understand, hard to evolve </b>and<b> </b>even<b> harder to maintain.</b></p><h2 class="heading" style="text-align:left;" id="so-what-is-api-federation">So, what is API Federation?</h2><p class="paragraph" style="text-align:left;">To solve this, we need a better way to manage APIs at scale. That’s where API Federation comes in! In simple terms: <b>API Federation is a way to combine multiple services into one unified API without forcing teams to tightly couple their systems.</b></p><p class="paragraph" style="text-align:left;">It allows:</p><ul><li><p class="paragraph" style="text-align:left;">Different services to evolve independently</p></li><li><p class="paragraph" style="text-align:left;">While still exposing a consistent interface externally</p></li></ul><p class="paragraph" style="text-align:left;">You can think of it like this Instead of clients calling 10 different APIs, they call <b>one federated API</b>, and behind the scenes, it fetches data from multiple services.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7dce0ac9-0932-439f-b1de-bf951dc8a1ca/image.png?t=1779111433"/></div><h3 class="heading" style="text-align:left;" id="what-its-not">What It’s NOT</h3><p class="paragraph" style="text-align:left;">Before going deeper, let’s clear some misconceptions:</p><p class="paragraph" style="text-align:left;"><b>1. It’s NOT a single “universal model”</b></p><p class="paragraph" style="text-align:left;">Trying to create one global schema for the entire organization doesn’t work. Why?</p><p class="paragraph" style="text-align:left;">Because:</p><ul><li><p class="paragraph" style="text-align:left;">Teams have different needs</p></li><li><p class="paragraph" style="text-align:left;">Domains evolve differently</p></li><li><p class="paragraph" style="text-align:left;">Central control slows everything down</p></li></ul><p class="paragraph" style="text-align:left;">This often leads to rigid systems that eventually break.</p><p class="paragraph" style="text-align:left;"><b>2. It’s NOT just GraphQL</b></p><p class="paragraph" style="text-align:left;">GraphQL is commonly used for federation, but it’s not mandatory. API Federation is a <b>design approach</b>, not a specific technology.</p><p class="paragraph" style="text-align:left;">It can work with:</p><ul><li><p class="paragraph" style="text-align:left;">REST APIs</p></li><li><p class="paragraph" style="text-align:left;">Event-driven systems</p></li><li><p class="paragraph" style="text-align:left;">Other architectures</p></li></ul><h2 class="heading" style="text-align:left;" id="breaking-systems-into-bounded-conte">Breaking Systems into Bounded Contexts</h2><p class="paragraph" style="text-align:left;">The first big idea behind API Federation comes from Domain-Driven Design (DDD). Instead of treating your system as one giant API, you divide it into <b>bounded contexts</b>.</p><p class="paragraph" style="text-align:left;"><b>What is a bounded context?</b></p><p class="paragraph" style="text-align:left;">A group of services that:</p><ul><li><p class="paragraph" style="text-align:left;">Share the same domain concepts</p></li><li><p class="paragraph" style="text-align:left;">Speak the same “language”</p></li><li><p class="paragraph" style="text-align:left;">Represent data consistently</p></li></ul><p class="paragraph" style="text-align:left;">For example:</p><ul><li><p class="paragraph" style="text-align:left;">A “Customer” should mean the same thing across services in that context</p></li><li><p class="paragraph" style="text-align:left;">Identifiers for entities should match everywhere</p></li><li><p class="paragraph" style="text-align:left;">Data types should be consistent</p></li></ul><p class="paragraph" style="text-align:left;">It matters because federation works best when <b>data meanings are aligned, IDs are consistent </b>and<b> schemas are compatible.</b> Without this, combining APIs becomes messy.</p><h2 class="heading" style="text-align:left;" id="how-federation-actually-works">How Federation Actually Works</h2><p class="paragraph" style="text-align:left;">Once you have a bounded context, federation combines services using a few key patterns:</p><p class="paragraph" style="text-align:left;"><b>1. Entity Extension</b></p><p class="paragraph" style="text-align:left;">Different services may provide different fields for the same entity. Federation merges them into one unified view.</p><p class="paragraph" style="text-align:left;">For example, one service might return customer details, while another returns their orders. Federation combines both into a single response.</p><p class="paragraph" style="text-align:left;">Federated API → both combined</p><p class="paragraph" style="text-align:left;"><b>2. Entity Linking</b></p><p class="paragraph" style="text-align:left;">Services can reference entities from other services using shared identifiers. Instead of directly linking services: They connect through <b>common entity types and keys</b></p><p class="paragraph" style="text-align:left;">This is powerful because if one service changes or disappears → system doesn’t break</p><p class="paragraph" style="text-align:left;"><b>3. Entity Composition</b></p><p class="paragraph" style="text-align:left;">Sometimes entities don’t have identifiers (like value objects).</p><p class="paragraph" style="text-align:left;">In such cases:</p><ul><li><p class="paragraph" style="text-align:left;">Federation combines available attributes dynamically</p></li></ul><h2 class="heading" style="text-align:left;" id="a-smarter-way-to-design-ap-is">A Smarter Way to Design APIs</h2><p class="paragraph" style="text-align:left;">Traditionally, APIs are built as large, monolithic interfaces. But federation encourages something different: Build APIs as <b>small reusable features</b></p><p class="paragraph" style="text-align:left;">Instead of one big API doing everything, break it into smaller capabilities.</p><p class="paragraph" style="text-align:left;">Examples of features:</p><ul><li><p class="paragraph" style="text-align:left;">CRUD operations</p></li><li><p class="paragraph" style="text-align:left;">Pagination</p></li><li><p class="paragraph" style="text-align:left;">Filtering</p></li><li><p class="paragraph" style="text-align:left;">Search</p></li><li><p class="paragraph" style="text-align:left;">Event streaming</p></li></ul><p class="paragraph" style="text-align:left;">Each feature:</p><ul><li><p class="paragraph" style="text-align:left;">Can be developed independently</p></li><li><p class="paragraph" style="text-align:left;">Can be reused across services</p></li><li><p class="paragraph" style="text-align:left;">Can evolve separately</p></li></ul><p class="paragraph" style="text-align:left;">This works better because <b>now teams don’t depend on entire APIs, they depend only on specific features. </b>And federation combines these features into one unified interface.</p><h2 class="heading" style="text-align:left;" id="how-data-is-fetched-across-services">How Data is Fetched Across Services</h2><p class="paragraph" style="text-align:left;">To make federation work, services must support something called: <b>Entity resolution</b></p><p class="paragraph" style="text-align:left;">This means: Given an entity ID → return its data</p><p class="paragraph" style="text-align:left;">Using this:</p><ul><li><p class="paragraph" style="text-align:left;">Federation can fetch data from multiple services</p></li><li><p class="paragraph" style="text-align:left;">Join them using shared keys</p></li><li><p class="paragraph" style="text-align:left;">Return a single response</p></li></ul><p class="paragraph" style="text-align:left;">For example: <b>Fetch customer data, Fetch related orders and Combine everything. All in one query.</b></p><h2 class="heading" style="text-align:left;" id="balancing-flexibility-and-control">Balancing Flexibility and Control</h2><p class="paragraph" style="text-align:left;">Here’s a tricky part. You want teams to move independently. But also APIs to stay consistent This is where the idea of a <b>federation protocol</b> comes in.</p><h3 class="heading" style="text-align:left;" id="what-is-a-federation-protocol">What is a federation protocol?</h3><p class="paragraph" style="text-align:left;">A set of rules that:</p><ul><li><p class="paragraph" style="text-align:left;">Services must follow to be part of the federated API</p></li><li><p class="paragraph" style="text-align:left;">Defines how conflicts are handled</p></li><li><p class="paragraph" style="text-align:left;">Controls how schemas evolve</p></li></ul><p class="paragraph" style="text-align:left;"><b>Example rules:</b></p><ul><li><p class="paragraph" style="text-align:left;">If two services define the same entity → they must agree on keys</p></li><li><p class="paragraph" style="text-align:left;">If attributes conflict → resolve or rename</p></li><li><p class="paragraph" style="text-align:left;">If a service can’t fully support federation → fallback mechanisms apply</p></li></ul><p class="paragraph" style="text-align:left;">It’s not fully centralized, and not completely decentralized either. It sits somewhere in between, giving teams freedom while still maintaining consistency.</p><h2 class="heading" style="text-align:left;" id="handling-changes-over-time">Handling Changes Over Time</h2><p class="paragraph" style="text-align:left;">In real systems, APIs change all the time. Federation supports this by:</p><ul><li><p class="paragraph" style="text-align:left;">Automatically adapting to new attributes</p></li><li><p class="paragraph" style="text-align:left;">Reflecting removed fields</p></li><li><p class="paragraph" style="text-align:left;">Allowing gradual adoption of federation patterns</p></li></ul><p class="paragraph" style="text-align:left;">This means you don’t need a perfect system from day one, you can evolve into federation over time.</p><h2 class="heading" style="text-align:left;" id="keeping-things-stable-in-production">Keeping Things Stable in Production</h2><p class="paragraph" style="text-align:left;">Adding a federation layer shouldn’t break things. So it must work with:</p><ul><li><p class="paragraph" style="text-align:left;">Authentication systems</p></li><li><p class="paragraph" style="text-align:left;">Caching</p></li><li><p class="paragraph" style="text-align:left;">Rate limiting</p></li><li><p class="paragraph" style="text-align:left;">Monitoring</p></li><li><p class="paragraph" style="text-align:left;">SLAs</p></li></ul><p class="paragraph" style="text-align:left;">At the same time, federation introduces shared requirements like common authentication across services and shared entity identifiers.</p><h2 class="heading" style="text-align:left;" id="does-federation-affect-performance">Does Federation Affect Performance?</h2><p class="paragraph" style="text-align:left;">A common concern: “Won’t federation slow things down?”</p><p class="paragraph" style="text-align:left;">Well, not necessarily. In fact:</p><ul><li><p class="paragraph" style="text-align:left;">It can improve efficiency by optimizing queries</p></li><li><p class="paragraph" style="text-align:left;">It allows scaling specific services independently</p></li></ul><p class="paragraph" style="text-align:left;">The key is:</p><ul><li><p class="paragraph" style="text-align:left;">Efficient query planning</p></li><li><p class="paragraph" style="text-align:left;">Leveraging capabilities of underlying services</p></li></ul><h2 class="heading" style="text-align:left;" id="making-it-work-at-scale">Making it Work at Scale</h2><p class="paragraph" style="text-align:left;">Federation isn’t just an architectural idea, it also impacts your development lifecycle.</p><p class="paragraph" style="text-align:left;">You need:</p><ul><li><p class="paragraph" style="text-align:left;">CI/CD pipelines</p></li><li><p class="paragraph" style="text-align:left;">Metadata management</p></li><li><p class="paragraph" style="text-align:left;">API contract validation</p></li><li><p class="paragraph" style="text-align:left;">Automated deployments</p></li></ul><p class="paragraph" style="text-align:left;">Every time a service changes:</p><ul><li><p class="paragraph" style="text-align:left;">Federation layer must update</p></li><li><p class="paragraph" style="text-align:left;">Consistency checks must run</p></li></ul><h2 class="heading" style="text-align:left;" id="tools-that-help-like-data-graph">Tools That Help (Like DataGraph)</h2><p class="paragraph" style="text-align:left;">Tools like MuleSoft’s DataGraph help automate this entire process.</p><p class="paragraph" style="text-align:left;">They:</p><ul><li><p class="paragraph" style="text-align:left;">Use API metadata (like OAS/RAML)</p></li><li><p class="paragraph" style="text-align:left;">Generate federated schemas</p></li><li><p class="paragraph" style="text-align:left;">Handle query execution</p></li><li><p class="paragraph" style="text-align:left;">Provide monitoring and logging</p></li></ul><p class="paragraph" style="text-align:left;">And importantly, they simplify federation into a <b>manageable workflow.</b></p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="key-takeaways">Key Takeaways</h2><p class="paragraph" style="text-align:left;">API Federation is not just a technical solution, it’s a way of thinking. It helps solve one of the biggest problems in modern systems: <b>How do you scale without losing control?</b></p><p class="paragraph" style="text-align:left;">By:</p><ul><li><p class="paragraph" style="text-align:left;">Structuring APIs into bounded contexts</p></li><li><p class="paragraph" style="text-align:left;">Designing modular features</p></li><li><p class="paragraph" style="text-align:left;">Combining services intelligently</p></li><li><p class="paragraph" style="text-align:left;">Balancing autonomy with governance</p></li></ul><p class="paragraph" style="text-align:left;">In the end, it gives you the best of both worlds i.e., Independent teams and Unified experience. And that’s what makes it powerful.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from Salesforce: <a class="link" href="https://engineering.salesforce.com/api-federation-growing-scalable-api-landscapes-a0f1f0dad506/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=when-microservices-get-messy-how-api-federation-brings-order" target="_blank" rel="noopener noreferrer nofollow">API Federation: growing scalable API landscapes</a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of, <b>When Microservices Get Messy: How API Federation Brings Order. </b>In a nutshell, API Federation unifies multiple microservices into a single API using bounded contexts, shared models, and feature-based design. It balances team autonomy with consistency while simplifying complex API ecosystems.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=528a1ccf-98e1-42e0-80de-9c32eb3fbef5&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Redis leak, Zapocalypse and AI Chief of Staff!</title>
  <description>Redis leak, Zaplocalypse and how can you turn your all photos into a movie ft. Google</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9368df95-73ef-420b-8f6b-06ef51c0053d/Pasted_image__32_.png" length="1036995" type="image/png"/>
  <link>https://hw.glich.co/p/bots-generate-more-web-traffic-than-humans</link>
  <guid isPermaLink="true">https://hw.glich.co/p/bots-generate-more-web-traffic-than-humans</guid>
  <pubDate>Sat, 06 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-06T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"><b>State of AI Engineering</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://r2trck.com/hello-world-datadog-10?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-researchreport-2026StateofAIEngineering-ai-aiob-ww-en-701VY00000mzxcFYAQ&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">What The State of AI Engineering Report Reveals About AI in Production</a></p><p class="paragraph" style="text-align:left;">AI engineering has moved past experimentation, but are your observability practices keeping up? Datadog analyzed LLM telemetry from 1,000+ customers to reveal what&#39;s actually happening at scale: model adoption shifts, hidden token costs, and what the rise of agentic frameworks means for reliability. Download the report to benchmark your AI stack against production reality.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2aec51b1-9bbf-444c-8f76-5c2c4dfe568a/state_of_ai_llm_2026_type_graphics_260130_v01_hero_3800_x_1930.png?t=1780716635"/></div><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>AI Finds What Humans Missed ft. Redis - </b>An AI security tool discovered a critical Redis vulnerability that had gone unnoticed for over two years. The flaw could have allowed attackers to compromise vulnerable systems. It’s another sign that AI is becoming a powerful ally in modern cybersecurity. <a class="link" href="https://www.cyberkendra.com/2026/06/an-ai-security-tool-dug-up-2-year-old.html?utm_source=tldrinfosec#google_vignette" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><p class="paragraph" style="text-align:left;"><b>Zapier got Zapocalypse - </b>Researchers uncovered an attack chain that could have enabled full takeover of authenticated Zapier accounts. By chaining together multiple seemingly minor security weaknesses, they achieved escalating levels of access. The issue has since been fixed, highlighting the risks hidden in complex cloud environments. <a class="link" href="https://www.token.security/blog/zapocalypse-the-attack-chain-that-could-have-hijacked-zapier?utm_source=tldrinfosec" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><hr class="content_break"><h2 class="heading" style="text-align:left;" id="the-internet-has-a-new-majority-bot">The Internet Has a New Majority: Bots</h2><p class="paragraph" style="text-align:left;">For decades, the web was built around human users. Every click, search, purchase, and page visit was driven by people navigating the internet themselves. That reality is changing faster than many experts expected.</p><p class="paragraph" style="text-align:left;">According to Cloudflare CEO Matthew Prince, automated bots now generate more web traffic than humans for the first time in internet history. Recent Cloudflare data shows bots account for 57.5% of HTTP requests, while human traffic has fallen to 42.5%. What makes this shift remarkable is that Prince previously predicted this crossover would not happen until 2027.</p><h3 class="heading" style="text-align:left;" id="not-the-bots-youre-thinking-of">Not the Bots You&#39;re Thinking Of</h3><p class="paragraph" style="text-align:left;">When most people hear the word &quot;bot,&quot; they think of spam, scams, or search engine crawlers. However, the new wave of traffic comes from AI-powered agents acting on behalf of users.</p><p class="paragraph" style="text-align:left;">These agents can browse websites, compare products, check prices, search for flights, gather information, and even complete multi-step tasks without requiring direct human interaction. Instead of people visiting ten websites to compare options, an AI agent may do that work automatically and return a summarized result.</p><p class="paragraph" style="text-align:left;">This represents a significant change in how the internet is being used.</p><h3 class="heading" style="text-align:left;" id="the-rise-of-agentic-traffic">The Rise of Agentic Traffic</h3><p class="paragraph" style="text-align:left;">The growth of AI assistants and autonomous agents is fueling this trend. Modern AI systems are increasingly capable of navigating websites, collecting information, and performing actions that once required human clicks.</p><p class="paragraph" style="text-align:left;">Cloudflare has been tracking these new categories of visitors, including verified bots and signed agents. Their data suggests that AI agents are now operating at a scale large enough to reshape internet traffic patterns.</p><p class="paragraph" style="text-align:left;">For website owners, this means a growing percentage of visitors may never actually see a webpage. Instead, AI systems may consume the content, extract relevant information, and present it directly to users elsewhere.</p><h3 class="heading" style="text-align:left;" id="why-humans-still-dominate-attention">Why Humans Still Dominate Attention</h3><p class="paragraph" style="text-align:left;">Despite the traffic numbers, humans remain the primary consumers of online content in terms of engagement and time spent.</p><p class="paragraph" style="text-align:left;">Activities such as watching videos, scrolling social media feeds, gaming, and using mobile apps generate long sessions but relatively few HTTP requests. AI agents, on the other hand, can generate thousands of requests in a short period while gathering information across multiple websites.</p><p class="paragraph" style="text-align:left;">In other words, bots may be winning the traffic battle, but humans still dominate the attention economy.</p><h3 class="heading" style="text-align:left;" id="what-comes-next">What Comes Next?</h3><p class="paragraph" style="text-align:left;">The shift signals the beginning of a new internet era. Websites were originally designed for human visitors, but they may increasingly need to serve both people and AI agents.</p><p class="paragraph" style="text-align:left;">Businesses, publishers, and developers will face new challenges around content access, AI scraping, monetization, and digital identity. As agentic AI becomes more capable, the web may evolve from a place humans browse directly into a network where software agents do much of the browsing for us.</p><p class="paragraph" style="text-align:left;">Cloudflare&#39;s data suggests that future has already started. The only surprise is how quickly we arrived there.</p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>AI Chief of Staff -</b> <a class="link" href="https://www.computerworld.com/article/4181295/asana-launches-ai-chief-of-staff-to-keep-projects-on-track.html?utm_source=tldrit" target="_blank" rel="noopener noreferrer nofollow">Asana has launched Dash</a>, an AI-powered &quot;chief of staff&quot; designed to monitor projects across emails, calendars, messaging apps, and Asana itself. It can spot risks, recommend next steps, and even coordinate AI agents to keep work moving.</p><p class="paragraph" style="text-align:left;"><b>Google Dreambeans to turn your life into a Cartoon:</b> <a class="link" href="https://techcrunch.com/2026/06/03/googles-dreambeans-its-weirdest-named-ai-tool-to-date-will-turn-your-life-into-a-cartoon/?utm_source=tldrdesign" target="_blank" rel="noopener noreferrer nofollow">Google’s experimental AI tool, DreamBeans</a>, turns data from your Google apps into personalized illustrated stories and lifestyle recommendations. It can transform memories, trips, and interests into cartoon-like visual narratives generated by AI.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><h3 class="heading" style="text-align:left;" id="capability-routing">Capability Routing</h3><p class="paragraph" style="text-align:left;">Capability Routing is the process of dynamically selecting the best AI model, tool, API, or agent for a specific task instead of sending every request to a single system. Modern AI platforms use capability routing to decide whether a query needs code generation, web search, reasoning, image creation, or data analysis. This improves accuracy, reduces costs, and speeds up response times. As AI agents become more specialized, capability routing is emerging as a core architectural pattern for agentic systems. Many developers use it unknowingly through AI platforms, but few recognize it as a distinct engineering concept.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>xsv</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/burntsushi/xsv?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=redis-leak-zapocalypse-and-ai-chief-of-staff" target="_blank" rel="noopener noreferrer nofollow">xsv</a> is a lightning-fast toolkit for CSV manipulation.</p><p class="paragraph" style="text-align:left;"><b>ripgrep</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/phiresky/ripgrep-all?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=redis-leak-zapocalypse-and-ai-chief-of-staff" target="_blank" rel="noopener noreferrer nofollow">ripgrep</a><b> </b>extends ripgrep to search inside PDFs, Office docs, ebooks, and archives.</p><p class="paragraph" style="text-align:left;"><b>jqp</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/noahgorstein/jqp?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=redis-leak-zapocalypse-and-ai-chief-of-staff" target="_blank" rel="noopener noreferrer nofollow">jqp</a> is an interactive playground for building jq queries visually.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read: <a class="link" href="https://engineering.atspotify.com/2026/6/code-with-claude-coding-is-no-longer-the-constraint?utm_source=tldrdevops" target="_blank" rel="noopener noreferrer nofollow">Coding is no Longer the Constraint.</a></p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://youtu.be/AotCL-v--4Q?si=VrY_8DN4gSXylkdn&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=redis-leak-zapocalypse-and-ai-chief-of-staff" target="_blank" rel="noopener noreferrer nofollow">Why everyone is Shifting to Wired earphones.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/23060ebe-711a-4364-8138-78f80c466f12/image.png?t=1780716412"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=4112ad39-517a-4095-972b-da58b23b24df&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What Is Weak Consistency?</title>
  <description>Struggling to understand what is weak consistency? This guide explains the concept with simple analogies, real-world examples, and performance trade-offs.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/what-is-weak-consistency</link>
  <guid isPermaLink="true">https://hw.glich.co/p/what-is-weak-consistency</guid>
  <pubDate>Wed, 03 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-03T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><p class="paragraph" style="text-align:left;">Weak consistency is a model in distributed systems where updates to data don&#39;t always show up for everyone instantly. This means if you read from the system, you&#39;re <b>not guaranteed</b> to see the very latest write. For a little while, different locations might see different versions of the same information.</p><h2 class="heading" style="text-align:left;" id="defining-weak-consistency-without-t">Defining Weak Consistency Without the Jargon</h2><p class="paragraph" style="text-align:left;">Ever collaborated with a team on a shared document in the cloud? You make a crucial edit, but for a few seconds or maybe even longer your colleagues are still looking at the old version.</p><p class="paragraph" style="text-align:left;">That delay, that brief moment of data misalignment, is the core idea behind <b>weak consistency</b>. It’s not a bug; it&#39;s a deliberate design choice that prioritizes system speed and availability over perfect, real-time data synchronization across every single location.</p><p class="paragraph" style="text-align:left;">In distributed systems, weak consistency allows data synchronization with minimal guarantees. When data is updated on one server, other servers may still show outdated information until updates propagate through the network. Although this might seem like a drawback, it prevents system slowdown by avoiding the need for immediate global agreement on every update. More insights can be found at <a class="link" href="https://www.geeksforgeeks.org/system-design/weak-vs-eventual-consistency-in-system-design/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-weak-consistency" target="_blank" rel="noopener noreferrer nofollow">GeeksforGeeks</a>.</p><h3 class="heading" style="text-align:left;" id="prioritizing-speed-over-immediacy">Prioritizing Speed Over Immediacy</h3><p class="paragraph" style="text-align:left;">Weak consistency enables systems to quickly respond to user requests without waiting for global consensus, ensuring high availability and low latency in large-scale applications. For instance, on a viral social media post, a slight delay in displaying the exact &quot;like&quot; count to all users is acceptable. The system focuses on recording your &quot;like&quot; instantly rather than synchronizing the count globally at every moment. The key is that updates will eventually be visible to all, or possibly not at all, under weak consistency&#39;s strictest definition.</p><h3 class="heading" style="text-align:left;" id="how-it-compares-to-other-models">How It Compares to Other Models</h3><p class="paragraph" style="text-align:left;">To really get <b>what is weak consistency</b>, it helps to see how it stacks up against the alternatives. Different systems need different levels of data integrity. A banking application, for instance, could never tolerate stale data every single transaction has to be immediate and universally visible. This is where stronger consistency models come into play.</p><p class="paragraph" style="text-align:left;">Let&#39;s break down the fundamental trade-offs between weak consistency and two other common models.</p><h3 class="heading" style="text-align:left;" id="quick-comparison-of-consistency-mod">Quick Comparison of Consistency Models</h3><p class="paragraph" style="text-align:left;">This table lays out the core differences in guarantees, performance, and availability.</p><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Consistency Model</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Read Guarantee</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Performance</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Availability</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Strong Consistency</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Guarantees reads return the most recent write.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Slower due to synchronization overhead.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Lower, as it may be unavailable during partitions.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Eventual Consistency</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Guarantees that if no new updates are made, all replicas will eventually converge.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Faster, as writes can proceed without immediate consensus.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Higher, as it can operate during partitions.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Weak Consistency</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Offers no guarantee about when a read will return the latest write.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Fastest, as it has the fewest synchronization rules.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Highest, offering maximum operational uptime.</p></td></tr></table></div><p class="paragraph" style="text-align:left;">As you can see, there&#39;s a clear trade-off: the stronger the guarantee you need for data accuracy, the more you typically sacrifice in performance and availability. <b>Weak consistency</b> sits at the far end of this spectrum, giving you maximum speed by providing the most relaxed data guarantees.</p><h2 class="heading" style="text-align:left;" id="comparing-consistency-models-strong">Comparing Consistency Models: Strong, Eventual, and Weak</h2><p class="paragraph" style="text-align:left;">To understand weak consistency, it helps to compare it with stricter models. Choosing the right consistency model is a significant architectural decision, focusing on what&#39;s suitable for the application rather than what&#39;s &quot;best.&quot;</p><p class="paragraph" style="text-align:left;">Consider some real-world examples:</p><h3 class="heading" style="text-align:left;" id="strong-consistency">Strong Consistency</h3><p class="paragraph" style="text-align:left;">In banking apps like Chase or Bank of America, transactions require <b>strong consistency</b>. When you transfer money, your balance must update immediately and correctly for subsequent actions, with no allowance for outdated data.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Strong Consistency:</b> Ensures any read reflects the most recent successful write, providing immediate and accurate data across the system.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="the-middle-ground-eventual-consiste">The Middle Ground: Eventual Consistency</h3><p class="paragraph" style="text-align:left;">Consider your DNS (Domain Name System). When a website owner updates their domain&#39;s IP address, it doesn&#39;t have to reach every DNS server instantly. This system employs <b>eventual consistency</b>, ensuring that updates are accepted and, within hours, all DNS servers will reflect the new address. This delay supports the resilience and decentralization of the internet.</p><p class="paragraph" style="text-align:left;">These aspects are central to distributed systems. For more insight into managing these trade-offs, check out our guide on <a class="link" href="https://hw.glich.co/p/what-is-cap-theorem?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-weak-consistency" target="_blank" rel="noopener noreferrer nofollow">the CAP theorem</a>, which explains choosing between consistency, availability, and partition tolerance.</p><h3 class="heading" style="text-align:left;" id="weak-consistency-in-action">Weak Consistency in Action</h3><p class="paragraph" style="text-align:left;">Think of a VoIP service like Skype or a multi-user online game using the UDP protocol for real-time communication, exemplifying <b>weak consistency</b>. The system focuses on sending data packets quickly, without ensuring every packet arrives or maintains order.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Weak Consistency&#39;s Promise:</b> The system makes no promises about when, or even if, an update will be visible to all subsequent reads. Performance and availability are king here, and it’s okay if the data is temporarily out of sync.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">This image helps visualize where different models fall under this broad umbrella.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8832b3b7-b982-42db-af50-f7ddb629b0ec/4e3c4109-5788-4a41-9afb-9ee0db69afde.jpg?t=1780415470"/></div><p class="paragraph" style="text-align:left;">Even &quot;weak&quot; isn&#39;t singular; there are different guarantee levels from eventual synchronization to models like causal and session consistency. By allowing some delay in data sync, architects can significantly boost performance and resilience, particularly in large-scale systems where strict consistency would cause bottlenecks.</p><h2 class="heading" style="text-align:left;" id="performance-and-availability-trade-">Performance and Availability Trade-Offs</h2><p class="paragraph" style="text-align:left;">Why allow temporarily outdated data? It seems problematic, but it&#39;s intentional. The trade-off is sacrificing some immediate data uniformity for a large increase in performance and availability.</p><p class="paragraph" style="text-align:left;">Weak consistency is a strategic decision to create faster, resilient applications that manage heavy traffic.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f408f6b6-e535-4b63-a47e-c99da140e11f/8261d267-04a6-409d-b701-2e8b4623f181.jpg?t=1780415470"/></div><p class="paragraph" style="text-align:left;">This decision makes perfect sense once you look at the cost of strong consistency. To guarantee every single user sees the latest update the <i>instant</i> it happens, a system has to jump through hoops with complex synchronization protocols. A write operation isn&#39;t considered &quot;done&quot; until every server in the network confirms it has received and applied the update. All that waiting adds precious milliseconds of latency to every request.</p><h3 class="heading" style="text-align:left;" id="slashing-latency-for-a-better-user-">Slashing Latency for a Better User Experience</h3><p class="paragraph" style="text-align:left;">In a weakly consistent system, a server can immediately respond to a user&#39;s request without waiting for global consensus, significantly reducing latency for a quick user experience. For example, when posting a comment on a YouTube video, the system accepts it instantly and updates it across servers in the background. In contrast, a strongly consistent system might delay your confirmation to ensure it&#39;s visible everywhere. Verified benchmarks, like those on Apache Cassandra, show weakly consistent writes can be much faster, reducing latency from over 100 milliseconds to under 30 milliseconds. For more details, see this <a class="link" href="https://www.geeksforgeeks.org/system-design/weak-consistency-in-system-design/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-weak-consistency" target="_blank" rel="noopener noreferrer nofollow">consistency models deep dive</a>.</p><h2 class="heading" style="text-align:left;" id="seeing-weak-consistency-in-real-wor">Seeing Weak Consistency in Real-World Systems</h2><p class="paragraph" style="text-align:left;">Theory is great, but the real &quot;aha!&quot; moment comes when you see where this stuff actually gets used. The truth is, you probably interact with weakly consistent systems every single day without a second thought. These aren&#39;t just abstract concepts for textbooks; they&#39;re the practical engineering choices that make many of the world&#39;s biggest apps feel so fast and responsive.</p><p class="paragraph" style="text-align:left;">Engineers choose <b>weak consistency</b> for a simple reason: sometimes, speed and uptime are just flat-out more important than having perfectly synchronized data across the globe at every single millisecond.</p><p class="paragraph" style="text-align:left;">The diagram below gives you a quick visual. A &quot;write&quot; happens, but for a little while, different nodes (P1, P2) might still serve up the old data. Eventually, they catch up.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/98a986b1-57d9-4e17-a81c-9a888d70335c/1581d21f-37d3-4e47-8246-e6fb376e8e50.jpg?t=1780415470"/></div><p class="paragraph" style="text-align:left;">This is the core trade-off in action. The system temporarily allows data to be &quot;stale&quot; in some places to keep everything available and performing at lightning speed.</p><h3 class="heading" style="text-align:left;" id="non-critical-analytics-and-logging">Non-Critical Analytics and Logging</h3><p class="paragraph" style="text-align:left;">Netflix collects a vast amount of data from user interactions, such as clicks and views, which is ideal for weak consistency.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Goal:</b> Log billions of events without slowing down the app.</p></li><li><p class="paragraph" style="text-align:left;"><b>Method:</b> Events are sent to the nearest server, without waiting for global sync.</p></li><li><p class="paragraph" style="text-align:left;"><b>Result:</b> The app remains smooth for users, while data is processed later. A slight delay in analytics is acceptable for a seamless user experience.</p></li></ul><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">The main advantage is decoupling: the app&#39;s functionality isn&#39;t hindered by the analytics system. It sends data and continues, trusting the system to handle it later.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="high-stakes-scenarios">High-Stakes Scenarios</h3><p class="paragraph" style="text-align:left;">Certain domains require the immediate, unified view of data that only strong consistency can provide. Using a weaker model in these cases could be costly and potentially disastrous.</p><p class="paragraph" style="text-align:left;">Examples where weak consistency is unsuitable include:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Financial Transactions:</b> Online banking, fund transfers, and payment processing demand atomic operations visible across the entire system. A weak model might display incorrect balances or allow duplicate transactions.</p></li><li><p class="paragraph" style="text-align:left;"><b>E-commerce Systems:</b> In platforms like Amazon, a weakly consistent system could lead to overselling by allowing multiple users to purchase the last item due to outdated inventory views.</p></li><li><p class="paragraph" style="text-align:left;"><b>Booking and Reservation Systems:</b> Platforms like Expedia and Booking.com must ensure that once a reservation is made, it&#39;s immediately unavailable to others to avoid double-booking.</p></li></ul><p class="paragraph" style="text-align:left;">If operations need a &quot;read-your-own-writes&quot; guarantee or must prevent user conflicts over the same resource, strong consistency is essential. Data accuracy is crucial in these contexts.</p><p class="paragraph" style="text-align:left;">When designing systems, it&#39;s important to weigh these needs carefully and safeguard sensitive financial and user data. Understanding <a class="link" href="https://hw.glich.co/p/what-is-encryption-and-different-types-of-encryption?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-weak-consistency" target="_blank" rel="noopener noreferrer nofollow">encryption and its types</a> is also vital.</p><h2 class="heading" style="text-align:left;" id="questions-about-weak-consistency">Questions About Weak Consistency</h2><h3 class="heading" style="text-align:left;" id="is-weak-consistency-the-same-as-eve">Is Weak Consistency the Same as Eventual Consistency?</h3><p class="paragraph" style="text-align:left;">No, they are related but distinct. Weak consistency means there&#39;s no guarantee of reading the latest write, with no assurance of when or if data will sync across all nodes. Eventual consistency, a form of weak consistency, ensures that all replicas will eventually align if no new updates occur.</p><h3 class="heading" style="text-align:left;" id="how-do-systems-handle-data-conflict">How Do Systems Handle Data Conflicts?</h3><p class="paragraph" style="text-align:left;">Conflicts arise due to the lack of immediate synchronization. A common method is <b>Last Write Wins (LWW)</b>, where updates are timestamped, and the latest is accepted. Other methods, like vector clocks, offer more advanced conflict resolution, but LWW is popular in systems like Apache Cassandra.</p><h3 class="heading" style="text-align:left;" id="can-a-single-application-use-multip">Can a Single Application Use Multiple Consistency Models?</h3><p class="paragraph" style="text-align:left;">Yes, it&#39;s common in modern systems. For instance, an e-commerce platform might use:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Strong Consistency</b> for shopping carts and inventory.</p></li><li><p class="paragraph" style="text-align:left;"><b>Eventual Consistency</b> for user reviews and recommendations.</p></li><li><p class="paragraph" style="text-align:left;"><b>Weak Consistency</b> for analytics tracking.</p></li></ul><p class="paragraph" style="text-align:left;">This approach balances strict guarantees with performance and availability for different system parts.</p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=7d251427-eb05-43ce-9db3-9a34c31c57d0&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Slack makes its Mobile App Feel Seamless (Even on Bad Internet)</title>
  <description>Slack optimizes mobile performance using prioritized APIs, caching + versioning, offline sync, and scalable architecture for reliable user experience.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/efb5c745-030b-4a1f-9b79-1c00a888a4d2/_-_visual_selection__3_.png" length="223586" type="image/png"/>
  <link>https://hw.glich.co/p/how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet</guid>
  <pubDate>Mon, 01 Jun 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-06-01T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Slack]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"></p><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><a class="link" href="https://scaleengineer.com/blog/how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet" target="_blank" rel="noopener noreferrer nofollow">How LinkedIn Rebuilt Service Discovery to Scale to Millions of Services</a></p></div><hr class="content_break"><div class="image"><a class="image__link" href="https://r2trck.com/hello-world-datadog-11?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-toolkit-2026AIEraDeveloper-delivery-cipipe-ww-en-701VY00000kMeE2YAK&utm_content=paid&utm_term=1-1-2026" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fa977086-b7fa-464a-8aa4-fee7e93e096b/frontend_developer_kit_ad4_1080x1080.png_page-0001.jpg?t=1780251773"/></a></div><h2 class="heading" style="text-align:left;" id="how-to-ship-better-frontend-experie">How to ship better frontend experiences faster</h2><p class="paragraph" style="text-align:left;">The Datadog <a class="link" href="https://r2trck.com/hello-world-datadog-11?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-toolkit-2026AIEraDeveloper-delivery-cipipe-ww-en-701VY00000kMeE2YAK&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">Frontend Developer Kit</a> brings together essential resources to help you monitor performance, fix errors faster, and improve real user experience.</p><p class="paragraph" style="text-align:left;">Inside the kit, you’ll find:</p><ul><li><p class="paragraph" style="text-align:left;">Best practices guide on strengthening your frontend monitoring and testing strategies</p></li><li><p class="paragraph" style="text-align:left;">Step-by-step brief on catching and resolving frontend issues proactively</p></li><li><p class="paragraph" style="text-align:left;">Technical talk discussing how to better monitor and optimize frontend apps</p></li></ul><p class="paragraph" style="text-align:left;">If you’re building modern web apps, this kit helps you see what users see and fix what matters. </p><p class="paragraph" style="text-align:left;">Link to the Kit: <a class="link" href="https://r2trck.com/hello-world-datadog-11?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-toolkit-2026AIEraDeveloper-delivery-cipipe-ww-en-701VY00000kMeE2YAK&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">Frontend Developer Kit</a></p><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-core-problem-mobile-is-unpredic" rel="noopener noreferrer nofollow">The Core Problem: Mobile is Unpredictable</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#smart-data-fetching" rel="noopener noreferrer nofollow">Smart Data Fetching</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#reducing-network-calls-with-smarter" rel="noopener noreferrer nofollow">Reducing Network Calls with Smarter Caching</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#versioning-this-is-the-secret-behin" rel="noopener noreferrer nofollow">Versioning (This is the Secret Behind Fresh Data)</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#fixing-the-bigger-problem-first-ie-" rel="noopener noreferrer nofollow">Fixing the Bigger Problem first i.e. Technical Deb …</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#risk-management-thinking-what-if-th" rel="noopener noreferrer nofollow">Risk Management (Thinking what if Things Break?)</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-are-feature-flags" rel="noopener noreferrer nofollow">What are feature flags?</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#scaling-challenges-its-not-just-use" rel="noopener noreferrer nofollow">Scaling Challenges (It’s Not Just Users, It’s Comp …</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#building-for-offline-mode" rel="noopener noreferrer nofollow">Building for Offline Mode</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-does-this-component-do" rel="noopener noreferrer nofollow">What does this component do?</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#why-this-matters" rel="noopener noreferrer nofollow">Why This Matters</a></p></li></ul></div><p class="paragraph" style="text-align:left;">When we use apps like Slack on our phones, we expect things to just work seamlessly. Like messages should load instantly, channels should update in real time and everything should feel smooth even when we’re switching networks, riding a metro, or dealing with poor connectivity. But behind the scenes, making this happen is far from simple. Let’s discover how Slack’s engineering team solves some very real mobile challenges without overcomplicating things and how they keep the app fast, reliable, and scalable in this blog today.</p><h2 class="heading" style="text-align:left;" id="the-core-problem-mobile-is-unpredic">The Core Problem: Mobile is Unpredictable</h2><p class="paragraph" style="text-align:left;">On desktop, things are relatively easy. You’re usually on stable Wi-Fi. The app stays open for long sessions. Data can be fetched continuously without interruptions. But mobile is a completely different story.</p><p class="paragraph" style="text-align:left;">Users:</p><ul><li><p class="paragraph" style="text-align:left;">Move between networks (Wi-Fi → cellular → no signal)</p></li><li><p class="paragraph" style="text-align:left;">Open the app for short bursts</p></li><li><p class="paragraph" style="text-align:left;">Expect instant responses despite unstable connections</p></li></ul><p class="paragraph" style="text-align:left;">So the biggest challenge becomes: <b>How do you fetch the most important data quickly even when the network is unreliable?</b></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/603f63bd-3d7f-4cde-9d5c-31b2961fe778/image.png?t=1778256449"/></div><h2 class="heading" style="text-align:left;" id="smart-data-fetching">Smart Data Fetching</h2><p class="paragraph" style="text-align:left;">Slack tackles this by prioritizing API requests. Instead of treating all data equally, the system decides:</p><ul><li><p class="paragraph" style="text-align:left;">What is <i>critical</i> right now?</p></li><li><p class="paragraph" style="text-align:left;">What can wait?</p></li></ul><p class="paragraph" style="text-align:left;">For example:</p><ul><li><p class="paragraph" style="text-align:left;">Messages in the current channel are considered as high priority</p></li><li><p class="paragraph" style="text-align:left;">Background updates are considered as lower priority</p></li></ul><p class="paragraph" style="text-align:left;">So when the app gets even a small window of connectivity, it fetches the <b>most important data first</b>. This ensures that users can read messages and interact with conversations even if everything else hasn’t loaded yet. This one decision massively improves perceived performance.</p><h2 class="heading" style="text-align:left;" id="reducing-network-calls-with-smarter">Reducing Network Calls with Smarter Caching</h2><p class="paragraph" style="text-align:left;">Another key idea is <b>avoiding unnecessary data fetching</b>. Slack uses caching, but not in a naive way. Instead of blindly re-fetching data every time:</p><ul><li><p class="paragraph" style="text-align:left;">The app stores previously fetched data</p></li><li><p class="paragraph" style="text-align:left;">It checks whether that data is still valid before making new requests</p></li></ul><p class="paragraph" style="text-align:left;">Thus it reduces <b>Network usage, Load time </b>and<b> Dependency on connectivity. </b>But this raises an important question: <b>How does the app know if cached data is still up to date?</b></p><h2 class="heading" style="text-align:left;" id="versioning-this-is-the-secret-behin">Versioning (This is the Secret Behind Fresh Data)</h2><p class="paragraph" style="text-align:left;">Slack solves this using something called <b>object versioning</b>. Here’s the idea in simple terms: <i><b>Every piece of data (like a message or channel) has a version.</b></i></p><p class="paragraph" style="text-align:left;">When the app has cached data:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">It asks the server: <i>“Is my version still the latest?”</i></p></li><li><p class="paragraph" style="text-align:left;">If yes → no need to download again</p></li><li><p class="paragraph" style="text-align:left;">If no → fetch the updated version</p></li></ol><p class="paragraph" style="text-align:left;">This avoids <b>Re-downloading identical data, Wasting bandwidth </b>and<b> Slowing down the app. </b>The implementation isn’t trivial though. </p><p class="paragraph" style="text-align:left;">Versioning logic:</p><ul><li><p class="paragraph" style="text-align:left;">Differs based on the type of object</p></li><li><p class="paragraph" style="text-align:left;">Requires coordination between client and server</p></li><li><p class="paragraph" style="text-align:left;">Needs accurate version calculation</p></li></ul><p class="paragraph" style="text-align:left;">But once done right, it leads to <b>significant reduction in data usage</b></p><h2 class="heading" style="text-align:left;" id="fixing-the-bigger-problem-first-ie-">Fixing the Bigger Problem first i.e. Technical Debt</h2><p class="paragraph" style="text-align:left;">Over time, Slack’s mobile codebase started facing issues:</p><ul><li><p class="paragraph" style="text-align:left;">Inconsistent feature development</p></li><li><p class="paragraph" style="text-align:left;">Slower delivery</p></li><li><p class="paragraph" style="text-align:left;">Difficulty meeting product deadlines</p></li><li><p class="paragraph" style="text-align:left;">Growing technical debt</p></li></ul><p class="paragraph" style="text-align:left;">This is something almost every large system faces eventually. So instead of patching things slowly, the team took a bold step: <b>They revamped the mobile architecture.</b></p><p class="paragraph" style="text-align:left;">This included:</p><ul><li><p class="paragraph" style="text-align:left;">Major codebase changes</p></li><li><p class="paragraph" style="text-align:left;">Cleaning up legacy code</p></li><li><p class="paragraph" style="text-align:left;">Introducing a new feature architecture (especially on iOS)</p></li></ul><p class="paragraph" style="text-align:left;">This wasn’t easy at all, but this resulted into <b>faster development, better consistency across features </b>and<b> improved ability to scale. </b></p><p class="paragraph" style="text-align:left;">This shows us that<b> </b>sometimes, the only way forward is to <b>rebuild the foundation properly</b>.</p><h2 class="heading" style="text-align:left;" id="risk-management-thinking-what-if-th">Risk Management (Thinking what if Things Break?)</h2><p class="paragraph" style="text-align:left;">Making changes to a mobile app especially something like networking is risky. One wrong change can break critical functionality, impact millions of users and cause cascading failures. So Slack uses <b>feature flags (change toggling)</b>.</p><h3 class="heading" style="text-align:left;" id="what-are-feature-flags">What are feature flags?</h3><p class="paragraph" style="text-align:left;">They allow engineers to:</p><ul><li><p class="paragraph" style="text-align:left;">Turn features ON/OFF remotely</p></li><li><p class="paragraph" style="text-align:left;">Roll out changes gradually</p></li><li><p class="paragraph" style="text-align:left;">Quickly roll back if something goes wrong</p></li></ul><p class="paragraph" style="text-align:left;">So instead of pushing risky changes blindly, they test in controlled environments, monitor real-world impact and disable instantly if needed. Still, no system is perfect.</p><p class="paragraph" style="text-align:left;">So they also:</p><ul><li><p class="paragraph" style="text-align:left;">Push <b>hotfixes</b> when urgent issues arise</p></li><li><p class="paragraph" style="text-align:left;">Perform deep analysis after failures</p></li><li><p class="paragraph" style="text-align:left;">Improve testing processes</p></li></ul><p class="paragraph" style="text-align:left;">This constant feedback loop helps them improve over time.</p><h2 class="heading" style="text-align:left;" id="scaling-challenges-its-not-just-use">Scaling Challenges (It’s Not Just Users, It’s Complexity)</h2><p class="paragraph" style="text-align:left;">Slack doesn’t just deal with millions of users. It also deals with:</p><ul><li><p class="paragraph" style="text-align:left;">Workspaces having <b>80,000 custom emojis</b></p></li><li><p class="paragraph" style="text-align:left;">Users in <b>20,000 channels</b></p></li><li><p class="paragraph" style="text-align:left;">Large teams constantly modifying the app</p></li></ul><p class="paragraph" style="text-align:left;">On top of that, their mobile engineering team itself has grown to <b>120+ engineers. </b>This introduces a different kind of scaling problem: <b>How do you manage a massive codebase with so many developers working on it?</b></p><p class="paragraph" style="text-align:left;">The answer is to <b>keep the</b> <b>architecture clean, build reusable components </b>and<b> continuously refactor </b>because scaling isn’t just about handling traffic, it’s also about handling <b>people and code complexity</b>.</p><h2 class="heading" style="text-align:left;" id="building-for-offline-mode">Building for Offline Mode</h2><p class="paragraph" style="text-align:left;">One of the most interesting problems Slack solved is: <b>How do you make features work even without internet?</b></p><p class="paragraph" style="text-align:left;">A feature team wanted users to:</p><ul><li><p class="paragraph" style="text-align:left;">Perform actions offline</p></li><li><p class="paragraph" style="text-align:left;">Save changes</p></li><li><p class="paragraph" style="text-align:left;">Sync them later</p></li></ul><p class="paragraph" style="text-align:left;">Instead of solving this problem again and again for every feature, the infrastructure team built a <b>shared component for offline data handling</b>.</p><h3 class="heading" style="text-align:left;" id="what-does-this-component-do">What does this component do?</h3><p class="paragraph" style="text-align:left;">It stores user actions when offline, tracks what needs to be synced and applies changes once connectivity returns.</p><p class="paragraph" style="text-align:left;">So, now any feature can:</p><ul><li><p class="paragraph" style="text-align:left;">Plug into this system</p></li><li><p class="paragraph" style="text-align:left;">Define what actions to store</p></li><li><p class="paragraph" style="text-align:left;">Let the system handle syncing</p></li></ul><p class="paragraph" style="text-align:left;">This leads to much better user experience, less duplicated work and cleaner code across teams.</p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="why-this-matters">Why This Matters</h2><p class="paragraph" style="text-align:left;">If you step back, you’ll notice a pattern in all these decisions:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Prioritize what matters</b>: Not all data is equal so fetch the important stuff first.</p></li><li><p class="paragraph" style="text-align:left;"><b>Avoid unnecessary work: </b>Don’t refetch what hasn’t changed.</p></li><li><p class="paragraph" style="text-align:left;"><b>Build reusable systems: </b>Solve problems once, not repeatedly.</p></li><li><p class="paragraph" style="text-align:left;"><b>Expect failures: </b>And design systems that can recover quickly.</p></li><li><p class="paragraph" style="text-align:left;"><b>Continuously improve:</b> Refactor, learn, and adapt.</p></li></ol><p class="paragraph" style="text-align:left;">What makes Slack’s mobile app feel smooth isn’t just good UI. It’s the result of:</p><ul><li><p class="paragraph" style="text-align:left;">Smart networking strategies</p></li><li><p class="paragraph" style="text-align:left;">Efficient data handling</p></li><li><p class="paragraph" style="text-align:left;">Thoughtful architecture decisions</p></li><li><p class="paragraph" style="text-align:left;">Continuous iteration</p></li></ul><p class="paragraph" style="text-align:left;">The real takeaway here is simple, <b>Building great mobile apps isn’t about handling the best-case scenario, it’s about designing for the worst conditions and still delivering a great experience. </b>And Slack does exactly that.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from Slack: <a class="link" href="https://engineering.salesforce.com/slack-behind-the-scenes-overcoming-key-challenges-to-craft-a-seamless-mobile-app/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-slack-makes-its-mobile-app-feel-seamless-even-on-bad-internet" target="_blank" rel="noopener noreferrer nofollow">Slack Behind the Scenes: Overcoming Key Challenges to Craft a Seamless Mobile App</a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of,<b> How Slack makes its Mobile App Feel Seamless (Even on Bad Internet)? </b>In a nutshell, Slack makes its mobile app fast and reliable by prioritizing critical data, using smart caching with versioning, and handling offline + poor network scenarios efficiently. It also improves scalability and development speed through better architecture, reusable components, and safe feature rollouts.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1b7cc629-71f4-4ced-8086-29b6f3c87ba8&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>MSFT says No to Claude and IOS 27 leak!</title>
  <description>MSFT says No to Claude, IOS 27 leak and your private images may not be private</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6ee02406-a3bd-4e67-8050-9ed4ca1c1845/Pasted_image__30_.png" length="581826" type="image/png"/>
  <link>https://hw.glich.co/p/everything-you-need-to-know-about-claude-opus-4-8</link>
  <guid isPermaLink="true">https://hw.glich.co/p/everything-you-need-to-know-about-claude-opus-4-8</guid>
  <pubDate>Sat, 30 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-30T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"><a class="link" href="https://r2trck.com/hello-world-datadog-5?utm_medium=newsletter&utm_source=hello-world-r&utm_campaign=dg-content-ebook-DevOpsKubernetes-infra-containers-ww-en-701VY00000F5QHNYA3&utm_content=paid&utm_term=1-1-2026" target="_blank" rel="noopener noreferrer nofollow">15 Kubernetes Metrics Every DevOps Team Should Track</a></p><p class="paragraph" style="text-align:left;">Enhance Your Kubernetes Strategy with These Essential Metrics</p><p class="paragraph" style="text-align:left;">Download our comprehensive eBook on optimizing Kubernetes performance. This guide delves into crucial cluster state, resource, and control plane metrics, highlighting 15 of the most essential metrics your DevOps team should be tracking. Learn how to gain complete visibility into your containerized environments and optimize Kubernetes performance with Datadog.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/00243fda-b63b-4f2d-a8bd-eb5987a0cac8/ad1_15-kubernetes-metrics-ebook_1200x628_241229.png?t=1780077642"/></div><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Microsoft says NO to Claude! - </b><a class="link" href="https://www.developer-tech.com/news/microsoft-claude-code-github-copilot-cli/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">Microsoft is reportedly replacing Claude Code internally with GitHub Copilot CLI</a> as it doubles down on its own AI developer ecosystem. The shift signals rising competition between AI coding assistants and a stronger push toward tightly integrated, enterprise-controlled tooling. It also reflects how AI-powered developer workflows are becoming a strategic battleground for major tech companies.</p><p class="paragraph" style="text-align:left;"><b>Hackers Crash Carnival - </b>Carnival has confirmed a massive data breach affecting nearly 6 million people after attackers used social engineering to compromise an employee account. The breach reportedly exposed sensitive customer information, including personal identification data, raising serious identity theft concerns. The incident also highlights how human-targeted attacks remain one of the weakest links in enterprise cybersecurity defenses. <a class="link" href="https://www.securityweek.com/carnival-data-breach-exposed-6-million-people/?utm_source=tldrinfosec" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="everything-you-need-to-know-about-c">Everything you need to know about Claude Opus 4.8</h1><p class="paragraph" style="text-align:left;">Anthropic has officially launched Claude Opus 4.8, the latest version of its flagship AI model. While the update may look incremental on paper, it brings several meaningful improvements for developers, teams, and businesses using AI for coding and large-scale workflows.</p><p class="paragraph" style="text-align:left;">The company says Opus 4.8 is more capable across coding, reasoning, and agentic tasks while remaining available at the same price as Opus 4.7. Alongside the model upgrade, Anthropic also introduced new workflow tools, effort controls, and faster performance options designed to make Claude more practical for real-world engineering work.</p><h2 class="heading" style="text-align:left;" id="better-at-coding-and-collaboration">Better at Coding and Collaboration</h2><p class="paragraph" style="text-align:left;">One of the biggest focuses of Opus 4.8 is reliability. Anthropic says the model is now better at handling complex coding tasks and collaborating with developers over long sessions.</p><p class="paragraph" style="text-align:left;">Early testers reported that Opus 4.8 makes fewer unsupported claims and is more likely to acknowledge uncertainty when it is unsure about something. That may sound small, but it addresses one of the biggest frustrations developers have with AI coding assistants: confident mistakes.</p><p class="paragraph" style="text-align:left;">According to Anthropic’s evaluations, Opus 4.8 is around four times less likely than its predecessor to ignore flaws in generated code. This improvement could make the model more trustworthy for enterprise workflows where accuracy matters far more than speed alone.</p><h2 class="heading" style="text-align:left;" id="dynamic-workflows-bring-bigger-auto">Dynamic Workflows Bring Bigger Automation</h2><p class="paragraph" style="text-align:left;">Anthropic also introduced a new research preview feature called Dynamic Workflows for Claude Code.</p><p class="paragraph" style="text-align:left;">This allows Claude to plan large tasks and run hundreds of parallel subagents in a single session. The system can verify outputs before returning results, making it useful for massive engineering projects such as code migrations across huge codebases.</p><p class="paragraph" style="text-align:left;">The feature is currently available for Enterprise, Team, and Max plans, and it pushes Claude closer to becoming a fully autonomous engineering assistant instead of just a chatbot that writes snippets of code.</p><h2 class="heading" style="text-align:left;" id="new-effort-controls-for-users">New Effort Controls for Users</h2><p class="paragraph" style="text-align:left;">Another major addition is effort control inside <a class="link" href="https://claude.ai?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">claude.ai</a>.</p><p class="paragraph" style="text-align:left;">Users can now decide how much “thinking effort” Claude should put into a task. Lower effort settings generate faster responses while using fewer rate limits. Higher effort settings allow the model to spend more tokens and reasoning time for better-quality outputs.</p><p class="paragraph" style="text-align:left;">Anthropic says Opus 4.8 defaults to high effort mode because it provides the best balance between speed and quality. However, users working on difficult coding problems or long-running workflows can switch to “extra” or “max” effort modes for deeper reasoning.</p><h2 class="heading" style="text-align:left;" id="faster-and-cheaper-performance">Faster and Cheaper Performance</h2><p class="paragraph" style="text-align:left;">Anthropic also introduced a fast mode for Opus 4.8 that runs at 2.5× the speed while being three times cheaper than previous fast modes.</p><p class="paragraph" style="text-align:left;">For developers and businesses managing high API usage, this could significantly reduce operational costs without sacrificing too much quality.</p><p class="paragraph" style="text-align:left;">Pricing for standard usage remains unchanged at $5 per million input tokens and $25 per million output tokens.</p><p class="paragraph" style="text-align:left;">Anthropic says Opus 4.8 is only a step toward even more advanced systems. The company is already working on “Mythos-class” models under Project Glasswing, which are reportedly powerful enough to require stronger cybersecurity safeguards before public release.</p><p class="paragraph" style="text-align:left;">For now, Opus 4.8 represents a practical upgrade focused less on flashy benchmarks and more on making AI systems more dependable for real software engineering work.</p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>IOS 27 leak! -</b> New iOS 27 leaks suggest Apple is planning a major Siri redesign with smarter AI features and a fresh visual interface. The update could also introduce a redesigned Camera app, deeper AI integration, and improved customization across the iPhone experience. If true, iOS 27 may become Apple’s biggest software refresh since the introduction of Apple Intelligence. <a class="link" href="https://9to5mac.com/2026/05/28/ios-27-leak-reveals-new-siri-design-camera-app-more/?utm_source=tldrdesign" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><p class="paragraph" style="text-align:left;"><b>Your Private Images Might Not Be Private:</b> <a class="link" href="https://www.techtimes.com/articles/317291/20260527/gitea-flaw-left-30000-deployments-private-container-images-readable-4-years.htm?utm_source=tldrinfosec" target="_blank" rel="noopener noreferrer nofollow">A critical Gitea vulnerability</a> reportedly exposed private container images from over 30,000 deployments for nearly four years without requiring authentication. The flaw allowed attackers to access sensitive code, API keys, and infrastructure data simply by pulling “private” images from affected servers. Security researchers are urging all Gitea and Forgejo users to immediately upgrade to patched versions before the issue is actively exploited.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><p class="paragraph" style="text-align:left;"><b>Cognitive Load Budget</b></p><p class="paragraph" style="text-align:left;">Cognitive Load Budget is the idea that every developer has a limited amount of mental energy they can spend while coding, debugging, reviewing PRs, or switching between tools. Modern engineering problems are no longer just about compute limits, they’re about human attention limits. Teams with too many dashboards, alerts, frameworks, meetings, and AI tools often slow down because developers spend more time context-switching than building. Good engineering teams actively reduce cognitive load by simplifying workflows, documentation, architecture, and tooling. This concept is becoming increasingly important in the AI era because developers are now managing both human complexity and AI-generated complexity at the same time.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>Hurl</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/Orange-OpenSource/hurl?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">Hurl</a> is Run and test HTTP requests using plain text files.<br>Feels like curl mixed with integration testing.</p><p class="paragraph" style="text-align:left;"><b>Vegeta</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/tsenart/vegeta?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">Vegeta</a><b> </b>is a flexible HTTP load-testing tool written in Go.</p><p class="paragraph" style="text-align:left;"><b>Grit</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/shub39/Grit?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">Grit</a> is an AI-powered code migration and transformation engine.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read: <a class="link" href="https://dpereira.substack.com/p/past-present-and-future-of-pms-with?utm_source=tldrproduct" target="_blank" rel="noopener noreferrer nofollow">Past present and future of Product Management.</a></p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://www.youtube.com/watch?v=Af6Lv8M0V0Y&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=msft-says-no-to-claude-and-ios-27-leak" target="_blank" rel="noopener noreferrer nofollow">I tested what Vape does to your body.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/5e57930b-60f8-4cb5-bbfc-2816df9948ee/image.png?t=1780076000"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=49ca7224-dd56-4fca-a308-1fc82a132789&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What is Site Reliability Engineering?</title>
  <description>Discover what is site reliability engineering and the best practices. Learn SLOs, automation, and more with real-world examples to build resilient systems.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/site-reliability-engineering-best-practices</link>
  <guid isPermaLink="true">https://hw.glich.co/p/site-reliability-engineering-best-practices</guid>
  <pubDate>Wed, 27 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-27T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Concepts]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><p class="paragraph" style="text-align:left;">In today&#39;s digital landscape, reliability is essential for user trust and business success. As systems become more complex, traditional operations models often fall short, resulting in downtime and user dissatisfaction. Site Reliability Engineering (SRE), developed at Google, addresses these issues by applying a software engineering approach to infrastructure and operations, leading to scalable and reliable software systems.</p><p class="paragraph" style="text-align:left;">This article delves into essential <b>site reliability engineering best practices</b>, offering practical insights and examples from companies like Netflix and Spotify. It covers topics such as setting data-driven Service Level Objectives (SLOs), reducing operational burdens, promoting a culture of learning, and adopting secure deployment strategies.</p><p class="paragraph" style="text-align:left;">Building resilient systems requires integrating security from the start, as outlined in <a class="link" href="https://infrazen.tech/why-you-need-to-understand-secure-by-design-cybersecurity-practices/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-site-reliability-engineering" target="_blank" rel="noopener noreferrer nofollow">Secure by Design cybersecurity practices</a>. This guide offers the knowledge needed to develop, manage, and scale systems capable of handling production challenges.</p><h2 class="heading" style="text-align:left;" id="1-service-level-objectives-sl-os-an">1. Service Level Objectives (SLOs) and Error Budgets</h2><p class="paragraph" style="text-align:left;">One of the foundational site reliability engineering best practices is the establishment of Service Level Objectives (SLOs) and their corresponding error budgets. Instead of aiming for an unrealistic 100% reliability, SLOs define a precise, measurable target for a service&#39;s performance, such as 99.9% uptime over 30 days. This practice shifts the conversation from subjective feelings about stability to an objective, data-driven framework.</p><p class="paragraph" style="text-align:left;">The error budget is the mathematical inverse of the SLO (100% - SLO). For a 99.9% SLO, the error budget is 0.1%, which translates to approximately 43 minutes of acceptable downtime or degraded performance per month. This budget becomes a shared currency between development and operations teams. As long as the service operates within its error budget, development teams have the autonomy to release new features and take calculated risks. If the budget is exhausted due to incidents or performance degradation, the focus automatically shifts to reliability improvements, halting non-essential feature releases until the service is stable.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To adopt SLOs and error budgets effectively, follow these steps:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Focus on Users:</b> Start with user-centric metrics, like latency for key API endpoints. For example, an e-commerce site&#39;s SLO might be &quot;99.9% of &#39;add to cart&#39; API calls should succeed in under 500ms.&quot; Avoid internal metrics that don&#39;t impact the user experience directly.</p></li><li><p class="paragraph" style="text-align:left;"><b>Start Conservatively:</b> Initially set achievable SLOs (e.g., 99.5%) and tighten them as your processes improve to avoid discouragement from unachievable goals.</p></li><li><p class="paragraph" style="text-align:left;"><b>Automate Monitoring:</b> Use tools like Prometheus, Grafana, and SLO management platforms such as Nobl9 or Datadog for real-time monitoring and alerting of SLO compliance and error budgets.</p></li></ul><h2 class="heading" style="text-align:left;" id="2-toil-reduction-and-automation">2. Toil Reduction and Automation</h2><p class="paragraph" style="text-align:left;">A key aspect of site reliability engineering involves reducing toil through automation. Toil refers to manual, repetitive tasks that can be automated and grow with service expansion, such as restarting services or provisioning servers. By eliminating these tasks, SRE teams can focus on enhancing system reliability and scalability.</p><p class="paragraph" style="text-align:left;">The SRE approach, widely adopted by Google, advises engineers to limit toil to no more than 50% of their time, dedicating the rest to tasks like automation development and system re-architecture. This encourages teams to prioritize long-term engineering solutions over reactive work.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To effectively decrease toil and automate processes, consider these steps:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Track Toil Precisely:</b> Have team members log time spent on toil using systems like Jira to create a data-driven baseline. This helps identify time-consuming tasks and justify automation investments.</p></li><li><p class="paragraph" style="text-align:left;"><b>Apply the &#39;Rule of Three&#39;:</b> Automate any task after doing it three times. Learn during the first attempt, note inefficiencies the second time, and create a robust automation tool by the third.</p></li><li><p class="paragraph" style="text-align:left;"><b>Develop Self-Service Tools:</b> Build a self-service portal with tools like Backstage.io to enable developers to provision resources and run tests independently, reducing interrupt-driven toil.</p></li></ul><h2 class="heading" style="text-align:left;" id="3-blameless-post-mortems-and-learni">3. Blameless Post-Mortems and Learning from Failures</h2><p class="paragraph" style="text-align:left;">A key practice in site reliability engineering is conducting blameless post-mortems. This involves reviewing incidents to understand systemic failures and process gaps, not to assign individual blame. The approach acknowledges that complex systems will fail and views human error as a sign of deeper issues like inadequate training or poor tools. By fostering psychological safety, this method encourages transparency and learning from mistakes. It turns failures into learning opportunities, aiming to identify contributing factors and vulnerabilities without assigning blame, leading to effective and lasting improvements.</p><h3 class="heading" style="text-align:left;" id="real-world-implementation-and-benef">Real-World Implementation and Benefits</h3><p class="paragraph" style="text-align:left;">The blameless post-mortem culture, supported by figures like John Allspaw at <b>Etsy</b>, is key to advanced engineering organizations. Etsy&#39;s public post-mortems enhance customer trust and tech community knowledge. <b>Google</b>&#39;s SRE teams perform detailed post-mortems for major incidents with executive attention to foster systemic improvements. <b>GitLab</b> offers transparency by making incident reports and post-mortems publicly accessible. This feedback loop ensures failures inform system design and operations, enhancing service resilience. Learn more about incident handling <a class="link" href="https://hw.glich.co/p/outages-password-leaks-what-else-happened-this-week?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-site-reliability-engineering" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To embed a blameless culture:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Act Quickly:</b> Conduct post-mortems within 48-72 hours using a template in tools like Confluence or Google Docs, focusing on timeline, impact, and action items.</p></li><li><p class="paragraph" style="text-align:left;"><b>Use the &#39;Five Whys&#39;:</b> Ask &quot;Why?&quot; repeatedly to identify root causes beyond immediate issues.</p></li><li><p class="paragraph" style="text-align:left;"><b>Share Learnings:</b> Distribute post-mortem insights via email, Slack, or sessions, and maintain a central, searchable repository.</p></li></ul><h2 class="heading" style="text-align:left;" id="4-monitoring-observability-and-aler">4. Monitoring, Observability, and Alerting Excellence</h2><p class="paragraph" style="text-align:left;">In site reliability engineering, establishing systems for monitoring, observability, and alerting is essential. Monitoring involves analyzing data to assess system health, while observability allows engineers to deduce internal states from external outputs without new code. Intelligent alerting targets user-impacting symptoms, helping teams quickly resolve issues.</p><p class="paragraph" style="text-align:left;">This approach surpasses basic health checks like CPU usage, focusing on the &quot;three pillars of observability&quot;: metrics, logs, and traces. By integrating these, SREs can pinpoint specific issues, such as increased API latency in certain regions, which is vital for managing complex microservices and distributed architectures.</p><h3 class="heading" style="text-align:left;" id="real-world-implementation-and-benef">Real-World Implementation and Benefits</h3><p class="paragraph" style="text-align:left;">Leading tech companies demonstrate the power of this practice. <b>Netflix</b> relies on its Atlas telemetry system and distributed tracing to manage thousands of microservices, ensuring a smooth streaming experience. Similarly, <b>Uber</b> developed its own M3 platform to handle billions of metrics, providing deep insights into its vast, real-time operations. The primary benefit is a drastic reduction in Mean Time To Resolution (MTTR). Teams can pinpoint root causes faster, understand the blast radius of an incident, and validate fixes with confidence, all while minimizing noise from non-actionable alerts.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To achieve excellence, focus on these steps:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Monitor Key Signals:</b> Focus on latency, traffic, errors, and saturation to assess service health from the user&#39;s perspective.</p></li><li><p class="paragraph" style="text-align:left;"><b>Alert on Symptoms:</b> Base alerts on user-facing issues, ensuring they are actionable and protect user experience.</p></li><li><p class="paragraph" style="text-align:left;"><b>Link Runbooks to Alerts:</b> Connect actionable alerts with runbooks detailing investigation and remediation steps to improve response times. For data visualization, explore <a class="link" href="https://hw.glich.co/p/what-is-grafana?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-site-reliability-engineering" target="_blank" rel="noopener noreferrer nofollow">Grafana</a>.</p></li></ul><h2 class="heading" style="text-align:left;" id="5-capacity-planning-and-performance">5. Capacity Planning and Performance Engineering</h2><p class="paragraph" style="text-align:left;">In site reliability engineering, proactive system resource management through capacity planning and performance engineering is essential. This involves predicting future demand to ensure infrastructure can handle loads and optimizing system architecture for efficiency. Instead of reacting to traffic spikes, SRE teams prevent resource exhaustion and maintain a fast user experience as the user base grows. System capacity is viewed as a dynamic variable aligned with business goals. Performance engineering ensures software efficiency, reducing hardware needs, lowering costs, and minimizing performance risks under stress.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To integrate capacity planning and performance engineering effectively, follow these steps:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Maintain Headroom:</b> Keep a 20-30% capacity buffer above your expected peak load to manage traffic spikes and minor failures.</p></li><li><p class="paragraph" style="text-align:left;"><b>Perform Realistic Load Tests:</b> Use tools like k6 or JMeter to create tests that mimic actual user behavior and transaction mixes. Test at peak scale plus extra to detect scaling issues.</p></li><li><p class="paragraph" style="text-align:left;"><b>Forecast Using Data:</b> Combine historical usage, business growth projections, and marketing calendars for demand forecasting. Automate with trend analysis for better accuracy. <a class="link" href="https://hw.glich.co/p/what-is-load-balancing?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-site-reliability-engineering" target="_blank" rel="noopener noreferrer nofollow">Learn about load balancing and capacity planning</a>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Profile Applications Regularly:</b> Use tools like pprof or YourKit in your CI/CD pipeline to identify and optimize resource-heavy functions, reducing capacity needs.</p></li><li><p class="paragraph" style="text-align:left;"><b>Conduct Quarterly Reviews:</b> Hold reviews with stakeholders to ensure alignment and adjust forecasts and budgets as needed.</p></li></ul><h2 class="heading" style="text-align:left;" id="6-incident-management-and-on-call-p">6. Incident Management and On-Call Practices</h2><p class="paragraph" style="text-align:left;">A key aspect of site reliability engineering is a structured approach to incident management and on-call duties. This practice formalizes detecting, responding to, resolving, and learning from service interruptions, shifting from chaotic responses to a predictable and sustainable system that prioritizes quick service recovery and engineer well-being.</p><p class="paragraph" style="text-align:left;">Effective incident management sets clear roles and communication channels to reduce Mean Time To Resolution (MTTR). Along with thoughtful on-call practices, it ensures fair distribution of the 24/7 reliability burden, preventing burnout. This focus on both system and engineer health is essential for long-term operational success, as tired engineers are more likely to make errors.</p><h3 class="heading" style="text-align:left;" id="actionable-tips-for-implementation">Actionable Tips for Implementation</h3><p class="paragraph" style="text-align:left;">To establish a strong incident management framework, follow these steps:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Define Roles:</b> Assign clear roles such as Incident Commander, Communications Lead, and Subject Matter Experts. The Incident Commander declares their role and delegates tasks to avoid confusion.</p></li><li><p class="paragraph" style="text-align:left;"><b>Use Dedicated Channels:</b> Utilize tools like PagerDuty or Opsgenie to automatically create specific Slack or Teams channels for incidents, ensuring centralized and efficient communication.</p></li><li><p class="paragraph" style="text-align:left;"><b>On-Call Rotations:</b> Limit on-call shifts to one week to prevent fatigue and adopt a &quot;follow-the-sun&quot; model for global teams to handle responsibilities across time zones. Ensure a clear escalation policy is in place.</p></li><li><p class="paragraph" style="text-align:left;"><b>Maintain Runbooks:</b> Document incident patterns and resolution steps in runbooks, including diagnostic queries and mitigation commands.</p></li></ul><h2 class="heading" style="text-align:left;" id="site-reliability-engineering-practi">Site Reliability Engineering Practices Comparison</h2><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Practice</p></th><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Implementation Complexity 🔄</p></th><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Resource Requirements ⚡</p></th><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Expected Outcomes 📊</p></th><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Ideal Use Cases 💡</p></th><th class="bh__table_header" width="16%"><p class="paragraph" style="text-align:left;">Key Advantages ⭐</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Service Level Objectives (SLOs) and Error Budgets</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Medium to High: requires data collection and policy setup</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Moderate: monitoring tools and cross-team coordination</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Balanced innovation and reliability with clear risk management</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Services requiring clear reliability targets and release policies</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Objective metrics, aligned teams, risk-based decisions</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Toil Reduction and Automation</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Medium: requires process analysis and automation development</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Moderate to High: automation tooling and maintenance</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Reduced manual work, faster remediation, improved engineer productivity</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Teams facing high manual repetitive tasks</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Scalable ops, less human error, improved job satisfaction</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Blameless Post-Mortems and Learning</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Low to Medium: structured reviews and documentation needed</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Low: mainly time and collaboration</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Improved incident understanding, reduced blame culture, faster recovery</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Incident response and continuous learning cultures</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Psychological safety, systemic improvements, team trust</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Monitoring, Observability, and Alerting Excellence</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">High: extensive instrumentation and alert tuning needed</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">High: storage and processing of large data volumes</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Proactive issue detection, rapid troubleshooting, data-driven decisions</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Complex distributed systems needing deep visibility</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Early detection, reduced alert fatigue, rich context for debugging</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Capacity Planning and Performance Engineering</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Medium to High: forecasting, testing, and tuning required</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Moderate: tooling for load testing and monitoring</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Optimized resource use, predictable performance, outage prevention</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Systems with variable or growing load demands</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Cost efficiency, outage prevention, performance optimization</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Incident Management and On-Call Practices</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Medium: process definition and role assignments required</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Low to Moderate: communication and tooling support</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Faster incident resolution, structured response, reduced engineer burnout</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">24/7 service reliability and rapid incident handling</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Clear roles, fair on-call, improved communication and wellbeing</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Progressive Rollouts and Safe Deployment</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">High: complex deployment infrastructure needed</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Moderate to High: rollout automation and monitoring</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Minimized deployment risks, quick rollback, staged feature releases</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">High-risk deployments, continuous delivery environments</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Reduced blast radius, faster recovery, real-world validation</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Infrastructure as Code and Configuration Management</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Medium to High: learning curve for IaC tools and processes</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Moderate: tooling, version control, and validation pipelines</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Reproducible infrastructure, reduced drift, consistent environments</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Teams managing cloud or large-scale infrastructure</p></td><td class="bh__table_cell" width="16%"><p class="paragraph" style="text-align:left;">Faster scaling, disaster recovery, versioned infrastructure</p></td></tr></table></div><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=ca6b1c42-c207-4ba0-acff-c1368accfa79&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How LinkedIn Rebuilt Service Discovery to Scale to Millions of Services</title>
  <description>LinkedIn rebuilt service discovery using Kafka and Observer, enabling scalable, push-based updates with lower latency and higher availability.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7a3de821-aedc-4b35-a2e6-fdee53a49a5a/image.png" length="55453" type="image/png"/>
  <link>https://hw.glich.co/p/how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services</guid>
  <pubDate>Mon, 25 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-25T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Linkedin]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"></p><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><span style="background-color:lab(100 0 0);"><b><a class="link" href="https://scaleengineer.com/blog/how-snowflake-reduced-query-time-by-20-without-you-doing-anything?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services" target="_blank" rel="noopener noreferrer nofollow">How Snowflake Reduced Query Time by 20% (Without You Doing Anything)</a></b></span></p></div><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-service-discovery" rel="noopener noreferrer nofollow">What is Service Discovery </a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#old-system-zookeeper-based-architec" rel="noopener noreferrer nofollow">Old System: Zookeeper-Based Architecture</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#problems-with-the-old-system" rel="noopener noreferrer nofollow">Problems with the Old System</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-scalability-issues" rel="noopener noreferrer nofollow">1. Scalability Issues</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-compatibility-issues" rel="noopener noreferrer nofollow">2. Compatibility Issues</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#3-extensibility-problems" rel="noopener noreferrer nofollow">3. Extensibility Problems</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#new-system-next-gen-service-discove" rel="noopener noreferrer nofollow">New System: Next-Gen Service Discovery</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#why-this-architecture-works-better" rel="noopener noreferrer nofollow">Why this Architecture Works Better</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-scalability" rel="noopener noreferrer nofollow">1. Scalability</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-availability-over-strict-consiste" rel="noopener noreferrer nofollow">2. Availability Over Strict Consistency</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#3-fault-tolerance" rel="noopener noreferrer nofollow">3. Fault Tolerance</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#dual-mode" rel="noopener noreferrer nofollow">Dual Mode</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#observability-the-backbone-of-migra" rel="noopener noreferrer nofollow">Observability: The Backbone of Migration</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#metrics" rel="noopener noreferrer nofollow">Metrics</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#migration-dependencies" rel="noopener noreferrer nofollow">Migration Dependencies</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#takeaways" rel="noopener noreferrer nofollow">Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">When you open LinkedIn and scroll your feed, send a message, or load notifications, a lot more is happening behind the scenes than you might expect. Each action triggers <b>multiple services talking to each other</b>. Now imagine this at LinkedIn scale with tens of thousands of microservices, running across global data centers, handling billions of requests daily. The question is, “How do all these services find and talk to each other reliably?” The answer lies in something called <b>service discovery</b> and LinkedIn recently rebuilt this system from scratch. Let’s discover what it is and how LinkedIn rebuilt it in this blog today.</p><h2 class="heading" style="text-align:left;" id="what-is-service-discovery">What is Service Discovery </h2><p class="paragraph" style="text-align:left;">You can think of service discovery like a directory. If one service wants to talk to another, it needs to know where that service is running, its IP address and port.</p><p class="paragraph" style="text-align:left;">But in modern systems services keep scaling up/down and instances keep changing. So hardcoding locations doesn’t work. Instead:</p><ul><li><p class="paragraph" style="text-align:left;">Services register themselves in a central system</p></li><li><p class="paragraph" style="text-align:left;">Other services query this system to find them</p></li></ul><p class="paragraph" style="text-align:left;">This central system is called the <b>control plane</b>.</p><h2 class="heading" style="text-align:left;" id="old-system-zookeeper-based-architec">Old System: Zookeeper-Based Architecture</h2><p class="paragraph" style="text-align:left;">For years, LinkedIn used <b>Apache ZooKeeper</b> as its service discovery control plane. Here’s how it worked:</p><ul><li><p class="paragraph" style="text-align:left;">Services register their endpoints in ZooKeeper</p></li><li><p class="paragraph" style="text-align:left;">Clients read this data directly</p></li><li><p class="paragraph" style="text-align:left;">ZooKeeper also performs health checks</p></li></ul><p class="paragraph" style="text-align:left;">This sounds simple and it worked well initially. But as LinkedIn scaled, cracks started to appear.</p><h2 class="heading" style="text-align:left;" id="problems-with-the-old-system">Problems with the Old System</h2><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7a3de821-aedc-4b35-a2e6-fdee53a49a5a/image.png?t=1777406087"/></div><h3 class="heading" style="text-align:left;" id="1-scalability-issues">1. Scalability Issues</h3><p class="paragraph" style="text-align:left;">ZooKeeper handled all reads, all writes and all health checks in one place and this created a bottleneck.</p><p class="paragraph" style="text-align:left;">During large deployments <b>service data changed frequently</b> and t<b>housands of clients started reading updates</b> which caused <b>read storms</b> (massive spikes in read requests). Since ZooKeeper enforces strong consistency, everything goes through a single queue. So when reads increase:</p><ul><li><p class="paragraph" style="text-align:left;">writes get delayed</p></li><li><p class="paragraph" style="text-align:left;">health checks fail</p></li><li><p class="paragraph" style="text-align:left;">sessions drop</p></li></ul><p class="paragraph" style="text-align:left;">And as a result<b> service instances get removed, capacity drops and systems become unavailable.</b></p><h3 class="heading" style="text-align:left;" id="2-compatibility-issues">2. Compatibility Issues</h3><p class="paragraph" style="text-align:left;">LinkedIn used a custom format (D2), which didn’t work well with modern systems like <b>gRPC</b> or <b>Envoy </b>and it was heavily Java-centric.</p><p class="paragraph" style="text-align:left;">This made:</p><ul><li><p class="paragraph" style="text-align:left;">multi-language support difficult</p></li><li><p class="paragraph" style="text-align:left;">onboarding new systems harder</p></li></ul><h3 class="heading" style="text-align:left;" id="3-extensibility-problems">3. Extensibility Problems</h3><p class="paragraph" style="text-align:left;">The architecture lacked an intermediate layer. So it was hard to:</p><ul><li><p class="paragraph" style="text-align:left;">add centralized load balancing</p></li><li><p class="paragraph" style="text-align:left;">integrate with Kubernetes</p></li><li><p class="paragraph" style="text-align:left;">evolve the system</p></li></ul><h2 class="heading" style="text-align:left;" id="new-system-next-gen-service-discove">New System: Next-Gen Service Discovery</h2><p class="paragraph" style="text-align:left;">To solve these issues, LinkedIn built a completely new system. Instead of one monolithic control plane, they introduced a <b>decoupled architecture</b>.</p><p class="paragraph" style="text-align:left;"><b>Key Components</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Kafka (Write Path): </b>Services now send updates via events to <b>Apache Kafka. </b>These include <b>service registrations</b> and <b>heartbeats.</b></p></li><li><p class="paragraph" style="text-align:left;"><b>Service Discovery Observer (Read Path): </b><br>A new component called <b>Observer</b>:</p><ul><li><p class="paragraph" style="text-align:left;">consumes events from Kafka</p></li><li><p class="paragraph" style="text-align:left;">stores data in memory</p></li><li><p class="paragraph" style="text-align:left;">serves clients</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>gRPC Streams: </b>Clients:</p><ul><li><p class="paragraph" style="text-align:left;">open persistent connections using gRPC</p></li><li><p class="paragraph" style="text-align:left;">receive updates in real time</p></li></ul></li></ol><p class="paragraph" style="text-align:left;">The key shift was moving from a pull-based model (clients fetching data) to a push-based model where the Observer streams updates to clients in real time.</p><h2 class="heading" style="text-align:left;" id="why-this-architecture-works-better">Why this Architecture Works Better</h2><h3 class="heading" style="text-align:left;" id="1-scalability">1. Scalability</h3><p class="paragraph" style="text-align:left;">Observer is horizontally scalable and highly concurrent. Thus one Observer can:</p><ul><li><p class="paragraph" style="text-align:left;">handle <b>40K client connections</b></p></li><li><p class="paragraph" style="text-align:left;">process <b>10K updates/sec</b></p></li></ul><h3 class="heading" style="text-align:left;" id="2-availability-over-strict-consiste">2. Availability Over Strict Consistency</h3><p class="paragraph" style="text-align:left;">The shift was from prioritizing strong consistency in ZooKeeper to favoring availability with eventual consistency in the new system.</p><p class="paragraph" style="text-align:left;">This means:</p><ul><li><p class="paragraph" style="text-align:left;">small inconsistencies are okay temporarily</p></li><li><p class="paragraph" style="text-align:left;">system stays responsive</p></li></ul><h3 class="heading" style="text-align:left;" id="3-fault-tolerance">3. Fault Tolerance</h3><p class="paragraph" style="text-align:left;">Even if Kafka is slow or down, observer serves cached data. So services still function with no downtime.</p><h2 class="heading" style="text-align:left;" id="dual-mode">Dual Mode</h2><p class="paragraph" style="text-align:left;">Replacing a core system like service discovery is risky. So LinkedIn didn’t switch everything at once.</p><p class="paragraph" style="text-align:left;">They used <b>Dual Mode (Dual Read + Dual Write)</b></p><p class="paragraph" style="text-align:left;"><b>Dual Read</b></p><ul><li><p class="paragraph" style="text-align:left;">clients read from both old (ZK) and new system</p></li><li><p class="paragraph" style="text-align:left;">compare results in background</p></li></ul><p class="paragraph" style="text-align:left;">Dual Write</p><ul><li><p class="paragraph" style="text-align:left;">services register in both systems</p></li></ul><p class="paragraph" style="text-align:left;">This approach is powerful because it verifies correctness, catches mismatches early, and prevents issues from impacting production.</p><h2 class="heading" style="text-align:left;" id="observability-the-backbone-of-migra">Observability: The Backbone of Migration</h2><p class="paragraph" style="text-align:left;">To ensure everything works, LinkedIn added deep monitoring.</p><p class="paragraph" style="text-align:left;">They tracked:</p><ul><li><p class="paragraph" style="text-align:left;">connection health</p></li><li><p class="paragraph" style="text-align:left;">latency</p></li><li><p class="paragraph" style="text-align:left;">data consistency</p></li><li><p class="paragraph" style="text-align:left;">system resource usage</p></li></ul><h2 class="heading" style="text-align:left;" id="metrics">Metrics</h2><p class="paragraph" style="text-align:left;">End-to-end propagation latency improved from <b>P50 &lt; 10s and P99 &lt; 30s</b> in the old system to <b>P50 &lt; 1s and P99 &lt; 5s</b> in the new system </p><p class="paragraph" style="text-align:left;">New system:</p><ul><li><p class="paragraph" style="text-align:left;">P50 &lt; 1s</p></li><li><p class="paragraph" style="text-align:left;">P99 &lt; 5s</p></li></ul><p class="paragraph" style="text-align:left;">Earlier system:</p><ul><li><p class="paragraph" style="text-align:left;">P50 &lt; 10s</p></li><li><p class="paragraph" style="text-align:left;">P99 &lt; 30s</p></li></ul><p class="paragraph" style="text-align:left;">And that’s some crazy improvement!</p><h2 class="heading" style="text-align:left;" id="migration-dependencies">Migration Dependencies</h2><p class="paragraph" style="text-align:left;">One of the trickiest challenges was migration dependency because clients needed to move first, but write migration depended on read migration, creating a dependency loop.</p><p class="paragraph" style="text-align:left;"><b>Solution</b></p><p class="paragraph" style="text-align:left;">LinkedIn:</p><ul><li><p class="paragraph" style="text-align:left;">analyzed dependency graphs</p></li><li><p class="paragraph" style="text-align:left;">tracked which services depend on others</p></li><li><p class="paragraph" style="text-align:left;">migrated carefully in phases</p></li></ul><p class="paragraph" style="text-align:left;">They also built tools to detect regressions and monitored which apps still relied on old system.</p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="takeaways">Takeaways</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Decouple read and write paths</b><br>Separating Kafka (write path) and Observer (read path) removes bottlenecks and allows each side to scale independently.</p></li><li><p class="paragraph" style="text-align:left;"><b>Prefer push over pull at scale</b><br>Instead of clients repeatedly polling for updates, streaming changes via persistent connections reduces load and improves latency.</p></li><li><p class="paragraph" style="text-align:left;"><b>Prioritize availability over strict consistency (when appropriate)</b><br>In systems like service discovery, it’s more important to stay responsive than perfectly consistent at every moment.</p></li><li><p class="paragraph" style="text-align:left;"><b>Design for multi-language ecosystems</b><br>Using standards like gRPC and xDS makes the system compatible across different languages and frameworks.</p></li><li><p class="paragraph" style="text-align:left;"><b>Migrate critical systems gradually</b><br>Techniques like dual read and dual write help validate the new system safely without risking production stability.</p></li></ol><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from LinkedIn: <a class="link" href="https://www.linkedin.com/blog/engineering/infrastructure/scalable-multi-language-service-discovery-at-linkedin?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-linkedin-rebuilt-service-discovery-to-scale-to-millions-of-services" target="_blank" rel="noopener noreferrer nofollow">Scalable, multi-language service discovery at LinkedIn</a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of,<b> How LinkedIn Rebuilt Service Discovery to Scale to Millions of Services? </b>In a nutshell, LinkedIn replaced its ZooKeeper-based service discovery with a Kafka + Observer system to handle massive scale more reliably. By shifting to a push-based, scalable architecture, it improved latency, availability, and multi-language support.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=6ec4364d-9316-4ea0-981b-1e37525b30bb&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Github&#39;s worst leak, Meta&#39;s new Reddit and Everything you need to know about Google I/O 2026!</title>
  <description>Github&#39;s worst leak, Meta&#39;s new Reddit and Everything you need to know about Google I/O!</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ddef2778-3b4d-4bc9-8ebf-e5e4faf0568f/Pasted_image__29_.png" length="791125" type="image/png"/>
  <link>https://hw.glich.co/p/google-io-2026-everything-you-need-to-know</link>
  <guid isPermaLink="true">https://hw.glich.co/p/google-io-2026-everything-you-need-to-know</guid>
  <pubDate>Sat, 23 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-23T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>GitHub’s worst leak ever! - </b>GitHub has confirmed a major breach affecting nearly 3,800 internal repositories after a poisoned VS Code extension compromised an employee device. The attack, linked to the TeamPCP hacking group, highlights the growing danger of software supply-chain attacks targeting developer tools and open-source ecosystems. <a class="link" href="https://thehackernews.com/2026/05/github-internal-repositories-breached.html?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Read full story.</a></p><p class="paragraph" style="text-align:left;"><b>Meta’s new Reddit - </b><a class="link" href="https://www.engadget.com/2179165/meta-forum-groups-app/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Meta has quietly launched a new app called Forum,</a> a standalone experience built around Facebook Groups that feels heavily inspired by Reddit. The app focuses on community discussions, AI-powered answers, and interest-based feeds while separating group conversations from the clutter of the main Facebook app. It also includes an “Ask” AI feature that pulls insights from multiple groups, showing Meta’s growing push toward AI-assisted social discovery.</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="google-io-2026-was-not-about-ai-fea">Google I/O 2026 Was Not About AI Features. It Was About AI Taking Over the Interface</h1><p class="paragraph" style="text-align:left;">Google I/O 2026 made one thing very clear. Gemini is no longer just an AI chatbot sitting inside an app. Google is turning it into the operating layer across Search, Android, Workspace, YouTube, and even wearable devices.</p><p class="paragraph" style="text-align:left;">This year’s announcements were not focused on one breakthrough model. Instead, Google introduced an entire ecosystem where AI quietly works in the background, takes actions for users, and becomes deeply integrated into everyday digital workflows.</p><h2 class="heading" style="text-align:left;" id="gemini-is-becoming-an-active-agent">Gemini Is Becoming an Active Agent</h2><p class="paragraph" style="text-align:left;">The biggest shift came with Gemini Spark, which Google described as a “personal agent” capable of performing tasks on behalf of users. Unlike traditional assistants that only respond to prompts, Spark is designed to actively manage parts of your digital life.</p><p class="paragraph" style="text-align:left;">It can interact with Gmail, Docs, Calendar, and Tasks while also connecting to third-party tools through MCP integrations later this year. Google is clearly moving toward a future where AI does more than answer questions. It schedules, organizes, monitors, and executes tasks with minimal input.</p><p class="paragraph" style="text-align:left;">The new Daily Brief feature reinforces this direction. By scanning emails, calendars, and task lists, Gemini creates a personalized summary of your day and even suggests next steps automatically.</p><h2 class="heading" style="text-align:left;" id="search-is-turning-into-a-continuous">Search Is Turning Into a Continuous Experience</h2><p class="paragraph" style="text-align:left;">Google Search also received one of its biggest transformations in years. AI Mode is now powered by Gemini 3.5 Flash, which promises faster reasoning and better multimodal understanding.</p><p class="paragraph" style="text-align:left;">The traditional search bar is evolving into a conversational interface that grows dynamically as users type longer questions. More importantly, Google introduced information agents that continuously monitor topics across the web.</p><p class="paragraph" style="text-align:left;">Instead of searching repeatedly for updates, users can now let AI track news, finance, shopping trends, or sports in the background and surface relevant changes automatically.</p><p class="paragraph" style="text-align:left;">This signals a major shift from reactive search to persistent AI-driven discovery.</p><h2 class="heading" style="text-align:left;" id="gemini-models-are-getting-faster-an">Gemini Models Are Getting Faster and More Creative</h2><p class="paragraph" style="text-align:left;">Google also unveiled Gemini 3.5 Flash, which combines strong coding and reasoning performance with faster response speeds. According to Google, it delivers output four times faster than competing frontier models in some scenarios.</p><p class="paragraph" style="text-align:left;">Meanwhile, Gemini Omni expands AI generation beyond text. The new model accepts image, audio, video, and text inputs while generating editable video outputs grounded in real-world knowledge.</p><p class="paragraph" style="text-align:left;">This technology is already being integrated into YouTube Shorts, Google Flow, and creative tools inside the Gemini ecosystem.</p><h2 class="heading" style="text-align:left;" id="android-xr-and-ai-wearables-are-rea">Android XR and AI Wearables Are Real</h2><p class="paragraph" style="text-align:left;">Another standout announcement was Android XR eyewear. Google confirmed that its first AI-powered audio glasses will launch later this year in partnership with Samsung, Qualcomm, Gentle Monster, and Warby Parker.</p><p class="paragraph" style="text-align:left;">These glasses are positioned as “intelligent eyewear” and will even support iPhones. Combined with Android Halo, which shows live agent activity at the top of the screen, Google is preparing users for ambient AI experiences that exist beyond phones and laptops.</p><p class="paragraph" style="text-align:left;">Google I/O 2026 showed that the company is no longer treating AI as a separate product category. Gemini is becoming the connective layer across the entire Google ecosystem.</p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>You cannot search “disregard” in Google now -</b> Google Search users discovered a strange AI bug where typing words like “disregard,” “ignore,” or “skip” caused Search to respond like a chatbot instead of showing normal results. The issue appears tied to Google’s AI Overviews system, which seems to misinterpret certain commands as prompt instructions. <a class="link" href="https://techcrunch.com/2026/05/22/you-can-no-longer-google-the-word-disregard/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><p class="paragraph" style="text-align:left;"><b>NotebookLM Rival Is Here ft. Spotify:</b> <a class="link" href="https://techcrunch.com/2026/05/21/spotify-debuts-a-new-desktop-app-for-creating-personal-podcasts/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Spotify is expanding deeper into AI-generated audio with a new desktop app that lets users create personalized podcasts using prompts, documents, calendars, and web content.</a> The tool, called Studio by Spotify Labs, works similarly to Google’s NotebookLM and generates private AI podcasts synced across your Spotify account. Spotify says the app is still an early research preview and may occasionally produce inaccurate results.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><p class="paragraph" style="text-align:left;"><b>Capability Sandboxing</b></p><p class="paragraph" style="text-align:left;"><b>Capability Sandboxing</b> is a security architecture approach where applications, AI agents, or processes receive extremely limited and specific permissions instead of broad system-wide access. Rather than giving software full control, it only gets small “capabilities” such as reading one file, accessing one API endpoint, or using a single device feature. This model reduces the damage caused by compromised apps, prompt injection attacks, or malicious plugins. It is becoming increasingly important in AI agents, browser engines, WebAssembly runtimes, containers, and cloud-native systems where autonomous tools can execute actions on behalf of users.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>Rnr</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/ismaelgv/rnr?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Rnr</a> is a fast Rust-powered batch rename utility with regex support.</p><p class="paragraph" style="text-align:left;"><b>Deadnix</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/astro/deadnix?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Deadnix</a><b> </b>detects unused variables and dead code in Nix files.</p><p class="paragraph" style="text-align:left;"><b>Tparse</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/mfridman/tparse?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">Tparse</a> converts noisy test outputs into clean summaries and analytics dashboards.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read:<a class="link" href="https://sdtimes.com/ai/this-week-in-ai-updates-claude-sonnet-4-6-gemini-3-1-pro-and-more-february-20-2026/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow"> </a><a class="link" href="https://techcrunch.com/2026/05/20/openai-claims-it-solved-an-80-year-old-math-problem-for-real-this-time/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">OpenAI claims it solved an 80-year-old math problem — for real this time</a>.</p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://youtu.be/IT0Aao0LJrw?si=-Hrjr0RAPHUpfmUb&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=github-s-worst-leak-meta-s-new-reddit-and-everything-you-need-to-know-about-google-i-o-2026" target="_blank" rel="noopener noreferrer nofollow">I Investigated The Country Where it&#39;s Illegal To Be Fat.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/049937db-da23-4e7a-9f81-19e2e148abf3/image.png?t=1779472249"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=c1fb65e5-1eea-4b71-b870-77bff15e9537&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What is Cache Invalidation?</title>
  <description>Discover what is cache invalidation and strategies to boost performance and ensure data consistency. Learn to implement the right approach for your system.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/cache-invalidation-strategies</link>
  <guid isPermaLink="true">https://hw.glich.co/p/cache-invalidation-strategies</guid>
  <pubDate>Wed, 20 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-20T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Concepts]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><p class="paragraph" style="text-align:left;">A cache invalidation strategy is a set of rules that decides when old, stored data gets thrown out. Think of it as the bouncer for your cache, ensuring only fresh, relevant information is served to your users. Getting this right is critical for keeping your application’s data consistent between the cache and your main database, which directly impacts performance and reliability.</p><h2 class="heading" style="text-align:left;" id="why-cache-invalidation-is-tricky">Why Cache Invalidation is Tricky</h2><p class="paragraph" style="text-align:left;">Imagine your cache is an assistant that memorizes answers to common questions to reply instantly, saving a slow trip to the database. A user asks for their profile info? The cache provides it immediately.</p><p class="paragraph" style="text-align:left;">But what happens when that user updates their profile picture? The cache, unaware of the change, will keep serving the old photo. This is called serving <b>stale data</b>, and it can cause anything from user confusion to showing the wrong price for a product. The process of telling the cache to forget the old info and grab the new version is the essence of cache invalidation.</p><p class="paragraph" style="text-align:left;">The core challenge is balancing speed and accuracy. Do you invalidate data the instant it changes? This guarantees accuracy but adds complexity. Or do you let it expire after a set time? This is simpler but means knowingly serving stale data for a period.</p><p class="paragraph" style="text-align:left;">This balancing act is why computer scientist Phil Karlton famously said, &quot;There are only two hard things in Computer Science: cache invalidation and naming things.&quot; Mastering this skill can seriously <a class="link" href="https://www.shorepod.com/post/boost-your-team-increase-developer-productivity-today?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-cache-invalidation" target="_blank" rel="noopener noreferrer nofollow">increase developer productivity</a> by preventing bugs and simplifying system maintenance.</p><h2 class="heading" style="text-align:left;" id="exploring-common-cache-invalidation">Exploring Common Cache Invalidation Strategies</h2><p class="paragraph" style="text-align:left;">Picking the right <b>cache invalidation strategy</b> comes down to your application&#39;s needs. Each one offers a different trade-off between data freshness, performance, and complexity.</p><h3 class="heading" style="text-align:left;" id="time-to-live-ttl-the-set-it-and-for">Time-To-Live (TTL): The &quot;Set It and Forget It&quot; Approach</h3><p class="paragraph" style="text-align:left;">TTL is the simplest strategy. You put a &quot;best before&quot; date on data when you add it to the cache. Once the timer expires, the data is removed. The next request must go back to the database for a fresh copy, which then gets a new timer.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Practical Example:</b> A news website can cache an article for 10 minutes. This is simple and greatly reduces database load, as the content doesn&#39;t change frequently.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actionable Insight:</b> Use TTL for data that is not time-sensitive. A long TTL (hours or days) works for static assets, while a shorter TTL (minutes) is better for content that updates periodically.</p></li></ul><p class="paragraph" style="text-align:left;">The main drawback is that you knowingly serve stale data until the TTL expires.</p><h3 class="heading" style="text-align:left;" id="write-through-caching-for-guarantee">Write-Through Caching: For Guaranteed Consistency</h3><p class="paragraph" style="text-align:left;">With this approach, every write operation goes to <i>both</i> the cache and the primary database simultaneously. The operation isn&#39;t complete until both writes are confirmed.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Practical Example:</b> An e-commerce inventory system. When a product is sold, the count must be updated in both the cache and database immediately to prevent overselling.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actionable Insight:</b> Choose write-through when data consistency is non-negotiable. This eliminates stale data at the cost of higher write latency, as you have to wait for two systems to confirm the write.</p></li></ul><h3 class="heading" style="text-align:left;" id="write-back-caching-write-behind-for">Write-Back Caching (Write-Behind): For High-Speed Writes</h3><p class="paragraph" style="text-align:left;">Write-back caching prioritizes write performance. Data is written only to the fast in-memory cache, and the write to the slower database happens later, either after a delay or in batches.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Practical Example:</b> A system logging user activity. Writes are frequent and speed is critical. The system can buffer logs in the cache and write them to the database every few seconds.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actionable Insight:</b> Use write-back for write-heavy applications where immediate data persistence isn&#39;t required. Be aware of the risk: if the cache server crashes before data is written to the database, those updates are lost.</p></li></ul><h3 class="heading" style="text-align:left;" id="event-driven-invalidation-for-real-">Event-Driven Invalidation: For Real-Time Updates</h3><p class="paragraph" style="text-align:left;">Also known as manual invalidation, this is the most precise strategy. When data changes in the database, your application logic sends a specific command to purge that exact entry from the cache.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Practical Example:</b> A live sports score feed. When a team scores, a &quot;score_updated&quot; event is published. A listener service catches this event and immediately invalidates the cached score, ensuring all users see the new score instantly.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actionable Insight:</b> Implement this using a publish/subscribe (pub/sub) model with tools like Redis Pub/Sub or Kafka. This decouples your services and provides precise control but adds significant architectural complexity.</p></li></ul><h2 class="heading" style="text-align:left;" id="how-to-choose-the-right-strategy">How to Choose the Right Strategy</h2><p class="paragraph" style="text-align:left;">To pick the best strategy, answer three key questions about your data.</p><h3 class="heading" style="text-align:left;" id="1-how-often-does-your-data-change">1. How Often Does Your Data Change?</h3><ul><li><p class="paragraph" style="text-align:left;">Low Frequency (e.g., Blog Posts): Time-To-Live (TTL) is ideal. A long TTL (e.g., 24 hours) reduces database load with minimal risk of stale content.</p></li><li><p class="paragraph" style="text-align:left;">High Frequency (e.g., Social Media Feeds): An event-driven approach is better. It provides precise control to invalidate content the moment it updates.</p></li></ul><h3 class="heading" style="text-align:left;" id="2-how-critical-is-data-freshness">2. How Critical Is Data Freshness?</h3><ul><li><p class="paragraph" style="text-align:left;">Low Criticality (e.g., User Profile Avatar): TTL is acceptable. A user seeing an old avatar for a few minutes is a minor issue.</p></li><li><p class="paragraph" style="text-align:left;">High Criticality (e.g., Bank Account Balance): Write-through is necessary. It guarantees the cache and database are always in sync, which is essential for financial data.</p></li></ul><h3 class="heading" style="text-align:left;" id="3-is-your-application-read-heavy-or">3. Is Your Application Read-Heavy or Write-Heavy?</h3><ul><li><p class="paragraph" style="text-align:left;">Read-Heavy (e.g., Content Websites): Strategies that speed up reads, like TTL, are effective. The same article is read thousands of times for every one update.</p></li><li><p class="paragraph" style="text-align:left;">Write-Heavy (e.g., Analytics Platforms): Write-back shines here. It makes writes feel instantaneous by buffering them in the cache, improving performance for write-intensive operations.</p></li></ul><p class="paragraph" style="text-align:left;">Understanding these trade-offs is fundamental to building high-<a class="link" href="https://hw.glich.co/p/performance-and-scalability-in-web-applications?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-cache-invalidation" target="_blank" rel="noopener noreferrer nofollow">performance and scalability in web applications</a>.</p><h2 class="heading" style="text-align:left;" id="common-questions-about-cache-invali">Common Questions About Cache Invalidation</h2><h3 class="heading" style="text-align:left;" id="whats-the-difference-between-invali">What&#39;s the difference between invalidation and eviction?</h3><p class="paragraph" style="text-align:left;"><b>Invalidation</b> is a deliberate action to remove data known to be stale. <b>Eviction</b> is an automatic process where the cache removes data (often the least recently used) to make space when it&#39;s full. Invalidation is about accuracy; eviction is about capacity.</p><h3 class="heading" style="text-align:left;" id="what-is-a-cache-stampede-and-how-do">What is a cache stampede and how do I prevent it?</h3><p class="paragraph" style="text-align:left;">A cache stampede happens when a popular cached item expires, causing a flood of concurrent requests to hit your database at once.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Prevention Tactic:</b> Use a &quot;stale-while-revalidate&quot; approach. When an item expires, serve the stale data to users while a single background process fetches the fresh version. This shields the database from the stampede.</p></li></ul><h3 class="heading" style="text-align:left;" id="can-i-combine-different-strategies">Can I combine different strategies?</h3><p class="paragraph" style="text-align:left;">Yes, and you should for complex applications. A common pattern is to use <b>TTL</b> as a default and layer <b>event-driven invalidation</b> for critical updates. For example, cache a user&#39;s profile with a 24-hour TTL but use an event to purge it immediately if they change their password. This hybrid model balances performance and consistency, a challenge seen in systems like those discussed in <a class="link" href="https://hw.glich.co/p/how-uber-handles-40-million-reads-per-second-using-an-integrated-cache?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-cache-invalidation" target="_blank" rel="noopener noreferrer nofollow">how Uber handles 40 million reads per second using an integrated cache</a>.</p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=67548b41-0fc0-4219-8b19-5d7af0062faf&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Snowflake Reduced Query Time by 20% (Without You Doing Anything)</title>
  <description>Snowflake reduces query time by 20% via continuous engine optimizations, improving real workloads automatically without user changes.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c08423d7-b50b-439a-800f-f5bea976f1dd/image.png" length="86027" type="image/png"/>
  <link>https://hw.glich.co/p/how-snowflake-reduced-query-time-by-20-without-you-doing-anything</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-snowflake-reduced-query-time-by-20-without-you-doing-anything</guid>
  <pubDate>Mon, 18 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-18T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Snowflake]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"></p><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, "Times New Roman", serif;font-size:16px;"><a class="link" href="https://scaleengineer.com/blog/how-github-uses-codeql-to-secure-code-at-scale?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow"><i> </i></a></span><b><a class="link" href="https://scaleengineer.com/blog/how-github-uses-codeql-to-secure-code-at-scale?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow">How GitHub Uses CodeQL to Secure Code at Scale</a></b></p></div><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-the-snowflake-performance-i" rel="noopener noreferrer nofollow">What is the Snowflake Performance Index (SPI)?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-do-the-numbers-say" rel="noopener noreferrer nofollow">What do the Numbers Say?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#so-how-snowflake-did-it" rel="noopener noreferrer nofollow">So, how Snowflake did it? </a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-actually-improved-under-the-ho" rel="noopener noreferrer nofollow">What Actually Improved Under the Hood?</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-faster-and-smarter-query-compilat" rel="noopener noreferrer nofollow">1.Faster and Smarter Query Compilation</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-improved-materialized-view-perfor" rel="noopener noreferrer nofollow">2. Improved Materialized View Performance</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#3-better-performance-on-non-cluster" rel="noopener noreferrer nofollow">3. Better Performance on Non-Clustered Data</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#4-smarter-search-optimization" rel="noopener noreferrer nofollow">4. Smarter Search Optimization</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#5-faster-metadata-operations-show-c" rel="noopener noreferrer nofollow">5. Faster Metadata Operations (SHOW Commands)</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-bigger-picture" rel="noopener noreferrer nofollow">The Bigger Picture</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#how-snowflake-measures-this-fairly" rel="noopener noreferrer nofollow">How Snowflake Measures this Fairly</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#takeaways" rel="noopener noreferrer nofollow">Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">Imagine running the same query today and then running it again a few months later. Same SQL, same data but this time, it runs <b>20% faster</b>. You didn’t optimize it, you didn’t rewrite anything but it just got faster. This is exactly what Snowflake has been achieving and they track this improvement using something called the <b>Snowflake Performance Index (SPI)</b>. Let’s discover how Snowflake did it in this blog today.</p><p class="paragraph" style="text-align:left;">If you’ve read our earlier blog on how Snowflake improved overall performance by 27%, this is a continuation of that story. That blog explored<b> </b><i><b>what kinds of optimizations Snowflake made</b></i>. This one focuses on <i><b>how those improvements translate into real-world query performance over time</b></i>, measured using the Snowflake Performance Index (SPI). </p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">In case you missed the blog:<a class="link" href="https://scaleengineer.com/blog/how-snowflake-improved-performance-by-27-without-users-noticing?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow"> </a><a class="link" href="https://scaleengineer.com/blog/how-snowflake-improved-performance-by-27-without-users-noticing?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow"><b>How Snowflake Improved Performance by 27% (Without Users Noticing)</b></a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h2 class="heading" style="text-align:left;" id="what-is-the-snowflake-performance-i">What is the Snowflake Performance Index (SPI)?</h2><p class="paragraph" style="text-align:left;">The Snowflake Performance Index (SPI) is a metric designed to measure, “How much faster Snowflake is getting over time for real customer workloads“.</p><p class="paragraph" style="text-align:left;">Now, this is important because most companies measure performance using:</p><ul><li><p class="paragraph" style="text-align:left;">synthetic benchmarks</p></li><li><p class="paragraph" style="text-align:left;">controlled test environments</p></li></ul><p class="paragraph" style="text-align:left;">But Snowflake does something different. Instead of testing in ideal conditions, SPI tracks <b>real queries running on real production workloads</b> due to which this approach gives the best results because real-world workloads are unpredictable, complex and constantly changing. So if performance improves there, it actually means something.</p><p class="paragraph" style="text-align:left;">SPI focuses specifically on:</p><ul><li><p class="paragraph" style="text-align:left;"><b>stable workloads</b> (same type of queries over time)</p></li><li><p class="paragraph" style="text-align:left;">similar data volume</p></li><li><p class="paragraph" style="text-align:left;">consistent query patterns</p></li></ul><p class="paragraph" style="text-align:left;">This allows Snowflake to compare performance fairly across time.</p><h2 class="heading" style="text-align:left;" id="what-do-the-numbers-say">What do the Numbers Say?</h2><p class="paragraph" style="text-align:left;">Based on Snowflake’s internal data:</p><ul><li><p class="paragraph" style="text-align:left;">Query duration improved <b>11% in one year</b></p></li><li><p class="paragraph" style="text-align:left;">Up to <b>20% improvement since SPI tracking began</b></p></li></ul><p class="paragraph" style="text-align:left;">and this isn’t a one-time spike. It reflects continuous, steady improvement in how queries execute over time</p><h2 class="heading" style="text-align:left;" id="so-how-snowflake-did-it">So, how Snowflake did it? </h2><p class="paragraph" style="text-align:left;">One of the most important ideas here is Snowflake’s approach to performance. They don’t rely on big upgrades, migrations and manual tuning. Instead, they focus on continuously improving the <b>core database engine. </b>These improvements are shipped through <b>weekly releases</b></p><p class="paragraph" style="text-align:left;">What this means for users is that every week small improvements are deployed that are automatically applied to your workloads. So over time your queries just become faster. No action required from users end.</p><h2 class="heading" style="text-align:left;" id="what-actually-improved-under-the-ho">What Actually Improved Under the Hood?</h2><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c08423d7-b50b-439a-800f-f5bea976f1dd/image.png?t=1777400820"/></div><h3 class="heading" style="text-align:left;" id="1-faster-and-smarter-query-compilat">1.Faster and Smarter Query Compilation</h3><p class="paragraph" style="text-align:left;">Before any query runs, Snowflake first compiles it. This step decides<b> how the query will execute</b> and <b>how data will be processed. </b>Snowflake improved this phase in multiple ways.</p><p class="paragraph" style="text-align:left;"><b>What changed?</b></p><ul><li><p class="paragraph" style="text-align:left;">It avoids unnecessary optimization steps when they’re not needed</p></li><li><p class="paragraph" style="text-align:left;">It evaluates SQL expressions more efficiently</p></li><li><p class="paragraph" style="text-align:left;">It improves compilation for queries using materialized views</p></li></ul><p class="paragraph" style="text-align:left;">If compilation is faster,<b> queries start executing sooner</b> and <b>overall latency drops. </b>It’s like reducing the “thinking time” before doing the actual work.</p><h3 class="heading" style="text-align:left;" id="2-improved-materialized-view-perfor">2. Improved Materialized View Performance</h3><p class="paragraph" style="text-align:left;">Materialized views are precomputed results that help speed up queries. But maintaining them can be expensive. Snowflake improved how efficiently they are updated and how queries interact with them</p><p class="paragraph" style="text-align:left;"><b>Result</b></p><ul><li><p class="paragraph" style="text-align:left;">faster query performance</p></li><li><p class="paragraph" style="text-align:left;">reduced compute overhead</p></li></ul><p class="paragraph" style="text-align:left;">So you get the benefit of precomputed data without paying a high maintenance cost.</p><h3 class="heading" style="text-align:left;" id="3-better-performance-on-non-cluster">3. Better Performance on Non-Clustered Data</h3><p class="paragraph" style="text-align:left;">In an ideal world, data is perfectly organized but in reality most data is not clustered optimally. Earlier, queries on such data could be slower. Now, Snowflake improved how queries run on <b>non-clustered tables</b>.</p><p class="paragraph" style="text-align:left;">This matters because most real-world datasets are <b>messy </b>and <b>don’t follow perfect structure. </b>Improving performance here means<b> better performance for real use cases, not just ideal ones.</b></p><h3 class="heading" style="text-align:left;" id="4-smarter-search-optimization">4. Smarter Search Optimization</h3><p class="paragraph" style="text-align:left;">Snowflake introduced improved search optimization techniques. Instead of scanning large chunks of data, <b>the system can quickly narrow down relevant data</b></p><p class="paragraph" style="text-align:left;"><b>Result</b></p><ul><li><p class="paragraph" style="text-align:left;">less data scanned</p></li><li><p class="paragraph" style="text-align:left;">faster query execution</p></li></ul><p class="paragraph" style="text-align:left;">This is essentially about doing less work to get the same result.</p><h3 class="heading" style="text-align:left;" id="5-faster-metadata-operations-show-c">5. Faster Metadata Operations (SHOW Commands)</h3><p class="paragraph" style="text-align:left;">Snowflake also improved the performance of <code>SHOW</code> commands. These are used to:</p><ul><li><p class="paragraph" style="text-align:left;">inspect tables</p></li><li><p class="paragraph" style="text-align:left;">check schemas</p></li><li><p class="paragraph" style="text-align:left;">understand system state</p></li></ul><p class="paragraph" style="text-align:left;">This matters because even though these are not heavy queries, <b>they are used frequently </b>and<b> they impact developer workflows. </b>Basically faster metadata operations improve overall user experience!</p><h2 class="heading" style="text-align:left;" id="the-bigger-picture">The Bigger Picture</h2><p class="paragraph" style="text-align:left;">None of these individual improvements alone results in a 20% performance gain. But when you combine them, they create a compounding effect.</p><p class="paragraph" style="text-align:left;">That’s the key idea here:</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Many small optimizations across different parts of the system can together lead to significant overall improvements.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">This is exactly how Snowflake achieves consistent performance gains over time, not through one big change, but through continuous, incremental enhancements.</p><h2 class="heading" style="text-align:left;" id="how-snowflake-measures-this-fairly">How Snowflake Measures this Fairly</h2><p class="paragraph" style="text-align:left;">Snowflake doesn’t just claim performance improvements, it measures them carefully. To ensure accuracy, it follows a structured approach:</p><ul><li><p class="paragraph" style="text-align:left;">Identify <b>stable workloads</b></p></li><li><p class="paragraph" style="text-align:left;">Ensure <b>similar query patterns</b></p></li><li><p class="paragraph" style="text-align:left;">Keep <b>data volume consistent</b></p></li><li><p class="paragraph" style="text-align:left;">Track <b>query duration over time</b></p></li></ul><p class="paragraph" style="text-align:left;">This approach ensures:</p><ul><li><p class="paragraph" style="text-align:left;"><b>fair comparison under the same conditions</b></p></li><li><p class="paragraph" style="text-align:left;"><b>actual performance improvements (not artificial gains)</b></p></li><li><p class="paragraph" style="text-align:left;"><b>no misleading benchmarks</b></p></li></ul><p class="paragraph" style="text-align:left;">In other words, Snowflake isn’t just making things faster, it’s proving that real-world workloads are actually improving over time.</p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="takeaways">Takeaways</h2><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Continuous optimization beats big releases</b><br>Instead of relying on occasional major upgrades, Snowflake delivers small, frequent improvements that compound into significant gains over time.</p></li><li><p class="paragraph" style="text-align:left;"><b>Optimize the engine, not the user</b><br>Great systems improve internally so users don’t have to tune queries or change configurations to get better performance.</p></li><li><p class="paragraph" style="text-align:left;"><b>Measure real workloads, not benchmarks</b><br>Synthetic benchmarks can be misleading. Tracking real customer workloads ensures improvements actually matter in production.</p></li><li><p class="paragraph" style="text-align:left;"><b>Eliminate unnecessary work</b><br>Skipping redundant optimization steps and avoiding extra computation directly improves performance without adding more resources.</p></li><li><p class="paragraph" style="text-align:left;"><b>Small gains compound over time</b><br>Individual improvements may seem minor, but together they create a noticeable and meaningful impact on overall system performance.</p></li></ol><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from Snowflake: <a class="link" href="https://www.snowflake.com/en/blog/measuring-performance-improvements-spi/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-reduced-query-time-by-20-without-you-doing-anything" target="_blank" rel="noopener noreferrer nofollow">Snowflake Improves Query Duration by 20% on Stable Workloads Since We Began Tracking the Snowflake Performance Index</a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of,<b> How Snowflake Reduced Query Time by 20%? </b>In a nutshell, Snowflake improves query performance by continuously optimizing its core engine and measuring real-world impact using SPI. These incremental improvements reduce query duration by up to 20% without requiring any user changes.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=ccdfbef0-9e3d-4532-92da-20bc41e7c6a1&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Your mouse can now think and Codex comes to mobile!</title>
  <description>Your mouse can now think, Codex comes to mobile and &quot;Shai-Hulund&quot; attacks again</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/500c553f-b2d9-48d3-9448-485d9fed4f2a/Pasted_image__26_.png" length="993573" type="image/png"/>
  <link>https://hw.glich.co/p/why-developers-still-dont-fully-trust-ai-coding-agents</link>
  <guid isPermaLink="true">https://hw.glich.co/p/why-developers-still-dont-fully-trust-ai-coding-agents</guid>
  <pubDate>Sat, 16 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-16T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>“Mini Shai-Hulund” attacks npm, PyPi again - </b>A new “Mini Shai-Hulud” supply-chain worm is spreading through popular npm and PyPI packages, compromising developer tools linked to TanStack, Mistral AI, UiPath, and other major ecosystems. Instead of targeting end users directly, the malware steals GitHub, cloud, and CI/CD credentials by hijacking trusted publishing pipelines and developer environments. <a class="link" href="https://thehackernews.com/2026/05/mini-shai-hulud-worm-compromises.html?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">Read more.</a></p><p class="paragraph" style="text-align:left;"><b>Now your mouse can think! - </b>DeepMind is rethinking how humans interact with AI by <a class="link" href="https://deepmind.google/blog/ai-pointer/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">turning the mouse pointer into a contextual control system for embedded agents.</a> Instead of writing long prompts, users can simply point at objects, tables, images, or text and give natural commands like “explain this” or “turn this into a chart.” The system combines gestures with voice input, making AI interaction feel less like prompting a chatbot and more like collaborating with an intelligent operating system.</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="why-developers-still-dont-fully-tru">Why Developers Still Don’t Fully Trust AI Coding Agents</h1><p class="paragraph" style="text-align:left;"><i>AI tools can write code faster than ever, but speed alone is not enough. The real challenge is whether developers can actually trust what these agents are doing behind the scenes.</i></p><h2 class="heading" style="text-align:left;" id="ai-agents-are-fast-but-confidence-i">AI Agents Are Fast, But Confidence Is Fragile</h2><p class="paragraph" style="text-align:left;">Over the last few years, AI coding assistants have evolved from autocomplete tools into fully autonomous agents capable of generating features, fixing bugs, refactoring applications, and even opening pull requests. The productivity gains are undeniable. Developers are shipping faster, experimenting more, and automating tedious work that once consumed entire afternoons.</p><p class="paragraph" style="text-align:left;">But there’s a growing problem beneath the excitement: trust.</p><p class="paragraph" style="text-align:left;">Most developers who regularly use AI tools have experienced moments where the output felt… off. Maybe the agent edited files you never asked it to touch. Maybe it introduced logic that technically worked but made the codebase harder to maintain. Sometimes it confidently generated instructions that were simply wrong.</p><p class="paragraph" style="text-align:left;">What makes these moments frustrating is not just the mistake itself, it’s the uncertainty that follows. Developers often spend more time auditing the AI’s work than actually writing code.</p><h2 class="heading" style="text-align:left;" id="the-real-problem-isnt-capability">The Real Problem Isn’t Capability</h2><p class="paragraph" style="text-align:left;">AI companies often focus on benchmark scores and coding performance metrics. But developers rarely stick with tools simply because they are powerful. They stick with tools that are predictable.</p><p class="paragraph" style="text-align:left;">A fast tool that behaves inconsistently creates anxiety. Developers want systems that behave reliably under pressure, especially in production environments where every change has consequences.</p><p class="paragraph" style="text-align:left;">This is where many coding agents struggle today. They optimize heavily for task completion, not collaboration.</p><p class="paragraph" style="text-align:left;">One of the biggest frustrations with AI agents is the lack of visibility into their decision-making process.</p><p class="paragraph" style="text-align:left;">You might receive a large diff explaining <i>what</i> changed, but not <i>why</i> those decisions were made. Without understanding the reasoning behind the implementation, reviewing AI-generated code becomes mentally exhausting.</p><p class="paragraph" style="text-align:left;">Consider a simple example.</p><p class="paragraph" style="text-align:left;">An AI agent may generate a compressed one-liner that technically solves the problem. But human developers often prefer readable abstractions, descriptive helper functions, and maintainable naming conventions that future teammates can quickly understand.</p><p class="paragraph" style="text-align:left;">The issue is not correctness. It’s maintainability.</p><p class="paragraph" style="text-align:left;">Readable code is collaborative code, and many AI agents still optimize for efficiency over clarity.</p><h2 class="heading" style="text-align:left;" id="scope-creep-makes-reviews-harder">Scope Creep Makes Reviews Harder</h2><p class="paragraph" style="text-align:left;">Another major trust issue is uncontrolled scope.</p><p class="paragraph" style="text-align:left;">Developers frequently ask agents to make one targeted fix, only to discover changes scattered across unrelated files. Sometimes those edits are useful. Sometimes they even improve the project. But they still increase the burden of review.</p><p class="paragraph" style="text-align:left;">This creates a dangerous pattern where engineers must constantly audit AI-generated work for hidden surprises.</p><p class="paragraph" style="text-align:left;">In traditional software development, pull requests are easier to review when they remain tightly scoped. AI agents often break that expectation by behaving like overly enthusiastic contributors trying to “improve everything” at once.</p><p class="paragraph" style="text-align:left;">The result is review fatigue. The industry loves the phrase “human-in-the-loop,” but in many tools that simply means approving the final output after the work is already done.</p><p class="paragraph" style="text-align:left;">Real collaboration should happen earlier.</p><p class="paragraph" style="text-align:left;">Developers need the ability to see the plan before execution, modify the approach mid-process, and guide the agent while it works, not just approve or reject the result afterward. Some newer systems are experimenting with workflows built around planning, context gathering, review stages, and controlled execution. That direction feels promising because it treats AI as a collaborator instead of an autonomous black box.</p><h2 class="heading" style="text-align:left;" id="the-future-of-ai-development-depend">The Future of AI Development Depends on Trust</h2><p class="paragraph" style="text-align:left;">AI coding agents are not failing because they lack intelligence. They are failing because software engineering is fundamentally collaborative work.</p><p class="paragraph" style="text-align:left;">Developers do not just want faster outputs. They want systems they can reason about, maintain, and confidently deploy.</p><p class="paragraph" style="text-align:left;">The next generation of AI tools will not win by being the most autonomous. They will win by being the most trustworthy.</p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Anthropic moves into AWS -</b> <a class="link" href="https://claude.com/blog/claude-platform-on-aws?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">Anthropic is bringing its full Claude developer platform directly into AWS</a>, giving companies access to native Claude APIs, agent tools, code execution, MCP connectors, and experimental features without leaving the AWS ecosystem. Instead of using a separate Anthropic account, developers can now manage authentication, billing, and security through existing AWS infrastructure and IAM controls.</p><p class="paragraph" style="text-align:left;"><b>Codex comes to mobile phone:</b> <a class="link" href="https://sdtimes.com/ai/openai-announces-codex-for-mobile-devices/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">OpenAI is bringing Codex to mobile devices </a>through the ChatGPT app, letting developers monitor, guide, and approve AI coding tasks directly from their phones. Instead of coding on mobile, users can remotely manage long-running AI agents, review outputs, switch models, and keep development workflows moving from anywhere.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><p class="paragraph" style="text-align:left;"><b>Deterministic Replay</b></p><p class="paragraph" style="text-align:left;"><b>Deterministic Replay</b> is a debugging technique where a system records enough information about a program’s execution so the exact same behavior can be replayed later, instruction by instruction. Most engineers debug by reproducing bugs manually, but deterministic replay lets you “time travel” through crashes, race conditions, and distributed system failures that normally disappear once the system changes state. It is especially important in multithreaded systems, AI agents, trading platforms, game engines, and distributed infrastructure where bugs are often non-deterministic and nearly impossible to reproduce consistently. Tools using deterministic replay can capture inputs, thread scheduling, network events, and system calls so developers can inspect the exact moment something went wrong.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>Czkawka</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/qarmin/czkawka?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">Czkawka</a> is a powerful duplicate finder for developers.<br>Helps clean massive workspaces, caches, and downloaded assets.</p><h5 class="heading" style="text-align:left;" id="kondo">Kondo</h5><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/tbillington/kondo?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">Kondo</a><b> </b>cleans build artifacts and temporary dev files automatically.</p><p class="paragraph" style="text-align:left;"><b>jless</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/pauljuliusmartinez/jless?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow"><b>jless</b></a> - is a terminal JSON viewer with navigation and syntax awareness.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read:<a class="link" href="https://sdtimes.com/ai/this-week-in-ai-updates-claude-sonnet-4-6-gemini-3-1-pro-and-more-february-20-2026/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow"> </a><a class="link" href="https://sdtimes.com/ai/may-8-2026-ai-updates-from-the-past-week-coder-agents-launch-snyk-claude-partnership-opsera-cursor-partnership-and-more/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">AI updates from the past week.</a></p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://youtu.be/DU9JCFMJp8E?si=SVCvM6TM23I_6T6x&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=your-mouse-can-now-think-and-codex-comes-to-mobile" target="_blank" rel="noopener noreferrer nofollow">Aviloop: YouTube&#39;s Darkest Mystery</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bd3d850e-0ed0-4394-a033-0d6d4994ab22/image.png?t=1778819972"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=a7540c34-9840-4536-8a77-0f9dbb249e6b&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What is Service Discovery?</title>
  <description>Explore service discovery for microservices. Understand key patterns, compare tools like Consul and Eureka, and learn best practices for resilient systems.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/service-discovery-for-microservices</link>
  <guid isPermaLink="true">https://hw.glich.co/p/service-discovery-for-microservices</guid>
  <pubDate>Wed, 13 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-13T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Concepts]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><p class="paragraph" style="text-align:left;">When you&#39;re building with a <a class="link" href="https://hw.glich.co/p/what-are-microservices?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">microservices architecture</a>, <b>service discovery</b> is the magic that lets all your different services find and talk to each other automatically. Forget hardcoding IP addresses or network locations.</p><p class="paragraph" style="text-align:left;">Think of it like a dynamic, self-updating phonebook for your entire application. In an environment where services are constantly spinning up, shutting down, or scaling out, this isn&#39;t just a nice-to-have it&#39;s absolutely essential.</p><h2 class="heading" style="text-align:left;" id="why-service-discovery-is-essential-">Why Service Discovery Is Essential for Microservices</h2><p class="paragraph" style="text-align:left;">Imagine trying to call a friend whose phone number changes every few minutes. To have any hope of reaching them, you&#39;d need a contact list that updates in real-time. That&#39;s the exact problem microservices face in the cloud.</p><p class="paragraph" style="text-align:left;">Old-school methods, like static IP addresses and manual config files, just can&#39;t keep up when service instances are temporary by design. They come and go as needed.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bf18bfe3-3533-4cd7-8c2c-fa87048cbc81/b05cef61-9986-485f-95ed-fdb5cb5615f7.jpg?t=1778600546"/></div><p class="paragraph" style="text-align:left;">This is where a <b>service registry</b> steps in to become the single source of truth. When a new service instance boots up, it registers its current location (IP address and port) with this central registry. When it’s time to shut down, it de-registers itself.</p><p class="paragraph" style="text-align:left;">Now, any other service that needs to connect can just ask the registry for the latest, most current location. Communication just works, seamlessly.</p><h3 class="heading" style="text-align:left;" id="the-issue-with-static-configuration">The Issue with Static Configurations</h3><p class="paragraph" style="text-align:left;">In a large monolithic application, direct interaction between code parts is easy since they run in the same process. However, breaking it into distributed services turns communication into a network challenge.</p><p class="paragraph" style="text-align:left;">Tracking network locations for numerous service instances manually is not only tedious but also prone to errors and unmanageable at scale. This complexity increases with modern practices like auto-scaling, where service instances fluctuate based on traffic. Without automation, constant downtime and operational issues are inevitable. A robust <b>service discovery for microservices</b> system addresses this by abstracting service locations.</p><h2 class="heading" style="text-align:left;" id="breaking-down-the-core-service-disc">Breaking Down the Core Service Discovery Patterns</h2><p class="paragraph" style="text-align:left;">In a microservices world, services are constantly spinning up and shutting down. So, how do they find each other in this ever-changing environment? The whole process boils down to two fundamental approaches: <b>Client-Side Discovery</b> and <b>Server-Side Discovery</b>.</p><p class="paragraph" style="text-align:left;">Each pattern tackles the same essential question: &quot;Where is the service I need to talk to <i>right now</i>?&quot; Your choice here will shape your application&#39;s architecture, complexity, and even its performance. Let&#39;s dig into how they work.</p><h3 class="heading" style="text-align:left;" id="the-client-side-discovery-pattern">The Client-Side Discovery Pattern</h3><p class="paragraph" style="text-align:left;">In the client-side pattern, each client service determines where to send requests. When calling another service like &quot;Payment Service,&quot; the client queries a service registry for available instances and their IPs and ports. The client then uses a <b>load-balancing algorithm</b> to select an instance to connect directly.</p><p class="paragraph" style="text-align:left;">The essence is that the client manages service selection and connection. This grants full control over load-balancing, beneficial for custom logic, and reduces complexity between the client and service, potentially lowering network latency.</p><p class="paragraph" style="text-align:left;">However, this pattern ties the client to the service registry, requiring the discovery logic to be implemented across all languages and frameworks, which can be challenging in diverse environments.</p><h3 class="heading" style="text-align:left;" id="the-server-side-discovery-pattern">The Server-Side Discovery Pattern</h3><p class="paragraph" style="text-align:left;">In server-side discovery, the client offloads discovery logic to a central point like a router or load balancer, instead of querying the service registry. This intermediary manages traffic by checking the service registry, selecting a service using a load-balancing algorithm, and forwarding requests. The client remains unaware of this process. For more on how DNS plays a role in routing, see our guide on <a class="link" href="https://hw.glich.co/p/what-is-dns-and-how-does-it-work?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">DNS and its function</a>.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">With server-side discovery, the client simply requests, &quot;I need to talk to the Payment Service,&quot; and the infrastructure handles it.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">This approach simplifies client code, reducing the need for discovery and load-balancing logic in each service, resulting in a cleaner system. However, it introduces an additional network hop, adding slight latency, and makes the router or load balancer a crucial component that must be highly reliable.</p><p class="paragraph" style="text-align:left;">Service discovery is part of a larger system. To understand its role, it&#39;s useful to explore various <a class="link" href="https://opsmoon.com/blog/microservices-architecture-design-patterns?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">microservices architecture patterns</a>. Choosing between client-side and server-side discovery depends on priorities like control, simplicity, or performance.</p><h2 class="heading" style="text-align:left;" id="a-practical-comparison-of-popular-s">A Practical Comparison of Popular Service Discovery Tools</h2><p class="paragraph" style="text-align:left;">Choosing the right service discovery tool for microservices is a significant architectural decision. It&#39;s essential to understand the trade-offs each tool makes between consistency, availability, and features. Major tools like Consul, Eureka, etcd, and Zookeeper differ mainly in their consistency models, relating to the CAP theorem. Tools are generally either <b>CP</b> (Consistency and Partition Tolerance) or <b>AP</b> (Availability and Partition Tolerance).</p><h3 class="heading" style="text-align:left;" id="hashi-corp-consul">HashiCorp Consul</h3><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.consul.io/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">Consul</a> is more than a service discovery tool; it&#39;s a service networking platform with a service mesh, key-value store, and robust health checks. It uses the Raft consensus algorithm, ensuring all nodes agree on the service registry&#39;s state.</p><p class="paragraph" style="text-align:left;">Consul excels with multi-datacenter support, ideal for global applications, and its health checks can assess application-level health.</p><h3 class="heading" style="text-align:left;" id="netflix-eureka">Netflix Eureka</h3><p class="paragraph" style="text-align:left;">Eureka, part of the Netflix OSS stack, is focused on extreme resilience and availability as an AP system. Its main aim is to maintain the service registry online, even during network issues. If a server loses peer contact, it enters &quot;fail-safe&quot; mode, serving requests with the last known data, ensuring no downtime.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Key Insight:</b> Choosing between a CP tool like Consul and an AP tool like Eureka depends on whether your system can handle brief unavailability (CP) or the use of slightly outdated data (AP).</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="etcd-and-zookeeper">etcd and Zookeeper</h3><p class="paragraph" style="text-align:left;"><a class="link" href="https://etcd.io/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">etcd</a> and <a class="link" href="https://zookeeper.apache.org/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">Zookeeper</a> are distributed key-value stores often used in service discovery. Both are <b>CP systems</b> known for their consistency, essential for reliable service registries.</p><ul><li><p class="paragraph" style="text-align:left;"><b>etcd:</b> Known for its role in Kubernetes, etcd manages cluster states using the Raft algorithm and is optimized for read operations.</p></li><li><p class="paragraph" style="text-align:left;"><b>Zookeeper:</b> This Apache project uses the ZAB protocol and has long supported distributed systems like Kafka and Hadoop.</p></li></ul><p class="paragraph" style="text-align:left;">Organizations often use these tools for coordination tasks, extending them to service discovery. For more advanced routing, see our article on <b><a class="link" href="https://hw.glich.co/p/api-gateway-patterns-in-microservies?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">API gateway patterns in microservices</a></b>.</p><h3 class="heading" style="text-align:left;" id="comparison-of-service-discovery-too">Comparison of Service Discovery Tools</h3><p class="paragraph" style="text-align:left;">Here&#39;s a concise guide to help you choose the right architectural tool based on key features and common use cases.</p><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Feature</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Consul</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Eureka</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">etcd</p></th><th class="bh__table_header" width="20%"><p class="paragraph" style="text-align:left;">Zookeeper</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Consistency Model</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">CP (Raft)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">AP (Peer-to-Peer)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">CP (Raft)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">CP (ZAB)</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Primary Use Case</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Service Discovery & Mesh</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Service Discovery</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Distributed Key-Value Store</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Distributed Coordination</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Health Checking</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Advanced (App & Script)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Basic (Client Heartbeats)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Basic (TTL on keys)</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Basic (Ephemeral nodes)</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>Multi-Datacenter</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes, first-class support</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Custom setup needed</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;"><b>K/V Store</b></p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes, fully integrated</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">No</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes, core function</p></td><td class="bh__table_cell" width="20%"><p class="paragraph" style="text-align:left;">Yes, core function</p></td></tr></table></div><p class="paragraph" style="text-align:left;">Ultimately, the right <b>service discovery for microservices</b> hinges on your specific requirements. <b>Consul</b> offers an all-in-one solution with service mesh capabilities. For maximum availability, <b>Eureka</b> is ideal. If using <b>etcd</b> or <b>Zookeeper</b>, they can be efficiently utilized for discovery.</p><h3 class="heading" style="text-align:left;" id="exploring-different-service-types">Exploring Different Service Types</h3><p class="paragraph" style="text-align:left;">Kubernetes offers various Service types to expose applications according to your needs:</p><ul><li><p class="paragraph" style="text-align:left;"><b>ClusterIP:</b> The default option, providing an internal IP accessible only within the cluster, ideal for internal microservice communication.</p></li><li><p class="paragraph" style="text-align:left;"><b>NodePort:</b> Exposes the Service on a static port across each node&#39;s IP, allowing external access, typically for development or specific proxy configurations.</p></li><li><p class="paragraph" style="text-align:left;"><b>LoadBalancer:</b> Suitable for cloud environments, this option automatically creates an external load balancer from your cloud provider to route internet traffic to your Service.</p></li><li><p class="paragraph" style="text-align:left;"><b>Ingress:</b> Although not a Service type, an Ingress controller serves as an L7 load balancer for HTTP/S routing, allowing multiple services to share a single IP based on hostname or path rules, commonly used for web applications.</p></li></ul><h2 class="heading" style="text-align:left;" id="best-practices-for-building-resilie">Best Practices for Building Resilient Service Discovery</h2><p class="paragraph" style="text-align:left;">Ensuring communication between microservices is essential, but maintaining this connection when issues arise is critical. A reliable service discovery setup is fundamental for production environments. The focus should be on designing systems with built-in failure tolerance from the start.</p><h3 class="heading" style="text-align:left;" id="implement-meaningful-health-checks">Implement Meaningful Health Checks</h3><p class="paragraph" style="text-align:left;">A simple network ping is insufficient for assessing service health. It&#39;s akin to confirming a chef&#39;s presence without checking their ingredients. A robust system requires in-depth health checks.</p><p class="paragraph" style="text-align:left;">Go beyond network status: verify database connections, dependency access, and core functionality. For example, a payment service is unhealthy if it can&#39;t connect to the gateway, despite its API responsiveness.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">A service is &quot;healthy&quot; only if it&#39;s fully operational. Application-level health checks ensure traffic reaches instances that are completely ready.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h3 class="heading" style="text-align:left;" id="use-caching-and-smart-retries">Use Caching and Smart Retries</h3><p class="paragraph" style="text-align:left;">If your service registry goes down, services can&#39;t find each other, risking system failure. Client-side caching can help by storing the last known locations of dependencies. If the registry fails, clients use this cached data to continue functioning. Combine this with smart retry logic, like exponential backoff, to avoid overwhelming a struggling service. For more on these strategies, see our guide on the <a class="link" href="https://hw.glich.co/p/circuit-breaker-vs-retry?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-is-service-discovery" target="_blank" rel="noopener noreferrer nofollow">circuit breaker vs. retry pattern</a>.</p><h3 class="heading" style="text-align:left;" id="secure-and-federate-your-registry">Secure and Federate Your Registry</h3><p class="paragraph" style="text-align:left;">An unsecured service registry poses a major security risk, as it reveals your infrastructure map. Unauthorized access could allow traffic rerouting or service shutdowns. Ensure your registry is secured with robust access controls and encryption.</p><p class="paragraph" style="text-align:left;">For large-scale, multi-region setups, consider registry federation. This involves separate but linked registry clusters in each region, offering significant advantages:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Improved Latency:</b> Services access their local registry for quicker lookups.</p></li><li><p class="paragraph" style="text-align:left;"><b>High Availability:</b> A regional registry outage won’t affect other regions.</p></li><li><p class="paragraph" style="text-align:left;"><b>Fault Isolation:</b> Issues remain confined to one area, avoiding widespread disruptions.</p></li></ul><p class="paragraph" style="text-align:left;">By integrating thorough health checks, smart caching, and a secure, distributed registry, you can establish a resilient service discovery for microservices.</p><p class="paragraph" style="text-align:left;">Despite familiarity with these patterns and tools, engineers often encounter recurring questions when implementing service discovery for microservices. Addressing these common confusions can help avoid mistakes and lead to better architectural decisions.</p><h3 class="heading" style="text-align:left;" id="service-mesh-vs-service-registry">Service Mesh vs. Service Registry</h3><p class="paragraph" style="text-align:left;">A common question is, &quot;What&#39;s the difference between a service mesh and a service registry?&quot; While they often work in tandem, they address distinct issues.</p><p class="paragraph" style="text-align:left;">A <b>service registry</b> (such as Consul or Eureka) acts like a phonebook, maintaining an updated list of network locations for all services. It is straightforward and crucial.</p><p class="paragraph" style="text-align:left;">A <b>service mesh</b> (like Istio or Linkerd) is more akin to the entire telephone network. It uses the registry to locate services but offers additional features like intelligent traffic routing, encryption, circuit breaking, and observability.</p><p class="paragraph" style="text-align:left;">A service registry identifies <i>where</i> a service is, while a service mesh manages <i>how</i> services interact once identified. Though every service mesh requires a discovery mechanism, not all discovery setups are service meshes. A registry can suffice, but a mesh provides a more robust management toolkit.</p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=581f2e4e-935d-469a-b20a-30fc599449d5&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How GitHub Uses CodeQL to Secure Code at Scale</title>
  <description>GitHub uses CodeQL to scan code as data, detect vulnerabilities, and secure thousands of repos automatically at scale.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e4af35ab-07e5-444e-b469-e89dbe3d0e3b/image.png" length="127403" type="image/png"/>
  <link>https://hw.glich.co/p/how-github-uses-codeql-to-secure-code-at-scale</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-github-uses-codeql-to-secure-code-at-scale</guid>
  <pubDate>Mon, 11 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-11T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Github]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h2 class="heading" style="text-align:left;" id="heading-2"></h2><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-github-uses-codeql-to-secure-code-at-scale" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><b><a class="link" href="https://scaleengineer.com/blog/how-snowflake-improved-performance-by-27-without-users-noticing?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-github-uses-codeql-to-secure-code-at-scale" target="_blank" rel="noopener noreferrer nofollow">How Snowflake Improved Performance by 27% (Without Users Noticing)</a></b></p></div><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-code-ql" rel="noopener noreferrer nofollow">What is CodeQL?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#how-git-hub-uses-code-ql-internally" rel="noopener noreferrer nofollow">How GitHub Uses CodeQL Internally</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-better-approach" rel="noopener noreferrer nofollow">The Better Approach</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#treating-security-queries-like-prod" rel="noopener noreferrer nofollow">Treating Security Queries Like Production Code</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-do-these-queries-actually-dete" rel="noopener noreferrer nofollow">What Do These Queries Actually Detect?</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-detecting-unsafe-api-usage" rel="noopener noreferrer nofollow">1. Detecting Unsafe API Usage</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-enforcing-authorization-rules" rel="noopener noreferrer nofollow">2. Enforcing Authorization Rules</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#3-catching-unsafe-patterns" rel="noopener noreferrer nofollow">3. Catching Unsafe Patterns</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#not-all-alerts-are-blockers" rel="noopener noreferrer nofollow">Not All Alerts Are Blockers</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#variant-analysis" rel="noopener noreferrer nofollow">Variant Analysis</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#scaling-this-across-repositories-mr" rel="noopener noreferrer nofollow">Scaling This Across Repositories: MRVA</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#security-built-into-ci" rel="noopener noreferrer nofollow">Security Built into CI</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#takeaways" rel="noopener noreferrer nofollow">Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">When you think about GitHub, you probably think about code hosting, pull requests, and collaboration. But behind the scenes, GitHub is also responsible for securing <b>thousands of repositories and millions of lines of code</b>. And doing that manually? Not even remotely possible. The answer isn’t just “security engineers reviewing code.” That would never scale. Instead, GitHub relies heavily on something much more powerful, <a class="link" href="https://codeql.github.com/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-github-uses-codeql-to-secure-code-at-scale" target="_blank" rel="noopener noreferrer nofollow"><b>CodeQL</b></a>. Let’s discover what it is and how GitHub uses it to secure the code at this huge scale.</p><h2 class="heading" style="text-align:left;" id="what-is-code-ql">What is CodeQL?</h2><p class="paragraph" style="text-align:left;">CodeQL is a <b>static analysis tool</b>. But unlike traditional tools that scan code using simple patterns or keyword matching, it treats code like structured data. This means you can actually <i>query your codebase</i> the same way you would query a database.</p><p class="paragraph" style="text-align:left;">You can think of your codebase as structured data. So instead of asking: “Where is this string used?” With CodeQL, you can ask questions like:</p><ul><li><p class="paragraph" style="text-align:left;">“Where is user input flowing into a database query?”</p></li><li><p class="paragraph" style="text-align:left;">“Which APIs are being used without proper validation?”</p></li><li><p class="paragraph" style="text-align:left;">“Where are we missing authorization checks?”</p></li></ul><p class="paragraph" style="text-align:left;">That shift from text search to semantic understanding is what makes CodeQL so powerful.</p><h2 class="heading" style="text-align:left;" id="how-git-hub-uses-code-ql-internally">How GitHub Uses CodeQL Internally</h2><p class="paragraph" style="text-align:left;">At GitHub CodeQL isn’t optional, it’s part of their <b>default development workflow</b>. For most repositories, CodeQL runs automatically on every pull request. So whenever a developer pushes code or opens a PR, CodeQL scans it and flags potential issues before the code is merged.</p><p class="paragraph" style="text-align:left;">This ensures that:</p><ul><li><p class="paragraph" style="text-align:left;">vulnerabilities are caught early</p></li><li><p class="paragraph" style="text-align:left;">developers get instant feedback</p></li><li><p class="paragraph" style="text-align:left;">security becomes part of development, not an afterthought</p></li></ul><p class="paragraph" style="text-align:left;">For the majority of GitHub’s repositories, this default setup is enough to maintain a strong security baseline. However, not all codebases are the same. Some systems like GitHub’s large Ruby monolith have unique patterns, internal APIs and specific security risks. To handle this, GitHub builds <b>custom </b><a class="link" href="https://docs.github.com/en/code-security/codeql-cli/getting-started-with-the-codeql-cli/customizing-analysis-with-codeql-packs?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-github-uses-codeql-to-secure-code-at-scale#about-codeql-packs" target="_blank" rel="noopener noreferrer nofollow"><b>query packs</b></a>.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>What is a Query Pack?</b></p><p class="paragraph" style="text-align:left;">A query pack is simply a collection of CodeQL queries, designed for a specific codebase or use case.</p><p class="paragraph" style="text-align:left;">These queries help detect issues like GitHub-specific risky APIs, Missing authorization checks, Unsafe framework usage</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;"><b>Why Not Just Write Queries Directly?</b></p><p class="paragraph" style="text-align:left;">Initially, GitHub stored queries directly inside repositories. But this caused problems:</p><ul><li><p class="paragraph" style="text-align:left;">Every update required a deployment</p></li><li><p class="paragraph" style="text-align:left;">Queries weren’t precompiled thus slower CI</p></li><li><p class="paragraph" style="text-align:left;">Version mismatches caused confusing failures</p></li></ul><h2 class="heading" style="text-align:left;" id="the-better-approach">The Better Approach</h2><p class="paragraph" style="text-align:left;">GitHub moved query packs to a central registry (GitHub Container Registry). This allowed them to version queries properly, deploy updates faster and avoid CI instability.</p><p class="paragraph" style="text-align:left;">They also follow a smart strategy:</p><ul><li><p class="paragraph" style="text-align:left;">During development: use latest dependencies</p></li><li><p class="paragraph" style="text-align:left;">During release: lock versions for stability</p></li></ul><p class="paragraph" style="text-align:left;">This balance ensures both <b>innovation and reliability</b>.</p><h2 class="heading" style="text-align:left;" id="treating-security-queries-like-prod">Treating Security Queries Like Production Code</h2><p class="paragraph" style="text-align:left;">One of the most interesting things GitHub does is how seriously they treat their queries. They don’t just write them and hope they work. Instead, they write unit tests for queries, test them on sample code and run them through CI pipelines. This ensures fewer false positives, better developer trust and stable security checks.</p><h2 class="heading" style="text-align:left;" id="what-do-these-queries-actually-dete">What Do These Queries Actually Detect?</h2><p class="paragraph" style="text-align:left;">The real value of CodeQL comes from what it can detect. GitHub uses custom queries to enforce important security rules.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e4af35ab-07e5-444e-b469-e89dbe3d0e3b/image.png?t=1777393697"/></div><h3 class="heading" style="text-align:left;" id="1-detecting-unsafe-api-usage">1. Detecting Unsafe API Usage</h3><p class="paragraph" style="text-align:left;">Some internal APIs become dangerous when they handle unsanitized user input.</p><p class="paragraph" style="text-align:left;">CodeQL identifies:</p><ul><li><p class="paragraph" style="text-align:left;">where input is not sanitized</p></li><li><p class="paragraph" style="text-align:left;">where risky APIs are used improperly</p></li></ul><h3 class="heading" style="text-align:left;" id="2-enforcing-authorization-rules">2. Enforcing Authorization Rules</h3><p class="paragraph" style="text-align:left;">GitHub ensures that every REST API endpoint defines proper access control. If a developer creates an endpoint but forgets to include the required <code>control_access</code> method, CodeQL flags it immediately.</p><p class="paragraph" style="text-align:left;">This prevents:</p><ul><li><p class="paragraph" style="text-align:left;">unauthorized access</p></li><li><p class="paragraph" style="text-align:left;">security gaps in APIs</p></li></ul><h3 class="heading" style="text-align:left;" id="3-catching-unsafe-patterns">3. Catching Unsafe Patterns</h3><p class="paragraph" style="text-align:left;">Example: Using <code>.decrypt</code> on ActiveRecord models.</p><p class="paragraph" style="text-align:left;">In simple terms, this method takes data that was stored in an <b>encrypted form</b> (for safety) and converts it back into plain text. Now here’s the problem. When you use <code>.decrypt</code>, it doesn’t just <b>read </b>the data, it can <b>permanently store it in an unencrypted form</b>. That means something that was supposed to stay protected (like passwords, tokens, or personal data) could accidentally become visible in plain text.</p><p class="paragraph" style="text-align:left;">This can:</p><ul><li><p class="paragraph" style="text-align:left;">expose sensitive data</p></li><li><p class="paragraph" style="text-align:left;">break encryption guarantees</p></li></ul><p class="paragraph" style="text-align:left;">CodeQL automatically scans the code and detects when <code>.decrypt</code> is used.</p><p class="paragraph" style="text-align:left;">Instead of waiting for a security engineer to find this manually, it flags it during the pull request and alerts the developer immediately. So the developer can avoid using it or replace it with a safer approach.</p><h2 class="heading" style="text-align:left;" id="not-all-alerts-are-blockers">Not All Alerts Are Blockers</h2><p class="paragraph" style="text-align:left;">GitHub uses two types of alerts:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Blocking alerts</b>: Alerts that must be fixed before merge</p></li><li><p class="paragraph" style="text-align:left;"><b>Advisory alerts</b>: These are guidance for developers</p></li></ul><p class="paragraph" style="text-align:left;">This balance is important. If it’s too strict, it slows down development. If it’s too loose, it risks security. GitHub finds the right middle ground.</p><h2 class="heading" style="text-align:left;" id="variant-analysis">Variant Analysis</h2><p class="paragraph" style="text-align:left;">One of the most powerful techniques GitHub uses is <b>variant analysis</b>.</p><p class="paragraph" style="text-align:left;"><b>What is Variant Analysis?</b></p><p class="paragraph" style="text-align:left;">When a vulnerability is found in one place, GitHub asks: “Where else could this same issue exist?” Instead of fixing just one instance, they search for similar patterns across all repositories.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;"><b>Example: IDOR Vulnerability</b></p><p class="paragraph" style="text-align:left;">GitHub once investigated a case where user input was used to fetch a database object and then the same input was reused later incorrectly leading to unauthorized access <b>Insecure Direct Object Reference (IDOR).</b></p><p class="paragraph" style="text-align:left;"><b>How They Solved It</b></p><p class="paragraph" style="text-align:left;">They wrote a custom CodeQL query to track user input, follow its flow through the code and detect risky patterns. Even if results weren’t perfect, it helped narrow down the search significantly.</p><figcaption class="blockquote__byline"></figcaption></blockquote></div><h2 class="heading" style="text-align:left;" id="scaling-this-across-repositories-mr">Scaling This Across Repositories: MRVA</h2><p class="paragraph" style="text-align:left;">So far, we’ve seen how CodeQL can detect issues in a single repository. But here’s the real problem GitHub faces: What if the same vulnerability exists in <b>hundreds of repositories</b>? Fixing one repo is easy but finding and fixing it everywhere? That’s the hard part.</p><p class="paragraph" style="text-align:left;">This is where <b>Multi-Repository Variant Analysis (MRVA)</b> comes in.</p><p class="paragraph" style="text-align:left;">Instead of fixing a vulnerability in just one place, GitHub writes a <b>CodeQL query for that pattern</b> and runs it across multiple repositories at once. This helps them quickly identify all similar cases across the system.</p><p class="paragraph" style="text-align:left;"><b>How It Works</b></p><p class="paragraph" style="text-align:left;">Let’s say a vulnerability is found where user input is passed unsafely into a database query. Instead of manually searching, GitHub writes a CodeQL query that:</p><ul><li><p class="paragraph" style="text-align:left;">tracks user input</p></li><li><p class="paragraph" style="text-align:left;">follows how it flows through the code</p></li><li><p class="paragraph" style="text-align:left;">detects where it is used dangerously</p></li></ul><p class="paragraph" style="text-align:left;">This query is then executed across many repositories.</p><p class="paragraph" style="text-align:left;"><b>Why not use Simple Search?</b></p><p class="paragraph" style="text-align:left;">Simple code search works on text and doesn’t understand logic or data flow at all. On the other hand CodeQL understands how data moves and detects real patterns, not just keywords.</p><h2 class="heading" style="text-align:left;" id="security-built-into-ci">Security Built into CI</h2><p class="paragraph" style="text-align:left;">One of the biggest reasons this works is integration with CI. CodeQL runs automatically on every pull request without developer intervention.</p><p class="paragraph" style="text-align:left;">This means:</p><ul><li><p class="paragraph" style="text-align:left;">issues are caught early</p></li><li><p class="paragraph" style="text-align:left;">developers fix problems immediately</p></li><li><p class="paragraph" style="text-align:left;">no separate security audit needed</p></li></ul><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="takeaways">Takeaways</h2><p class="paragraph" style="text-align:left;">There are some strong system design takeaways here.</p><p class="paragraph" style="text-align:left;"><b>1. Automation is Non-Negotiable: </b>At scale, manual security simply doesn’t work. You need automated detection systems</p><p class="paragraph" style="text-align:left;"><b>2. Customize for Your System: </b>Generic tools are not enough. GitHub builds custom queries for its own needs</p><p class="paragraph" style="text-align:left;"><b>3. Shift Security Left: </b>Security should happen during development and not after deployment.</p><p class="paragraph" style="text-align:left;"><b>4. Detect Patterns, Not Just Bugs: </b>Fixing one issue is not enough. The better approach is to find all similar issues.</p><p class="paragraph" style="text-align:left;"><b>5. Build Trust in Tooling: </b>Testing queries and reducing false positives ensures developers actually trust the system</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from GitHub: <a class="link" href="https://github.blog/engineering/how-github-uses-codeql-to-secure-github/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-github-uses-codeql-to-secure-code-at-scale" target="_blank" rel="noopener noreferrer nofollow"><b>How GitHub uses CodeQL to secure GitHub</b></a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of,<b> How GitHub Uses CodeQL to Secure Code at Scale? </b>In a nutshell, GitHub uses CodeQL to automatically scan code for security issues by treating code like data and running queries on it. With custom queries and variant analysis, it detects and prevents vulnerabilities at scale across thousands of repositories.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=12f3b91f-b094-4362-a7ca-1facd7a537f5&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Apple lets you choose your AI and Anthropic dreams!</title>
  <description>Apple lets you choose your AI, VIbe Coding has a new name and Anthropic dreams.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b3ecf0af-bda9-4fea-98af-b39b680724a5/Pasted_image__19_.png" length="908300" type="image/png"/>
  <link>https://hw.glich.co/p/the-bottleneck-was-never-the-code</link>
  <guid isPermaLink="true">https://hw.glich.co/p/the-bottleneck-was-never-the-code</guid>
  <pubDate>Sat, 09 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-09T04:30:00Z</atom:published>
    <dc:creator>Aniket Rawat</dc:creator>
    <category><![CDATA[News]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Ubuntu goes AI - </b>Ubuntu is preparing to bring AI features directly into the OS, but with a strong focus on local inference, privacy, and user control instead of cloud-first AI assistants. Canonical says the new features will improve accessibility, automation, and troubleshooting while staying optional and open-source friendly. <a class="link" href="https://www.developer-tech.com/news/ubuntu-ai-features-local-inference/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Read full blog here.</a></p><p class="paragraph" style="text-align:left;"><b>Vibe coding has a new Name! - </b><a class="link" href="https://sdtimes.com/ai/andrej-karpathy-has-renamed-vibe-coding-heres-what-engineering-leaders-need-to-do-about-it/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Andrej Karpathy says “vibe coding” is no longer</a> the right term for modern AI-assisted development, proposing “agentic engineering” instead as AI agents take on larger coding and decision-making roles. The shift reflects how developers are moving from simply prompting AI to actively supervising autonomous coding workflows, architecture, and validation.</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="the-bottleneck-was-never-the-code">The Bottleneck Was Never the Code</h1><p class="paragraph" style="text-align:left;">For years, the software industry believed the biggest problem in engineering was writing code faster. Better programming languages, smarter IDEs, automation tools, and now AI coding agents all promised the same thing: higher developer productivity. And in many ways, they delivered. AI tools today can generate features, write tests, fix bugs, and even build prototypes in hours instead of weeks.</p><p class="paragraph" style="text-align:left;">But something interesting is happening inside teams using these tools heavily: coding is no longer the slowest part of software development.</p><p class="paragraph" style="text-align:left;">The real bottleneck is deciding what should actually be built.</p><p class="paragraph" style="text-align:left;">As AI agents become better at implementation, engineers spend less time waiting for code and more time waiting for clarity. Product managers, tech leads, and leadership teams now have to create extremely detailed specifications, acceptance criteria, workflows, and architectural direction so agents can execute correctly. The challenge has shifted from “can we build this?” to “do we clearly understand what we want?”</p><p class="paragraph" style="text-align:left;">That sounds simple, but it exposes a problem software teams have always had: alignment.</p><h2 class="heading" style="text-align:left;" id="collaboration-was-always-the-hard-p">Collaboration Was Always the Hard Part</h2><p class="paragraph" style="text-align:left;">Software has never been just about typing code. Large systems are built by groups of people negotiating priorities, trade-offs, and product decisions. AI agents do not remove this complexity. They amplify it.</p><p class="paragraph" style="text-align:left;">When code becomes cheap to generate, companies naturally start building more things. Features that once took months now take days. Internal tools appear overnight. Experimental products multiply quickly. But users still absorb products at the same speed, and teams still struggle with focus.</p><p class="paragraph" style="text-align:left;">This creates a new risk: shipping too much without enough coherence.</p><p class="paragraph" style="text-align:left;">A product with twenty AI-generated features is not automatically better than one with five carefully designed ones. In fact, faster development often increases the need for discipline and prioritization.</p><h2 class="heading" style="text-align:left;" id="why-context-matters-more-than-ever">Why Context Matters More Than Ever</h2><p class="paragraph" style="text-align:left;">One of the biggest hidden challenges in software engineering is context. Senior engineers often make decisions based on years of undocumented knowledge: old outages, failed migrations, architectural compromises, and unwritten team conventions.</p><p class="paragraph" style="text-align:left;">Humans absorb this naturally through meetings, debugging sessions, Slack conversations, and shared experiences. AI agents cannot.</p><p class="paragraph" style="text-align:left;">Agents only know what is written down or explicitly connected to them. Missing context can lead to technically correct but strategically wrong decisions. That is why many companies are now investing in systems that turn scattered company knowledge into searchable, structured context for AI agents.</p><p class="paragraph" style="text-align:left;">This may become one of the most important infrastructure layers of the AI era.</p><h2 class="heading" style="text-align:left;" id="the-new-competitive-advantage">The New Competitive Advantage</h2><p class="paragraph" style="text-align:left;">The companies that succeed with AI will probably not just be the ones with the best models. They will be the organizations that maintain alignment while scaling output rapidly.</p><p class="paragraph" style="text-align:left;">AI is becoming a multiplier for organizational quality. Strong teams become dramatically faster. Weak teams become chaotic faster.</p><p class="paragraph" style="text-align:left;">The future of software engineering may depend less on raw coding ability and more on communication, documentation, focus, and the ability to preserve shared understanding across teams. The bottleneck was never truly the code. It was always the humans trying to stay aligned while building something together.</p><hr class="content_break"><p class="paragraph" style="text-align:left;"><b>Anthropic’s new feature that Dreams! -</b> <a class="link" href="https://venturebeat.com/technology/anthropic-introduces-dreaming-a-system-that-lets-ai-agents-learn-from-their-own-mistakes?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Anthropic has introduced a new “dreaming” system</a> that allows AI agents to review past actions, identify mistakes, and improve future performance without direct human retraining. The feature is designed to make long-running AI agents more autonomous by giving them a kind of reflective memory between tasks and sessions.</p><p class="paragraph" style="text-align:left;"><b>Apple lets you choose your AI models:</b> <a class="link" href="https://www.theverge.com/tech/924515/apple-intelligence-third-party-chatbot-extensions-ios-27?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Apple is reportedly planning to open Apple Intelligence to third-party AI chatbots in iOS 27,</a> letting users choose models like Claude, Gemini, or ChatGPT as their default AI assistant. The new “Extensions” system could plug external AI directly into Siri, Writing Tools, and other system features, marking a major shift away from Apple’s traditionally closed ecosystem.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="buzz-of-the-week">Buzz of the Week:</h3><p class="paragraph" style="text-align:left;"><b>Semantic Caching</b></p><p class="paragraph" style="text-align:left;"><b>Semantic Caching</b> is an AI infrastructure technique where systems cache answers based on meaning instead of exact text matches, dramatically reducing latency and inference costs for LLM applications. Most engineers know traditional caching, but semantic caching uses embeddings to detect when two different prompts are “close enough” to reuse an earlier response. For example, “How do I reset my password?” and “I forgot my password, what do I do?” may trigger the same cached AI result even though the wording is different. This is becoming critical because AI apps are expensive to run at scale, especially with agent workflows repeatedly asking similar questions. Companies building production AI systems are now combining vector databases, embedding models, and semantic similarity search to create smarter caching layers.</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="things-that-launched-things-that-we">Things that launched. Things that went viral. Things you&#39;ll pretend to try.</h3><div class="image"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9765551e-0037-4459-b1c5-75b0e223908c/image.png?t=1751469165"/></div><p class="paragraph" style="text-align:left;"><b>Comby</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://comby.dev/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Comby</a> is a structural search-and-replace tool for codebases.</p><h5 class="heading" style="text-align:left;" id="zoxide">Zoxide</h5><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/ajeetdsouza/zoxide?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">Zoxide</a><b> </b>is a smarter <code>cd</code> command that learns your habits.</p><p class="paragraph" style="text-align:left;"><b>WinFindGrep</b></p><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/valginer0/WinFindGrep?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow"><b>WinFindGrep</b></a> - Multi-directory text search and replace utility for Windows with grep-like functionality</p><hr class="content_break"><h3 class="heading" style="text-align:left;" id="build-braincells-not-just-features"><b>Build Braincells, Not Just Features</b></h3><p class="paragraph" style="text-align:left;">This weekend’s read:<a class="link" href="https://sdtimes.com/ai/this-week-in-ai-updates-claude-sonnet-4-6-gemini-3-1-pro-and-more-february-20-2026/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow"> </a><a class="link" href="https://sdtimes.com/test/ai-is-generating-more-tests-but-are-they-preventing-the-next-cloud-outage/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">AI Is Generating More Tests. But Are They Preventing the Next Cloud Outage?</a></p><p class="paragraph" style="text-align:left;">This week’s watch: <a class="link" href="https://youtu.be/cOAaonpTLlc?si=VoYxmR7thGN4nHN0&utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=apple-lets-you-choose-your-ai-and-anthropic-dreams" target="_blank" rel="noopener noreferrer nofollow">The Hunt for Lux: The Internet’s Most Disturbed User.</a></p><hr class="content_break"><p class="paragraph" style="text-align:left;">Meanwhile…</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9923adb5-d7f1-4c5b-b102-badfda5f4017/image.png?t=1778252341"/></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=02d1edbd-d79d-4fa1-96d7-d6763953bd04&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>What are Distributed Systems?</title>
  <description>Explore distributed systems architecture with practical insights, design patterns, and real-world examples to enhance your understanding and skills.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3424aaf7-dd05-4b8a-aef8-10ad47678f06/article-image-bd77aaaf-98df-49b9-b020-48e7611903f6.jpg" length="63664" type="image/jpeg"/>
  <link>https://hw.glich.co/p/distributed-systems-architecture</link>
  <guid isPermaLink="true">https://hw.glich.co/p/distributed-systems-architecture</guid>
  <pubDate>Wed, 06 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-06T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Distributed System]]></category>
    <category><![CDATA[Concepts]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><hr class="content_break"><hr class="content_break"><p class="paragraph" style="text-align:left;">A distributed systems architecture is a method of building software where different application components run on separate, networked computers. These components communicate by passing messages, coordinating their actions to achieve a common goal. This approach is the backbone of modern, large-scale services like <i>Netflix</i> and <i>Amazon</i>, enabling them to handle massive user loads and remain available even when individual parts fail.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/1a95108c-1bf0-4c9d-b5b7-bebc792f041d/d73667a4-2bae-471c-8bc4-e309eca4546e.jpg?t=1777997353"/></div><h3 class="heading" style="text-align:left;" id="why-distributed-systems-matter">Why Distributed Systems Matter</h3><p class="paragraph" style="text-align:left;">Contrast a traditional monolithic system with a distributed system architecture. A monolith is akin to a single chef managing an entire restaurant kitchen, from preparation to cleaning, eventually becoming a bottleneck. In contrast, a distributed system is like a culinary team, where specialized chefs handle specific tasks, working together to deliver a complete meal, much like distributed components collaborate to deliver an application.</p><p class="paragraph" style="text-align:left;">This model offers practical benefits:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Scalability</b>: Handle user surges by adding more computers to the network, as seen with services like Instagram.</p></li><li><p class="paragraph" style="text-align:left;"><b>Fault Tolerance</b>: If one component fails, the system continues to function, preventing total outages.</p></li><li><p class="paragraph" style="text-align:left;"><b>Performance</b>: Tasks are processed in parallel across machines, improving efficiency and user experience.</p></li></ul><p class="paragraph" style="text-align:left;">Implementing distributed systems is crucial for resilient applications. For further understanding, explore the core <a class="link" href="https://hw.glich.co/p/system-design-fundamentals?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">system design fundamentals</a>.</p><h3 class="heading" style="text-align:left;" id="core-principles-of-distributed-syst">Core Principles of Distributed Systems at a Glance</h3><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">Principle</p></th><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">Description</p></th><th class="bh__table_header" width="33%"><p class="paragraph" style="text-align:left;">Practical Example</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"><b>Concurrency</b></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">Multiple processes execute simultaneously across different machines.</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">A ride-sharing app processing thousands of simultaneous ride requests, location updates, and driver assignments across its server fleet.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"><b>No Global Clock</b></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">Each computer has its own independent clock, making it difficult to determine the precise global order of events.</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">A financial trading system must use complex algorithms like the Lamport timestamp to correctly order trades submitted from different geographical locations.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"><b>Independent Failures</b></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">One component of the system can fail without necessarily bringing down the entire application.</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">In an e-commerce platform, the recommendation engine might fail, but users can still search for products, add them to the cart, and complete a purchase.</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;"><b>Message Passing</b></p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">Components communicate by sending messages over a network, which introduces potential delays and failures.</p></td><td class="bh__table_cell" width="33%"><p class="paragraph" style="text-align:left;">A microservice for user authentication sends a message to the notification service to trigger a welcome email. The network could delay or drop this message.</p></td></tr></table></div><p class="paragraph" style="text-align:left;">Each of these principles introduces unique challenges. The lack of a global clock complicates data consistency, while independent failures necessitate robust error-handling mechanisms. However, mastering these principles is what allows engineers to build systems that can operate reliably at a global scale.</p><h2 class="heading" style="text-align:left;" id="understanding-core-distributed-syst">Understanding Core Distributed Systems Concepts</h2><p class="paragraph" style="text-align:left;">Before designing a distributed system, understanding its key components is essential. The most important are <b>nodes</b> (individual network computers), <b>latency</b> (communication delay between nodes), and <b>fault tolerance</b> (the ability to function despite failures).</p><p class="paragraph" style="text-align:left;">Latency is a physical limitation. For instance, a video call across continents has a delay due to the signal traveling through undersea cables. Similarly, data between servers in different centers faces speed-of-light constraints.</p><p class="paragraph" style="text-align:left;">A tough challenge is <b>partial failure</b>, where one node or link fails but the system continues. For example, if a database replica is unreachable, some users get stale data while others receive current information. Designing for this distinguishes reliable systems from fragile ones.</p><p class="paragraph" style="text-align:left;">This model&#39;s growing adoption is seen in market trends, with the distributed hybrid infrastructure market projected to reach <b>$234.56 billion</b> by 2030, as per a <a class="link" href="https://www.360iresearch.com/library/intelligence/distributed-hybrid-infrastructure?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">report</a>. In such settings, performance techniques are essential, as discussed in our guide on <a class="link" href="https://hw.glich.co/p/what-is-distributed-caching?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">distributed caching</a>.</p><h2 class="heading" style="text-align:left;" id="navigating-tradeoffs-with-the-cap-t">Navigating Trade-offs with the CAP Theorem</h2><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c392cd37-51d4-463c-8ba2-911d39f3e188/d59d3c90-a6a2-4294-aef9-dfe60e880a37.jpg?t=1777997353"/></div><p class="paragraph" style="text-align:left;">In distributed systems, every design decision involves trade-offs, guided by the <b>CAP Theorem</b>. This principle states that a distributed data store can only ensure two of the following: <b>C</b>onsistency, <b>A</b>vailability, and <b>P</b>artition Tolerance.</p><p class="paragraph" style="text-align:left;">Given that network partitions are inevitable, the choice usually lies between consistency and availability. For instance, in an airline booking system:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Availability (AP System)</b>: Users can continue booking across separated data centers, risking double-bookings, which must be resolved later.</p></li><li><p class="paragraph" style="text-align:left;"><b>Consistency (CP System)</b>: One network part becomes read-only or unavailable to prevent double-bookings, but this may turn away customers during the partition.</p></li></ul><p class="paragraph" style="text-align:left;">A banking system prioritizes consistency to avoid financial errors, while social media favors availability to keep content accessible, even if outdated. For more insights, our guide explains <a class="link" href="https://hw.glich.co/p/what-is-cap-theorem?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">the CAP theorem</a> and its impact on system design.</p><h2 class="heading" style="text-align:left;" id="key-patterns-for-building-resilient">Key Patterns for Building Resilient Systems</h2><p class="paragraph" style="text-align:left;">With the core concepts set, let&#39;s examine the architectural patterns crucial for effective <b>distributed systems architecture</b>. These patterns are proven solutions for scaling challenges.</p><p class="paragraph" style="text-align:left;">Two essential patterns are <b>replication</b> and <b>sharding</b>. Replication involves keeping multiple data copies on different nodes to ensure data access if a server fails. Sharding, or partitioning, breaks massive datasets into smaller pieces, or shards, across multiple servers. For instance, a social media app might shard user data by region to speed up local queries.</p><p class="paragraph" style="text-align:left;">The infographic below shows a typical layered structure for these systems.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/36a408f8-104b-4530-bfd7-f0c2856bec97/855f9f52-abe2-4618-8cd2-0a147dd30d56.jpg?t=1777997353"/></div><p class="paragraph" style="text-align:left;">This design divides the presentation layer, application logic, and data storage, isolating failures to prevent them from affecting other layers. Efficient data movement between layers is addressed by data pipeline architectures like Lambda and Kappa. Implementing these requires knowledge of fault-tolerance strategies, such as the circuit breaker and retry patterns.</p><h3 class="heading" style="text-align:left;" id="comparison-of-core-distributed-syst">Comparison of Core Distributed System Patterns</h3><p class="paragraph" style="text-align:left;">To make sense of the most common architectural patterns, it helps to see them side-by-side. The table below breaks down their primary goals, typical use cases, and the key benefits they bring to the table.</p><div style="padding:14px 20px 14px;"><table class="bh__table" width="100%" style="border-collapse:collapse;"><tr class="bh__table_row"><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Pattern</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Primary Goal</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Common Use Case</p></th><th class="bh__table_header" width="25%"><p class="paragraph" style="text-align:left;">Key Benefit</p></th></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Replication</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Enhance availability and durability</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Ensuring a database can survive a server crash by keeping copies on other nodes.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">High availability</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Sharding</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Improve scalability and performance</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Splitting a massive user database across multiple servers to speed up queries.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Horizontal scaling</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Load Balancing</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Distribute incoming traffic evenly</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Spreading web requests across a fleet of application servers to prevent overload.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Improved performance</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Service Discovery</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Enable dynamic communication</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Allowing a microservice to find the network location of another service it depends on.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Agility and resilience</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>Circuit Breaker</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Prevent cascading failures</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Stopping requests to a failing service to give it time to recover.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Fault tolerance</p></td></tr><tr class="bh__table_row"><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;"><b>API Gateway</b></p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Simplify client-side interactions</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Providing a single entry point for all client requests to a microservices backend.</p></td><td class="bh__table_cell" width="25%"><p class="paragraph" style="text-align:left;">Centralized management</p></td></tr></table></div><p class="paragraph" style="text-align:left;">Each of these patterns addresses a specific piece of the distributed systems puzzle. By combining them thoughtfully, you can build systems that are not just powerful but also resilient enough to withstand the inevitable bumps in the road.</p><h2 class="heading" style="text-align:left;" id="how-leading-tech-companies-use-thes">How Leading Tech Companies Use These Architectures</h2><p class="paragraph" style="text-align:left;">Architectural theories are realized on a large scale by top technology companies through <b>distributed systems architecture</b>, turning design patterns into everyday applications.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://netflix.com?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">Netflix</a> exemplifies microservices architecture with its platform of numerous independent services, each handling specific tasks like billing and user authentication. This design ensures that a failure in one service, such as the recommendation engine, doesn&#39;t affect other functions like streaming.</p><p class="paragraph" style="text-align:left;">E-commerce leaders such as <a class="link" href="https://amazon.com?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">Amazon</a> use sharding and replication to handle millions of transactions during events like Prime Day. Their data is sharded and replicated globally, providing low-latency access and high availability even under heavy load.</p><p class="paragraph" style="text-align:left;">For more on this topic, see <a class="link" href="https://hw.glich.co/p/how-meta-distributes-exabytes-of-data-across-the-world-so-fast?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">how Meta efficiently manages exabytes of data worldwide</a>.</p><h2 class="heading" style="text-align:left;" id="designing-for-a-distributed-future">Designing for a Distributed Future</h2><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/681a754c-0810-425d-bb62-da4b955c2f7f/7866432e-cefc-47b8-b9e0-abfeb22052e2.jpg?t=1777997353"/></div><p class="paragraph" style="text-align:left;">Adopting a distributed systems architecture is a strategic choice that supports global scale, high availability, and resilience. Initially utilized by web-scale companies, these practices are now widespread in various industries.</p><p class="paragraph" style="text-align:left;">In the industrial sector, distributed control systems (DCS) oversee large-scale manufacturing and energy operations. The DCS market is expected to grow from <b>USD 22.71 billion</b> to <b>USD 29.37 billion</b> by 2030, as reported by <a class="link" href="https://www.mordorintelligence.com/industry-reports/distributed-control-system-market?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=what-are-distributed-systems" target="_blank" rel="noopener noreferrer nofollow">Mordor Intelligence</a>. This growth highlights significant investment in distributed technologies for critical applications.</p><p class="paragraph" style="text-align:left;">Trends like serverless and edge computing enhance this model by reducing latency and expanding possibilities, offering essential tools to tackle complex engineering challenges.</p><h2 class="heading" style="text-align:left;" id="frequently-asked-questions">Frequently Asked Questions</h2><h3 class="heading" style="text-align:left;" id="what-is-the-biggest-challenge-in-di">What Is the Biggest Challenge in Distributed Systems Design?</h3><p class="paragraph" style="text-align:left;">Handling <b>partial failures</b> is a major challenge when transitioning from monoliths. Unlike monolithic apps, which are either fully operational or not, distributed systems can experience ongoing partial failures like network unreliability, server crashes, or slow databases while still functioning overall. This necessitates designing for failure from the start, using essential tools like data replication and circuit breakers. The main challenge lies in ensuring all nodes agree on the system&#39;s state, despite some being unreliable or unreachable, maintaining availability and consistency. Thus, you must prepare for a broader range of failure scenarios than in centralized systems.</p><h3 class="heading" style="text-align:left;" id="when-is-a-monolithic-architecture-a">When Is a Monolithic Architecture a Better Choice?</h3><p class="paragraph" style="text-align:left;">Despite the hype surrounding distributed systems, a monolith can be ideal for smaller projects or early-stage startups with a straightforward scope. Its simplicity allows easier development, testing, and deployment, offering a speed advantage crucial for small teams. If there&#39;s no urgent need to scale parts of your application separately, a monolith is a practical choice. Start simple and consider a shift to a distributed architecture only if complexity necessitates it.</p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=300e150b-bb19-4346-a87b-256c8f569ad2&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Snowflake Improved Performance by 27% (Without Users Noticing)</title>
  <description>Snowflake boosts performance by 27% via backend optimizations in ingestion, planning, and execution thus faster queries and lower cost automatically</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2f6d77d1-b744-4a1a-819a-bf4b88efb1dc/image.png" length="125751" type="image/png"/>
  <link>https://hw.glich.co/p/how-snowflake-improved-performance-by-27-without-users-noticing</link>
  <guid isPermaLink="true">https://hw.glich.co/p/how-snowflake-improved-performance-by-27-without-users-noticing</guid>
  <pubDate>Mon, 04 May 2026 04:30:00 +0000</pubDate>
  <atom:published>2026-05-04T04:30:00Z</atom:published>
    <dc:creator>Rohit Lakhotia</dc:creator>
    <category><![CDATA[Snowflake]]></category>
    <category><![CDATA[System Design]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><div class="section" style="background-color:#FFFFFF;border-color:#fd5621;border-radius:4px;border-style:solid;border-width:1px;margin:16.0px 16.0px 16.0px 16.0px;padding:16.0px 16.0px 16.0px 16.0px;"><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>Welcome to </i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;"><i><b><a class="link" href="https://hw.glich.co/subscribe?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-improved-performance-by-27-without-users-noticing" target="_blank" rel="noopener noreferrer nofollow">Hello World</a></b></i></span><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>, we help software engineers learn the art of building scalable and resilient systems.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(0, 0, 0);font-family:Georgia, Times New Roman, serif;font-size:16px;"><i>You can also checkout: </i></span><a class="link" href="https://scaleengineer.com/blog/how-nomad-by-hashicorp-reduced-scheduler-load-by-90?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-improved-performance-by-27-without-users-noticing" target="_blank" rel="noopener noreferrer nofollow">How Nomad by HashiCorp Reduced Scheduler Load by 90%</a></p></div><hr class="content_break"><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.vpdae.com/redirect/7u7zeuig4s7wjgmwuxb92wlfwm?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-improved-performance-by-27-without-users-noticing"><span class="button__text" style=""> Explore Updates </span></a></div><hr class="content_break"><hr class="content_break"><div class="section" style="background-color:transparent;margin:0.0px 0.0px 0.0px 0.0px;padding:0.0px 0.0px 0.0px 0.0px;"><h2 class="heading" style="text-align:left;">Table of Contents</h2><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-snowflake" rel="noopener noreferrer nofollow">What is Snowflake?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#what-is-the-snowflake-performance-i" rel="noopener noreferrer nofollow">What is the Snowflake Performance Index (SPI)?</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#the-numbers" rel="noopener noreferrer nofollow">The Numbers</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#how-snowflake-improves-performance" rel="noopener noreferrer nofollow">How Snowflake Improves Performance</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#key-areas-where-snowflake-improved-" rel="noopener noreferrer nofollow">Key Areas Where Snowflake Improved Performance</a></p><ul><li><p class="paragraph" style="text-align:left;"><a class="link" href="#1-faster-data-ingestion" rel="noopener noreferrer nofollow">1. Faster Data Ingestion</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#2-faster-communication-between-node" rel="noopener noreferrer nofollow">2. Faster Communication Between Nodes</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#3-smarter-query-optimization" rel="noopener noreferrer nofollow">3. Smarter Query Optimization</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#4-improvements-in-query-execution" rel="noopener noreferrer nofollow">4. Improvements in Query Execution</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#5-better-memory-management" rel="noopener noreferrer nofollow">5. Better Memory Management</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#6-smarter-query-pruning-with-top-k-" rel="noopener noreferrer nofollow">6. Smarter Query Pruning with Top-K Optimization</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#7-query-acceleration-service-enhanc" rel="noopener noreferrer nofollow">7. Query Acceleration Service Enhancements</a></p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#8-better-clustering-and-data-prunin" rel="noopener noreferrer nofollow">8. Better Clustering and Data Pruning</a></p></li></ul></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="#takeaways" rel="noopener noreferrer nofollow">Takeaways</a></p></li></ul></div><p class="paragraph" style="text-align:left;">Imagine using a product that keeps getting faster every week without upgrades, without migrations and without touching a single setting. That’s exactly what Snowflake has been doing. Over time, they’ve improved performance by <b>27%</b>, tracked using something called the <b>Snowflake Performance Index (SPI)</b>. But this isn’t just a random number, it represents <b>real improvements experienced by real users on real workloads</b>. Let’s discover what exacly happened and how Snowflake achieved this in this blog today.</p><h2 class="heading" style="text-align:left;" id="what-is-snowflake">What is Snowflake?</h2><p class="paragraph" style="text-align:left;">Snowflake is a <b>cloud data platform</b>.</p><p class="paragraph" style="text-align:left;">In simple words, It helps companies store, process, and analyze massive amounts of data.</p><p class="paragraph" style="text-align:left;">You can:</p><ul><li><p class="paragraph" style="text-align:left;">Run SQL queries</p></li><li><p class="paragraph" style="text-align:left;">Build dashboards</p></li><li><p class="paragraph" style="text-align:left;">Process analytics workloads</p></li><li><p class="paragraph" style="text-align:left;">Ingest data from multiple sources</p></li></ul><p class="paragraph" style="text-align:left;">All without worrying about infrastructure.</p><h2 class="heading" style="text-align:left;" id="what-is-the-snowflake-performance-i">What is the Snowflake Performance Index (SPI)?</h2><p class="paragraph" style="text-align:left;">The <b>Snowflake Performance Index (SPI)</b> is a metric Snowflake introduced to measure: How much faster customer workloads are getting over time.</p><p class="paragraph" style="text-align:left;">This is important because most companies:</p><ul><li><p class="paragraph" style="text-align:left;">Use synthetic benchmarks (fake tests)</p></li><li><p class="paragraph" style="text-align:left;">Show ideal scenarios</p></li></ul><p class="paragraph" style="text-align:left;">Snowflake does something different. It measures performance on <b>real production workloads </b>which means, totally real queries, real data and real usage patterns.</p><h2 class="heading" style="text-align:left;" id="the-numbers">The Numbers</h2><ul><li><p class="paragraph" style="text-align:left;"><b>27% improvement since Aug 2022</b></p></li><li><p class="paragraph" style="text-align:left;"><b>12% improvement in the last 12 months</b></p></li></ul><p class="paragraph" style="text-align:left;">This is not theoretical, this is what customers actually experience.</p><h2 class="heading" style="text-align:left;" id="how-snowflake-improves-performance">How Snowflake Improves Performance</h2><p class="paragraph" style="text-align:left;">Here’s the interesting part. Snowflake doesn’t rely on:</p><ul><li><p class="paragraph" style="text-align:left;">Manual tuning</p></li><li><p class="paragraph" style="text-align:left;">Configuration changes</p></li><li><p class="paragraph" style="text-align:left;">Version upgrades</p></li></ul><p class="paragraph" style="text-align:left;">Instead, they continuously improve the <b>core engine</b> and release updates <b>every week</b>.</p><p class="paragraph" style="text-align:left;">This works because Snowflake uses a <b>fully managed cloud architecture </b>and a <b>consumption-based pricing model.</b></p><p class="paragraph" style="text-align:left;">So when performance improves:</p><ul><li><p class="paragraph" style="text-align:left;">Queries run faster</p></li><li><p class="paragraph" style="text-align:left;">Compute usage drops</p></li><li><p class="paragraph" style="text-align:left;">Costs reduce automatically</p></li></ul><h2 class="heading" style="text-align:left;" id="key-areas-where-snowflake-improved-">Key Areas Where Snowflake Improved Performance</h2><h3 class="heading" style="text-align:left;" id="1-faster-data-ingestion">1. Faster Data Ingestion</h3><p class="paragraph" style="text-align:left;">Everything starts with getting data into Snowflake. If ingestion itself is slow, everything downstream gets delayed.</p><p class="paragraph" style="text-align:left;">Snowflake improved how it reads and processes semi-structured data formats like JSON and Parquet. These formats are widely used but can be expensive to parse, especially when dealing with case-insensitive data. By optimizing this layer, Snowflake was able to improve ingestion performance by up to 25%.</p><p class="paragraph" style="text-align:left;">By doing this, data becomes available faster for querying. If you’re running pipelines or near real-time analytics, this directly reduces the overall latency of your system.</p><h3 class="heading" style="text-align:left;" id="2-faster-communication-between-node">2. Faster Communication Between Nodes</h3><p class="paragraph" style="text-align:left;">Snowflake doesn’t run your query on a single machine. It spreads the work across multiple machines (nodes). These machines constantly send data to each other while processing a query.</p><p class="paragraph" style="text-align:left;">Earlier, this communication could become a bottleneck.</p><p class="paragraph" style="text-align:left;">So Snowflake improved:</p><ul><li><p class="paragraph" style="text-align:left;">how data is transferred between nodes</p></li><li><p class="paragraph" style="text-align:left;">how data is compressed before sending</p></li><li><p class="paragraph" style="text-align:left;">optimized aggregation placement, essentially deciding <i>where</i> certain computations (like SUM or COUNT) should happen so that less data needs to move across the network.</p></li></ul><p class="paragraph" style="text-align:left;">You can think of it like this, instead of moving a lot of raw data across machines, Snowflake now tries to <b>process data earlier and send less of it around</b>.</p><p class="paragraph" style="text-align:left;">This reduces waiting time and makes queries faster.</p><h3 class="heading" style="text-align:left;" id="3-smarter-query-optimization">3. Smarter Query Optimization</h3><p class="paragraph" style="text-align:left;">One of the biggest wins comes from making the query optimizer smarter.</p><p class="paragraph" style="text-align:left;">Whenever you run a SQL query, the system doesn’t just execute it as-is. It first decides the best possible way to run it, which tables to scan first, how to perform joins, where to filter data, and so on.</p><p class="paragraph" style="text-align:left;">Snowflake improved this decision-making process in a few key ways. Now it:</p><ul><li><p class="paragraph" style="text-align:left;">better understands how much data each filter will remove</p></li><li><p class="paragraph" style="text-align:left;">chooses a smarter order to join tables</p></li><li><p class="paragraph" style="text-align:left;">decides when to move data vs when to process locally</p></li></ul><p class="paragraph" style="text-align:left;">For example, if one table is huge and the other is small, it can decide the most efficient way to join them instead of blindly following the query order.</p><p class="paragraph" style="text-align:left;">You don’t change your query, Snowflake just runs it in a better way.</p><h3 class="heading" style="text-align:left;" id="4-improvements-in-query-execution">4. Improvements in Query Execution</h3><p class="paragraph" style="text-align:left;">Once the plan is ready, the query actually runs. Snowflake improved how this execution happens. Instead of processing everything at the end, it now tries to <b>reduce data as early as possible</b>.</p><p class="paragraph" style="text-align:left;">For example:</p><ul><li><p class="paragraph" style="text-align:left;">applying filters early</p></li><li><p class="paragraph" style="text-align:left;">doing aggregations earlier</p></li><li><p class="paragraph" style="text-align:left;">reducing intermediate data</p></li></ul><p class="paragraph" style="text-align:left;">If you shrink the data early, everything that follows becomes faster. It’s like cleaning unnecessary items from a list before processing it, less work overall.</p><h3 class="heading" style="text-align:left;" id="5-better-memory-management">5. Better Memory Management</h3><p class="paragraph" style="text-align:left;">Memory is one of the most critical resources during query execution. If a query runs out of memory, it spills data to disk, which is significantly slower.</p><p class="paragraph" style="text-align:left;">Snowflake improved how it manages memory, especially for complex queries involving multiple joins. Now it:</p><ul><li><p class="paragraph" style="text-align:left;">avoids unnecessary memory usage</p></li><li><p class="paragraph" style="text-align:left;">reduces chances of spilling to disk</p></li><li><p class="paragraph" style="text-align:left;">handles large joins more efficiently</p></li></ul><p class="paragraph" style="text-align:left;">It results in fewer spills to disk and more work happening in-memory, which leads to faster and more stable query performance.</p><h3 class="heading" style="text-align:left;" id="6-smarter-query-pruning-with-top-k-">6. Smarter Query Pruning with Top-K Optimization</h3><p class="paragraph" style="text-align:left;">One of the more interesting improvements is something called Top-K pruning.</p><p class="paragraph" style="text-align:left;">Consider this query:</p><div class="codeblock"><pre><code>SELECT * FROM products
ORDER BY price DESC
LIMIT 10;</code></pre></div><p class="paragraph" style="text-align:left;">You only need the top 10 results.</p><p class="paragraph" style="text-align:left;">Earlier, the system might scan a large portion of the data before deciding the top 10.</p><p class="paragraph" style="text-align:left;">Now Snowflake does something smarter: As soon as it knows the remaining data can’t affect the result, it stops scanning further</p><p class="paragraph" style="text-align:left;">So instead of doing full work, it does only the required work and this directly improves performance (around 12.5% for such queries).</p><h3 class="heading" style="text-align:left;" id="7-query-acceleration-service-enhanc">7. Query Acceleration Service Enhancements</h3><p class="paragraph" style="text-align:left;">Snowflake also improved its Query Acceleration Service (QAS), which is designed to speed up heavy queries by offloading parts of the work to additional compute resources.</p><p class="paragraph" style="text-align:left;">Now, more types of queries can use this service (even INSERT queries). So more queries benefit from acceleration automatically without any manual setup.</p><h3 class="heading" style="text-align:left;" id="8-better-clustering-and-data-prunin">8. Better Clustering and Data Pruning</h3><p class="paragraph" style="text-align:left;">Finally, Snowflake improved how data is organized internally through clustering.</p><p class="paragraph" style="text-align:left;">Better clustering means related data is stored closer together, which allows the system to skip irrelevant data during query execution, a process known as pruning.</p><p class="paragraph" style="text-align:left;">When pruning is more effective, queries scan less data. Less scanning directly translates to faster queries and lower compute costs. Snowflake estimates this can reduce costs by around 10% for certain workloads.</p><p class="paragraph" style="text-align:left;"><b>Putting it All Together</b></p><p class="paragraph" style="text-align:left;">If you look at everything together, Snowflake improved performance by focusing on a few simple ideas:</p><ul><li><p class="paragraph" style="text-align:left;">Get data in faster</p></li><li><p class="paragraph" style="text-align:left;">Move less data between machines</p></li><li><p class="paragraph" style="text-align:left;">Make smarter decisions before execution</p></li><li><p class="paragraph" style="text-align:left;">Reduce data as early as possible</p></li><li><p class="paragraph" style="text-align:left;">Avoid doing unnecessary work</p></li><li><p class="paragraph" style="text-align:left;">Use memory and compute more efficiently</p></li></ul><p class="paragraph" style="text-align:left;">None of these changes require you to rewrite queries, tune configs and upgrade anything</p><p class="paragraph" style="text-align:left;">And that’s the most important part. Snowflake improved performance not by adding complexity for users, but by <b>making the system smarter internally</b>.</p><hr class="content_break"><hr class="content_break"><h2 class="heading" style="text-align:left;" id="takeaways">Takeaways</h2><p class="paragraph" style="text-align:left;">This isn’t just about Snowflake getting faster, it’s about how good systems are designed.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/2f6d77d1-b744-4a1a-819a-bf4b88efb1dc/image.png?t=1777226566"/></div><p class="paragraph" style="text-align:left;">First, <b>continuous improvement beats big releases</b>. Instead of shipping rare, heavy updates, Snowflake improves performance in small, steady steps. Over time, these changes compound into massive gains.</p><p class="paragraph" style="text-align:left;">Second, <b>optimize the system, not the user</b>. Great systems don’t ask users to tune queries or configs. They improve internally so things just work better without extra effort.</p><p class="paragraph" style="text-align:left;">Third, <b>measure real workloads, not benchmarks</b>. Snowflake tracks performance using actual customer queries, not synthetic tests. That’s what makes the improvements meaningful in real-world scenarios.</p><p class="paragraph" style="text-align:left;">Another key idea is <b>avoiding unnecessary work</b>. With optimizations like Top-K pruning, the system stops processing once it has enough data. This simple principle can drastically improve performance.</p><p class="paragraph" style="text-align:left;">Finally, <b>smarter decisions beat more resources</b>. A better optimizer, choosing the right join order or execution plan can improve performance without adding more compute.</p><div class="blockquote"><blockquote class="blockquote__quote"><p class="paragraph" style="text-align:left;">Official blog from Snowflake: <a class="link" href="https://www.snowflake.com/en/blog/performance-index-27-percent-improvement/?utm_source=hw.glich.co&utm_medium=newsletter&utm_campaign=how-snowflake-improved-performance-by-27-without-users-noticing" target="_blank" rel="noopener noreferrer nofollow"><b>Snowflake Improves Performance by 27%, According to the Snowflake Performance Index</b></a></p><figcaption class="blockquote__byline"></figcaption></blockquote></div><p class="paragraph" style="text-align:left;">By now, you must have had a clear idea of, <b>How Snowflake Improved Performance by 27%? </b>In a nutshell, Snowflake improved performance by 27% through continuous backend optimizations in ingestion, query planning, and execution without requiring any user changes. By making the system smarter (not heavier), it delivers faster queries and lower costs automatically over time.</p><p class="paragraph" style="text-align:left;"><b>Congratulations! You&#39;ve just advanced another step in your tech journey. Keep progressing!</b></p><hr class="content_break"></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1dd7fdfb-be98-4f6e-a19d-b7241b2f2e44&utm_medium=post_rss&utm_source=hello_world_system_design_newsletter">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

  </channel>
</rss>
