<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Quastor</title>
    <description>Get Summaries of Big Tech Engineering Blog Posts on Frontend, Backend, Machine Learning, Data Engineering and more!</description>
    
    <link>https://blog.quastor.org/</link>
    <atom:link href="https://rss.beehiiv.com/feeds/nczRb4PQ6t.xml" rel="self"/>
    
    <lastBuildDate>Mon, 2 Mar 2026 03:08:38 +0000</lastBuildDate>
    <pubDate>Mon, 24 Feb 2025 21:03:00 +0000</pubDate>
    <atom:published>2025-02-24T21:03:00Z</atom:published>
    <atom:updated>2026-03-02T03:08:38Z</atom:updated>
    
    <copyright>Copyright 2026, Quastor</copyright>
    
    <image>
      <url>https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/publication/logo/7a588cb1-1dd5-4201-9945-b1b2cde55baa/Q.png</url>
      <title>Quastor</title>
      <link>https://blog.quastor.org/</link>
    </image>
    
    <docs>https://www.rssboard.org/rss-specification</docs>
    <generator>beehiiv</generator>
    <language>en-us</language>
    <webMaster>support@beehiiv.com (Beehiiv Support)</webMaster>

      <item>
  <title>The Architecture of Grab&#39;s Data Lake</title>
  <description>We&#39;ll talk about data storage formats, merge on read, copy on write and more. Plus, a detailed guide to software architecture documentation, how Figma overhauled their performance testing framework and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7fbfb779-bae0-4672-ac55-6296d9e41047/Screenshot_2024-05-29_at_1.31.21_PM.png" length="261310" type="image/png"/>
  <link>https://blog.quastor.org/p/the-architecture-of-grab-s-data-lake</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/the-architecture-of-grab-s-data-lake</guid>
  <pubDate>Mon, 24 Feb 2025 21:03:00 +0000</pubDate>
  <atom:published>2025-02-24T21:03:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Grab&#39;s Data Lake</b></p><ul><li><p class="paragraph" style="text-align:left;">Introduction to Data Storage Formats</p></li><li><p class="paragraph" style="text-align:left;">Design Choices when picking a Data Storage Format</p></li><li><p class="paragraph" style="text-align:left;">High Throughput vs. Low Throughput Data at Grab</p></li><li><p class="paragraph" style="text-align:left;">Using Avro and <i>Merge on Read </i>for High Throughput Data</p></li><li><p class="paragraph" style="text-align:left;">Using Parquet and <i>Copy on Write</i> for Low Throughput Data</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">A Detailed Guide to Software Architecture Documentation</p></li><li><p class="paragraph" style="text-align:left;">The really important job interview questions engineers should ask (but don’t)</p></li><li><p class="paragraph" style="text-align:left;">How Figma overhauled their Performance Testing Framework</p></li><li><p class="paragraph" style="text-align:left;">Is this the simplest sorting algorithm ever?</p></li><li><p class="paragraph" style="text-align:left;">How SQLite got 10x faster for Analytical Queries</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcJCFACtFpAOgdB_JFqnMVM5uZZknd2HVsutnMWq7V8jPeoOzKRwbZ8tYSFbKSkr1HJu6-Rb9l5pC1iKCYxfnLXE8AzxYPH4Xp7QG2D0zisGbUMwaDvGDPh3PnUKj7CvO55DJDDdw?key=0XiR9bi6yeWSXCHLPsMUXhRu"/></a></div><h1 class="heading" style="text-align:left;" id="our-api-could-probably-run-a-small-"><a class="link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" target="_blank" rel="noopener noreferrer nofollow"><b>Our API Could Probably Run a Small Country.</b></a></h1><p class="paragraph" style="text-align:left;">Your API can barely handle <b>10 requests</b> without throwing a tantrum, and you wanna scale? </p><p class="paragraph" style="text-align:left;"><a class="link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" target="_blank" rel="noopener noreferrer nofollow">CarsXE</a><b> is so powerful, it could run a startup, a Fortune 500, or a small dictatorship</b>—and still have bandwidth left for your grandma’s car blog. </p><p class="paragraph" style="text-align:left;"><b>Instant license plate decoding, real-time vehicle specs, and enterprise-level security.</b> Your move.</p><p class="paragraph" style="text-align:left;">Or keep using that weak API that breaks when someone sneezes.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24"><span class="button__text" style=""> Try CarsXE for Free Now! </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="the-architecture-of-grabs-data-lake"><span style="color:rgb(14, 16, 26);"><b>The Architecture of Grab&#39;s Data Lake</b></span></h1><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Grab is one of the largest technology companies in Southeast Asia, with over 35 million monthly users. They run a &quot;super-app&quot; that offers ride-sharing, food delivery, banking, and communication all within a single app.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">As you might guess, operating all these services generates a </span><span style="color:rgb(14, 16, 26);"><i>lot</i></span><span style="color:rgb(14, 16, 26);"> of data. Grab&#39;s data analysts need to comb through this data for insights that can help the company improve its operations.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">The data is primarily stored in a data lake. Some of the data is high-throughput and gets frequent updates (</span><span style="color:rgb(14, 16, 26);"><i>multiple updates every second/minute</i></span><span style="color:rgb(14, 16, 26);">). Other data is low-throughput and is rarely updated (</span><span style="color:rgb(14, 16, 26);"><i>updated daily/weekly</i></span><span style="color:rgb(14, 16, 26);">).</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Grab needs to store this data efficiently and also allow data analysts to run ad-hoc queries on it without it being too costly.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">One crucial design choice is picking the right</span><span style="color:rgb(14, 16, 26);"><i> storage format </i></span><span style="color:rgb(14, 16, 26);">for the data on the data lake. Choosing the wrong format can make data storage </span><span style="color:rgb(14, 16, 26);"><i>significantly</i></span><span style="color:rgb(14, 16, 26);"> more expensive and make gaining insights from the data more difficult.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">We&#39;ll first explore data storage formats, discussing the tradeoffs involved and some commonly used technologies.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Then, we&#39;ll talk about what Grab chose and why. You can read the full article by Grab </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://engineering.grab.com/enabling-near-realtime-data-analytics?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">here</a></span><span style="color:rgb(14, 16, 26);">.</span></p><h2 class="heading" style="text-align:left;" id="introduction-to-data-storage-format"><span style="color:rgb(14, 16, 26);"><b>Introduction to Data Storage Formats</b></span></h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Let&#39;s say you have a bunch of sensor data from a weather satellite. If you need to store it on S3, how would you do it?</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Do you just upload the CSV? What if the file is 50 gigabytes and has many repeat values? Would you pick a format that compresses the data?</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">The format you choose for encoding your data is crucial.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Here are some of the tradeoffs you might consider:</span></p><ul><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Human Readability</b></span><span style="color:rgb(14, 16, 26);"> - It&#39;s pretty easy for you to read a CSV or a JSON file however these formats aren&#39;t very efficient. Compressed formats like Protobuf are much smaller but not human-readable.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Row vs. Column-oriented</b></span><span style="color:rgb(14, 16, 26);"> - In a row-oriented format, you put data in the same row next to each other on disk. In a column-oriented format, you put data in the same </span><span style="color:rgb(14, 16, 26);"><i>column</i></span><span style="color:rgb(14, 16, 26);"> next to each other on disk. This choice has </span><span style="color:rgb(14, 16, 26);"><i>a ton</i></span><span style="color:rgb(14, 16, 26);"> of effects on read/write performance, compression efficiency and more. We did a deep dive on Row vs. Column-oriented databases that you can read </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://blog.quastor.org/p/row-vs-column-oriented-databases-plus-managing-load-robinhood?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-zomato-built-a-petabyte-scale-logging-system" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">here</a></span><span style="color:rgb(14, 16, 26);">.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Schema Evolution</b></span><span style="color:rgb(14, 16, 26);"> - how easy is changing the data schema over time without breaking existing data? Some formats support adding/removing fields.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Compression</b></span><span style="color:rgb(14, 16, 26);"> - How efficiently can you compress the data? Compression can save storage costs and reduce latencies.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Split-ability</b></span><span style="color:rgb(14, 16, 26);"> - How easy is it to divide the file format into smaller chunks? Split-ability is important if you need to do parallel processing on the data.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Ecosystem Compatibility</b></span><span style="color:rgb(14, 16, 26);"> - Is the format commonly used? Does it have good support with tools like Spark, Redshift, Snowflake , etc.</span></p></li></ul><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Some common formats that you&#39;ll frequently see are</span></p><ul><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>JSON</b></span><span style="color:rgb(14, 16, 26);"> - this is a text-based format so it&#39;s human-readable. Almost every programming language will have support for JSON. I&#39;m assuming you&#39;ve already used JSON before but you can read more </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://en.wikipedia.org/wiki/JSON?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">here</a></span><span style="color:rgb(14, 16, 26);"> if you haven&#39;t.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>CSV</b></span><span style="color:rgb(14, 16, 26);"> - a text format for storing data in a table-like way. Each line corresponds to a row and every value in the line is separated by a comma. CSV is also human-readable but it&#39;s not an efficient way to store data.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Avro</b></span><span style="color:rgb(14, 16, 26);"> - a binary data serialization framework developed within the Hadoop ecosystem. Data on disk is row-oriented and compressed. Avro supports robust schema evolution. Learn more </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://en.wikipedia.org/wiki/Apache_Avro?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">here</a></span><span style="color:rgb(14, 16, 26);">.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Parquet</b></span><span style="color:rgb(14, 16, 26);"> - developed in 2013 as a joint effort between Twitter and Cloudera. Parquet is column-oriented and provides efficient compression on disk. It supports schema evolution so you can add/remove columns without breaking compatibility. Read more </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://en.wikipedia.org/wiki/Apache_Parquet?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">here</a></span><span style="color:rgb(14, 16, 26);">.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>ORC</b></span><span style="color:rgb(14, 16, 26);"> - created in 2013 at Facebook. It&#39;s optimized for large streaming reads and efficient data compression. </span><span style="color:rgb(74, 110, 224);"><a class="link" href="https://en.wikipedia.org/wiki/Apache_ORC?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">Read more</a></span><span style="color:rgb(14, 16, 26);">.</span></p></li></ul><h2 class="heading" style="text-align:left;" id="data-storage-formats-at-grab"><span style="color:rgb(14, 16, 26);"><b>Data Storage Formats at Grab</b></span></h2><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">For Grab, they looked at the characteristics of their data sources and used that to determine their storage formats.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">One characteristic they used was throughput.</span></p><ul><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>High Throughput</b></span><span style="color:rgb(14, 16, 26);"> data is updated/changed frequently (</span><span style="color:rgb(14, 16, 26);"><i>several times a second</i></span><span style="color:rgb(14, 16, 26);">). An example is the stream of booking events from customers who book a ride share.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Low Throughput</b></span><span style="color:rgb(14, 16, 26);"> data could be transaction events generated from a nightly batch process.</span></p></li></ul><h3 class="heading" style="text-align:left;" id="high-throughput-data"><span style="color:rgb(14, 16, 26);"><b>High Throughput Data</b></span></h3><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">For High Throughput data, Grab uses Apache Avro with a strategy called </span><span style="color:rgb(74, 110, 224);"><i><a class="link" href="https://hudi.apache.org/docs/concepts/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake#merge-on-read-table" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">Merge on Read (MOR)</a></i></span><span style="color:rgb(14, 16, 26);"><b>.</b></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Here&#39;s the main operations with Merge on Read:</span></p><ul><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Write Operations</b></span><span style="color:rgb(14, 16, 26);"> - When data is written, it&#39;s appended to the end of a log file. This is </span><span style="color:rgb(14, 16, 26);"><i>much</i></span><span style="color:rgb(14, 16, 26);"> more efficient than merging it in the current data and reduces the latency of writes.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Read Operations</b></span><span style="color:rgb(14, 16, 26);"> - When you need to read data, the base file is combined with the updates in the log file to provide the latest view. This can make reads more costly as you have to merge the updated data.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Periodic Compaction</b></span><span style="color:rgb(14, 16, 26);"> - To prevent reads from becoming </span><span style="color:rgb(14, 16, 26);"><i>too</i></span><span style="color:rgb(14, 16, 26);"> costly, updates in the log files are periodically merged with the base files. This limits the number of past updates you must merge during a read.</span></p></li></ul><h3 class="heading" style="text-align:left;" id="low-throughput-data"><span style="color:rgb(14, 16, 26);"><b>Low Throughput Data</b></span></h3><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">For low throughput data, Grab uses Parquet with </span><span style="color:rgb(74, 110, 224);"><i><a class="link" href="https://hudi.apache.org/docs/concepts/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake#copy-on-write-table" target="_blank" rel="noopener noreferrer nofollow" style="color: rgb(74, 110, 224)">Copy on Write (CoW)</a></i></span><span style="color:rgb(14, 16, 26);"><i>.</i></span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);">Here&#39;s the main operations for Copy on Write:</span></p><ul><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Write Operations</b></span><span style="color:rgb(14, 16, 26);"> - Whenever there&#39;s a write, you create a new version of the file that includes the latest change. You can also keep the previous version for consistency and rollback purposes. This helps prevent data corruption, inconsistent reads, and more.</span></p></li><li><p class="paragraph" style="text-align:left;"><span style="color:rgb(14, 16, 26);"><b>Read Operations</b></span><span style="color:rgb(14, 16, 26);"> - You read the latest versioned data file. Reads are faster than </span><span style="color:rgb(14, 16, 26);"><i>Merge on Read </i></span><span style="color:rgb(14, 16, 26);">since there is no merge process.</span></p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcJCFACtFpAOgdB_JFqnMVM5uZZknd2HVsutnMWq7V8jPeoOzKRwbZ8tYSFbKSkr1HJu6-Rb9l5pC1iKCYxfnLXE8AzxYPH4Xp7QG2D0zisGbUMwaDvGDPh3PnUKj7CvO55DJDDdw?key=0XiR9bi6yeWSXCHLPsMUXhRu"/></a></div><h1 class="heading" style="text-align:left;" id="our-api-could-probably-run-a-small-"><a class="link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" target="_blank" rel="noopener noreferrer nofollow"><b>Our API Could Probably Run a Small Country.</b></a></h1><p class="paragraph" style="text-align:left;">Your API can barely handle <b>10 requests</b> without throwing a tantrum, and you wanna scale? </p><p class="paragraph" style="text-align:left;"><a class="link" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24" target="_blank" rel="noopener noreferrer nofollow">CarsXE</a><b> is so powerful, it could run a startup, a Fortune 500, or a small dictatorship</b>—and still have bandwidth left for your grandma’s car blog. </p><p class="paragraph" style="text-align:left;"><b>Instant license plate decoding, real-time vehicle specs, and enterprise-level security.</b> Your move.</p><p class="paragraph" style="text-align:left;">Or keep using that weak API that breaks when someone sneezes.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://api.carsxe.com/?utm_source=quastor&utm_medium=email&utm_campaign=quastor-feb-24"><span class="button__text" style=""> Try CarsXE for Free Now! </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank"><div class="embed__content"><p class="embed__title"> A Detailed Guide to Software Architecture Documentation </p><p class="embed__description"> This is a fantastic guide to documenting things in your codebase that aren’t code.<br><br>You should be documenting things like<br>- non-functional requirements<br>- architectural decisions and their arguments<br>- data flow<br>- maintenance and update procedures <br>and much more.<br><br>This is a fantastic guide on how to document all these other areas of your system. </p><p class="embed__link"> www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide </p></div></a></div><div class="embed"><a class="embed__url" href="https://posthog.com/founders/what-to-ask-in-interviews?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank"><div class="embed__content"><p class="embed__title"> The really important job interview questions engineers should ask (but don’t) </p><p class="embed__description"> This is a good list of questions you should ask when you’re interviewing for a job (especially if it’s a start up).<br><br>Questions include<br>- Does the company have product-market fit?<br>- How much runway does the company have? What’s the burn?<br>- What’s in store for the future? What is the company strategy?<br>- Who decides what to build?<br><br>and more. </p><p class="embed__link"> posthog.com/founders/what-to-ask-in-interviews </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.figma.com/blog/keeping-figma-fast/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/99459024-bca2-4046-afc4-e34572574d94/4cad57473a76fdc6d0254d48e1c992a1079943f1-4032x3024.jpg?t=1717428698"/><div class="embed__content"><p class="embed__title"> How Figma overhauled their Performance Testing Framework </p><p class="embed__description"> Initially, Figma relied on a single macbook for their in-house performance testing system. As you might imagine, they eventually had to find a more scalable solution.<br><br>This is an interesting read from the Figma engineering blog on how the company built a new system to spot performance regressions in the app.<br><br>They ultimately shipped two systems: a cloud based system that covered the majority of tests and a hardware system for highly targeted tests. </p><p class="embed__link"> www.figma.com/blog/keeping-figma-fast </p></div></a></div><div class="embed"><a class="embed__url" href="https://arxiv.org/abs/2110.01111?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank"><div class="embed__content"><p class="embed__title"> Is this the simplest (and most surprising) sorting algorithm ever? </p><p class="embed__description"> This is a paper on the “ICan’tBelieveItCanSort” sorting algorithm. It’s an incredibly simple algorithm that appears incorrect at first glance, but still successfully sorts an array in non-decreasing order. </p><p class="embed__link"> arxiv.org/abs/2110.01111 </p></div></a></div><div class="embed"><a class="embed__url" href="https://avi.im/blag/2024/sqlite-past-present-future/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-data-lake" target="_blank"><div class="embed__content"><p class="embed__title"> How SQLite got 10x faster for Analytical Queries </p><p class="embed__description"> This is a really interesting read on how researchers used the Bloom Filter data structure to make SQLite 10x faster for analytical queries.<br><br>The article talks about how they were able to use Bloom Filters to cache and reduce expensive B-tree probes. </p><p class="embed__link"> avi.im/blag/2024/sqlite-past-present-future </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=23335d4b-3321-4131-bf3d-52396b3d44de&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>The Architecture of Grab&#39;s Auth System</title>
  <description>We&#39;ll talk about PassKeys, their benefits and how they work. Plus, a dive into the internals of Apache Kafka, a compilation of writing advice from the internet&#39;s best writers and more.</description>
  <link>https://blog.quastor.org/p/the-architecture-of-grab-s-auth-system</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/the-architecture-of-grab-s-auth-system</guid>
  <pubDate>Wed, 19 Feb 2025 16:20:00 +0000</pubDate>
  <atom:published>2025-02-19T16:20:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Grab uses Passkeys for a Passwordless Authentication System</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to PassKeys and their History</p></li><li><p class="paragraph" style="text-align:left;"> How PassKeys work and their Benefits</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of Grab’s Passkey System</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Exploring the internals of Apache Kafka</p></li><li><p class="paragraph" style="text-align:left;">Interview with the Inventors of Deep Research</p></li><li><p class="paragraph" style="text-align:left;">How My Washing Machine Refreshed My Thinking on Software Effort Estimation</p></li><li><p class="paragraph" style="text-align:left;">NASA’s 10 Rules for Software Development</p></li><li><p class="paragraph" style="text-align:left;">A Compilation of Writing Advice from the Internet’s Best Writers</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-feb3-doom?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdbK6y21UAeLtFOl5PoCbbe22VuOpEt9QYy87S8KK7CzyIW97BwgTsCIoHDThnbkzd0aBDbjRndZfWwpESGfEdCO6AurqeZZ3tZrmsbeY68FWtWBc_kKTwKC8DiWxPrC1v5bblLKQ?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="why-deadlines-are-slowing-your-team"><b><a class="link" href="https://dub.link/quas-feb3-doom?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">Why Deadlines Are Slowing Your Team Down</a></b></h1><p class="paragraph" style="text-align:left;">If you’ve ever been caught in a death spiral of shifting deadlines and mounting tech debt, you’re not alone. Developers often face unrealistic timelines that kill team morale and momentum.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> recently published a terrific blog post delving into how they avoid falling into this trap.</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Recognize the Vicious Cycle</b> - understand the cause of tight deadlines. Is it over-promising to customers? Poor or unrealistic planning? Work to eliminate the source. </p></li><li><p class="paragraph" style="text-align:left;"><b>Trust Small Teams</b> - Teams of six or fewer can ship faster than a team twice their size. Small teams means less coordination, fewer meetings and more time coding.</p></li><li><p class="paragraph" style="text-align:left;"><b>Ditch Arbitrary Deadlines</b> - Focus on <i>real user </i>feedback instead of guesses on timelines from sales/leadership teams.</p></li><li><p class="paragraph" style="text-align:left;"><b>Hire for Ownership</b> - the best engineers excel when they can drive product decisions. Having a sense of ownership about the product helps drive team morale and momentum.</p></li></ol><p class="paragraph" style="text-align:left;">For more on how to build more effective engineering teams, check out the <a class="link" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> newsletter below.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system"><span class="button__text" style=""> Check out Product for Engineers </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Grab uses Passkeys for a Passwordless Authentication System</b></h1><p class="paragraph" style="text-align:left;">Grab is one of the largest tech companies in Southeast Asia. They’re a “super app” and provide services like ride-hailing, food/grocery delivery, mobile payments and more. They operate in over 700 cities and have millions of daily users.</p><p class="paragraph" style="text-align:left;">The company offers digital payment solutions (GrabPay Wallet) and lending products (GrabFinance) so secure authentication is a critical component of the app. Phishing attacks, credential stuffing and other common attacks can cost the company hundreds of millions of dollars.</p><p class="paragraph" style="text-align:left;">Recently, Grab added support for Passkeys, a passwordless authentication method based on the FIDO standard.</p><p class="paragraph" style="text-align:left;">The Grab engineering team wrote a fantastic <a class="link" href="https://engineering.grab.com/embracing-passwordless-authentication-with-passkey?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into why they implemented Passkeys, how they work, and the challenges they faced. We’ll be summarizing the blog post and adding some additional context on Passkeys.</p><h2 class="heading" style="text-align:left;" id="introduction-to-passkeys"><b>Introduction to Passkeys</b></h2><p class="paragraph" style="text-align:left;">Passkeys were first proposed in 2009 by engineers at Validity Sensors, a company that developed fingerprint sensors. They were looking for a way to use biometrics for authentication without requiring users to create and remember passwords.</p><p class="paragraph" style="text-align:left;">The idea gained traction in 2012 when PayPal joined forces with Validity Sensors to create the Fast IDentity Online (FIDO) Alliance. The goal of the alliance was to develop open standards for passwordless authentication that could be used across different devices and platforms.</p><p class="paragraph" style="text-align:left;">In 2018, they released FIDO2, which introduced the <a class="link" href="https://en.wikipedia.org/wiki/WebAuthn?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">WebAuthn</a> standard. With this, users could authenticate with biometrics (Face ID or fingerprint) or a hardware security key instead of a password.</p><p class="paragraph" style="text-align:left;">Apple announced broad support for passkeys in iOS 16 (<i>in 2022</i>). Since then, companies like TikTok, Adobe, Amazon, Google and more have added support for passkeys.</p><h2 class="heading" style="text-align:left;" id="how-passkeys-work"><b>How Passkeys Work</b></h2><p class="paragraph" style="text-align:left;">Passkeys are based on Public-key cryptography (<i>asymmetric cryptography</i>) where you use a pair of mathematically related keys: a public key and a private key.</p><p class="paragraph" style="text-align:left;">The public and private keys are used to create a digital signature that can verify a user’s identity. The private key is kept secret on the user’s device while the public key is stored on the server (<i>can it can be freely distributed</i>).</p><p class="paragraph" style="text-align:left;"><b>Passkey Flow</b></p><p class="paragraph" style="text-align:left;">Here’s the steps that happen with passkeys</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>User Registration</b> - The user creates a passkey for a website with his iPhone. The authenticator on his phone will generate a new, unique public/private key pair using an algorithm like ECDSA. The public key will be sent to the website while the private key will be securely stored in iCloud Keychain</p></li><li><p class="paragraph" style="text-align:left;"><b>Passkey Authentication</b><b> </b>- Later, when the user wants to log into the website, he’ll select the “login with Passkey” option. The website will send him a challenge (<i>a random piece of data</i>) to verify his identity. The user will use iCloud Keychain on his iPhone (<i>after authenticating himself with FaceID or touchID</i>) to create a digital signature of the challenge using his private key. This is done with algorithms like DSA or ECDSA</p></li><li><p class="paragraph" style="text-align:left;"><b>Server Verification</b><b> </b>- the digitally signed challenge is sent back to the website server. The server uses the public key to verify the signature and grant access to the user.</p></li></ol><h2 class="heading" style="text-align:left;" id="benefits-of-passkeys-over-passwords"><b>Benefits of Passkeys over Passwords</b></h2><p class="paragraph" style="text-align:left;">Passkeys offer several advantages over traditional passwords:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Convenience and No Password Reuse</b>: With passwords, users often reuse the same password across multiple sites, which creates a security risk if one of those sites is breached. Passkeys eliminate the need to remember passwords, so users don&#39;t have to reuse the same password across different services.</p></li><li><p class="paragraph" style="text-align:left;"><b>Companies don’t store Credentials</b> - <i>explain how Public Key doesn’t need to be kept secret so the company doesn’t have to store anything secret like they do with passwords</i></p></li><li><p class="paragraph" style="text-align:left;"><b>Built-In MFA</b>: Passkeys provide multi-factor authentication by combining something the user has (their device) with something the user has (biometrics) or something the user knows (a PIN).</p></li><li><p class="paragraph" style="text-align:left;"><b>Eliminate Phishing Attacks</b>: Phishing attacks trick users into entering their credentials on a fake website. Passkeys are resistant to phishing because the authentication is tied to the user&#39;s device and the specific website or app.</p></li></ul><p class="paragraph" style="text-align:left;"><a class="link" href="https://en.wikipedia.org/wiki/WebAuthn?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">WebAuthn</a> is the web standard that defines the communication between your device, your web browser and the website’s servers. It provides a set of JavaScript APIs that websites can use to interact with a user’s authenticator and specifies the steps for creating/authenticating passkeys.</p><h2 class="heading" style="text-align:left;" id="how-grab-implemented-passkey-authen"><b>How Grab Implemented Passkey Authentication</b></h2><p class="paragraph" style="text-align:left;">Grab&#39;s implementation of passkey authentication follows the standard WebAuthn flow. Here’s the process step-by-step</p><div class="image"><img alt="" class="image__image" style="" src="https://engineering.grab.com/img/embracing-passwordless-authentication-with-passkey/sequence-diagram-passkey-registration.png"/></div><h3 class="heading" style="text-align:left;" id="creating-a-passkey"><b>Creating a Passkey</b></h3><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>User Initiates Passkey Creation</b> - The user selects the option to “Enable Passkey” in the Grab app.</p></li></ol><ol start="2"><li><p class="paragraph" style="text-align:left;"><b>Frontend Requests User Data and Challenge</b>: Grab&#39;s frontend sends a request to Grab&#39;s backend server. This request asks for specific user details and a unique, cryptographically secure random number called a &quot;challenge.&quot; The challenge helps prevent <a class="link" href="https://en.wikipedia.org/wiki/Replay_attack?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">replay attacks</a>. The backend will (<i>hopefully</i>) receive the request and reply with the data necessary to initiate the passkey creation.</p></li></ol><ol start="3"><li><p class="paragraph" style="text-align:left;"><b>Frontend Invokes WebAuthn API for Passkey Creation</b>: After receiving the user data and challenge, the frontend will call the WebAuthn API to start the passkey generation process.</p></li></ol><ol start="4"><li><p class="paragraph" style="text-align:left;"><b>Authenticator Creates the Passkey</b>: The authenticator app in the user’s device will receive the request from the WebAuthn API. It will ask the user for consent (<i>with identification through fingerprint scan, facial recognition or a PIN</i>) and then generate a new public-private key pair.</p></li></ol><ol start="5"><li><p class="paragraph" style="text-align:left;"><b>Public Key and Data Sent to Frontend</b>: After creating the key pair, the authenticator will return a PublicKeyCredential object to the frontend. This contains the public key, the credential ID and other relevant data. It <i>does not </i>include the private key. That is kept secret on the user’s device.</p></li></ol><ol start="6"><li><p class="paragraph" style="text-align:left;"><b>Frontend Sends Public Key to Backend</b>: The frontend takes the PublicKeyCredential object and transmits the public key and associated data to Grab&#39;s backend server. Grab’s backend will store this info with the user’s account in its database. The public key will be used later to verify the user’s identity when they try to log in. </p></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://engineering.grab.com/img/embracing-passwordless-authentication-with-passkey/sequency-diagram-passkey-authentication.png"/></div><h3 class="heading" style="text-align:left;" id="authenticating-with-pass-key"><b>Authenticating with PassKey</b></h3><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>User Initiates Sign-In with Passkey</b>: The user opens the Grab app and chooses the &quot;Sign in with Passkey&quot; option.</p></li></ol><ol start="2"><li><p class="paragraph" style="text-align:left;"><b>Frontend Requests a Challenge</b>: The Grab app&#39;s frontend sends a request to Grab&#39;s backend server, asking for a new, unique challenge. The backend responds with the challenge.</p></li></ol><ol start="3"><li><p class="paragraph" style="text-align:left;"><b>Frontend Invokes WebAuthn API for Authentication</b>: The frontend receives the challenge and calls the WebAuthn API. The Authenticator app (iCloud Keychain for example) will show a list of available passkeys that the user can choose from.</p></li></ol><ol start="4"><li><p class="paragraph" style="text-align:left;"><b>User Selects Passkey and Provides Consent</b>: The user selects the passkey for their Grab account and verifies using facial recognition, fingerprint or a passcode.</p></li></ol><ol start="5"><li><p class="paragraph" style="text-align:left;"><b>Authenticator Signs the Challenge</b>: The authenticator uses the private key associated with the selected passkey to create a digital signature of the challenge and other relevant data. This signature is created using algorithms like ECDSA or RSA. The signed data, along with the credential ID, is packaged into a <a class="link" href="https://developer.mozilla.org/en-US/docs/Web/API/PublicKeyCredential?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">PublicKeyCredential</a> object.</p></li></ol><ol start="6"><li><p class="paragraph" style="text-align:left;"><b>Frontend Sends Data to Backend</b> - The frontend receives the PublicKeyCredential object from the authenticator. It forwards this to Grab’s backend server.</p></li></ol><ol start="7"><li><p class="paragraph" style="text-align:left;"><b>Backend Verifies Signature and Logs User In</b>: Grab&#39;s backend receives the PublicKeyCredential object. It will look up the user’s public key from it’s database and use that to verify the digital signature. If it’s valid (<i>and the challenge matches the one it sent in step 2</i>) then the user can log into their account.</p></li></ol><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXecoi0vO47-nlgRj1gvFp2TsfUOapJr-0uVgdwor6II-006TbWNDon_J8cnY1TaOA5UNdAw5eErfF9aJzhRoZi_QHr10ljJOXS54clT3jqu3EfiwHSMOJJ4M670N7m13mGu0lGYjA?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-not-lose-a-billion-dollars-w"><b><a class="link" href="https://dub.link/quas-feb3-flag?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">How to NOT Lose a Billion Dollars with Feature Flags</a></b></h1><p class="paragraph" style="text-align:left;">Feature flags are a must for shipping new features quickly and safely. But getting them wrong can be a disaster. In 2012, the high-frequency trading firm Knight Capital famously lost $440 million dollars in just 45 minutes due to a bad flag.<br><br><a class="link" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> wrote a fantastic <a class="link" href="https://dub.link/quas-feb3-flag?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">blog post</a> on how to avoid some of the biggest feature flag pitfalls:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Structuring flags safely</b> – Keep them out of your business logic to dodge hidden states and insane debugging sessions.</p></li><li><p class="paragraph" style="text-align:left;"><b>Killing zombie flags</b> – Old flags can be lethal tech debt, so set up a reliable removal plan.</p></li><li><p class="paragraph" style="text-align:left;"><b>Ensure graceful failure</b> - Always assume a feature flag can fail or return null. Bugs in the implementation of the feature flag shouldn&#39;t break your code. </p></li></ul><p class="paragraph" style="text-align:left;">For more engineering war stories and best practices, check out <a class="link" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a>. It&#39;s free.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-feb3?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system"><span class="button__text" style=""> Check out Product for Engineers </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://www.cosive.com/blog/my-washing-machine-refreshed-my-thinking-on-software-effort-estimation?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank"><div class="embed__content"><p class="embed__title"> How My Washing Machine Refreshed My Thinking on Software Effort Estimation </p><p class="embed__description"> In your professional career, something you’ll constantly have to do is give software estimates to non-technical stakeholders.<br><br>This is a great article to send to these stakeholders when you’re trying to explain *why* giving software estimates can be difficult.<br><br>Many tasks may seem simple but timelines can frequently blow up as the team comes across different blockers. Non-technical stakeholders may not understand the technical reasoning behind the blockers, but explaining it through the analogy of setting up a washing machine can be easier to comprehend. </p><p class="embed__link"> www.cosive.com/blog/my-washing-machine-refreshed-my-thinking-on-software-effort-estimation </p></div></a></div><div class="embed"><a class="embed__url" href="https://chadnauseam.com/advice/writing-advice?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank"><div class="embed__content"><p class="embed__title"> A Compilation of Writing Advice from the Internet’s Best Writers </p><p class="embed__description"> This is a fantastic compilation of advice from writers like Scott Alexander, Eugene Wei, Steven Pinker, Julian Shapiro and more.<br><br>Writing for a blog is fundamentally different from writing for a book/essay and you’ll need to adapt your word choice, flow, style, etc. for the medium. Read the full article to learn how you can do this. </p><p class="embed__link"> chadnauseam.com/advice/writing-advice </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.latent.space/p/gdr?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-video.s3.amazonaws.com%2Fvideo_upload%2Fpost%2F157348543%2F213917e3-f26f-4126-b4b5-cf55b6eb0b24%2Ftranscoded-1739896873.png"/><div class="embed__content"><p class="embed__title"> Interview with The Inventors of Deep Research </p><p class="embed__description"> Google Deep Research is an agent that can generate a detailed research report for you on any topic. The creators recently went on the Latent Space podcast to discuss the engineering behind the agent. They talk about RAG, evaluation methods and the limitations they faced. </p><p class="embed__link"> www.latent.space/p/gdr </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.cs.otago.ac.nz/cosc345/resources/nasa-10-rules.htm?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank"><div class="embed__content"><p class="embed__title"> NASA’s 10 Rules for Software Development </p><p class="embed__description"> NASA published a famous list of rules they follow when writing code. These are meant for embedded software on *extremely* expensive spacecraft but they can also be useful when you’re writing web applications, compilers or any other piece of software.<br><br>This is a great article that takes a critical look at these rules and reviews how useful they are when writing application software and programming language processors. </p><p class="embed__link"> www.cs.otago.ac.nz/cosc345/resources/nasa-10-rules.htm </p></div></a></div><div class="embed"><a class="embed__url" href="https://cefboud.com/posts/exploring-kafka-internals/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-grab-s-auth-system" target="_blank"><div class="embed__content"><p class="embed__title"> Exploring Apache Kafka Internals and Codebase </p><p class="embed__description"> This is a fantastic blog post by Moncef Abboud that dives into the Apache Kafka codebase, providing a behind-the-scenes look at its major components.<br><br>It’s a useful read if you work with Kafka on a daily basis but never explored it’s internals. The post clarifies how Kafka’s networking, storage and overall architecture fit together. </p><p class="embed__link"> cefboud.com/posts/exploring-kafka-internals </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=c728942f-e054-43b3-8fdd-d341f2524e78&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Pinterest Optimized Video Playback</title>
  <description>An introduction to Adaptive Bitrate Streaming and how Pinterest was able to reduce startup latency for videos. Plus, the architecture of open source applications and how Anthropic was able to improve RAG.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/057e3361-e34a-4cfa-b7ff-a58dcf203704/Screenshot_2024-09-20_at_2.46.01_PM.png" length="218267" type="image/png"/>
  <link>https://blog.quastor.org/p/how-pinterest-optimized-video-playback-02e9</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-pinterest-optimized-video-playback-02e9</guid>
  <pubDate>Fri, 14 Feb 2025 16:30:00 +0000</pubDate>
  <atom:published>2025-02-14T16:30:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Pinterest Optimized Video Playback</b></p><ul><li><p class="paragraph" style="text-align:left;">Introduction to Adaptive Bitrate Streaming, HLS and DASH </p></li><li><p class="paragraph" style="text-align:left;"> Why Pinterest was experiencing high startup latency for videos</p></li><li><p class="paragraph" style="text-align:left;">Embedding the video manifest files in their metadata API and improving performance with caching</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">A Dive into Database Fundamentals</p></li><li><p class="paragraph" style="text-align:left;">Digital Signatures and how to avoid them</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of Open Source Applications</p></li></ul></li></ul><p class="paragraph" style="text-align:left;"></p><hr class="content_break"><div class="image"><a class="image__link" href="https://www.nango.dev/blog/the-hidden-costs-of-building-product-integrations-in-house?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8f1eeb52-74a9-404b-9307-4a2bf44bdb05/678f631b7d0d2d6405c73a05_15._Hidden_costs_of_building_product_integrations__in-house-p-1600.png?t=1739487294"/></a></div><h1 class="heading" style="text-align:left;" id="the-hidden-costs-of-building-produc"><a class="link" href="https://www.nango.dev/blog/the-hidden-costs-of-building-product-integrations-in-house?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" target="_blank" rel="noopener noreferrer nofollow"><b>The Hidden Costs of Building Product Integrations In-House</b></a></h1><p class="paragraph" style="text-align:left;">Building integrations in-house might seem like the obvious choice—full control, endlessly customizable, and no external dependencies. But what about the costs you don’t see upfront?</p><p class="paragraph" style="text-align:left;">From ongoing maintenance to unexpected edge cases, hidden costs of building every third party integration in-house can quickly add up and drain engineering resources.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.nango.dev/blog/the-hidden-costs-of-building-product-integrations-in-house?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" target="_blank" rel="noopener noreferrer nofollow">This article</a> from Nango spells out the gotchas you should watch out for before deciding to go 100% in-house, including:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Development costs</b> - Understand how engineering hours and project scope can balloon over time</p></li><li><p class="paragraph" style="text-align:left;"><b>Maintenance overhead</b> - See all the nitty gritty your team needs to take care of in order to have stable and scalable integrations in prod</p></li><li><p class="paragraph" style="text-align:left;"><b>Opportunity cost</b> - Consider how building integrations in-house can pull focus away from core product innovation</p></li></ul><div class="button" style="text-align:left;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.nango.dev/blog/the-hidden-costs-of-building-product-integrations-in-house?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025"><span class="button__text" style=""> Read the full article to uncover the true cost of building integrations in-house </span></a></div><p class="paragraph" style="text-align:left;"><i>sponsored</i></p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Pinterest Optimized Video Playback</b></h1><p class="paragraph" style="text-align:left;">Pinterest is a social media platform that helps you discover ideas and inspiration related to whatever you’re interested in (cooking recipes, home decor, clothing, etc)</p><p class="paragraph" style="text-align:left;">The platform was launched in 2010 and it’s grown to over 500 million monthly active users. Pinterest is now publicly traded and valued at more than $20 billion.</p><p class="paragraph" style="text-align:left;">Like every other social platform, video content is one of the most popular mediums on Pinterest. When you’re serving videos to your users, one of your highest priorities should be to minimize any buffering and startup delay. With the modern day attention span, even having your video buffer for a couple of seconds can result in a huge number of users leaving your app.</p><p class="paragraph" style="text-align:left;">Pinterest engineering published a great <a class="link" href="https://medium.com/pinterest-engineering/improving-abr-video-performance-at-pinterest-f0ea47a6d4fc?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank" rel="noopener noreferrer nofollow">blog post</a> on how they optimized video playback and reduced startup latency by 36%.</p><p class="paragraph" style="text-align:left;">We’ll give some context on how videos are streamed, what protocols are involved and what Pinterest did to optimize playback.</p><h2 class="heading" style="text-align:left;" id="introduction-to-adaptive-bitrate-st"><b>Introduction to Adaptive Bitrate Streaming</b></h2><p class="paragraph" style="text-align:left;">When you’re delivering video to users, one technique that’s used universally nowadays is Adaptive Bitrate Streaming.</p><p class="paragraph" style="text-align:left;">This is where you take the video and encode it at multiple bitrates and resolutions and store them all on your server. When a user wants to play the video, their phone will select the optimal rendition based on factors like network bandwidth and device characteristics to minimize any buffering.</p><p class="paragraph" style="text-align:left;">With Adaptive Bitrate Streaming, the player can also switch <i>dynamically</i> between different bitrates. If the internet connection weakens while they’re watching a video on their phone, ABR allows the player to automatically switch to a lower bitrate stream so playback can be smooth without any buffering interruptions.</p><p class="paragraph" style="text-align:left;">When the network improves, the player will automatically switch back to the higher bitrate stream to provide better video quality.</p><h2 class="heading" style="text-align:left;" id="basics-of-adaptive-bitrate-streamin"><b>Basics of Adaptive Bitrate Streaming</b></h2><p class="paragraph" style="text-align:left;">There are different protocols you can use for Adaptive Bitrate Streaming, but they share some common fundamentals.</p><ul><li><p class="paragraph" style="text-align:left;"><b>Chunking</b> - the video file is broken up into small chunks. Each chunk ranges from 2-10 seconds in length.</p></li><li><p class="paragraph" style="text-align:left;"><b>Multiple Renditions</b> - Each chunk is encoded at multiple bitrates and resolutions. </p></li><li><p class="paragraph" style="text-align:left;"><b>Manifest File</b> - a manifest file contains metadata about the available renditions for every chunk, including their bitrates and resolutions.</p></li><li><p class="paragraph" style="text-align:left;"><b>Dynamic Selection</b><b> </b>- the user’s video player will use the manifest file to determine which chunk to download based on the current network conditions and device capabilities.</p></li></ul><p class="paragraph" style="text-align:left;">The most widely adopted Adaptive Bitrate protocols are HTTP Live Streaming (HLS) and Dynamic Adaptive Streaming over HTTP (DASH).</p><p class="paragraph" style="text-align:left;">You’ve probably realized this by the names but HLS and DASH are both based on HTTP.</p><h2 class="heading" style="text-align:left;" id="http-live-streaming-hls"><b>HTTP Live Streaming (HLS)</b></h2><p class="paragraph" style="text-align:left;">HLS was developed by Apple in 2009 and it’s one of the earliest and most widely adopted ABR protocols. The video stream is broken into small, HTTP-based downloads. It supports both live and on-demand streaming.</p><p class="paragraph" style="text-align:left;">It’s developed and maintained by Apple so it’s natively supported on iOS, macOS and Safari.</p><p class="paragraph" style="text-align:left;">HLS uses <a class="link" href="https://docs.aws.amazon.com/mediatailor/latest/ug/manifest-hls-example.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank" rel="noopener noreferrer nofollow">.m3u8 manifest</a> files to guide the player in selecting the most appropriate video chunks based on real-time network conditions. </p><h2 class="heading" style="text-align:left;" id="dynamic-adaptive-streaming-over-htt"><b>Dynamic Adaptive Streaming over HTTP (DASH)</b></h2><p class="paragraph" style="text-align:left;">DASH was created by a consortium of companies led by MPEG (Moving Picture Experts Group). The protocol was first published in 2012 and it currently powers platforms like YouTube and Netflix.</p><p class="paragraph" style="text-align:left;">DASH uses <a class="link" href="https://ottverse.com/structure-of-an-mpeg-dash-mpd/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank" rel="noopener noreferrer nofollow">.mpd manifest</a> files to provide metadata about the available renditions and chunk URLs.</p><h2 class="heading" style="text-align:left;" id="video-streaming-at-pinterest"><b>Video Streaming at Pinterest</b></h2><p class="paragraph" style="text-align:left;">At Pinterest, both HLS and DASH are used for delivering videos across iOS and Android platforms, respectively. </p><ul><li><p class="paragraph" style="text-align:left;"><b>HLS:</b> Utilized for video streaming on iOS devices through Apple’s AVPlayer, accounting for approximately 70% of video playback sessions on iOS apps.</p></li><li><p class="paragraph" style="text-align:left;"><b>DASH:</b> Employed for video streaming on Android devices using ExoPlayer, representing around 55% of video playback sessions on Android.</p></li></ul><p class="paragraph" style="text-align:left;">One of the key metrics Pinterest measures for video performance is <b>startup latency</b> - the time it takes for a video to begin playing after a user initiates playback.</p><p class="paragraph" style="text-align:left;">As we stated above, both HLS and and DASH require a manifest file before you can initiate video playback. With HLS, you might have to download <i>additional</i> manifest files (<i>for the specific rendition</i>) after downloading the main one.</p><p class="paragraph" style="text-align:left;"> Only <i>after</i> you download the manifest file can the video player start downloading the first few chunks of the video. This is the primary contributor to users’ perceived latency.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4c3b1701-f6e1-4fb3-b9b8-5884485aacc3/Screenshot_2024-09-20_at_5.00.18_AM.png?t=1726822821"/></div><p class="paragraph" style="text-align:left;">The Pinterest team decided to eliminate the latency from the round trips by embedding all the relevant manifest files in the original API response. When a user first requests metadata for a video (<i>thumbnail, title, etc.</i>), the API response to that request will also contain the manifest files of the video.</p><p class="paragraph" style="text-align:left;">During playback, the player can swiftly access the manifest information locally and immediately start downloading video chunks.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdn-Z1mJjqcBCJXOoxyehpko5l0jhuzoU_jmn64d4wi7asI8qPt2EcrrPnDHY4ObL-BbmQ0hEOsFJ43-9cWWEB3aYDs6BdzW7wbe0Zu7JfiQgMyYOgaXjusOizqlrXfN8kBFyH1vbj5L-DrRho0B5dYSxlL?key=R6LUnjdZjB11ZSxF6NRNGg"/></div><h2 class="heading" style="text-align:left;" id="reducing-api-response-time"><b>Reducing API Response Time</b></h2><p class="paragraph" style="text-align:left;">When Pinterest started including manifest files in the API responses, the primary issue they faced was increased latency for the API endpoint. The backend now had to retrieve manifest files before it could respond with video metadata.</p><p class="paragraph" style="text-align:left;">They were able to solve this issue with caching. They added a <a class="link" href="https://medium.com/pinterest-engineering/improving-distributed-caching-performance-and-efficiency-at-pinterest-92484b5fe39b?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank" rel="noopener noreferrer nofollow">MemCache</a> layer into the manifest serving process to cache the most popular video manifest files.</p><p class="paragraph" style="text-align:left;">Here’s the new process for retrieving manifest files.</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>API Request</b> - a client requests Pins metadata</p></li><li><p class="paragraph" style="text-align:left;"><b>Manifest Embeddings</b> - the Backend retrieves manifest files from S3, serializes them and embeds the bytes within the API response</p></li><li><p class="paragraph" style="text-align:left;"><b>MemCache</b> - Subsequent requests for popular video manifest files are served immediately from the MemCache caching layer.</p></li><li><p class="paragraph" style="text-align:left;"><b>Response Delivery</b><b> </b>- the API delivers the payload with the manifest data embedded</p></li></ol><h2 class="heading" style="text-align:left;" id="results"><b>Results</b></h2><p class="paragraph" style="text-align:left;">With this new setup, Pinterest was able to see a 36.7% reduction in p90 startup latency on iOS. They also saw a 12.3% reduction in the number of users who had to wait longer than 1 second for a video to start.</p><hr class="content_break"><div class="image"><a class="image__link" href="https://www.nango.dev/blog/why-is-oauth-still-hard?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7b8d8805-7294-4130-9f69-584d3eff6e27/678f625ee603abbd012c27fb_4._Why_is_OAuth_still_hard_in_2024_-p-1600.png?t=1739488400"/></a></div><h1 class="heading" style="text-align:left;" id="why-is-o-auth-still-hard-in-2025"><a class="link" href="https://www.nango.dev/blog/why-is-oauth-still-hard?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" target="_blank" rel="noopener noreferrer nofollow"><b>Why is OAuth still hard in 2025?</b></a></h1><p id="o-auth-in-2025-is-like-java-script-" class="paragraph" style="text-align:left;">OAuth in 2025 is like JavaScript browser APIs in 2008. It’s a complete mess. But why is OAuth still hard?<br><br>This article breaks down the key reasons and all the gotchas to think about when implementing OAuth.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://www.nango.dev/blog/why-is-oauth-still-hard?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=02-14-2025" target="_blank" rel="noopener noreferrer nofollow">Read this article to see what to watch out for with OAuth.</a></p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://tontinton.com/posts/database-fundementals/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/cd7895b7-89e5-4cad-90f0-3d09a4755b31/Screenshot_2025-02-13_at_5.20.39_PM.png?t=1739488856"/><div class="embed__content"><p class="embed__title"> A Dive into Database Fundamentals </p><p class="embed__description"> This article is a terrific deep dive into database fundamentals. It talks about ACID properties (and how they can be implemented in a toy database), storage engines, indexes, LSM trees, the CAP theorem and much more.<br><br>It gives a fantastic overview of many of the most interesting topics in databases and also provides additional resources if you want to go further. </p><p class="embed__link"> tontinton.com/posts/database-fundementals </p></div></a></div><div class="embed"><a class="embed__url" href="https://neilmadden.blog/2024/09/18/digital-signatures-and-how-to-avoid-them/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank"><img class="embed__image embed__image--top" src="https://neilmadden.blog/wp-content/uploads/2023/10/ef036508-8b30-4229-98bd-a08c1a54dddb-550-000000c5f02d5a70_file.jpg"/><div class="embed__content"><p class="embed__title"> Digital signatures and how to avoid them </p><p class="embed__description"> Neil Madden is the author of API Security in Action and has worked as a Security Architect and software engineer.<br><br>He wrote a really interesting blog post on digital signatures, how they work and when they should be used (and when they should be avoided). He talks about the fragility of current signature schemes and how they can lose important contextual details. For many use-cases, Madden suggests using simpler methods like HMAC for authentication instead of digital signatures. </p><p class="embed__link"> neilmadden.blog/2024/09/18/digital-signatures-and-how-to-avoid-them </p></div></a></div><div class="embed"><a class="embed__url" href="https://aosabook.org/en/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-optimized-video-playback" target="_blank"><div class="embed__content"><p class="embed__title"> The Architecture of Open Source Applications </p><p class="embed__description"> This is a terrific series of free books that teach you software architecture using practical examples from open source.<br><br>The chapters go through applications like Git, CMake, Audacity, Firefox and more and explain how they work. </p><p class="embed__link"> aosabook.org/en </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=4116c4af-2ca9-4b12-8ea7-be56636db750&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Github Rebuilt Their Push Processing System</title>
  <description>We&#39;ll talk about how GitHub decoupled their system for processing code pushes. Plus, resources for CTOs and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/993bc404-4763-4623-9565-dbbeb94f550f/4_graphics_-_5_7_2024-06.png" length="59780" type="image/png"/>
  <link>https://blog.quastor.org/p/how-github-rebuilt-their-push-processing-system-0157</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-github-rebuilt-their-push-processing-system-0157</guid>
  <pubDate>Fri, 24 Jan 2025 16:15:00 +0000</pubDate>
  <atom:published>2025-01-24T16:15:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How GitHub Rebuilt their Push Processing System</b></p><ul><li><p class="paragraph" style="text-align:left;">GitHub rebuilt their system for handling code pushes to make it more decoupled</p></li><li><p class="paragraph" style="text-align:left;">We’ll give a brief overview of decoupled architectures and their pros/cons</p></li><li><p class="paragraph" style="text-align:left;">After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Pair Programming Antipatterns</p></li><li><p class="paragraph" style="text-align:left;">Resources for CTOs</p></li><li><p class="paragraph" style="text-align:left;">Nine ways to shoot yourself in the foot with Postgres</p></li><li><p class="paragraph" style="text-align:left;">How CloudFlare debugged an issue with dropped packets</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How GitHub Rebuilt their Push Processing System</b></h1><p class="paragraph" style="text-align:left;">GitHub has over 420 million repos and 100 million registered users. Every month, the platform handled over 500 million code pushes from 8.5 million developers.</p><p class="paragraph" style="text-align:left;">Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.</p><p class="paragraph" style="text-align:left;">GitHub has to do things like:</p><ul><li><p class="paragraph" style="text-align:left;">Update the repo with the latest commits</p></li><li><p class="paragraph" style="text-align:left;">Dispatch any Push webhooks</p></li><li><p class="paragraph" style="text-align:left;">Trigger relevant GitHub workflows</p></li></ul><p class="paragraph" style="text-align:left;">And much more. In fact, Github has 20 different services that run in response to a developer pushing code.</p><p class="paragraph" style="text-align:left;">Previously, push requests were handled by a single, enormous job (called <code>RepositoryPushJob</code>). Whenever you pushed code, GitHub’s Ruby on Rails monolith would enqueue <code>RepositoryPushJob</code> and handle all the underlying sub-jobs in a sequential manner.<br><br>However, the company faced issues with this approach and decided to switch to a more decoupled architecture with Apache Kafka. GitHub published a great <a class="link" href="https://github.blog/2024-06-11-how-we-improved-push-processing-on-github/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into the details.</p><p class="paragraph" style="text-align:left;">In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.</p><h2 class="heading" style="text-align:left;" id="overview-of-decoupled-architectures"><b>Overview of Decoupled Architectures</b></h2><p class="paragraph" style="text-align:left;">If a user does some significant action on your app, you might have to perform a series of different jobs. If you have a video sharing website, you’ll have a bunch of different things that need to be done when someone uploads a video (encoding, generating transcripts, checking for piracy, etc.)</p><p class="paragraph" style="text-align:left;">A key question is how coupled you want these jobs to be.</p><p class="paragraph" style="text-align:left;">On one side of the spectrum, you can combine these sub-jobs (<i>encoding, generating transcripts, etc.</i>) into a single larger job (<i>ProcessVideo</i>) and then execute them in a sequential manner.</p><p class="paragraph" style="text-align:left;">On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event streaming platform (like Kafka). Then, the different sub-jobs will consume the event and run independently. </p><p class="paragraph" style="text-align:left;">Some of the pros of a decoupled approach are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Scalability</b> - Each of the components can be scaled up/down independently based on their specific load and demand.</p></li><li><p class="paragraph" style="text-align:left;"><b>Fault Isolation</b> - Components are independent so a failure in one component can be contained (<i>hopefully</i>).</p></li><li><p class="paragraph" style="text-align:left;"><b>Easier Development</b> - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together. </p></li></ul><p class="paragraph" style="text-align:left;">Cons with the decoupled approach include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Increased Complexity</b> - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.</p></li><li><p class="paragraph" style="text-align:left;"><b>System Overhead</b> - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Consistency</b> - You’ll need to think about making sure data is consistent across the components. </p></li></ul><h2 class="heading" style="text-align:left;" id="git-hubs-old-tightly-coupled-archit"> <b>GitHub’s Old Tightly Coupled Architecture</b></h2><p class="paragraph" style="text-align:left;">Previously, GitHub used a single massive job called <code>RepositoryPushJob</code> for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/docsz/AD_4nXf8KCX4PUFVHjqr0yszMhIuxG0_DFOtAIhDeJh0ByOeOw2st6mNZI_gpv5Eri--0qBPAAT9usitPZC3AgExdzI6VgCvKYEfSBQAdxKyGnmeyAwv1342-JnWuhy8UolaOLDzyT8QZQVZ1R7LGS474gvLF94G?key=b4_VPeSVz4YqXbyB_t9mGA"/></div><p class="paragraph" style="text-align:left;">However, the GitHub team was facing quite a few issues with this approach</p><ul><li><p class="paragraph" style="text-align:left;"><b>Difficulty with Retries</b> - If <code>RepositoryPushJob</code> failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (<i>you couldn’t run them multiple times</i>). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks. </p></li><li><p class="paragraph" style="text-align:left;"><b>Huge Blast Radius</b><b> </b>- The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in <code>RepositoryPushJob</code>, the probability of failure increases.</p></li><li><p class="paragraph" style="text-align:left;"><b>Too Slow</b><b> </b>- Having a super long sequential process is bad for latency. The sub-jobs at the end of <code>RepositoryPushJob</code> had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (<i>over a second in some cases</i>).</p></li></ul><h2 class="heading" style="text-align:left;" id="git-hub-new-architecture"><b>GitHub New Architecture</b></h2><p class="paragraph" style="text-align:left;">To decrease the coupling in the push system, GitHub decided to break up <code>RepositoryPushJob</code> into smaller, independent jobs.</p><p class="paragraph" style="text-align:left;">They looked at each of the sub-jobs in <code>RepositoryPushJob</code> and grouped them based on dependencies, retry-ability, owning service, etc.  Each group of sub-jobs was placed into an independent job with a clear owner and appropriate retry configuration.<br><br>Whenever a developer pushes to a repo, GitHub will add a new event to Kafka. A Kafka consumer service will monitor the Kafka topic and consume the events.<br><br>If there’s a new event, the service will enqueue all the independent background jobs onto a job queue for processing. A dedicated pool of worker nodes will then handle the jobs in the queue.</p><p class="paragraph" style="text-align:left;">In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.</p><h2 class="heading" style="text-align:left;" id="results"><b>Results</b></h2><p class="paragraph" style="text-align:left;">GitHub has seen great results with the new system. Some of the improvements include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Reliability Improvements</b> - The old <code>RepositoryPushJob</code> was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Lower Latency</b> - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (<i>in the P50 time</i>).</p></li><li><p class="paragraph" style="text-align:left;"><b>Smaller Blast Radius</b> - previously, an issue with a single step in <code>RepositoryPushJob</code> could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.</p></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><img class="embed__image embed__image--left" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/69b65d8c-3015-4c6c-b5de-f5639a0006ef/image2-38.png?t=1718515856"/><div class="embed__content"><p class="embed__title"> Debugging Dropped Packets at Cloudflare </p><p class="embed__description"> This is an interesting article that delves into a problem Cloudflare was facing where they had drops in bandwidth and failing API requests after making a change to their load balancers. They eventually traced the problem to a bug in the linux kernel.<br><br>Terain Stock is a software engineer at Cloudflare and he wrote a post delving into packet handling and using tools like pwru to debug network issues and kprobe for kernel issues.blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths </p><p class="embed__link"> https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/ </p></div></a></div><div class="embed"><a class="embed__url" href="https://tuple.app/pair-programming-guide/antipatterns?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><div class="embed__content"><p class="embed__title"> Pair Programming Antipatterns </p><p class="embed__description"> Pair Programming can be an excellent tool for educating junior developers on the codebase however there are quite a few anti-patterns you’ll want to avoid.<br><br>This article gives a great list of some of them for the person leading the pair programming session (the driver) and the person following (the navigator). </p><p class="embed__link"> tuple.app/pair-programming-guide/antipatterns </p></div></a></div><div class="embed"><a class="embed__url" href="https://github.com/kuchin/awesome-cto?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><img class="embed__image embed__image--left" src="https://opengraph.githubassets.com/ca672bfe717a37a8bd0e8eccbef22dd1df7a141c1f5f942a60dc7859751f2dee/kuchin/awesome-cto"/><div class="embed__content"><p class="embed__title"> Resources for CTOs </p><p class="embed__description"> Resources for CTOs This is a great github repo with resources for CTOs (or aspiring CTOs).<br><br>It contains resources on software development processes, architecture, product management, hiring and much more! </p><p class="embed__link"> github.com/kuchin/awesome-cto </p></div></a></div><div class="embed"><a class="embed__url" href="https://philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><div class="embed__content"><p class="embed__title"> Nine ways to shoot yourself in the foot with PostgreSQL </p><p class="embed__description"> Nine ways to shoot yourself in the foot with Postgres Many developers have Postgres as their first-choice when they need a database (for good reason).<br><br>However, there are some gotchas you should be aware of, especially if you plan on scaling the database. Phil Booth wrote a great blog post delving into some of these potential pitfalls that can become a problem as you scale. </p><p class="embed__link"> philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=e4b2e5f0-4339-4411-9c15-a9ed7d2e02eb&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Github Rebuilt Their Push Processing System</title>
  <description>We&#39;ll talk about how GitHub decoupled their system for processing code pushes. Plus, resources for CTOs and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/993bc404-4763-4623-9565-dbbeb94f550f/4_graphics_-_5_7_2024-06.png" length="59780" type="image/png"/>
  <link>https://blog.quastor.org/p/how-github-rebuilt-their-push-processing-system</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-github-rebuilt-their-push-processing-system</guid>
  <pubDate>Fri, 24 Jan 2025 16:10:00 +0000</pubDate>
  <atom:published>2025-01-24T16:10:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How GitHub Rebuilt their Push Processing System</b></p><ul><li><p class="paragraph" style="text-align:left;">GitHub rebuilt their system for handling code pushes to make it more decoupled</p></li><li><p class="paragraph" style="text-align:left;">We’ll give a brief overview of decoupled architectures and their pros/cons</p></li><li><p class="paragraph" style="text-align:left;">After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Pair Programming Antipatterns</p></li><li><p class="paragraph" style="text-align:left;">Resources for CTOs</p></li><li><p class="paragraph" style="text-align:left;">Nine ways to shoot yourself in the foot with Postgres</p></li><li><p class="paragraph" style="text-align:left;">How CloudFlare debugged an issue with dropped packets</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c43ce0c4-58a2-430c-ba71-79dc2c40b9e4/Radar_Quastor_2x1-1.png?t=1737734168"/></a></div><h1 class="heading" style="text-align:left;" id="protect-your-app-with-work-os-radar"><a class="link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">Protect your App with WorkOS Radar</a></h1><p class="paragraph" style="text-align:left;">Does your app get fake signups, throwaway emails, or users abusing your free tier?</p><p class="paragraph" style="text-align:left;">Or <i>worse</i>, bot attacks and brute force attempts?</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">WorkOS Radar</a> can block all this and more. A simple API gives you advanced device fingerprinting that can detect bad actors, bots, and suspicious behaviors.</p><p class="paragraph" style="text-align:left;">Your users trust you. Let’s keep it that way.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025"><span class="button__text" style=""> Learn How to Protect Your App with WorkOS Radar </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How GitHub Rebuilt their Push Processing System</b></h1><p class="paragraph" style="text-align:left;">GitHub has over 420 million repos and 100 million registered users. Every month, the platform handled over 500 million code pushes from 8.5 million developers.</p><p class="paragraph" style="text-align:left;">Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.</p><p class="paragraph" style="text-align:left;">GitHub has to do things like:</p><ul><li><p class="paragraph" style="text-align:left;">Update the repo with the latest commits</p></li><li><p class="paragraph" style="text-align:left;">Dispatch any Push webhooks</p></li><li><p class="paragraph" style="text-align:left;">Trigger relevant GitHub workflows</p></li></ul><p class="paragraph" style="text-align:left;">And much more. In fact, Github has 20 different services that run in response to a developer pushing code.</p><p class="paragraph" style="text-align:left;">Previously, push requests were handled by a single, enormous job (called <code>RepositoryPushJob</code>). Whenever you pushed code, GitHub’s Ruby on Rails monolith would enqueue <code>RepositoryPushJob</code> and handle all the underlying sub-jobs in a sequential manner.<br><br>However, the company faced issues with this approach and decided to switch to a more decoupled architecture with Apache Kafka. GitHub published a great <a class="link" href="https://github.blog/2024-06-11-how-we-improved-push-processing-on-github/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into the details.</p><p class="paragraph" style="text-align:left;">In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.</p><h2 class="heading" style="text-align:left;" id="overview-of-decoupled-architectures"><b>Overview of Decoupled Architectures</b></h2><p class="paragraph" style="text-align:left;">If a user does some significant action on your app, you might have to perform a series of different jobs. If you have a video sharing website, you’ll have a bunch of different things that need to be done when someone uploads a video (encoding, generating transcripts, checking for piracy, etc.)</p><p class="paragraph" style="text-align:left;">A key question is how coupled you want these jobs to be.</p><p class="paragraph" style="text-align:left;">On one side of the spectrum, you can combine these sub-jobs (<i>encoding, generating transcripts, etc.</i>) into a single larger job (<i>ProcessVideo</i>) and then execute them in a sequential manner.</p><p class="paragraph" style="text-align:left;">On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event streaming platform (like Kafka). Then, the different sub-jobs will consume the event and run independently. </p><p class="paragraph" style="text-align:left;">Some of the pros of a decoupled approach are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Scalability</b> - Each of the components can be scaled up/down independently based on their specific load and demand.</p></li><li><p class="paragraph" style="text-align:left;"><b>Fault Isolation</b> - Components are independent so a failure in one component can be contained (<i>hopefully</i>).</p></li><li><p class="paragraph" style="text-align:left;"><b>Easier Development</b> - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together. </p></li></ul><p class="paragraph" style="text-align:left;">Cons with the decoupled approach include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Increased Complexity</b> - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.</p></li><li><p class="paragraph" style="text-align:left;"><b>System Overhead</b> - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Consistency</b> - You’ll need to think about making sure data is consistent across the components. </p></li></ul><h2 class="heading" style="text-align:left;" id="git-hubs-old-tightly-coupled-archit"> <b>GitHub’s Old Tightly Coupled Architecture</b></h2><p class="paragraph" style="text-align:left;">Previously, GitHub used a single massive job called <code>RepositoryPushJob</code> for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/docsz/AD_4nXf8KCX4PUFVHjqr0yszMhIuxG0_DFOtAIhDeJh0ByOeOw2st6mNZI_gpv5Eri--0qBPAAT9usitPZC3AgExdzI6VgCvKYEfSBQAdxKyGnmeyAwv1342-JnWuhy8UolaOLDzyT8QZQVZ1R7LGS474gvLF94G?key=b4_VPeSVz4YqXbyB_t9mGA"/></div><p class="paragraph" style="text-align:left;">However, the GitHub team was facing quite a few issues with this approach</p><ul><li><p class="paragraph" style="text-align:left;"><b>Difficulty with Retries</b> - If <code>RepositoryPushJob</code> failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (<i>you couldn’t run them multiple times</i>). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks. </p></li><li><p class="paragraph" style="text-align:left;"><b>Huge Blast Radius</b><b> </b>- The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in <code>RepositoryPushJob</code>, the probability of failure increases.</p></li><li><p class="paragraph" style="text-align:left;"><b>Too Slow</b><b> </b>- Having a super long sequential process is bad for latency. The sub-jobs at the end of <code>RepositoryPushJob</code> had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (<i>over a second in some cases</i>).</p></li></ul><h2 class="heading" style="text-align:left;" id="git-hub-new-architecture"><b>GitHub New Architecture</b></h2><p class="paragraph" style="text-align:left;">To decrease the coupling in the push system, GitHub decided to break up <code>RepositoryPushJob</code> into smaller, independent jobs.</p><p class="paragraph" style="text-align:left;">They looked at each of the sub-jobs in <code>RepositoryPushJob</code> and grouped them based on dependencies, retry-ability, owning service, etc.  Each group of sub-jobs was placed into an independent job with a clear owner and appropriate retry configuration.<br><br>Whenever a developer pushes to a repo, GitHub will add a new event to Kafka. A Kafka consumer service will monitor the Kafka topic and consume the events.<br><br>If there’s a new event, the service will enqueue all the independent background jobs onto a job queue for processing. A dedicated pool of worker nodes will then handle the jobs in the queue.</p><p class="paragraph" style="text-align:left;">In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.</p><h2 class="heading" style="text-align:left;" id="results"><b>Results</b></h2><p class="paragraph" style="text-align:left;">GitHub has seen great results with the new system. Some of the improvements include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Reliability Improvements</b> - The old <code>RepositoryPushJob</code> was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Lower Latency</b> - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (<i>in the P50 time</i>).</p></li><li><p class="paragraph" style="text-align:left;"><b>Smaller Blast Radius</b> - previously, an issue with a single step in <code>RepositoryPushJob</code> could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.</p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c43ce0c4-58a2-430c-ba71-79dc2c40b9e4/Radar_Quastor_2x1-1.png?t=1737734168"/></a></div><h1 class="heading" style="text-align:left;" id="protect-your-app-with-work-os-radar"><a class="link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">Protect your App with WorkOS Radar</a></h1><p class="paragraph" style="text-align:left;">Does your app get fake signups, throwaway emails, or users abusing your free tier?</p><p class="paragraph" style="text-align:left;">Or <i>worse</i>, bot attacks and brute force attempts?</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">WorkOS Radar</a> can block all this and more. A simple API gives you advanced device fingerprinting that can detect bad actors, bots, and suspicious behaviors.</p><p class="paragraph" style="text-align:left;">Your users trust you. Let’s keep it that way.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://workos.com/radar?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025"><span class="button__text" style=""> Learn How to Protect Your App with WorkOS Radar </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><img class="embed__image embed__image--left" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/69b65d8c-3015-4c6c-b5de-f5639a0006ef/image2-38.png?t=1718515856"/><div class="embed__content"><p class="embed__title"> Debugging Dropped Packets at Cloudflare </p><p class="embed__description"> This is an interesting article that delves into a problem Cloudflare was facing where they had drops in bandwidth and failing API requests after making a change to their load balancers. They eventually traced the problem to a bug in the linux kernel.<br><br>Terain Stock is a software engineer at Cloudflare and he wrote a post delving into packet handling and using tools like pwru to debug network issues and kprobe for kernel issues.blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths </p><p class="embed__link"> https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/ </p></div></a></div><div class="embed"><a class="embed__url" href="https://tuple.app/pair-programming-guide/antipatterns?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><div class="embed__content"><p class="embed__title"> Pair Programming Antipatterns </p><p class="embed__description"> Pair Programming can be an excellent tool for educating junior developers on the codebase however there are quite a few anti-patterns you’ll want to avoid.<br><br>This article gives a great list of some of them for the person leading the pair programming session (the driver) and the person following (the navigator). </p><p class="embed__link"> tuple.app/pair-programming-guide/antipatterns </p></div></a></div><div class="embed"><a class="embed__url" href="https://github.com/kuchin/awesome-cto?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><img class="embed__image embed__image--left" src="https://opengraph.githubassets.com/ca672bfe717a37a8bd0e8eccbef22dd1df7a141c1f5f942a60dc7859751f2dee/kuchin/awesome-cto"/><div class="embed__content"><p class="embed__title"> Resources for CTOs </p><p class="embed__description"> Resources for CTOs This is a great github repo with resources for CTOs (or aspiring CTOs).<br><br>It contains resources on software development processes, architecture, product management, hiring and much more! </p><p class="embed__link"> github.com/kuchin/awesome-cto </p></div></a></div><div class="embed"><a class="embed__url" href="https://philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-github-rebuilt-their-push-processing-system" target="_blank"><div class="embed__content"><p class="embed__title"> Nine ways to shoot yourself in the foot with PostgreSQL </p><p class="embed__description"> Nine ways to shoot yourself in the foot with Postgres Many developers have Postgres as their first-choice when they need a database (for good reason).<br><br>However, there are some gotchas you should be aware of, especially if you plan on scaling the database. Phil Booth wrote a great blog post delving into some of these potential pitfalls that can become a problem as you scale. </p><p class="embed__link"> philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=b524617c-5b76-4d6a-b03e-18125e1c5528&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Uber Built an Exabyte-Scale System for Data Processing</title>
  <description>We&#39;ll talk about data processing at Uber and how they revamped their ETL platform to make it modular and scalable. Plus, software testing anti-patterns and how to get better at finishing your side projects.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af145973-14b6-412c-8d79-313abbed6a09/unnamed__6_.png" length="271760" type="image/png"/>
  <link>https://blog.quastor.org/p/how-uber-built-an-exabyte-scale-system-for-data-processing-e6dc</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-uber-built-an-exabyte-scale-system-for-data-processing-e6dc</guid>
  <pubDate>Thu, 16 Jan 2025 17:10:00 +0000</pubDate>
  <atom:published>2025-01-16T17:10:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Uber’s ETL Platform</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to ETL</p></li><li><p class="paragraph" style="text-align:left;">Tools used for ETL</p></li><li><p class="paragraph" style="text-align:left;">Architecture of Sparkle, Uber’s ETL framework built on Apache Spark</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Software Testing Anti-Patterns</p></li><li><p class="paragraph" style="text-align:left;">Free University Courses for Learning CS</p></li><li><p class="paragraph" style="text-align:left;">How to Get Better at Finishing Your Side Projects</p></li><li><p class="paragraph" style="text-align:left;">15 Life and Work Principles from Jensen Huang (CEO of Nvidia)</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of Uber’s ETL Platform</b></h1><p class="paragraph" style="text-align:left;">Uber is the largest ride sharing company in the world with over 150 million monthly active users and approximately 25 million daily trips.</p><p class="paragraph" style="text-align:left;">With this scale comes a <i>huge</i> amount of data (<i>and a cloud-bill that’s larger than the GDP of a small island nation</i>). Uber generates petabytes of data daily from ride history, logs, payment transactions, etc.</p><p class="paragraph" style="text-align:left;">This data needs to be extracted from the various data sources (payment processor, OLTP database, logs, etc.) and then loaded into data warehouses, data lakes, machine learning platforms and more.</p><p class="paragraph" style="text-align:left;">To do this, Uber relies on ETL (Extract, Transform, Load) processes. The Uber engineering team published a terrific <a class="link" href="https://www.uber.com/blog/sparkle-modular-etl/?uclick_id=da6b6b46-4ebd-4c91-9947-460e32b1ed47&utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank" rel="noopener noreferrer nofollow">blog post</a> talking about exactly how they handle ETL at their scale. They have 20,000+ critical data pipelines and 3,000+ engineers who use this system.</p><h2 class="heading" style="text-align:left;" id="introduction-to-etl"><b>Introduction to ETL</b></h2><p class="paragraph" style="text-align:left;">Extract, Transform, Load (ETL) is the process where you</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Extract</b> - you extract data from the various data sources (<i>places where data is created/temporarily stored</i>). This can be a transactional database, payment processor, a CRM (Salesforce or HubSpot), message queue, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Transform</b> - you clean, validate and standardize the data. You might need to check for duplicates, handle missing values, join the data with another dataset and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Load</b> - you load the data into various data sinks. This can be a data warehouse like Google BigQuery, a data lake like HDFS, archival storage like Amazon Glacier or something else.</p></li></ol><p class="paragraph" style="text-align:left;">Building data pipelines for ETL can be quite painful. You’ll need to consider several things:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Data Integrity</b> - ensure the accuracy and consistency in your data. You’ll need to check for duplicate records, missing values, inconsistent formatting, outlier values and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Schema Evolution</b> - as business requirements change, the data you’ll be processing will change. You’ll need to account for new fields, data type changes, deprecated fields, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Monitoring/Debugging</b> - you’ll need logging for the different stages of your pipeline and real-time alerts for failures/performance issues so you minimize downtime (and don’t lose data)</p></li><li><p class="paragraph" style="text-align:left;"><b>Scalability</b><b> </b>- the pipeline shouldn’t require a complete re-architecture as your data volume grows. You may also have to deal with bursts in incoming data depending on the usage patterns.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reliability and Failure Recovery</b><b> </b>- For some systems, you might need to<i> </i>guarantee <i>at least once processing</i>. You’ll have to make sure that the system rarely goes down and that you have a process in place to minimize data loss in case of crashes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Compliance</b><b> </b>- you might have to consider internal data governance/privacy policies when doing transformations.</p></li></ul><p class="paragraph" style="text-align:left;">Some common tools used for ETL include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Apache Spark</b> - an open-source engine for large-scale data processing. It’s a great choice for complex ETL jobs at scale.</p></li><li><p class="paragraph" style="text-align:left;"><b>dbt (data build tool)</b> - a toolkit for building data pipelines that encourages software engineering best practices like version control,testing, code review and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Apache Airflow</b> - a popular open-source platform for orchestrating workflows. You can schedule, monitor and manage your ETL pipelines with Python.</p></li><li><p class="paragraph" style="text-align:left;"><b>AWS Glue</b> - fully managed, serverless ETL service from Amazon. </p></li><li><p class="paragraph" style="text-align:left;"><b>Google Cloud Dataflow</b> - fully managed service for ETL from Google Cloud</p></li></ul><p class="paragraph" style="text-align:left;">In 2023, Uber migrated all their batch workloads to Apache Spark. Recently, they built <i>Sparkle, </i>a framework on top of Apache Spark with the goal of simplifying data pipeline development and testing.</p><h2 class="heading" style="text-align:left;" id="etl-at-uber-with-sparkle"><b>ETL at Uber with Sparkle</b></h2><p class="paragraph" style="text-align:left;">As your ETL jobs get more and more complex, it becomes crucial to use software engineering best practices when writing/maintaining them. Observability, version control, testing, documentation, etc. are a couple of best practices that have become increasingly adopted in the data community.</p><p class="paragraph" style="text-align:left;">Leading this charge is dbt, a data engineering platform that helps you apply these best practices to your data transformations.</p><p class="paragraph" style="text-align:left;">However, at Uber, switching from Spark to an entirely new ETL tool wasn’t possible. The scale of Uber’s platform, the developer learning curve and investment required just wasn’t worth it. (<i>try telling your boss you need to rewrite 20,000 mission-critical data pipelines</i>)</p><p class="paragraph" style="text-align:left;">Instead, the Uber team decided to build <i>Sparkle</i>, a framework on top of Apache Spark that lets engineers write configuration-based modular ETL jobs. Sparkle added features for observability, testing, data lineage tracking and more.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/17e2eaee-0029-4d4f-b56a-c59e8a55fe36/sparkle-layered-architecture-1-17236489174498.jpeg?t=1737039411"/></div><p class="paragraph" style="text-align:left;">The core idea behind Sparkle is <i>modularity</i>. Rather than writing complex, monolithic Spark jobs, engineers break their ETL logic down into a series of smaller, reusable modules. Each module can be in SQL, Java/Scala or Python and they’re defined with YAML. Check the <a class="link" href="https://www.uber.com/blog/sparkle-modular-etl/?uclick_id=da6b6b46-4ebd-4c91-9947-460e32b1ed47&utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank" rel="noopener noreferrer nofollow">blog post</a> for samples of what Sparkle jobs look like.</p><p class="paragraph" style="text-align:left;">Developers can just focus on the business logic around their data pipeline. Sparkle will handle infrastructure and boilerplate with pre-built components like</p><ul><li><p class="paragraph" style="text-align:left;"><b>Connectors</b><b> </b>- handles all the connection details to pull data from all the various data sources at Uber</p></li><li><p class="paragraph" style="text-align:left;"><b>Readers/Writers</b><b> </b>- handles translating data into different formats like Parquet, JSON, Avro, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Observability</b><b> </b>- provides logging, metrics and data lineage tracking</p></li><li><p class="paragraph" style="text-align:left;"><b>Testing</b> - you can write unit tests for your modules using mock data and SQL assertions to make sure your transformations are doing what you expect.</p></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://github.com/prakhar1989/awesome-courses?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing#readme" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub Awesome List of University Courses for Learning CS </p><p class="embed__description"> Awesome CS courses is a curated list of university-level CS courses available for free. You ca learn about the principles of distributed computing from ETH-Zurich or Natural Language Processing with Deep Learning from Oxford.<br><br>You’ll find lecture videos, notes, assignments and more. </p><p class="embed__link"> github.com/prakhar1989/awesome-courses#readme </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.bytedrum.com/posts/art-of-finishing/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><img class="embed__image embed__image--top" src="https://www.bytedrum.com/assets/art-of-finishing/og.png"/><div class="embed__content"><p class="embed__title"> The Art of Finishing </p><p class="embed__description"> If you’re a serial project starter, this article has some useful strategies to help you finally cross the finish line.<br><br>It talks about the main reasons why we avoid finishing things (fear of imperfection, illusion of productivity, etc.) and how you can overcome these blockers. </p><p class="embed__link"> www.bytedrum.com/posts/art-of-finishing </p></div></a></div><div class="embed"><a class="embed__url" href="https://creatoreconomy.so/p/15-life-and-work-principles-from-jensen?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54791a8e-7d40-4366-8933-1469816d25ca_1280x720.png"/><div class="embed__content"><p class="embed__title"> 15 Life and Work Principles from Jensen Huang (CEO of Nvidia) </p><p class="embed__description"> Jensen Huang has a unique leadership style. He has 60+ direct reports but does no 1:1 meetings. Instead, he tries to emphasize transparency and open discourse (where everyone in the company is involved in decision making).<br><br>He also looks to chase “zero-billion dollar markets“, where Nvidia is doing something completely new where they have no competitors.<br><br>Read the full article for the rest of his work/leadership principles. </p><p class="embed__link"> creatoreconomy.so/p/15-life-and-work-principles-from-jensen </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.codepipes.com/testing/software-testing-antipatterns.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><div class="embed__content"><p class="embed__title"> Software Testing Anti-Patterns </p><p class="embed__description"> This is an interesting article that dives into common software testing anti-patterns and why they can be detrimental.<br><br>Some of the anti-patterns include: paying excessive attention to test coverage, not converting production bugs to tests, treating test code as a second class citizen and more. </p><p class="embed__link"> blog.codepipes.com/testing/software-testing-antipatterns.html </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=9f072f57-a81b-4d22-9805-65216171d193&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Uber Built an Exabyte-Scale System for Data Processing</title>
  <description>We&#39;ll talk about data processing at Uber and how they revamped their ETL platform to make it modular and scalable. Plus, software testing anti-patterns and how to get better at finishing your side projects.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af145973-14b6-412c-8d79-313abbed6a09/unnamed__6_.png" length="271760" type="image/png"/>
  <link>https://blog.quastor.org/p/how-uber-built-an-exabyte-scale-system-for-data-processing</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-uber-built-an-exabyte-scale-system-for-data-processing</guid>
  <pubDate>Thu, 16 Jan 2025 15:05:00 +0000</pubDate>
  <atom:published>2025-01-16T15:05:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Uber’s ETL Platform</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to ETL</p></li><li><p class="paragraph" style="text-align:left;">Tools used for ETL</p></li><li><p class="paragraph" style="text-align:left;">Architecture of Sparkle, Uber’s ETL framework built on Apache Spark</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Software Testing Anti-Patterns</p></li><li><p class="paragraph" style="text-align:left;">Free University Courses for Learning CS</p></li><li><p class="paragraph" style="text-align:left;">How to Get Better at Finishing Your Side Projects</p></li><li><p class="paragraph" style="text-align:left;">15 Life and Work Principles from Jensen Huang (CEO of Nvidia)</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://www.nango.dev/blog/product-integrations-build-or-buy?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeQkWVZcOZe_7N0hu6RKYM6bA93fg0uZW6cdI4HsTzn-zYUZycRk4p75tpN5I4TVU__cIt5vuSkZ3pZmMoQzCjpQTLbcqUjHWZjYk9-Q1hxv2vX2CYnDrH3uqlIlKtcnf6IPYUJxQ?key=ArHUjbxqy4BwhRsjMAMFSUA0"/></a></div><h1 class="heading" style="text-align:left;" id="the-developers-guide-to-product-int"><a class="link" href="https://www.nango.dev/blog/product-integrations-build-or-buy?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" target="_blank" rel="noopener noreferrer nofollow">The Developer’s Guide to Product Integrations: Build vs. Buy</a></h1><p class="paragraph" style="text-align:left;">Building product integrations is no small feat—balancing timelines, resources, and technical complexity can feel overwhelming.</p><p class="paragraph" style="text-align:left;">Should you build integrations in-house, or is it better to leverage third-party solutions? </p><p class="paragraph" style="text-align:left;">Nango wrote a <a class="link" href="https://www.nango.dev/blog/product-integrations-build-or-buy?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" target="_blank" rel="noopener noreferrer nofollow">fantastic, in-depth guide</a> that walks you through everything you need to know to make an informed choice. They talk about the trade-offs involved and offer practical tips to make your decision easier.</p><p class="paragraph" style="text-align:left;">In the guide, you’ll learn:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Key Considerations</b> - Understand the costs, risks, and benefits of building vs. buying integrations.</p></li><li><p class="paragraph" style="text-align:left;"><b>When to Build</b> - Discover scenarios where in-house development gives you the most control and flexibility.</p></li><li><p class="paragraph" style="text-align:left;"><b>When to Buy</b> - Learn when leveraging pre-built solutions can accelerate timelines and reduce maintenance overhead.</p></li></ul><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.nango.dev/blog/product-integrations-build-or-buy?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025"><span class="button__text" style=""> Read The Full Guide To Make Smarter Integration Decisions </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of Uber’s ETL Platform</b></h1><p class="paragraph" style="text-align:left;">Uber is the largest ride sharing company in the world with over 150 million monthly active users and approximately 25 million daily trips.</p><p class="paragraph" style="text-align:left;">With this scale comes a <i>huge</i> amount of data (<i>and a cloud-bill that’s larger than the GDP of a small island nation</i>). Uber generates petabytes of data daily from ride history, logs, payment transactions, etc.</p><p class="paragraph" style="text-align:left;">This data needs to be extracted from the various data sources (payment processor, OLTP database, logs, etc.) and then loaded into data warehouses, data lakes, machine learning platforms and more.</p><p class="paragraph" style="text-align:left;">To do this, Uber relies on ETL (Extract, Transform, Load) processes. The Uber engineering team published a terrific <a class="link" href="https://www.uber.com/blog/sparkle-modular-etl/?uclick_id=da6b6b46-4ebd-4c91-9947-460e32b1ed47&utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank" rel="noopener noreferrer nofollow">blog post</a> talking about exactly how they handle ETL at their scale. They have 20,000+ critical data pipelines and 3,000+ engineers who use this system.</p><h2 class="heading" style="text-align:left;" id="introduction-to-etl"><b>Introduction to ETL</b></h2><p class="paragraph" style="text-align:left;">Extract, Transform, Load (ETL) is the process where you</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Extract</b> - you extract data from the various data sources (<i>places where data is created/temporarily stored</i>). This can be a transactional database, payment processor, a CRM (Salesforce or HubSpot), message queue, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Transform</b> - you clean, validate and standardize the data. You might need to check for duplicates, handle missing values, join the data with another dataset and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Load</b> - you load the data into various data sinks. This can be a data warehouse like Google BigQuery, a data lake like HDFS, archival storage like Amazon Glacier or something else.</p></li></ol><p class="paragraph" style="text-align:left;">Building data pipelines for ETL can be quite painful. You’ll need to consider several things:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Data Integrity</b> - ensure the accuracy and consistency in your data. You’ll need to check for duplicate records, missing values, inconsistent formatting, outlier values and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Schema Evolution</b> - as business requirements change, the data you’ll be processing will change. You’ll need to account for new fields, data type changes, deprecated fields, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Monitoring/Debugging</b> - you’ll need logging for the different stages of your pipeline and real-time alerts for failures/performance issues so you minimize downtime (and don’t lose data)</p></li><li><p class="paragraph" style="text-align:left;"><b>Scalability</b><b> </b>- the pipeline shouldn’t require a complete re-architecture as your data volume grows. You may also have to deal with bursts in incoming data depending on the usage patterns.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reliability and Failure Recovery</b><b> </b>- For some systems, you might need to<i> </i>guarantee <i>at least once processing</i>. You’ll have to make sure that the system rarely goes down and that you have a process in place to minimize data loss in case of crashes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Compliance</b><b> </b>- you might have to consider internal data governance/privacy policies when doing transformations.</p></li></ul><p class="paragraph" style="text-align:left;">Some common tools used for ETL include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Apache Spark</b> - an open-source engine for large-scale data processing. It’s a great choice for complex ETL jobs at scale.</p></li><li><p class="paragraph" style="text-align:left;"><b>dbt (data build tool)</b> - a toolkit for building data pipelines that encourages software engineering best practices like version control,testing, code review and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Apache Airflow</b> - a popular open-source platform for orchestrating workflows. You can schedule, monitor and manage your ETL pipelines with Python.</p></li><li><p class="paragraph" style="text-align:left;"><b>AWS Glue</b> - fully managed, serverless ETL service from Amazon. </p></li><li><p class="paragraph" style="text-align:left;"><b>Google Cloud Dataflow</b> - fully managed service for ETL from Google Cloud</p></li></ul><p class="paragraph" style="text-align:left;">In 2023, Uber migrated all their batch workloads to Apache Spark. Recently, they built <i>Sparkle, </i>a framework on top of Apache Spark with the goal of simplifying data pipeline development and testing.</p><h2 class="heading" style="text-align:left;" id="etl-at-uber-with-sparkle"><b>ETL at Uber with Sparkle</b></h2><p class="paragraph" style="text-align:left;">As your ETL jobs get more and more complex, it becomes crucial to use software engineering best practices when writing/maintaining them. Observability, version control, testing, documentation, etc. are a couple of best practices that have become increasingly adopted in the data community.</p><p class="paragraph" style="text-align:left;">Leading this charge is dbt, a data engineering platform that helps you apply these best practices to your data transformations.</p><p class="paragraph" style="text-align:left;">However, at Uber, switching from Spark to an entirely new ETL tool wasn’t possible. The scale of Uber’s platform, the developer learning curve and investment required just wasn’t worth it. (<i>try telling your boss you need to rewrite 20,000 mission-critical data pipelines</i>)</p><p class="paragraph" style="text-align:left;">Instead, the Uber team decided to build <i>Sparkle</i>, a framework on top of Apache Spark that lets engineers write configuration-based modular ETL jobs. Sparkle added features for observability, testing, data lineage tracking and more.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/17e2eaee-0029-4d4f-b56a-c59e8a55fe36/sparkle-layered-architecture-1-17236489174498.jpeg?t=1737039411"/></div><p class="paragraph" style="text-align:left;">The core idea behind Sparkle is <i>modularity</i>. Rather than writing complex, monolithic Spark jobs, engineers break their ETL logic down into a series of smaller, reusable modules. Each module can be in SQL, Java/Scala or Python and they’re defined with YAML. Check the <a class="link" href="https://www.uber.com/blog/sparkle-modular-etl/?uclick_id=da6b6b46-4ebd-4c91-9947-460e32b1ed47&utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank" rel="noopener noreferrer nofollow">blog post</a> for samples of what Sparkle jobs look like.</p><p class="paragraph" style="text-align:left;">Developers can just focus on the business logic around their data pipeline. Sparkle will handle infrastructure and boilerplate with pre-built components like</p><ul><li><p class="paragraph" style="text-align:left;"><b>Connectors</b><b> </b>- handles all the connection details to pull data from all the various data sources at Uber</p></li><li><p class="paragraph" style="text-align:left;"><b>Readers/Writers</b><b> </b>- handles translating data into different formats like Parquet, JSON, Avro, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Observability</b><b> </b>- provides logging, metrics and data lineage tracking</p></li><li><p class="paragraph" style="text-align:left;"><b>Testing</b> - you can write unit tests for your modules using mock data and SQL assertions to make sure your transformations are doing what you expect.</p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://www.nango.dev/blog/how-we-built-a-salesforce-api-integration-in-3-hours?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcORvf_K3u5hfj80bEMBjzl_fLEo1qlEkwNC1zFDlbWNXb20NpW_iyZhLKdk171y0evmN4lD8dMqbodoAYpvBnMO1uLaBZfznsKICugot4uNYcCtfu43UCuTCv_b9MkMC_j80pA?key=ArHUjbxqy4BwhRsjMAMFSUA0"/></a></div><h1 class="heading" style="text-align:left;" id="connect-your-saa-s-product-to-sales"><a class="link" href="https://www.nango.dev/blog/how-we-built-a-salesforce-api-integration-in-3-hours?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" target="_blank" rel="noopener noreferrer nofollow">Connect your SaaS product to Salesforce in hours, not weeks</a></h1><p class="paragraph" style="text-align:left;">Stop wasting weeks wrestling with Salesforce&#39;s API.</p><p class="paragraph" style="text-align:left;">Nango wrote a <a class="link" href="https://www.nango.dev/blog/how-we-built-a-salesforce-api-integration-in-3-hours?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025" target="_blank" rel="noopener noreferrer nofollow">terrific guide</a> showing how you can build a robust Salesforce integration in just 3 hours.</p><p class="paragraph" style="text-align:left;"><span style="color:rgb(34, 34, 34);">This guide breaks down the entire process with actionable tips and insights that’ll save your team dozens of hours.</span></p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.nango.dev/blog/how-we-built-a-salesforce-api-integration-in-3-hours?utm_source=quastor&utm_medium=newsletter&utm_campaign=primary&utm_content=01-16-2025"><span class="button__text" style=""> Read The Guide To Simplify Your Salesforce Integration </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://github.com/prakhar1989/awesome-courses?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing#readme" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub Awesome List of University Courses for Learning CS </p><p class="embed__description"> Awesome CS courses is a curated list of university-level CS courses available for free. You ca learn about the principles of distributed computing from ETH-Zurich or Natural Language Processing with Deep Learning from Oxford.<br><br>You’ll find lecture videos, notes, assignments and more. </p><p class="embed__link"> github.com/prakhar1989/awesome-courses#readme </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.bytedrum.com/posts/art-of-finishing/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><img class="embed__image embed__image--top" src="https://www.bytedrum.com/assets/art-of-finishing/og.png"/><div class="embed__content"><p class="embed__title"> The Art of Finishing </p><p class="embed__description"> If you’re a serial project starter, this article has some useful strategies to help you finally cross the finish line.<br><br>It talks about the main reasons why we avoid finishing things (fear of imperfection, illusion of productivity, etc.) and how you can overcome these blockers. </p><p class="embed__link"> www.bytedrum.com/posts/art-of-finishing </p></div></a></div><div class="embed"><a class="embed__url" href="https://creatoreconomy.so/p/15-life-and-work-principles-from-jensen?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54791a8e-7d40-4366-8933-1469816d25ca_1280x720.png"/><div class="embed__content"><p class="embed__title"> 15 Life and Work Principles from Jensen Huang (CEO of Nvidia) </p><p class="embed__description"> Jensen Huang has a unique leadership style. He has 60+ direct reports but does no 1:1 meetings. Instead, he tries to emphasize transparency and open discourse (where everyone in the company is involved in decision making).<br><br>He also looks to chase “zero-billion dollar markets“, where Nvidia is doing something completely new where they have no competitors.<br><br>Read the full article for the rest of his work/leadership principles. </p><p class="embed__link"> creatoreconomy.so/p/15-life-and-work-principles-from-jensen </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.codepipes.com/testing/software-testing-antipatterns.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-uber-built-an-exabyte-scale-system-for-data-processing" target="_blank"><div class="embed__content"><p class="embed__title"> Software Testing Anti-Patterns </p><p class="embed__description"> This is an interesting article that dives into common software testing anti-patterns and why they can be detrimental.<br><br>Some of the anti-patterns include: paying excessive attention to test coverage, not converting production bugs to tests, treating test code as a second class citizen and more. </p><p class="embed__link"> blog.codepipes.com/testing/software-testing-antipatterns.html </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=a562209c-eac8-46fb-bb4b-3daf7b1e9f96&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Pinterest Stores and Transfers Hundreds of Terabytes of Data Daily</title>
  <description>We&#39;ll talk about Change Data Capture and the internal system Pinterest built to handle CDC for all their databases. Plus, a practical guide on how you can improve your coding skills with cognitive psychology, a visual guide to memory allocation and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f8099f66-d8ac-41aa-90dd-7852a4a40878/Payments.png" length="84761" type="image/png"/>
  <link>https://blog.quastor.org/p/how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily</guid>
  <pubDate>Thu, 09 Jan 2025 20:56:00 +0000</pubDate>
  <atom:published>2025-01-09T20:56:00Z</atom:published>
    <dc:creator>Arpan KG</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Pinterest Stores and Transfers Hundreds of Terabytes of Data Daily</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to Change Data Capture</p></li><li><p class="paragraph" style="text-align:left;">The Design Goals for CDC at Pinterest </p></li><li><p class="paragraph" style="text-align:left;">Architecture of Pinterest’s CDC System</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">The Programmer’s Brain</p></li><li><p class="paragraph" style="text-align:left;">A Curated List of Cryptography Resources and Links</p></li><li><p class="paragraph" style="text-align:left;">Memory Allocation Visualized</p></li><li><p class="paragraph" style="text-align:left;">Getting Things Done in a Chaotic Environment</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXclfPkS36JyPOQTKSMiqUNE7I-Dmryo4gPMahFm9OTD-BTKKBu4l-Haz6k7RzmjnVgy_u5E5BOogh1EA-KgdyjCvq2XjUMaRZr7FfoM2PU2xwrkvOs7NnZ8nKnL0HIG4OpjUabzqQ?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-pick-technologies-for-your-t"><a class="link" href="https://dub.link/quas-jan9-tech?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow"><b>How to Pick Technologies for your Tech Stack</b></a></h1><p class="paragraph" style="text-align:left;">One of the hardest decisions you’ll have to make is around <i>what</i> technologies your team adopts. A wrong decision can be extremely costly and take years to reverse. On the other hand, <i>not </i>making a decision can be <i>just</i> as costly (lost revenue, poor developer productivity, etc.)</p><p class="paragraph" style="text-align:left;">Product for Engineers wrote <a class="link" href="https://dub.link/quas-jan9-tech?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">a fantastic blog post</a> on their advice for choosing technologies to adopt.</p><p class="paragraph" style="text-align:left;">Some of their tips include</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Prioritize based on set Criteria</b> - There will always be <i>some</i> shiny new toy that your team can adopt. Instead, prioritize based on problems your team is facing. This can be excessive costs, scaling challenges, or a customer need.</p></li><li><p class="paragraph" style="text-align:left;"><b>Mimic the Real World when Evaluating</b> - The engineers who will be using the technology should have significant sway in the decision. They should be able to test the technology in production (<i>safely</i>) and build proof of concepts before deciding.</p></li><li><p class="paragraph" style="text-align:left;"><b>Ensure you consider technical AND business factors</b> - You should talk to <i>all stakeholders</i> and clarify what the set of evaluation criteria are. Some potential criteria include performance, cost, reliability, support, flexibility and more.</p></li></ol><p class="paragraph" style="text-align:left;">Subscribe to <a class="link" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> for the rest of their tips on picking technologies. It’s free!</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily"><span class="button__text" style=""> Check out Product for Engineers </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Pinterest built a Change Data Capture System</b></h1><p class="paragraph" style="text-align:left;">Pinterest is a social media platform that lets users share images/links for things they’re interested in. You can use the site to get low-carb pasta recipes, find destinations for a wedding or get inspiration for a DIY project (<i>and learn that DIY projects are way harder than they look</i>).</p><p class="paragraph" style="text-align:left;">Pinterest first launched in 2010 and they’ve scaled to hundreds of millions of users with billions of monthly visits to the platform.</p><p class="paragraph" style="text-align:left;">As you might imagine, Pinterest handles a massive amount of data. Real-time data processing is crucial for delivering personalized recommendations, detecting fraudulent accounts, reporting results to advertisers and more.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://en.wikipedia.org/wiki/Change_data_capture?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">Change Data Capture (CDC)</a> is a critical tool that Pinterest uses to power their data infrastructure. The engineering team published a <a class="link" href="https://medium.com/pinterest-engineering/change-data-capture-at-pinterest-7e4c357ac527?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">fantastic blog post</a> talking about how they built a generic CDC system for <i>all</i> the online databases at Pinterest. This system handles millions of database queries per second and transfers hundreds of terabytes per day.</p><h2 class="heading" style="text-align:left;" id="introduction-to-change-data-capture"><b>Introduction to Change Data Capture (CDC)</b></h2><p class="paragraph" style="text-align:left;">Change Data Capture is a set of software design patterns that lets you track database changes in real-time. It captures data modifications like inserts, updates and deletes as they occur. Typically, you’ll set up CDC on your OLTP database (Postgres, MySQL, MongoDB, etc.) and transfer the modifications over to your analytics platform, audit/compliance system, data warehouse, etc.</p><p class="paragraph" style="text-align:left;">This ensures that all your systems are kept up-to-date with the most recent data (<i>compared to running nightly batch jobs for syncing changes</i>). It can also reduce the load on your OLTP database since you’re only transferring the changes instead of doing full data loads.</p><p class="paragraph" style="text-align:left;">All the commonly used database systems offer a mechanism for tracking database changes. </p><p class="paragraph" style="text-align:left;">Examples are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Postgres</b> - provides a write-ahead log (WAL) where every change to the database is written to the WAL first. Plugins (like pgoutput) can decode the WAL entries into JSON for a CDC tool to consume. </p></li><li><p class="paragraph" style="text-align:left;"><b>MySQL</b> - provides a binary log (binlog) that records all data modifications. CDC tools can tap into the binlog to capture changes in real-time. </p></li><li><p class="paragraph" style="text-align:left;"><b>MongoDB</b> - uses an operations log (oplog) to store a rolling record of all write operations. CDC systems can tail the oplog to capture changes. MongoDB also provides change streams that you can subscribe to for real-time data changes (<i>change streams are built on top of the oplog</i>).</p></li><li><p class="paragraph" style="text-align:left;"><b>DynamoDB</b> - offers DynamoDB streams for a time-ordered sequence of records that captures all the data modifications in your tables. </p></li><li><p class="paragraph" style="text-align:left;"><b>Microsoft SQL Server</b> - provides a built-in feature called Change Data capture (CDC)  that captures insert, update and delete activity and makes the details available in an easily consumed relational format.</p></li><li><p class="paragraph" style="text-align:left;"><b>Couchbase</b> - uses the Database Change Protocol (DCP) to stream any mutations that happen within the database. Applications can connect to DCP and get a real-time feed of changes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Cassandra</b> - provides a feature called CDC on Apache Cassandra that lets you capture changes on a per-table basis and write them to the local filesystem. </p></li></ul><p class="paragraph" style="text-align:left;"><a class="link" href="https://github.com/debezium/debezium?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">Debezium</a> is the most popular open source platform for CDC and it supports databases like Postgres, MySQL, MongoDB and more.</p><h2 class="heading" style="text-align:left;" id="architecture-of-pinterests-generic-"><b>Architecture of Pinterest’s Generic CDC System</b></h2><p class="paragraph" style="text-align:left;">Before Pinterest’s Generic CDC System, individual teams at the company were building their own CDC solutions. This led to issues around wasted engineering-time, unclear ownership, reliability issues and more.</p><p class="paragraph" style="text-align:left;">The Pinterest team decided to solve this by building a Generic CDC solution based on Debezium.</p><p class="paragraph" style="text-align:left;">Their goals were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Distributed</b><b> </b>- Pinterest has many distributed databases with some having 10,000+ shards. The CDC system should connect to all these shards and transfer the changes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reliability</b><b> </b>- data should be transferred reliably with a guarantee of <i>at least once processing</i>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Scalability</b> - the CDC system should scale to hundreds of terabytes of throughput. Databases at Pinterest receive millions of queries per second.</p></li><li><p class="paragraph" style="text-align:left;"><b>Configurability</b> - the CDC system should allow configurability around connectors, failure recovery, load balancing and more.</p></li></ul><p class="paragraph" style="text-align:left;">Here’s the architecture of the system they built</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXedJYkETQpzXbaANQdIV9uFUkcNxA1IW6eKBI1XA04ofE_Je3XyNM1G7Kba5NimkTNbuz1j9Zbr0_0fC1kcM7R0x53XZYPK53ukJw548JRdjw2OoYhisZYxV0O1B4uUUbAR1lmqrw?key=Zl7-xBQ8MZROqsO03_EmMA7E"/></div><p class="paragraph" style="text-align:left;">Here’s the components</p><ul><li><p class="paragraph" style="text-align:left;"><b>Control Plane</b><b> </b>- manages the configuration and the coordination of the CDC system. It handles things like creating new connectors for new database shards, fixing failed connectors, updating configuration of existing connectors, etc. It runs on a single host and executes its logic on a scheduled basis (typically every minute). </p></li><li><p class="paragraph" style="text-align:left;"><b>Data Plane</b> - the machines in the data plane are responsible for capturing changes from the Pinterest database shards and sending them to Apache Kafka. Each host runs Kafka Connect and runs multiple Debezium connectors (with each connector responsible for a single database shard).</p></li><li><p class="paragraph" style="text-align:left;"><b>Kafka</b> - Kafka is used as the message broker to store and transport the change events. The actual CDC data is stored in preconfigured topics. Consumers can subscribe to these topics for updates.</p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeUIc_RCCKulOb_RVlBD7cHP5g7eYhTQM6TxjLrc2CgGiJTQ8W38EZFrZ2PnFaTtX7-nWHVbv4MEVndnbe8TDWcBLd0d2GobuxVHygwf3BVU4SZa8bSVs3RsJhX9cl73XB5LKo7FZFwJrs1wTLGWvSFtVs?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-run-ab-tests-for-engineers"><a class="link" href="https://dub.link/quas-jan9-ab?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">How to run A/B tests for Engineers</a></h1><p class="paragraph" style="text-align:left;">Product for Engineers is a fantastic newsletter by PostHog that helps developers learn how to find product-market fit and build apps that users love.</p><p class="paragraph" style="text-align:left;">A/B testing and experimentation are crucial for building a feature roadmap, improving conversion rates and accelerating growth. However, many engineers don’t understand the ins-and-outs of how to run these tests effectively (<i>they just leave it to the data scientists</i>).</p><p class="paragraph" style="text-align:left;">This edition of <a class="link" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> delves into A/B tests and discusses</p><ul><li><p class="paragraph" style="text-align:left;">The 5 traits of good A/B tests</p></li><li><p class="paragraph" style="text-align:left;">How to think about statistical significance and p-values</p></li><li><p class="paragraph" style="text-align:left;">Avoiding false positives</p></li></ul><p class="paragraph" style="text-align:left;">And more.</p><p class="paragraph" style="text-align:left;">To hone your product skills and read more articles like this, check out Product for Engineers below.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-jan9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily"><span class="button__text" style=""> Check Out Product for Engineers. It’s Free! </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://yoan-thirion.gitbook.io/knowledge-base/software-craftsmanship/the-programmers-brain?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/28d82df9-a46f-4ad7-8473-e385ab63e9df/image.jpeg?t=1736100249"/><div class="embed__content"><p class="embed__title"> The Programmer&#39;s Brain </p><p class="embed__description"> This is a practical guide on improving your coding skills by understanding the cognitive processes involved.<br><br>The book introduces the Cognitive Dimensions of Notations (CDN) framework, a tool for accessing the usability of codebases. It also provides great insights for onboarding new devs, like limiting tasks to one programming activity and supporting their memory through extensive diagrams.<br><br>It’s a really useful read if you want to apply cognitive science principles to be more effective and efficient as a developer. </p><p class="embed__link"> yoan-thirion.gitbook.io/knowledge-base/software-craftsmanship/the-programmers-brain </p></div></a></div><div class="embed"><a class="embed__url" href="https://github.com/sobolevn/awesome-cryptography?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily#readme" target="_blank"><div class="embed__content"><p class="embed__title"> A Curated list of Cryptography Resources and Links </p><p class="embed__description"> If you’re looking to learn more about cryptography, this curated list is fantastic. It covers everything from basic cryptographic theory and algorithms to practical tools/libraries. </p><p class="embed__link"> github.com/sobolevn/awesome-cryptography#readme </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/0be5a81bf37ef0fe1fff48622afb1c42e33c7ba4985767b9b6d8212d17292920/sobolevn/awesome-cryptography"/></a></div><div class="embed"><a class="embed__url" href="https://samwho.dev/memory-allocation/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank"><img class="embed__image embed__image--top" src="https://samwho.dev/images/memory-allocation-card.png?h=8e970c33deef294c608a"/><div class="embed__content"><p class="embed__title"> Memory Allocation Visualized </p><p class="embed__description"> This is a great article that dives into the basics of memory allocation. It explains how programs use malloc and free to dynamically manage memory and the challenges that arise from this.<br><br>Check it out if you’re looking to understand the inner workings of memory management. </p><p class="embed__link"> samwho.dev/memory-allocation </p></div></a></div><div class="embed"><a class="embed__url" href="https://staysaasy.com/leadership/2024/03/12/Getting-Things-Done.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-pinterest-stores-and-transfers-hundreds-of-terabytes-of-data-daily" target="_blank"><div class="embed__content"><p class="embed__title"> Getting Things Done In A Chaotic Environment </p><p class="embed__description"> When you’re trying to get things done, there are four main traps: having multiple main focuses, ignoring pressing issues, not finishing tasks and taking too long.<br><br>This is a great read that goes through each of these and how you should avoid them. </p><p class="embed__link"> staysaasy.com/leadership/2024/03/12/Getting-Things-Done.html </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=fafe01a4-a9d1-47e7-ad93-e75a896668c1&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Cloudflare Mitigates Thousands of DDoS Attacks Every Hour</title>
  <description>We&#39;ll talk about the different types of DDoS attacks and how Cloudflare prevents them. Plus, resources on public speaking for software engineers, things we learned about LLMs in 2024 and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0319bda7-d6c1-4a48-b2b8-dafc7b5e6ae6/CloudFlare_Tries.gif" length="1808144" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour-25e0</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour-25e0</guid>
  <pubDate>Mon, 06 Jan 2025 15:10:00 +0000</pubDate>
  <atom:published>2025-01-06T15:10:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Cloudflare Defines, Measures and Stops DDoS Attacks</b></p><ul><li><p class="paragraph" style="text-align:left;"> DDoS Attacks Explained (Volumetric, Application layer and Protocol layer attacks)</p></li><li><p class="paragraph" style="text-align:left;"> How to Measure DDoS Attacks</p></li><li><p class="paragraph" style="text-align:left;">Steps for Protecting from DDoS Attacks</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Resources on Public Speaking for Software Engineers</p></li><li><p class="paragraph" style="text-align:left;">Things we Learned about LLMs in 2024</p></li><li><p class="paragraph" style="text-align:left;">How to Track Engineering Time</p></li><li><p class="paragraph" style="text-align:left;">Static Search Trees and how they can be 40x Faster than Binary Search</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Cloudflare Defines, Measures and Stops DDoS Attacks</b></h1><p class="paragraph" style="text-align:left;">Cloudflare is one of the largest internet infrastructure companies in the world, providing CDN services, DDoS protection, DNS management and more to millions of websites. They operate a massive global network with data centers in over 330 cities worldwide, handling trillions of requests per day.</p><p class="paragraph" style="text-align:left;">One of Cloudflare&#39;s core offerings is protection against Distributed Denial of Service (DDoS) attacks. Unfortunately, these malicious attacks are a constant threat to websites. In 2024 alone, CloudFlare mitigated over 14.5 million DDoS attacks (<i>an average of 2,200 DDoS attacks per hour</i>).</p><p class="paragraph" style="text-align:left;">Recently, the Cloudflare engineering team published an awesome blog post delving into the evolution of DDoS attacks over the past decade. We’ll be summarizing the post and adding some extra context on DDoS attacks.</p><h2 class="heading" style="text-align:left;" id="d-do-s-attacks-explained"><b>DDoS Attacks Explained</b></h2><p class="paragraph" style="text-align:left;">With a Distributed Denial of Service Attack, a hacker will use many geographically distributed machines to send traffic to a website. These machines usually belong to unsuspecting users and have been infected with malware to make them part of the attacker’s botnet.</p><p class="paragraph" style="text-align:left;">The goal is to overwhelm the target’s backend with traffic, so the site can no longer serve legitimate users. The attacker might then request a ransom from the company, promising to stop the DDoS attack if the company pays up.</p><p class="paragraph" style="text-align:left;">DDoS Attacks can roughly be split into 3 main types: Volumetric, Application layer and Protocol attacks.</p><h3 class="heading" style="text-align:left;" id="volumetric-attacks"><b>Volumetric Attacks</b></h3><p class="paragraph" style="text-align:left;">These attacks are based on brute force techniques where the target server is flooded with data packets to consume bandwidth and server resources.</p><p class="paragraph" style="text-align:left;">Volumetric attacks will frequently rely on <i>amplification</i> and <i>reflection</i>.</p><p class="paragraph" style="text-align:left;">Amplification is where a request in a certain protocol will result in a much larger response (in terms of the number of bytes); the ratio between the request size and response size is called the Amplification Factor.</p><p class="paragraph" style="text-align:left;">Reflection is where the attacker will spoof the source of request packets to be the target victim’s IP address. Servers will be unable to distinguish legitimate requests from spoofed ones so they’ll send the (much larger) response payload to the targeted victim’s servers and unintentionally flood them.</p><p class="paragraph" style="text-align:left;">Network Time Protocol (NTP) DDoS attacks are an example of volumetric attacks where you can send a 234-byte spoofed request to an NTP server, which will then send a 48,000 byte response to the target victim. Attackers will repeat this on many different open NTP servers simultaneously to DDoS the victim with all the NTP responses.</p><h3 class="heading" style="text-align:left;" id="application-layer-attacks"><b>Application Layer Attacks</b></h3><p class="paragraph" style="text-align:left;">These DDoS attacks target the “top” layer in the OSI model - the application layer. Attackers might flood the backend with HTTP requests, exploit expensive API endpoints, create many SSL/TLS handshakes, etc.</p><p class="paragraph" style="text-align:left;">Database DDoS attacks are quite common, where a hacker will look for requests that are particularly database-intensive and then spam those in an attempt to exhaust the database resources. Scaling your database through read replicas takes time, so this attack can be pretty successful.</p><p class="paragraph" style="text-align:left;">HTTP Floods are some of the most widely seen layer 7 DDoS attacks, where hackers will spam a web server with HTTP GET/POST requests. Sophisticated hackers will specifically design these to request resources with low usage in order to maximize the number of cache misses the web server has.</p><h3 class="heading" style="text-align:left;" id="protocol-layer-attacks"><b>Protocol Layer Attacks</b></h3><p class="paragraph" style="text-align:left;">Protocol attacks will rely on weaknesses in how particular protocols are designed. Examples of these kinds of exploits are SYN floods, BGP hijacking, Smurf attacks and more.</p><p class="paragraph" style="text-align:left;">A SYN flood attack exploits how TCP is designed, specifically the handshake process. The three-way handshake consists of SYN -&gt; SYN-ACK -&gt; ACK,  where the client sends a synchronize (SYN) message to initiate, the server responds with a synchronize-acknowledge (SYN-ACK) message and the client then responds back with an acknowledgement (ACK) message.</p><p class="paragraph" style="text-align:left;">In a SYN flood attack, a malicious client will send large volumes of SYN messages to the server, who will then respond back with SYN-ACK. The client will ignore these and never respond back with an ACK message. The server will waste resources (open ports) waiting for the ACK responses from the malicious client. If repeated on a large scale, this can bring the web server down since the server won’t know which requests are legitimate.</p><h2 class="heading" style="text-align:left;" id="how-to-measure-d-do-s-attacks"><b>How to Measure DDoS Attacks</b></h2><p class="paragraph" style="text-align:left;">First of all, defining an <i>individual </i>DDoS attack can be surprisingly difficult. It is not just a one-time spike in requests. A DDoS attack can last for several hours/days and consist of many smaller incidents (<i>also known as pulses</i>).</p><p class="paragraph" style="text-align:left;">Cloudflare analyzes a combination of factors to create a “fingerprint” that helps identify different attacks as part of the same DDoS targeting. </p><p class="paragraph" style="text-align:left;">Some factors Cloudflare looks at are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Attack Vectors:</b> Are the same attack vectors being used across the set of events?</p></li><li><p class="paragraph" style="text-align:left;"><b>Targets:</b> Are all the attacks focused on the same target website/entity?</p></li><li><p class="paragraph" style="text-align:left;"><b>Payload Signatures:</b> Do the payloads share anything that could mark them to a certain botnet?</p></li></ul><p class="paragraph" style="text-align:left;">Once they have gotten a rough sense of which pulses are part of the attack, Cloudflare can measure how large the DDoS was.</p><p class="paragraph" style="text-align:left;">The main metrics they use are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Bits per Second (BPS):</b> Measures the total data transferred per second. This is useful for evaluating network-layer attacks that aim to saturate bandwidth like UDP floods.</p></li><li><p class="paragraph" style="text-align:left;"><b>Requests per Second (RPS):</b> Measures the number of protocol requests made each second. This is useful for application-layer attacks (Layer 7).</p></li><li><p class="paragraph" style="text-align:left;"><b>Packets per Second (PPS):</b> Represents the number of individual packets sent to the target per second, regardless of size. This is critical for network-layer attacks (Layers 3 and 4) like SYN floods.</p></li></ul><h2 class="heading" style="text-align:left;" id="how-to-protect-from-d-do-s-attacks"><b>How to Protect from DDoS Attacks</b></h2><p class="paragraph" style="text-align:left;">Unfortunately, there is no single solution to protecting your service from a DDoS attack. Large companies use many different approaches. Some of them include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Rate Limiting:</b> This is the first line of defense. You set thresholds on the number of requests your server will accept from a single IP address within a given time frame.</p></li><li><p class="paragraph" style="text-align:left;"><b>Caching and CDNs:</b> You can significantly reduce the load on your web server by caching content and using a Content Delivery Network (CDN). CDNs will distribute your files across multiple servers globally, so that the impact of a DDoS attack is spread out.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reducing the Attack Surface:</b> Minimize the number of services exposed to the public internet. If you have an endpoint with expensive operations, then protect it through authentication (by requiring an API key).</p></li><li><p class="paragraph" style="text-align:left;"><b>Monitoring:</b> You should continuously monitor network traffic to detect anomalies and potential attacks. First, establish a baseline for normal traffic patterns. That way, you can quickly identify unusual spikes or patterns that indicate an attack.</p></li><li><p class="paragraph" style="text-align:left;"><b>Machine Learning</b><b> </b>- services like Cloudflare use machine learning algorithms to identify suspicious traffic patterns in real time. They wrote an interesting blog post on the ML models they use <a class="link" href="https://blog.cloudflare.com/training-a-million-models-per-day-to-save-customers-of-all-sizes-from-ddos/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://simonwillison.net/2024/Dec/31/llms-in-2024/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://static.simonwillison.net/static/2024/arena-dec-2024.jpg"/><div class="embed__content"><p class="embed__title"> Things we learned about LLMs in 2024 </p><p class="embed__description"> Over the course of 2024, a ton of rapid advancements came in LLMs. Multiple models were released that surpassed GPT-4’s capabilities and the competition drove LLM prices down dramatically. Multimodal models also became prevalent with capabilities extending to text, image, audio and video.<br><br>However, there’s still a ton of challenges around creating reliable evals, the environmental impact, agents and more.<br><br>This is a great post that summarizes the year for LLMs. </p><p class="embed__link"> simonwillison.net/2024/Dec/31/llms-in-2024 </p></div></a></div><div class="embed"><a class="embed__url" href="https://jacobian.org/2024/feb/7/tracking-engineering-time/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://jacobian.org/cards/tracking-engineering-time.png"/><div class="embed__content"><p class="embed__title"> How to Track Engineering Time </p><p class="embed__description"> This article presents a useful playbook for tracking engineering time. It splits time spent spent into features, bugs/debt and toil.<br><br>Analyzing ratios like the time spent on features vs. bugs/debt can be very helpful for making informed decisions about resource allocation and project prioritization. </p><p class="embed__link"> jacobian.org/2024/feb/7/tracking-engineering-time </p></div></a></div><div class="embed"><a class="embed__url" href="https://github.com/matteofigus/awesome-speaking?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour#readme" target="_blank"><div class="embed__content"><p class="embed__title"> Resources on Public Speaking for Software Engineers </p><p class="embed__description"> The “awesome-speaking“ GitHub repository is a fantastic resource for anyone looking to improve their public speaking skills, especially in the software engineering domain.<br><br>It contains links to blog posts, books and public speaking organizations. The content ranges from storytelling techniques to tips on how to handle your nerves. It’s a great repo if you want to become a more confident speaker. </p><p class="embed__link"> github.com/matteofigus/awesome-speaking#readme </p></div></a></div><div class="embed"><a class="embed__url" href="https://curiouscoding.nl/posts/static-search-tree/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/036667a0-3f29-4a76-8fdf-8ad781b86e62/Screenshot_2025-01-05_at_12.46.05_PM.png?t=1736099176"/><div class="embed__content"><p class="embed__title"> Static Search Trees and how they can be 40x faster than Binary Search </p><p class="embed__description"> Binary Search can be surprisingly inefficient for large datasets due to poor cache utilization. Ragnar Groot Koerkamp tackled this problem by optimizing S+ trees for high-throughput searching.<br><br>He wrote a terrific article talking about his optimizations and how he was able to achieve a 40x speedup over binary search. </p><p class="embed__link"> curiouscoding.nl/posts/static-search-tree </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=36043d36-3b6f-40fe-afdc-a6707cc08663&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Cloudflare Mitigates Thousands of DDoS Attacks Every Hour</title>
  <description>We&#39;ll talk about the different types of DDoS attacks and how Cloudflare prevents them. Plus, resources on public speaking for software engineers, things we learned about LLMs in 2024 and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0319bda7-d6c1-4a48-b2b8-dafc7b5e6ae6/CloudFlare_Tries.gif" length="1808144" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour</guid>
  <pubDate>Mon, 06 Jan 2025 13:49:00 +0000</pubDate>
  <atom:published>2025-01-06T13:49:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Cloudflare Defines, Measures and Stops DDoS Attacks</b></p><ul><li><p class="paragraph" style="text-align:left;"> DDoS Attacks Explained (Volumetric, Application layer and Protocol layer attacks)</p></li><li><p class="paragraph" style="text-align:left;"> How to Measure DDoS Attacks</p></li><li><p class="paragraph" style="text-align:left;">Steps for Protecting from DDoS Attacks</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Resources on Public Speaking for Software Engineers</p></li><li><p class="paragraph" style="text-align:left;">Things we Learned about LLMs in 2024</p></li><li><p class="paragraph" style="text-align:left;">How to Track Engineering Time</p></li><li><p class="paragraph" style="text-align:left;">Static Search Trees and how they can be 40x Faster than Binary Search</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcdKvZbhHDZLSLsERyr_kDIBT3i2tL4XkHcy0RoPuvkISdu1lr6U01KsNsxwEfvH-o08kZHZd1pvwvEjvEM6U5X5hvnhIKiPoEzw98vl9ADcwtGnh3NUm0ADu2x1aZf6zJl3R_jQg?key=WWLTacfcf4ieV0AckeDT6Q"/></a></div><h1 class="heading" style="text-align:left;" id="the-complete-guide-to-o-auth-20"><a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">The Complete Guide to OAuth 2.0</a></h1><p class="paragraph" style="text-align:left;">OAuth 2.0 is the industry standard for authorization. It lets you grant a third-party application access to data on your Google/Meta/Dropbox account without sharing your account’s password.</p><p class="paragraph" style="text-align:left;">If you’re building an app and want to add a “<i>sign in with google</i>” button then you’ll need to understand how OAuth works.</p><p class="paragraph" style="text-align:left;">Recently, WorkOS published a <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">fantastic guide on OAuth</a>, covering everything you need to know to implement it.</p><p class="paragraph" style="text-align:left;">The <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">guide</a> covers</p><ul><li><p class="paragraph" style="text-align:left;"><b>Roles and Terminology</b>: Explains the OAuth jargon like Resource Owner, Authorization Server, Refresh Token, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Tokens and Credentials</b>: Understand the different types of tokens like Access Tokens, Refresh Tokens, and Authorization Codes and more</p></li><li><p class="paragraph" style="text-align:left;"><b>OAuth Flows</b>: Dive into the various OAuth flows, including Authorization Code Grant, PKCE, and Client Credentials, and learn when to use each one based on your application needs</p></li></ul><p class="paragraph" style="text-align:left;">Read the <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">full guide</a> to learn more about OAuth 2.0 and how to implement it with WorkOS.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025"><span class="button__text" style=""> Read the Complete Guide to OAuth 2.0 </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Cloudflare Defines, Measures and Stops DDoS Attacks</b></h1><p class="paragraph" style="text-align:left;">Cloudflare is one of the largest internet infrastructure companies in the world, providing CDN services, DDoS protection, DNS management and more to millions of websites. They operate a massive global network with data centers in over 330 cities worldwide, handling trillions of requests per day.</p><p class="paragraph" style="text-align:left;">One of Cloudflare&#39;s core offerings is protection against Distributed Denial of Service (DDoS) attacks. Unfortunately, these malicious attacks are a constant threat to websites. In 2024 alone, CloudFlare mitigated over 14.5 million DDoS attacks (<i>an average of 2,200 DDoS attacks per hour</i>).</p><p class="paragraph" style="text-align:left;">Recently, the Cloudflare engineering team published an awesome blog post delving into the evolution of DDoS attacks over the past decade. We’ll be summarizing the post and adding some extra context on DDoS attacks.</p><h2 class="heading" style="text-align:left;" id="d-do-s-attacks-explained"><b>DDoS Attacks Explained</b></h2><p class="paragraph" style="text-align:left;">With a Distributed Denial of Service Attack, a hacker will use many geographically distributed machines to send traffic to a website. These machines usually belong to unsuspecting users and have been infected with malware to make them part of the attacker’s botnet.</p><p class="paragraph" style="text-align:left;">The goal is to overwhelm the target’s backend with traffic, so the site can no longer serve legitimate users. The attacker might then request a ransom from the company, promising to stop the DDoS attack if the company pays up.</p><p class="paragraph" style="text-align:left;">DDoS Attacks can roughly be split into 3 main types: Volumetric, Application layer and Protocol attacks.</p><h3 class="heading" style="text-align:left;" id="volumetric-attacks"><b>Volumetric Attacks</b></h3><p class="paragraph" style="text-align:left;">These attacks are based on brute force techniques where the target server is flooded with data packets to consume bandwidth and server resources.</p><p class="paragraph" style="text-align:left;">Volumetric attacks will frequently rely on <i>amplification</i> and <i>reflection</i>.</p><p class="paragraph" style="text-align:left;">Amplification is where a request in a certain protocol will result in a much larger response (in terms of the number of bytes); the ratio between the request size and response size is called the Amplification Factor.</p><p class="paragraph" style="text-align:left;">Reflection is where the attacker will spoof the source of request packets to be the target victim’s IP address. Servers will be unable to distinguish legitimate requests from spoofed ones so they’ll send the (much larger) response payload to the targeted victim’s servers and unintentionally flood them.</p><p class="paragraph" style="text-align:left;">Network Time Protocol (NTP) DDoS attacks are an example of volumetric attacks where you can send a 234-byte spoofed request to an NTP server, which will then send a 48,000 byte response to the target victim. Attackers will repeat this on many different open NTP servers simultaneously to DDoS the victim with all the NTP responses.</p><h3 class="heading" style="text-align:left;" id="application-layer-attacks"><b>Application Layer Attacks</b></h3><p class="paragraph" style="text-align:left;">These DDoS attacks target the “top” layer in the OSI model - the application layer. Attackers might flood the backend with HTTP requests, exploit expensive API endpoints, create many SSL/TLS handshakes, etc.</p><p class="paragraph" style="text-align:left;">Database DDoS attacks are quite common, where a hacker will look for requests that are particularly database-intensive and then spam those in an attempt to exhaust the database resources. Scaling your database through read replicas takes time, so this attack can be pretty successful.</p><p class="paragraph" style="text-align:left;">HTTP Floods are some of the most widely seen layer 7 DDoS attacks, where hackers will spam a web server with HTTP GET/POST requests. Sophisticated hackers will specifically design these to request resources with low usage in order to maximize the number of cache misses the web server has.</p><h3 class="heading" style="text-align:left;" id="protocol-layer-attacks"><b>Protocol Layer Attacks</b></h3><p class="paragraph" style="text-align:left;">Protocol attacks will rely on weaknesses in how particular protocols are designed. Examples of these kinds of exploits are SYN floods, BGP hijacking, Smurf attacks and more.</p><p class="paragraph" style="text-align:left;">A SYN flood attack exploits how TCP is designed, specifically the handshake process. The three-way handshake consists of SYN -&gt; SYN-ACK -&gt; ACK,  where the client sends a synchronize (SYN) message to initiate, the server responds with a synchronize-acknowledge (SYN-ACK) message and the client then responds back with an acknowledgement (ACK) message.</p><p class="paragraph" style="text-align:left;">In a SYN flood attack, a malicious client will send large volumes of SYN messages to the server, who will then respond back with SYN-ACK. The client will ignore these and never respond back with an ACK message. The server will waste resources (open ports) waiting for the ACK responses from the malicious client. If repeated on a large scale, this can bring the web server down since the server won’t know which requests are legitimate.</p><h2 class="heading" style="text-align:left;" id="how-to-measure-d-do-s-attacks"><b>How to Measure DDoS Attacks</b></h2><p class="paragraph" style="text-align:left;">First of all, defining an <i>individual </i>DDoS attack can be surprisingly difficult. It is not just a one-time spike in requests. A DDoS attack can last for several hours/days and consist of many smaller incidents (<i>also known as pulses</i>).</p><p class="paragraph" style="text-align:left;">Cloudflare analyzes a combination of factors to create a “fingerprint” that helps identify different attacks as part of the same DDoS targeting. </p><p class="paragraph" style="text-align:left;">Some factors Cloudflare looks at are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Attack Vectors:</b> Are the same attack vectors being used across the set of events?</p></li><li><p class="paragraph" style="text-align:left;"><b>Targets:</b> Are all the attacks focused on the same target website/entity?</p></li><li><p class="paragraph" style="text-align:left;"><b>Payload Signatures:</b> Do the payloads share anything that could mark them to a certain botnet?</p></li></ul><p class="paragraph" style="text-align:left;">Once they have gotten a rough sense of which pulses are part of the attack, Cloudflare can measure how large the DDoS was.</p><p class="paragraph" style="text-align:left;">The main metrics they use are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Bits per Second (BPS):</b> Measures the total data transferred per second. This is useful for evaluating network-layer attacks that aim to saturate bandwidth like UDP floods.</p></li><li><p class="paragraph" style="text-align:left;"><b>Requests per Second (RPS):</b> Measures the number of protocol requests made each second. This is useful for application-layer attacks (Layer 7).</p></li><li><p class="paragraph" style="text-align:left;"><b>Packets per Second (PPS):</b> Represents the number of individual packets sent to the target per second, regardless of size. This is critical for network-layer attacks (Layers 3 and 4) like SYN floods.</p></li></ul><h2 class="heading" style="text-align:left;" id="how-to-protect-from-d-do-s-attacks"><b>How to Protect from DDoS Attacks</b></h2><p class="paragraph" style="text-align:left;">Unfortunately, there is no single solution to protecting your service from a DDoS attack. Large companies use many different approaches. Some of them include</p><ul><li><p class="paragraph" style="text-align:left;"><b>Rate Limiting:</b> This is the first line of defense. You set thresholds on the number of requests your server will accept from a single IP address within a given time frame.</p></li><li><p class="paragraph" style="text-align:left;"><b>Caching and CDNs:</b> You can significantly reduce the load on your web server by caching content and using a Content Delivery Network (CDN). CDNs will distribute your files across multiple servers globally, so that the impact of a DDoS attack is spread out.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reducing the Attack Surface:</b> Minimize the number of services exposed to the public internet. If you have an endpoint with expensive operations, then protect it through authentication (by requiring an API key).</p></li><li><p class="paragraph" style="text-align:left;"><b>Monitoring:</b> You should continuously monitor network traffic to detect anomalies and potential attacks. First, establish a baseline for normal traffic patterns. That way, you can quickly identify unusual spikes or patterns that indicate an attack.</p></li><li><p class="paragraph" style="text-align:left;"><b>Machine Learning</b><b> </b>- services like Cloudflare use machine learning algorithms to identify suspicious traffic patterns in real time. They wrote an interesting blog post on the ML models they use <a class="link" href="https://blog.cloudflare.com/training-a-million-models-per-day-to-save-customers-of-all-sizes-from-ddos/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcdKvZbhHDZLSLsERyr_kDIBT3i2tL4XkHcy0RoPuvkISdu1lr6U01KsNsxwEfvH-o08kZHZd1pvwvEjvEM6U5X5hvnhIKiPoEzw98vl9ADcwtGnh3NUm0ADu2x1aZf6zJl3R_jQg?key=WWLTacfcf4ieV0AckeDT6Q"/></a></div><h1 class="heading" style="text-align:left;" id="the-complete-guide-to-o-auth-20"><a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">The Complete Guide to OAuth 2.0</a></h1><p class="paragraph" style="text-align:left;">OAuth 2.0 is the industry standard for authorization. It lets you grant a third-party application access to data on your Google/Meta/Dropbox account without sharing your account’s password.</p><p class="paragraph" style="text-align:left;">If you’re building an app and want to add a “<i>sign in with google</i>” button then you’ll need to understand how OAuth works.</p><p class="paragraph" style="text-align:left;">Recently, WorkOS published a <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">fantastic guide on OAuth</a>, covering everything you need to know to implement it.</p><p class="paragraph" style="text-align:left;">The <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">guide</a> covers</p><ul><li><p class="paragraph" style="text-align:left;"><b>Roles and Terminology</b>: Explains the OAuth jargon like Resource Owner, Authorization Server, Refresh Token, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Tokens and Credentials</b>: Understand the different types of tokens like Access Tokens, Refresh Tokens, and Authorization Codes and more</p></li><li><p class="paragraph" style="text-align:left;"><b>OAuth Flows</b>: Dive into the various OAuth flows, including Authorization Code Grant, PKCE, and Client Credentials, and learn when to use each one based on your application needs</p></li></ul><p class="paragraph" style="text-align:left;">Read the <a class="link" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025" target="_blank" rel="noopener noreferrer nofollow">full guide</a> to learn more about OAuth 2.0 and how to implement it with WorkOS.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://workos.com/blog/the-complete-guide-to-oauth?utm_source=quastor&utm_medium=newsletter&utm_campaign=q12025"><span class="button__text" style=""> Read the Complete Guide to OAuth 2.0 </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://simonwillison.net/2024/Dec/31/llms-in-2024/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://static.simonwillison.net/static/2024/arena-dec-2024.jpg"/><div class="embed__content"><p class="embed__title"> Things we learned about LLMs in 2024 </p><p class="embed__description"> Over the course of 2024, a ton of rapid advancements came in LLMs. Multiple models were released that surpassed GPT-4’s capabilities and the competition drove LLM prices down dramatically. Multimodal models also became prevalent with capabilities extending to text, image, audio and video.<br><br>However, there’s still a ton of challenges around creating reliable evals, the environmental impact, agents and more.<br><br>This is a great post that summarizes the year for LLMs. </p><p class="embed__link"> simonwillison.net/2024/Dec/31/llms-in-2024 </p></div></a></div><div class="embed"><a class="embed__url" href="https://jacobian.org/2024/feb/7/tracking-engineering-time/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://jacobian.org/cards/tracking-engineering-time.png"/><div class="embed__content"><p class="embed__title"> How to Track Engineering Time </p><p class="embed__description"> This article presents a useful playbook for tracking engineering time. It splits time spent spent into features, bugs/debt and toil.<br><br>Analyzing ratios like the time spent on features vs. bugs/debt can be very helpful for making informed decisions about resource allocation and project prioritization. </p><p class="embed__link"> jacobian.org/2024/feb/7/tracking-engineering-time </p></div></a></div><div class="embed"><a class="embed__url" href="https://github.com/matteofigus/awesome-speaking?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour#readme" target="_blank"><div class="embed__content"><p class="embed__title"> Resources on Public Speaking for Software Engineers </p><p class="embed__description"> The “awesome-speaking“ GitHub repository is a fantastic resource for anyone looking to improve their public speaking skills, especially in the software engineering domain.<br><br>It contains links to blog posts, books and public speaking organizations. The content ranges from storytelling techniques to tips on how to handle your nerves. It’s a great repo if you want to become a more confident speaker. </p><p class="embed__link"> github.com/matteofigus/awesome-speaking#readme </p></div></a></div><div class="embed"><a class="embed__url" href="https://curiouscoding.nl/posts/static-search-tree/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-cloudflare-mitigates-thousands-of-ddos-attacks-every-hour" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/036667a0-3f29-4a76-8fdf-8ad781b86e62/Screenshot_2025-01-05_at_12.46.05_PM.png?t=1736099176"/><div class="embed__content"><p class="embed__title"> Static Search Trees and how they can be 40x faster than Binary Search </p><p class="embed__description"> Binary Search can be surprisingly inefficient for large datasets due to poor cache utilization. Ragnar Groot Koerkamp tackled this problem by optimizing S+ trees for high-throughput searching.<br><br>He wrote a terrific article talking about his optimizations and how he was able to achieve a 40x speedup over binary search. </p><p class="embed__link"> curiouscoding.nl/posts/static-search-tree </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=95088630-4a87-43c1-b8a7-3e9c6f384b53&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>The Architecture of Canva&#39;s Data Platform</title>
  <description>We&#39;ll talk about Snowflake and how Canva built a monitoring system around it. Plus, how to optimize code, why Unicode is harder than you think, automating your job search and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/dea01670-c170-493f-9db6-fe8f23f8bd40/new_diagram.gif" length="428407" type="image/gif"/>
  <link>https://blog.quastor.org/p/the-architecture-of-canva-s-data-platform-b20a</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/the-architecture-of-canva-s-data-platform-b20a</guid>
  <pubDate>Tue, 31 Dec 2024 10:30:00 +0000</pubDate>
  <atom:published>2024-12-31T10:30:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Canva’s Data Platform</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to Snowflake</p></li><li><p class="paragraph" style="text-align:left;"> Architecture of Canva’s Data Platform</p></li><li><p class="paragraph" style="text-align:left;">How Canva Monitors their Snowflake Usage to avoid Expensive Surprises</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">The Four Kinds of Optimization</p></li><li><p class="paragraph" style="text-align:left;">Unicode is Harder Than You Think</p></li><li><p class="paragraph" style="text-align:left;">How I Automated My Job Application Process</p></li><li><p class="paragraph" style="text-align:left;">Three Bucket Framework for Engineering Metrics</p></li></ul></li></ul><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user">The Architecture of Canva&#39;s Data Platform</h1><p class="paragraph" style="text-align:left;">Canva is an online graphic design platform that lets you create social media posts, posters, videos, logos and more. They have a ton of pre-built templates that help you create professional looking designs even if you have the visual-design skills of a 4 year old.</p><p class="paragraph" style="text-align:left;">The company was founded in 2013 and has exploded in popularity with hundreds of millions of users globally. This hyper growth has caused a ton of interesting engineering challenges, particularly with their data platform.</p><p class="paragraph" style="text-align:left;">Canva uses Snowflake for the core of their data platform. They store over 25 petabytes of data and execute over 90 million queries a month. Over two-thirds of employees at Canva use Snowflake in some way (writing SQL queries, relying on business-intelligence dashboards, etc.)</p><p class="paragraph" style="text-align:left;">However, Snowflake costs can get out-of-hand <i>very</i> quickly if you don’t effectively monitor and optimize your usage. Canva wrote a fantastic <a class="link" href="https://www.canva.dev/blog/engineering/our-journey-to-snowflake-monitoring-mastery/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">engineering blog</a> on the tooling they built to do this.</p><h2 class="heading" style="text-align:left;" id="introduction-to-snowflake"><b>Introduction to Snowflake</b></h2><p class="paragraph" style="text-align:left;">Snowflake is a cloud data warehouse platform that has grown incredibly quickly since its launch in 2014.</p><p class="paragraph" style="text-align:left;">At the time of their founding, companies were mainly using on-prem data warehouses like Teradata or Oracle. However, these were <i>super expensive </i>to scale and maintain. You’d have to purchase a ton of hardware upfront and managing all the infrastructure was a huge pain.</p><p class="paragraph" style="text-align:left;">In response, Snowflake built a cloud-native data warehouse that runs on top of AWS, Azure and Google Cloud. You pick the cloud provider that fits with the rest of your infrastructure.</p><p class="paragraph" style="text-align:left;">Data in Snowflake is stored in a columnar format in “micro-partitions”. These are contiguous storage units that contain 50-500 mb of data and help massively with speeding up analytical queries.</p><p class="paragraph" style="text-align:left;">You can access data in Snowflake with SQL, the API (with connectors for Python, Java, NodeJS, etc.), the web interface and other third party tools.</p><p class="paragraph" style="text-align:left;">Some of the core selling points of Snowflake are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Separation of Storage and Compute</b> - Snowflake allows you to scale your <i>compute</i> resources independently of your <i>storage</i> resources. If you have a small dataset but need extensive processing, then you can scale up your compute resources without having to pay for storage you don’t need. AWS Redshift added this capability with RA3 instances in late 2019.</p></li><li><p class="paragraph" style="text-align:left;"><b>Low Management and Security</b> - Snowflake is cloud-native so it has far lower maintenance than an on-prem data warehouse. It also provides a ton of security features like <a class="link" href="https://docs.snowflake.com/en/user-guide/data-time-travel?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">Time Travel</a> that lets you access historical data versions for data recovery and auditing.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Integrations</b> - Snowflake supports both structured and semi-structured data formats like CSV, JSON, Avro, Parquet and more. It also has integrations to a ton of data tools like Kafka, Spark, Tableau, etc.</p></li></ul><p class="paragraph" style="text-align:left;">The main cons of Snowflake are vendor lock-in and (<i>potentially</i>) the costs.</p><p class="paragraph" style="text-align:left;">Snowflake’s pricing can be notoriously complex and difficult to predict. Not monitoring/optimizing your usage can quickly result in a very expensive surprise.</p><h2 class="heading" style="text-align:left;" id="snowflake-at-canva"><b>Snowflake at Canva</b></h2><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeH_IESlg1LAEEOaLBOpxB-PcHID38XDVb1eU92Be4qzsZm7PK3yI9aAdz16_97dgXNct8SseIPudnFrDmPbMEAy_8pIASSBsq1oM715ENciiBkGMlgpuEbDTc6zUrIcjjBKdgWSw?key=XhIYY0F08Di48z6W3kJ_cavu"/></div><p class="paragraph" style="text-align:left;">Here are the steps in Canva’s Data Platform</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Data Ingestion</b> - First party data (<i>generated by services at Canva</i>) is ingested through AWS S3. Third party data (<i>Facebook ad spend results, Google organic search data, etc.</i>) is ingested with Fivetran for ETL</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Transformation</b><b> </b>- Canva uses dbt for transforming and cleaning raw data into a format that supports analysis and reporting. dbt is an open source tool where you can write your data transformation logic in SQL or Python and organize it into maintainable components. </p></li><li><p class="paragraph" style="text-align:left;"><b>Data Views</b><b> </b>- Canva uses Census to synchronize enriched datasets to third-party systems. They also use Looker and Mode for data visualization and exploration.</p></li></ol><h2 class="heading" style="text-align:left;" id="canvas-snowflake-monitoring-system"><b>Canva’s Snowflake Monitoring System</b></h2><p class="paragraph" style="text-align:left;">In order to avoid overspending, the Canva team built out an extensive monitoring system around Snowflake and their usage.</p><p class="paragraph" style="text-align:left;">They started by using Snowflake’s <a class="link" href="https://docs.snowflake.com/en/sql-reference/account-usage/query_history?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">account_usage.query_history view</a> to collect usage data and they stored this in a dedicated dbt project. They also captured detailed metadata on all their other dbt runs (<i>for their main data transformations</i>) and stored this data in S3.</p><p class="paragraph" style="text-align:left;">In order to link specific Snowflake queries with the corresponding dbt models, Canva developed a custom dbt query tagging macro that appended JSON metadata to each query. They could use this to track the usage at a per-query level and then assign costs to individual queries and transformations.</p><p class="paragraph" style="text-align:left;">With this foundation, Canva created dashboards that provided Data Platform engineers with real-time metrics on how Snowflake was being used, which teams were incurring costs and how usage could be optimized.</p><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://tratt.net/laurie/blog/2023/four_kinds_of_optimisation.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> The Four Kinds of Optimization </p><p class="embed__description"> This article has a useful list of “mental models” you should have when thinking about optimizing your code.<br><br>The four strategies discussed are<br>1. Find a better algorithm<br>2. Use a more efficient data structure<br>3. Use a lower-level system<br>4. Accept a less precise solution<br><br>The article talks about each of these four and the trade-offs involved </p><p class="embed__link"> tratt.net/laurie/blog/2023/four_kinds_of_optimisation.html </p></div></a></div><div class="embed"><a class="embed__url" href="https://mcilloni.ovh/2023/07/23/unicode-is-hard/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> Unicode is harder than you think </p><p class="embed__description"> This is a great article that explores Unicode and talks about common misconceptions/pitfalls you may face. It talks about the different encoding formats (UTF-8, UTF-16, etc.), normalization and best practices when working with Unicode. </p><p class="embed__link"> mcilloni.ovh/2023/07/23/unicode-is-hard </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.daviddodda.com/how-i-automated-my-job-application-process-part-1?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> How I Automated My Job Application Process </p><p class="embed__description"> This is an interesting article by David Dodda on how he automated his job search process with LLMs. He built a system that helped him submit 250 job applications in 20 minutes.<br><br>If you’re currently hunting for a job, it might give you some ideas on tactics you can test out to make the process easier. </p><p class="embed__link"> blog.daviddodda.com/how-i-automated-my-job-application-process-part-1 </p></div></a></div><div class="embed"><a class="embed__url" href="https://newsletter.getdx.com/p/choosing-engineering-metrics?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/3fe9cd01-b2e8-41a5-870c-1506a26a013b/eba972e1-f1bc-4bac-a011-08210a42ef2e_1578x736.jpg?t=1735585901"/><div class="embed__content"><p class="embed__title"> Three-Bucket Framework for Engineering Metrics </p><p class="embed__description"> Abi Noda wrote a terrific article on how engineering leaders can effectively report metrics to stakeholders.<br><br>His three-bucket framework is<br>1. Demonstrate Business Impact with ROI<br>2. Show System Performance with Uptime, Incidents, Scale<br>3. Present Developer Effectiveness with SPACE and DORA </p><p class="embed__link"> newsletter.getdx.com/p/choosing-engineering-metrics </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=644c3284-d177-469f-9e54-f0e858fd5967&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>The Architecture of Canva&#39;s Data Platform</title>
  <description>We&#39;ll talk about Snowflake and how Canva built a monitoring system around it. Plus, how to optimize code, why Unicode is harder than you think, automating your job search and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/67d36047-ad82-437a-8e7e-7642d2db0841/Canva_Snowflake_Data_Architecture.gif" length="698670" type="image/gif"/>
  <link>https://blog.quastor.org/p/the-architecture-of-canva-s-data-platform</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/the-architecture-of-canva-s-data-platform</guid>
  <pubDate>Mon, 30 Dec 2024 22:15:00 +0000</pubDate>
  <atom:published>2024-12-30T22:15:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Canva’s Data Platform</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to Snowflake</p></li><li><p class="paragraph" style="text-align:left;"> Architecture of Canva’s Data Platform</p></li><li><p class="paragraph" style="text-align:left;">How Canva Monitors their Snowflake Usage to avoid Expensive Surprises</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">The Four Kinds of Optimization</p></li><li><p class="paragraph" style="text-align:left;">Unicode is Harder Than You Think</p></li><li><p class="paragraph" style="text-align:left;">How I Automated My Job Application Process</p></li><li><p class="paragraph" style="text-align:left;">Three Bucket Framework for Engineering Metrics</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user">The Architecture of Canva&#39;s Data Platform</h1><p class="paragraph" style="text-align:left;">Canva is an online graphic design platform that lets you create social media posts, posters, videos, logos and more. They have a ton of pre-built templates that help you create professional looking designs even if you have the visual-design skills of a 4 year old.</p><p class="paragraph" style="text-align:left;">The company was founded in 2013 and has exploded in popularity with hundreds of millions of users globally. This hyper growth has caused a ton of interesting engineering challenges, particularly with their data platform.</p><p class="paragraph" style="text-align:left;">Canva uses Snowflake for the core of their data platform. They store over 25 petabytes of data and execute over 90 million queries a month. Over two-thirds of employees at Canva use Snowflake in some way (writing SQL queries, relying on business-intelligence dashboards, etc.)</p><p class="paragraph" style="text-align:left;">However, Snowflake costs can get out-of-hand <i>very</i> quickly if you don’t effectively monitor and optimize your usage. Canva wrote a fantastic <a class="link" href="https://www.canva.dev/blog/engineering/our-journey-to-snowflake-monitoring-mastery/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">engineering blog</a> on the tooling they built to do this.</p><h2 class="heading" style="text-align:left;" id="introduction-to-snowflake"><b>Introduction to Snowflake</b></h2><p class="paragraph" style="text-align:left;">Snowflake is a cloud data warehouse platform that has grown incredibly quickly since its launch in 2014.</p><p class="paragraph" style="text-align:left;">At the time of their founding, companies were mainly using on-prem data warehouses like Teradata or Oracle. However, these were <i>super expensive </i>to scale and maintain. You’d have to purchase a ton of hardware upfront and managing all the infrastructure was a huge pain.</p><p class="paragraph" style="text-align:left;">In response, Snowflake built a cloud-native data warehouse that runs on top of AWS, Azure and Google Cloud. You pick the cloud provider that fits with the rest of your infrastructure.</p><p class="paragraph" style="text-align:left;">Data in Snowflake is stored in a columnar format in “micro-partitions”. These are contiguous storage units that contain 50-500 mb of data and help massively with speeding up analytical queries.</p><p class="paragraph" style="text-align:left;">You can access data in Snowflake with SQL, the API (with connectors for Python, Java, NodeJS, etc.), the web interface and other third party tools.</p><p class="paragraph" style="text-align:left;">Some of the core selling points of Snowflake are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Separation of Storage and Compute</b> - Snowflake allows you to scale your <i>compute</i> resources independently of your <i>storage</i> resources. If you have a small dataset but need extensive processing, then you can scale up your compute resources without having to pay for storage you don’t need. AWS Redshift added this capability with RA3 instances in late 2019.</p></li><li><p class="paragraph" style="text-align:left;"><b>Low Management and Security</b> - Snowflake is cloud-native so it has far lower maintenance than an on-prem data warehouse. It also provides a ton of security features like <a class="link" href="https://docs.snowflake.com/en/user-guide/data-time-travel?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">Time Travel</a> that lets you access historical data versions for data recovery and auditing.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Integrations</b> - Snowflake supports both structured and semi-structured data formats like CSV, JSON, Avro, Parquet and more. It also has integrations to a ton of data tools like Kafka, Spark, Tableau, etc.</p></li></ul><p class="paragraph" style="text-align:left;">The main cons of Snowflake are vendor lock-in and (<i>potentially</i>) the costs.</p><p class="paragraph" style="text-align:left;">Snowflake’s pricing can be notoriously complex and difficult to predict. Not monitoring/optimizing your usage can quickly result in a very expensive surprise.</p><h2 class="heading" style="text-align:left;" id="snowflake-at-canva"><b>Snowflake at Canva</b></h2><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeH_IESlg1LAEEOaLBOpxB-PcHID38XDVb1eU92Be4qzsZm7PK3yI9aAdz16_97dgXNct8SseIPudnFrDmPbMEAy_8pIASSBsq1oM715ENciiBkGMlgpuEbDTc6zUrIcjjBKdgWSw?key=XhIYY0F08Di48z6W3kJ_cavu"/></div><p class="paragraph" style="text-align:left;">Here are the steps in Canva’s Data Platform</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Data Ingestion</b> - First party data (<i>generated by services at Canva</i>) is ingested through AWS S3. Third party data (<i>Facebook ad spend results, Google organic search data, etc.</i>) is ingested with Fivetran for ETL</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Transformation</b><b> </b>- Canva uses dbt for transforming and cleaning raw data into a format that supports analysis and reporting. dbt is an open source tool where you can write your data transformation logic in SQL or Python and organize it into maintainable components. </p></li><li><p class="paragraph" style="text-align:left;"><b>Data Views</b><b> </b>- Canva uses Census to synchronize enriched datasets to third-party systems. They also use Looker and Mode for data visualization and exploration.</p></li></ol><h2 class="heading" style="text-align:left;" id="canvas-snowflake-monitoring-system"><b>Canva’s Snowflake Monitoring System</b></h2><p class="paragraph" style="text-align:left;">In order to avoid overspending, the Canva team built out an extensive monitoring system around Snowflake and their usage.</p><p class="paragraph" style="text-align:left;">They started by using Snowflake’s <a class="link" href="https://docs.snowflake.com/en/sql-reference/account-usage/query_history?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank" rel="noopener noreferrer nofollow">account_usage.query_history view</a> to collect usage data and they stored this in a dedicated dbt project. They also captured detailed metadata on all their other dbt runs (<i>for their main data transformations</i>) and stored this data in S3.</p><p class="paragraph" style="text-align:left;">In order to link specific Snowflake queries with the corresponding dbt models, Canva developed a custom dbt query tagging macro that appended JSON metadata to each query. They could use this to track the usage at a per-query level and then assign costs to individual queries and transformations.</p><p class="paragraph" style="text-align:left;">With this foundation, Canva created dashboards that provided Data Platform engineers with real-time metrics on how Snowflake was being used, which teams were incurring costs and how usage could be optimized.</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://tratt.net/laurie/blog/2023/four_kinds_of_optimisation.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> The Four Kinds of Optimization </p><p class="embed__description"> This article has a useful list of “mental models” you should have when thinking about optimizing your code.<br><br>The four strategies discussed are<br>1. Find a better algorithm<br>2. Use a more efficient data structure<br>3. Use a lower-level system<br>4. Accept a less precise solution<br><br>The article talks about each of these four and the trade-offs involved </p><p class="embed__link"> tratt.net/laurie/blog/2023/four_kinds_of_optimisation.html </p></div></a></div><div class="embed"><a class="embed__url" href="https://mcilloni.ovh/2023/07/23/unicode-is-hard/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> Unicode is harder than you think </p><p class="embed__description"> This is a great article that explores Unicode and talks about common misconceptions/pitfalls you may face. It talks about the different encoding formats (UTF-8, UTF-16, etc.), normalization and best practices when working with Unicode. </p><p class="embed__link"> mcilloni.ovh/2023/07/23/unicode-is-hard </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.daviddodda.com/how-i-automated-my-job-application-process-part-1?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><div class="embed__content"><p class="embed__title"> How I Automated My Job Application Process </p><p class="embed__description"> This is an interesting article by David Dodda on how he automated his job search process with LLMs. He built a system that helped him submit 250 job applications in 20 minutes.<br><br>If you’re currently hunting for a job, it might give you some ideas on tactics you can test out to make the process easier. </p><p class="embed__link"> blog.daviddodda.com/how-i-automated-my-job-application-process-part-1 </p></div></a></div><div class="embed"><a class="embed__url" href="https://newsletter.getdx.com/p/choosing-engineering-metrics?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-canva-s-data-platform" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/3fe9cd01-b2e8-41a5-870c-1506a26a013b/eba972e1-f1bc-4bac-a011-08210a42ef2e_1578x736.jpg?t=1735585901"/><div class="embed__content"><p class="embed__title"> Three-Bucket Framework for Engineering Metrics </p><p class="embed__description"> Abi Noda wrote a terrific article on how engineering leaders can effectively report metrics to stakeholders.<br><br>His three-bucket framework is<br>1. Demonstrate Business Impact with ROI<br>2. Show System Performance with Uptime, Incidents, Scale<br>3. Present Developer Effectiveness with SPACE and DORA </p><p class="embed__link"> newsletter.getdx.com/p/choosing-engineering-metrics </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=747600e3-5713-4e2b-8839-db7a4d557d02&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>The Architecture of Stripe&#39;s Document Database</title>
  <description>Stripe built a document database on top of MongoDB. We&#39;ll go over it&#39;s architecture and why they built it. Plus, how Nginx scales, how to give constructive feedback and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fcdd51ce-ee8c-49a9-93ce-3472f0ef459c/Untitled_Diagram.gif" length="1140805" type="image/gif"/>
  <link>https://blog.quastor.org/p/the-architecture-of-stripe-s-document-database</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/the-architecture-of-stripe-s-document-database</guid>
  <pubDate>Mon, 16 Dec 2024 22:15:00 +0000</pubDate>
  <atom:published>2024-12-16T22:15:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>The Architecture of Stripe’s Document Database</b> - Stripe wrote a great blog post describing DocDB, their internal database as a service. DocDB is built on MongoDB and stores petabytes of data.</p><ul><li><p class="paragraph" style="text-align:left;">Brief Intro to MongoDB and its Benefits</p></li><li><p class="paragraph" style="text-align:left;">Why Stripe built DocDB</p></li><li><p class="paragraph" style="text-align:left;">Architecture of DocDB and how it works</p></li><li><p class="paragraph" style="text-align:left;">Rebalancing Data Shards on DocDB for Efficiency</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Going from New Grad to Staff Engineer in 3 Years at Meta</p></li><li><p class="paragraph" style="text-align:left;">How Nginx Scales</p></li><li><p class="paragraph" style="text-align:left;">An Engineering Philosophy of “Let It Fail“</p></li><li><p class="paragraph" style="text-align:left;">How to Give Constructive Feedback to Coworkers</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXecbkovj4J42wPIX6Z3yrzSGitrwqqzDuWTV408-oiB12a6n38kjqO7m0FlDY-lCkx9LJVW8fQtOtCoPsZHiszuvUqF1LHqqQgJnNZUq8HbXHP5yhewHxm18m9SPAjX3iP2B_S1u8-6vtzFhN_0M7SmrRw?key=WWLTacfcf4ieV0AckeDT6Q"/></a></div><h1 class="heading" style="text-align:left;" id="the-developers-guide-to-fine-graine"><a class="link" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024" target="_blank" rel="noopener noreferrer nofollow">The Developer’s Guide to Fine Grained Authorization</a></h1><p class="paragraph" style="text-align:left;">As your application grows in complexity, you’ll need to implement a more granular and scalable authorization system.</p><p class="paragraph" style="text-align:left;"><a class="link" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024" target="_blank" rel="noopener noreferrer nofollow">Fine-grained authorization (FGA)</a> systems are where you define permissions <i>at the resource level</i> and give individual users permissions to read/modify specific resources (a document in google drive, bank account in JP Morgan, etc.). To build an FGA system, you’ll need to provide precision while also handling thousands of authentication requests per second.</p><p class="paragraph" style="text-align:left;">WorkOS published an <a class="link" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024" target="_blank" rel="noopener noreferrer nofollow">in-depth guide</a> that walks you through everything you need to know about FGA-from designing your data model, leveraging third-party solutions and optimizing your UI.</p><p class="paragraph" style="text-align:left;">In the guide, you’ll understand</p><ul><li><p class="paragraph" style="text-align:left;"><b>FPA Basics</b> - Grasp the the relationship between users, resources and roles</p></li><li><p class="paragraph" style="text-align:left;"><b>How to Build Resolver Logic </b>- Implement custom logic to handle complex authorization scenarios</p></li><li><p class="paragraph" style="text-align:left;"><b>Centralize vs. Decentralize FGA</b> - how to decide the best architectural approach for your authentication system</p></li></ul><p class="paragraph" style="text-align:left;">Read the <a class="link" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024" target="_blank" rel="noopener noreferrer nofollow">full guide</a> to learn more.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://workos.com/blog/the-developers-guide-to-fine-grained-authorization?utm_source=quastor&utm_medium=newsletter&utm_campaign=q42024"><span class="button__text" style=""> Read the full Developer’s Guide to Fine-Grained Authorization </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="the-architecture-of-stripes-documen"><b>The Architecture of Stripe’s Document Database</b></h1><p class="paragraph" style="text-align:left;">Stripe is one of the largest payment processors in the world. In 2023, they processed over $1 trillion USD of payment volume, and they did this with a 99.999% uptime.</p><p class="paragraph" style="text-align:left;">A crucial system that helped the company achieve this is DocDB, Stripe&#39;s internal Database as a Service.</p><p class="paragraph" style="text-align:left;">Developers at Stripe can use the API for reading/writing data (<a class="link" href="https://en.wikipedia.org/wiki/Online_transaction_processing?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank" rel="noopener noreferrer nofollow">OLTP</a> reads/writes) and not have to worry about scaling compute, increasing storage, schema changes, etc. They can just focus on the product they&#39;re building.</p><p class="paragraph" style="text-align:left;">Stripe&#39;s Database Infrastructure team published a fantastic <a class="link" href="https://stripe.com/blog/how-stripes-document-databases-supported-99.999-uptime-with-zero-downtime-data-migrations?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into the internals of DocDB and how it works.</p><p class="paragraph" style="text-align:left;">When Stripe was founded in 2011, the company adopted MongoDB as their online database. They found it to be easier to use than a traditional relational database.</p><p class="paragraph" style="text-align:left;">As the company grew to hundreds of terabytes of data, Stripe built DocDB on top of MongoDB to make scaling easier.</p><p class="paragraph" style="text-align:left;">DocDB handles dynamic rebalancing between shards, gives fine-grained control over data distribution, ensures data consistency during migrations and more.</p><p class="paragraph" style="text-align:left;">In this article, we&#39;ll first give a brief overview of MongoDB and then talk about how Stripe designed DocDB.</p><h2 class="heading" style="text-align:left;" id="mongo-db-overview">MongoDB Overview</h2><p class="paragraph" style="text-align:left;">MongoDB is a document-oriented database. It stores your data in semi-structured documents using BSON ( <i>a binary format that extends JSON</i>).</p><p class="paragraph" style="text-align:left;">It was first developed in 2007 and released as an open-source database in 2009 (<i>note - in 2018, they changed their license to be more restrictive</i>).</p><p class="paragraph" style="text-align:left;">The database was created by the founders of DoubleClick, the startup that would later get acquired by Google and become Google Ads.</p><p class="paragraph" style="text-align:left;">The founders faced scalability and usability issues with traditional relational databases so they built MongoDB with certain design goals:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Developer-Friendly</b> - Relational databases will store data in structured tables with relationships defined by primary and foreign keys. This doesn&#39;t naturally map with the traditional data structures you use in an object-oriented programming language (<i>this issue is called </i><i><a class="link" href="https://en.wikipedia.org/wiki/Object%E2%80%93relational_impedance_mismatch?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank" rel="noopener noreferrer nofollow">Object-Relational Impedance Mismatch</a></i>). MongoDB stores data as flexible, JSON-like documents so it&#39;s much more natural to map to objects in your code.</p></li><li><p class="paragraph" style="text-align:left;"><b>Scalability</b> - Horizontal scalability is built into the core of MongoDB and it supports range, hash and zone-based sharding. Document-oriented databases encourage <i>denormalization</i> (where all related data is embedded into a single document) to minimize joins. Avoiding cross-shard joins is crucial for scalability.</p></li><li><p class="paragraph" style="text-align:left;"><b>Schema Flexibility</b> - With a Relational Database, you need to define a schema and ensure that any data you insert follows the schema. Changing the schema means doing a database migration. On the other hand, MongoDB is <i>schemaless</i>. Each document can have different fields and the data types of those fields can vary from document to document. </p></li></ul><h2 class="heading" style="text-align:left;" id="why-stripe-built-doc-db">Why Stripe built DocDB</h2><p class="paragraph" style="text-align:left;">Stripe originally started with MongoDB. As they grew (<i>at an insanely fast rate</i>), the engineering team desired additional features.</p><p class="paragraph" style="text-align:left;">In order to utilize their database infrastructure most efficiently, they needed to transfer data between different shards in their fleet. Stripe has thousands of shards, so managing this can be very complex.</p><p class="paragraph" style="text-align:left;">They wanted a solution that gave them complete operational control, so they could move individual data chunks between shards. Additionally, this needs to be done with minimal downtime and strong data consistency (Stripe is dealing with financial data). </p><p class="paragraph" style="text-align:left;">To solve this, they built DocDB on top of MongoDB.</p><h2 class="heading" style="text-align:left;" id="architecture-of-doc-db">Architecture of DocDB</h2><p class="paragraph" style="text-align:left;">As we mentioned earlier, DocDB is a Database as a Service that Stripe engineers can use through an API.</p><div class="image"><img alt="" class="image__image" style="" src="https://images.ctfassets.net/fzn2n1nzq965/4Sl5p8FugS8KS2ZNmGjdZh/752969d3b2cf4ff37e27b22a68633c4d/Databases_Blog_Chart__900px_wide_5__2x.png?w=1080&q=80&fm=webp"/></div><p class="paragraph" style="text-align:left;">Developers can send a read/write request to DocDB and it&#39;ll first go to a Database Proxy server.</p><p class="paragraph" style="text-align:left;">The proxy server will first check for things like access controls, potential bugs, scalability, etc.</p><p class="paragraph" style="text-align:left;">Then, it&#39;ll figure out which specific data chunks are being read/modified and it&#39;ll talk to a central Chunk Metadata Service to get the locations of the specific database shards.</p><p class="paragraph" style="text-align:left;">Finally, the proxy server will send the read/write requests to the specified database shards. Each of the database shards has multiple replicas so they use a CDC (<i><a class="link" href="https://en.wikipedia.org/wiki/Change_data_capture?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank" rel="noopener noreferrer nofollow">Change Data Capture</a></i>) service to replicate changes between the replicas.</p><h2 class="heading" style="text-align:left;" id="data-movement-platform">Data Movement Platform</h2><p class="paragraph" style="text-align:left;">When you&#39;re sharding horizontally, it&#39;s important to remember that your data won&#39;t be static. You&#39;ll need to move data between shards for expanding capacity, hardware upgrades, rebalancing hot/cold shards and more.</p><p class="paragraph" style="text-align:left;">In Stripe&#39;s case, this is especially complicated because they&#39;re handling financial data.</p><p class="paragraph" style="text-align:left;">Some of their requirements were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Data Consistency</b> - data being migrated needs to be consistent between both the source and target shards.</p></li><li><p class="paragraph" style="text-align:left;"><b>Zero Downtime</b> - any prolonged downtime is unacceptable as businesses need to process payments 24/7. Downtime should be under a few seconds so the product application can just retry the read/write request and there is minimal impact to the customer.</p></li><li><p class="paragraph" style="text-align:left;"><b>Granularity</b> - they should be able to migrate an arbitrary number of data chunks between shards without any restrictions on the number of in-flight transfers or the number of migrations a given shard can perform at once.</p></li></ul><p class="paragraph" style="text-align:left;">Here&#39;s the steps they follow for zero-downtime migrations across database shards:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Confirm Migration and build Indexes</b> - the system first registers the start of the migration in the Chunk Metadata Service. It also builds <a class="link" href="https://en.wikipedia.org/wiki/Database_index?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank" rel="noopener noreferrer nofollow">indexes</a> on the target shards for the data chunks that are being migrated.</p></li><li><p class="paragraph" style="text-align:left;"><b>Bulk Data Import</b> - take a snapshot of the data chunk on the original shard and copy it onto one or more target database shards.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Replication</b> - After you copy over the original shard to the target shard, the original shard will still be getting writes. In this step, you asynchronously replicate any writes that happen on the original shard over to the target shards.</p></li><li><p class="paragraph" style="text-align:left;"><b>Correctness Check</b> - Take point-in-time snapshots of the source and target shards and compare them to ensure data completeness and correctness. </p></li><li><p class="paragraph" style="text-align:left;"><b>Traffic Switch </b>- Once the data is imported to the target shard and mutations are being properly replicated, DocDB&#39;s coordinator will switch traffic over to the target shard. Stripe does this by first blocking any new writes on the source shard. Then, they wait for the replication service to replicate any outstanding writes to the target shard. Finally, the update the route for the data chunk to point to the target shard in the Chunk Metadata service.<br> This step takes a few seconds, so any database requests that get blocked during this process can just retry after a small timeout and get served by the new shards.</p></li><li><p class="paragraph" style="text-align:left;"><b>Finalize Migration</b> - The last step is to mark the migration as complete in the chunk metadata service. They can also delete the chunk data off the source shard.</p></li></ol><h2 class="heading" style="text-align:left;" id="results">Results</h2><p class="paragraph" style="text-align:left;">DocDB&#39;s ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.</p><p class="paragraph" style="text-align:left;">In 2023, they migrated petabytes of data between shards and it helped them achieve much better utilization of their database infrastructure.</p><hr class="content_break"><div class="image"><a class="image__link" href="https://magic.beehiiv.com/v1/dfc7e9db-4293-4204-892b-c43aaf834fb0?email={{email}}&utm_source=quastor&utm_medium=email&utm_campaign=crosspromo" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f7e3c2e0-7bbe-49fe-bc66-9c8ce1471d3b/logo__1_.jpeg?t=1734123762"/></a></div><h1 class="heading" style="text-align:left;" id="get-smarter-about-software-and-ai-i"><a class="link" href="https://magic.beehiiv.com/v1/dfc7e9db-4293-4204-892b-c43aaf834fb0?email={{email}}&utm_source=quastor&utm_medium=email&utm_campaign=crosspromo" target="_blank" rel="noopener noreferrer nofollow"><b>Get smarter about Software and AI in 5 minutes</b></a></h1><p id="save-50-hoursweek-with-deep-dives-t" class="paragraph" style="text-align:left;">Save 50+ hours/week with deep dives, trends, and tools hand-picked from 100+ sources.</p><p id="its-read-by-engineers-at-meta-googl" class="paragraph" style="text-align:left;"><a class="link" href="https://magic.beehiiv.com/v1/dfc7e9db-4293-4204-892b-c43aaf834fb0?email={{email}}&utm_source=quastor&utm_medium=email&utm_campaign=crosspromo" target="_blank" rel="noopener noreferrer nofollow">It’s read</a> by engineers at Meta, Google, Uber, Amazon, and big startups.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://magic.beehiiv.com/v1/dfc7e9db-4293-4204-892b-c43aaf834fb0?email={{email}}&utm_source=quastor&utm_medium=email&utm_campaign=crosspromo"><span class="button__text" style=""> Join 40,000+ engineers for 1 free email every Monday </span></a></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://www.developing.dev/p/new-grad-to-staff-at-meta-in-3-years?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank"><div class="embed__content"><p class="embed__title"> How Evan King went from New Grad to Staff Engineer in 3 Years at Meta </p><p class="embed__description"> Evan King was able to triple his compensation at Meta in 3 years by growing from a new grad to a staff engineer.<br><br>He wrote a great blog post in developing.dev where he broke down six key principles that accelerated his growth.<br><br>Some of the principles include<br>- Working fast and using the extra time to focus on higher-level problems<br>- How to build relationships<br>- Questioning assumptions to find simpler solutions<br><br>Read the full blog post for more. </p><p class="embed__link"> www.developing.dev/p/new-grad-to-staff-at-meta-in-3-years </p></div></a></div><div class="embed"><a class="embed__url" href="https://newsletter.systemdesign.one/p/how-does-nginx-work?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd034c6d7-c09a-45f6-ba38-12297d63c11c_1280x720.gif"/><div class="embed__content"><p class="embed__title"> How Nginx Was Able to Support 1 Million Concurrent Connections on a Single Server </p><p class="embed__description"> Neo Kim wrote a fantastic article breaking down how Nginx achieves its impressive scalability through parallelism and concurrency. The web server uses a master-worker model where each worker runs as a single-threaded process with an event loop, allowing it to handle multiple client requests efficiently.<br><br>To prevent blocking operations from impacting performance, workers delegate CPU-intensive tasks to a shared thread pool. Nginx also improves scalability through shared memory that lets workers share cached data, session information, and rate limiting data. </p><p class="embed__link"> newsletter.systemdesign.one/p/how-does-nginx-work </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.maxcountryman.com/articles/let-it-fail?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank"><div class="embed__content"><p class="embed__title"> Let It Fail </p><p class="embed__description"> This is an interesting article on how allowing controlled failures can be extremely effective for engineering organizations.<br><br>When you can’t decide between tech debt vs. business priorities, letting small failures happen can create organic “back pressure” that leads to better alignment between engineering and product teams.<br><br>The controlled failures helped product teams understand the importance of reducing tech debt and make the organization more resilient. </p><p class="embed__link"> www.maxcountryman.com/articles/let-it-fail </p></div></a></div><div class="embed"><a class="embed__url" href="https://alexturek.com/2022-03-18-How-to-criticize-coworkers/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=the-architecture-of-stripe-s-document-database" target="_blank"><div class="embed__content"><p class="embed__title"> How To Criticize Coworkers </p><p class="embed__description"> Need a way to give constructive feedback to your colleagues? This article outlines key principles on how to do that. Tips include<br><br>- giving specific, behavior-focused feedback with concrete examples rather than making general statements about someone&#39;s character<br>- use &quot;I&quot; language to describe impact (&quot;I felt...&quot;) instead of making assumptions about others&#39; intentions<br>- when giving feedback looking for at least two specific examples of a behavior pattern before bringing it up<br><br>Read the full article for the rest. </p><p class="embed__link"> alexturek.com/2022-03-18-How-to-criticize-coworkers </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=c741005d-7eab-4e6d-8d75-b95c6551fffe&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Airbnb Processes a Million User Events Every Second</title>
  <description>An introduction to Apache Flink, the Lambda Architecture and the Architecture of Airbnb&#39;s platform. Plus, how Duolingo cut their AWS bill by 20%, Google&#39;s State of the Art Quantum chip and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/90e65364-67cf-49e6-b7cc-c7f97f7d7ca5/Screenshot_2024-12-09_at_11.25.15_PM.png" length="167414" type="image/png"/>
  <link>https://blog.quastor.org/p/how-airbnb-processes-a-million-user-events-every-second-f5f2</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-airbnb-processes-a-million-user-events-every-second-f5f2</guid>
  <pubDate>Wed, 11 Dec 2024 00:30:00 +0000</pubDate>
  <atom:published>2024-12-11T00:30:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Airbnb Processes a Million User Events Every Second</b></p><ul><li><p class="paragraph" style="text-align:left;">How Airbnb built a User Events Platform to track, process and store billions of user interactions </p></li><li><p class="paragraph" style="text-align:left;">Introduction to the Lambda Architecture</p></li><li><p class="paragraph" style="text-align:left;">Overview of Apache Flink</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of Airbnb’s User Events Platform</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Breaking down the Browser Rendering Process</p></li><li><p class="paragraph" style="text-align:left;">How to Maintain Code Quality in the age of AI</p></li><li><p class="paragraph" style="text-align:left;">How Duolingo cut their Cloud Spend by 20%</p></li><li><p class="paragraph" style="text-align:left;">Explaining the Modular Monolith Architecture</p></li><li><p class="paragraph" style="text-align:left;">Google’s State of the Art Quantum Chip</p></li></ul></li></ul><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of Airbnb’s User Signals Platform</b></h1><p class="paragraph" style="text-align:left;">Airbnb is one of the largest travel platforms in the world with over 200 million active users. When travelers browse through the app, there are <i>millions</i> of properties/destinations that Airbnb can recommend. Small improvements in their recommendation system can result in a huge increase in bookings (<i>and hundreds of millions of dollars in revenue</i>)</p><p class="paragraph" style="text-align:left;">To provide the best recommendations, Airbnb needs to keep track of past user actions like viewing listings, favoriting experiences, starting a booking process, etc. This data needs to be processed, cleaned and stored in a database.</p><p class="paragraph" style="text-align:left;">The Airbnb team built the User Signals Platform to handle this. It ingests and processes over 1 million user events per second and stores them in a key-value database. The platform serves over 70k+ queries per second to other internal teams at Airbnb that need access to this data.</p><p class="paragraph" style="text-align:left;">Last week, the Airbnb engineering team published a <a class="link" href="https://medium.com/airbnb-engineering/building-a-user-signals-platform-at-airbnb-b236078ec82b?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">terrific blog post</a> delving into how they built this platform and the design choices they made.</p><h2 class="heading" style="text-align:left;" id="user-signal-platform-goals"><b>User Signal Platform Goals</b></h2><p class="paragraph" style="text-align:left;">The Airbnb team had quite a few objectives for the User Signals Platform. Some of the goals were: </p><ul><li><p class="paragraph" style="text-align:left;"><b>Ingest Real-time and Historical User Data</b> - The platform should store real-time user engagement data as it occurs, but it should also allow for batch jobs that write historical user engagement data.</p></li><li><p class="paragraph" style="text-align:left;"><b>Low Latency</b> - Other services at Airbnb will be relying on the User Signals Platform for real-time user engagement data, so the platform should ingest and process new user events in under 1 second.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Computation</b> - Engineers at Airbnb should be able to run asynchronous computation jobs on the data in the User Signals platform to generate deeper insights.</p></li></ul><p class="paragraph" style="text-align:left;">In this article, we’ll talk about the architecture of the User Signals Platform and also delve into the design patterns and technologies Airbnb used.</p><h2 class="heading" style="text-align:left;" id="introduction-to-lambda-architecture"><b>Introduction to Lambda Architecture</b></h2><p class="paragraph" style="text-align:left;">The core design pattern Airbnb used for their platform is the <a class="link" href="https://en.wikipedia.org/wiki/Lambda_architecture?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">Lambda Architecture</a>. The Lambda architecture is composed of two layers:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Speed/Streaming Layer (Real-Time)</b>: Processes streaming data as it arrives, delivering low-latency, up-to-date results. Airbnb implements this with Apache Flink and achieves latencies less than one second.</p></li><li><p class="paragraph" style="text-align:left;"><b>Batch Layer (Offline)</b>: Periodically processes large volumes of historical data to generate more accurate or corrected views. The batch layer ensures long-term accuracy and handles late-arriving data or retrospective fixes. The batch layer will typically operate on a longer timescale, updating views every few hours.</p></li></ul><div class="image"><a class="image__link" href="#user-signal-platform-goals" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/90e65364-67cf-49e6-b7cc-c7f97f7d7ca5/Screenshot_2024-12-09_at_11.25.15_PM.png?t=1733804719"/></a><div class="image__source"><a class="image__source_link" href="https://www.geeksforgeeks.org/what-is-lambda-architecture-system-design/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" rel="noopener" target="_blank"><span class="image__source_text"><p>Credits to GeeksforGeeks</p></span></a></div></div><p class="paragraph" style="text-align:left;">By combining these two layers, the Lambda architecture provides the best of both worlds. The speed layer ensures fresh, low-latency data for online queries and personalization. The batch layer ensures correctness, allowing retrospective updates and improvements to data quality.</p><h2 class="heading" style="text-align:left;" id="introduction-to-apache-flink"><b>Introduction to Apache Flink</b></h2><p class="paragraph" style="text-align:left;">The core technology Airbnb used for their User Signals Platform is Apache Flink, an open source engine built for processing real-time data with very low latency.</p><p class="paragraph" style="text-align:left;">Prior to Flink, data processing systems would rely on “micro-batching” to process data in “real-time”. They would collect data over a small fixed period (<i>every few seconds/minutes</i>) and then process that data as a batch job.</p><p class="paragraph" style="text-align:left;">On the other hand, Flink takes an event-driven approach. Instead of waiting for a batch window to fill up, Flink processes each event as soon as it arrives. This results in <i>much</i> lower latencies.</p><p class="paragraph" style="text-align:left;">Another benefit of Flink is that it is <i>stateful</i>. Traditional data processing systems might require an external database to maintain state across events. Flink integrates state management directly into the engine, allowing it to remember, accumulate, and update contextual information as events stream in. This is great if you want to do operations like aggregations or joins across your messages.</p><p class="paragraph" style="text-align:left;">Other benefits of Apache Flink are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Fault Tolerance</b> - Flink provides <a class="link" href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">checkpointing</a> mechanisms to ensure that, if a job or node fails, the system can recover to a previously consistent state. This guarantees exactly-once processing, so each event is reflected in the application’s state exactly once, even in the face of failures.</p></li><li><p class="paragraph" style="text-align:left;"><b>Understanding Event-Time</b> - Flink understands the concept of <a class="link" href="https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">“event time“ </a>(the time when an event actually occurred) instead of just processing time (when the event is processed by the system). This makes it much easier to handle out-of-order events or late-arriving data accurately.</p></li><li><p class="paragraph" style="text-align:left;"><b>Integration with the Ecosystem </b>- Flink is widely used and comes with connectors to all the other data tools you might be using (Kafka, Postgres, S3, etc.)</p></li></ul><h2 class="heading" style="text-align:left;" id="architecture-of-airbnbs-user-signal"><b>Architecture of Airbnb’s User Signals Platform</b></h2><p class="paragraph" style="text-align:left;">Here’s the architecture of Airbnb’s User Signals Platform</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXerK9079APCW4uSZNoqgU2T9O2NGz-TKAcplja51OFixaUAh85hU0WpGwY8gyNmsIr4jSiyzetf47Q0bNoVexrRZM-bcdop24hKfDxsZeE6CQEoq_jllUmM-NJnhhgxnKpJ1EKl4w?key=RKoPUD1asUNoVOQdiKbhl91e"/></div><p class="paragraph" style="text-align:left;">As mentioned earlier, it’s based on the Lambda Architecture, so it consists of a real-time ingestion layer and a batch layer.</p><p class="paragraph" style="text-align:left;">Here are the steps:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>User Events</b>: Guests interacting with Airbnb’s apps generate raw events when they view properties, add an experience to their wishlist, search for “rooms in London”, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Real-Time Transformation (Speed Layer)</b>: Events flow into Kafka, where Flink jobs consume and transform them into “User Signals.”  Some transformations are just simple mappings from raw events, while others may require joining multiple events based on user ID to create richer signals.</p></li><li><p class="paragraph" style="text-align:left;"><b>KV Storage and Serving</b>: The transformed User Signals are stored in a Key-Value store with append-only writes. Using append-only writes helps ensure idempotency and makes data operations much simpler.</p></li><li><p class="paragraph" style="text-align:left;"><b>Batch Processing (Batch Layer)</b>: Periodic batch jobs will reprocess the historical data sets and identify any discrepancies or missed events from the speed layer. They’ll backfill the missing/incorrect data to ensure long-term data accuracy and consistency.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Computations</b>: In addition to the immediate user signals that are stored in the KV store, Airbnb has Flink jobs that consume the new user signals to generate more insights. These jobs do things like categorize users into segments or group a single user’s actions into “sessions” to get a better understanding of the user’s intent. These jobs are run asynchronously.</p></li><li><p class="paragraph" style="text-align:left;"><b>Online Queries and Services</b>: The USP service provides a way for downstream services at Airbnb to use the user signals data for their own insights.</p></li></ol><h2 class="heading" style="text-align:left;" id="results"><b>Results</b></h2><p class="paragraph" style="text-align:left;">With this setup, the User Signals Platform processes over 1 million events per second across 100+ Flink jobs. The USP service serves over 70k queries per second to various teams/services at Airbnb.</p><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://abhisaha.com/blog/exploring-browser-rendering-process/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://abhisaha.com/blog/exploring-browser-rendering-process/og.png"/><div class="embed__content"><p class="embed__title"> Breaking down the Browser Rendering Process </p><p class="embed__description"> Abhishek Saha published a fantastic blog post that talks about exactly what happens between going to “www.google.com“ and seeing the page load on your computer.<br><br>He delves into the DNS lookup, TCP/TLS handshake, Browser Rendering process and much more. The article is filled with interactive graphics to help you understand the process. </p><p class="embed__link"> abhisaha.com/blog/exploring-browser-rendering-process </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.google/technology/research/google-willow-quantum-chip/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/04_YouTube_Thumbnail_-_Hero_Shot_1280x720.width-1300.png"/><div class="embed__content"><p class="embed__title"> Willow - Google’s state-of-the-art quantum chip </p><p class="embed__description"> Google’s Quantum AI team just announced Willow, their latest quantum processor.<br><br>Google used the Random Circuit Sampling (RCS) benchmark to measure its performance and it was able to complete an extremely computation in under 5 minutes. That same computation would take 10 septillion years for the world’s fastest supercomputer.<br><br>At scale, Quantum computers would break many current encryption methods so there’s a big push for quantum-resistant cryptography. </p><p class="embed__link"> blog.google/technology/research/google-willow-quantum-chip </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.levelupcoding.com/p/luc-66-breaking-down-modular-monolithic-architecture-blending-tradition-with-innovation?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/ac500e78-9993-40e4-8f80-f4fc6bce37a4/Modular_Monoliths_Newsletter_Version.png?t=1730542325"/><div class="embed__content"><p class="embed__title"> Explaining the Modular Monolith Architecture </p><p class="embed__description"> Modular Monoliths are becoming increasingly popular, where you have a balance between the efficiency of traditional Monoliths and the separation of Microservices.<br><br>This is a great article that delves into this pattern, it’s defining characteristics, pros and cons. </p><p class="embed__link"> blog.levelupcoding.com/p/luc-66-breaking-down-modular-monolithic-architecture-blending-tradition-with-innovation </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.duolingo.com/reducing-cloud-spending/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/edd159f3-e1be-4ef2-bc64-f508de4fc7f8/Screenshot_2024-12-10_at_1.46.53_AM.png?t=1733813223"/><div class="embed__content"><p class="embed__title"> How Duolingo cut their Cloud Spend by 20% </p><p class="embed__description"> Duolingo published a fantastic blog post delving into the exact strategies they used to cut their AWS cloud spend.<br><br>Some of the key optimizations included<br>- Extending cache TTLs for rarely-changing resources<br>- Reducing unnecessarily verbose logging in production<br>- Changing databases to more optimal configurations<br><br>and more. </p><p class="embed__link"> blog.duolingo.com/reducing-cloud-spending </p></div></a></div><div class="embed"><a class="embed__url" href="https://refactoring.fm/p/code-quality-in-the-age-of-ai?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47090301-73a3-4cf3-a2aa-531023b9b456_1326x742.png"/><div class="embed__content"><p class="embed__title"> How to Maintain Code Quality in the age of AI </p><p class="embed__description"> While AI can help write code faster, it can also create a tradeoff with control vs. quality. In this article, Refactoring.fm provides a six-step process for “Lifecycle of Quality“ to help ensure that your codebase doesn’t suffer as you use tools like Claude or gpt-4o. </p><p class="embed__link"> refactoring.fm/p/code-quality-in-the-age-of-ai </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=1c9ea4bb-bab6-4637-8d8b-3d91c8e31f9f&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Airbnb Processes a Million User Events Every Second</title>
  <description>An introduction to Apache Flink, the Lambda Architecture and the Architecture of Airbnb&#39;s platform. Plus, how Duolingo cut their AWS bill by 20%, Google&#39;s State of the Art Quantum chip and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/dea01670-c170-493f-9db6-fe8f23f8bd40/new_diagram.gif" length="428407" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-airbnb-processes-a-million-user-events-every-second</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-airbnb-processes-a-million-user-events-every-second</guid>
  <pubDate>Tue, 10 Dec 2024 17:09:00 +0000</pubDate>
  <atom:published>2024-12-10T17:09:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Airbnb Processes a Million User Events Every Second</b></p><ul><li><p class="paragraph" style="text-align:left;">How Airbnb built a User Events Platform to track, process and store billions of user interactions </p></li><li><p class="paragraph" style="text-align:left;">Introduction to the Lambda Architecture</p></li><li><p class="paragraph" style="text-align:left;">Overview of Apache Flink</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of Airbnb’s User Events Platform</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Breaking down the Browser Rendering Process</p></li><li><p class="paragraph" style="text-align:left;">How to Maintain Code Quality in the age of AI</p></li><li><p class="paragraph" style="text-align:left;">How Duolingo cut their Cloud Spend by 20%</p></li><li><p class="paragraph" style="text-align:left;">Explaining the Modular Monolith Architecture</p></li><li><p class="paragraph" style="text-align:left;">Google’s State of the Art Quantum Chip</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc7kqV8dbl_pchXavhHkxWSIbRwQn1WTVvajZnFb9UkSgVqqfmAv52AxCyJdfMKdRDCuFxaPG71xpFxfKgprPps24_g-4Ih0ODcgECLpYuc1DvSkY0dl_7iZpOgqyZT5wLz4TK40g?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-pick-technologies-for-your-t"><a class="link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow"><b>How to Pick Technologies for your Tech Stack</b></a></h1><p class="paragraph" style="text-align:left;">One of the hardest decisions you’ll have to make is around <i>what</i> technologies your team adopts. A wrong decision can be extremely costly and take years to reverse. On the other hand, <i>not </i>making a decision can be <i>just</i> as costly (lost revenue, poor developer productivity, etc.)</p><p class="paragraph" style="text-align:left;">Product for Engineers wrote <a class="link" href="https://dub.link/quas-dec9-ct?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">a fantastic blog post</a> on their advice for choosing technologies to adopt.</p><p class="paragraph" style="text-align:left;">Some of their tips include</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Prioritize based on set Criteria</b> - There will always be <i>some</i> shiny new toy that your team can adopt. Instead, prioritize based on problems your team is facing. This can be excessive costs, scaling challenges, or a customer need.</p></li><li><p class="paragraph" style="text-align:left;"><b>Mimic the Real World when Evaluating</b> - The engineers who will be using the technology should have significant sway in the decision. They should be able to test the technology in production (<i>safely</i>) and build proof of concepts before deciding.</p></li><li><p class="paragraph" style="text-align:left;"><b>Ensure you consider technical AND business factors</b> - You should talk to <i>all stakeholders</i> and clarify what the set of evaluation criteria are. Some potential criteria include performance, cost, reliability, support, flexibility and more.</p></li></ol><p class="paragraph" style="text-align:left;">Subscribe to <a class="link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers</a> for the rest of their tips on picking technologies. It’s free!</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second"><span class="button__text" style=""> Check out Product for Engineers </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of Airbnb’s User Signals Platform</b></h1><p class="paragraph" style="text-align:left;">Airbnb is one of the largest travel platforms in the world with over 200 million active users. When travelers browse through the app, there are <i>millions</i> of properties/destinations that Airbnb can recommend. Small improvements in their recommendation system can result in a huge increase in bookings (<i>and hundreds of millions of dollars in revenue</i>)</p><p class="paragraph" style="text-align:left;">To provide the best recommendations, Airbnb needs to keep track of past user actions like viewing listings, favoriting experiences, starting a booking process, etc. This data needs to be processed, cleaned and stored in a database.</p><p class="paragraph" style="text-align:left;">The Airbnb team built the User Signals Platform to handle this. It ingests and processes over 1 million user events per second and stores them in a key-value database. The platform serves over 70k+ queries per second to other internal teams at Airbnb that need access to this data.</p><p class="paragraph" style="text-align:left;">Last week, the Airbnb engineering team published a <a class="link" href="https://medium.com/airbnb-engineering/building-a-user-signals-platform-at-airbnb-b236078ec82b?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">terrific blog post</a> delving into how they built this platform and the design choices they made.</p><h2 class="heading" style="text-align:left;" id="user-signal-platform-goals"><b>User Signal Platform Goals</b></h2><p class="paragraph" style="text-align:left;">The Airbnb team had quite a few objectives for the User Signals Platform. Some of the goals were: </p><ul><li><p class="paragraph" style="text-align:left;"><b>Ingest Real-time and Historical User Data</b> - The platform should store real-time user engagement data as it occurs, but it should also allow for batch jobs that write historical user engagement data.</p></li><li><p class="paragraph" style="text-align:left;"><b>Low Latency</b> - Other services at Airbnb will be relying on the User Signals Platform for real-time user engagement data, so the platform should ingest and process new user events in under 1 second.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Computation</b> - Engineers at Airbnb should be able to run asynchronous computation jobs on the data in the User Signals platform to generate deeper insights.</p></li></ul><p class="paragraph" style="text-align:left;">In this article, we’ll talk about the architecture of the User Signals Platform and also delve into the design patterns and technologies Airbnb used.</p><h2 class="heading" style="text-align:left;" id="introduction-to-lambda-architecture"><b>Introduction to Lambda Architecture</b></h2><p class="paragraph" style="text-align:left;">The core design pattern Airbnb used for their platform is the <a class="link" href="https://en.wikipedia.org/wiki/Lambda_architecture?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">Lambda Architecture</a>. The Lambda architecture is composed of two layers:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Speed/Streaming Layer (Real-Time)</b>: Processes streaming data as it arrives, delivering low-latency, up-to-date results. Airbnb implements this with Apache Flink and achieves latencies less than one second.</p></li><li><p class="paragraph" style="text-align:left;"><b>Batch Layer (Offline)</b>: Periodically processes large volumes of historical data to generate more accurate or corrected views. The batch layer ensures long-term accuracy and handles late-arriving data or retrospective fixes. The batch layer will typically operate on a longer timescale, updating views every few hours.</p></li></ul><div class="image"><a class="image__link" href="#user-signal-platform-goals" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/90e65364-67cf-49e6-b7cc-c7f97f7d7ca5/Screenshot_2024-12-09_at_11.25.15_PM.png?t=1733804719"/></a><div class="image__source"><a class="image__source_link" href="https://www.geeksforgeeks.org/what-is-lambda-architecture-system-design/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" rel="noopener" target="_blank"><span class="image__source_text"><p>Credits to GeeksforGeeks</p></span></a></div></div><p class="paragraph" style="text-align:left;">By combining these two layers, the Lambda architecture provides the best of both worlds. The speed layer ensures fresh, low-latency data for online queries and personalization. The batch layer ensures correctness, allowing retrospective updates and improvements to data quality.</p><h2 class="heading" style="text-align:left;" id="introduction-to-apache-flink"><b>Introduction to Apache Flink</b></h2><p class="paragraph" style="text-align:left;">The core technology Airbnb used for their User Signals Platform is Apache Flink, an open source engine built for processing real-time data with very low latency.</p><p class="paragraph" style="text-align:left;">Prior to Flink, data processing systems would rely on “micro-batching” to process data in “real-time”. They would collect data over a small fixed period (<i>every few seconds/minutes</i>) and then process that data as a batch job.</p><p class="paragraph" style="text-align:left;">On the other hand, Flink takes an event-driven approach. Instead of waiting for a batch window to fill up, Flink processes each event as soon as it arrives. This results in <i>much</i> lower latencies.</p><p class="paragraph" style="text-align:left;">Another benefit of Flink is that it is <i>stateful</i>. Traditional data processing systems might require an external database to maintain state across events. Flink integrates state management directly into the engine, allowing it to remember, accumulate, and update contextual information as events stream in. This is great if you want to do operations like aggregations or joins across your messages.</p><p class="paragraph" style="text-align:left;">Other benefits of Apache Flink are:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Fault Tolerance</b> - Flink provides <a class="link" href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">checkpointing</a> mechanisms to ensure that, if a job or node fails, the system can recover to a previously consistent state. This guarantees exactly-once processing, so each event is reflected in the application’s state exactly once, even in the face of failures.</p></li><li><p class="paragraph" style="text-align:left;"><b>Understanding Event-Time</b> - Flink understands the concept of <a class="link" href="https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">“event time“ </a>(the time when an event actually occurred) instead of just processing time (when the event is processed by the system). This makes it much easier to handle out-of-order events or late-arriving data accurately.</p></li><li><p class="paragraph" style="text-align:left;"><b>Integration with the Ecosystem </b>- Flink is widely used and comes with connectors to all the other data tools you might be using (Kafka, Postgres, S3, etc.)</p></li></ul><h2 class="heading" style="text-align:left;" id="architecture-of-airbnbs-user-signal"><b>Architecture of Airbnb’s User Signals Platform</b></h2><p class="paragraph" style="text-align:left;">Here’s the architecture of Airbnb’s User Signals Platform</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXerK9079APCW4uSZNoqgU2T9O2NGz-TKAcplja51OFixaUAh85hU0WpGwY8gyNmsIr4jSiyzetf47Q0bNoVexrRZM-bcdop24hKfDxsZeE6CQEoq_jllUmM-NJnhhgxnKpJ1EKl4w?key=RKoPUD1asUNoVOQdiKbhl91e"/></div><p class="paragraph" style="text-align:left;">As mentioned earlier, it’s based on the Lambda Architecture, so it consists of a real-time ingestion layer and a batch layer.</p><p class="paragraph" style="text-align:left;">Here are the steps:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>User Events</b>: Guests interacting with Airbnb’s apps generate raw events when they view properties, add an experience to their wishlist, search for “rooms in London”, etc.</p></li><li><p class="paragraph" style="text-align:left;"><b>Real-Time Transformation (Speed Layer)</b>: Events flow into Kafka, where Flink jobs consume and transform them into “User Signals.”  Some transformations are just simple mappings from raw events, while others may require joining multiple events based on user ID to create richer signals.</p></li><li><p class="paragraph" style="text-align:left;"><b>KV Storage and Serving</b>: The transformed User Signals are stored in a Key-Value store with append-only writes. Using append-only writes helps ensure idempotency and makes data operations much simpler.</p></li><li><p class="paragraph" style="text-align:left;"><b>Batch Processing (Batch Layer)</b>: Periodic batch jobs will reprocess the historical data sets and identify any discrepancies or missed events from the speed layer. They’ll backfill the missing/incorrect data to ensure long-term data accuracy and consistency.</p></li><li><p class="paragraph" style="text-align:left;"><b>Asynchronous Computations</b>: In addition to the immediate user signals that are stored in the KV store, Airbnb has Flink jobs that consume the new user signals to generate more insights. These jobs do things like categorize users into segments or group a single user’s actions into “sessions” to get a better understanding of the user’s intent. These jobs are run asynchronously.</p></li><li><p class="paragraph" style="text-align:left;"><b>Online Queries and Services</b>: The USP service provides a way for downstream services at Airbnb to use the user signals data for their own insights.</p></li></ol><h2 class="heading" style="text-align:left;" id="results"><b>Results</b></h2><p class="paragraph" style="text-align:left;">With this setup, the User Signals Platform processes over 1 million events per second across 100+ Flink jobs. The USP service serves over 70k queries per second to various teams/services at Airbnb.</p><hr class="content_break"><div class="image"><a class="image__link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeQ2aw1ltuDFOVnzV3j1WqRvXiEgBMeVuJ-uDFYT7-5iRyeuqyNFUc1QXTdah-GMjzf9H-Nssn66KuZr7lW9VRL3fXWEqENj9gqqJb2i9l9Z7VrS7j3V-9kvSYtmTZXUbHjOLVfoB7O1obrN3d0IM7ofRL4?key=eXEz32zof7Iu-jmoXAT47A"/></a></div><h1 class="heading" style="text-align:left;" id="6-common-traits-of-the-most-impactf"><a class="link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">6 Common Traits of the Most Impactful Developers</a></h1><p class="paragraph" style="text-align:left;">You’ll often hear about the mythical “10x engineer” - the go-to person on the team whenever you need a feature shipped fast. However, 10x engineers aren’t just super-technical, they also have a great sense of <i>what </i>to build.</p><p class="paragraph" style="text-align:left;">If you’re working on the wrong feature, then it doesn’t matter how fast you work. The company won’t see a big impact from your work.</p><p class="paragraph" style="text-align:left;">Product for Engineers wrote <a class="link" href="https://dub.link/quas-dec9-10x?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">a great article</a> delving into the most impactful engineers and identified six common traits that they share.</p><p class="paragraph" style="text-align:left;">Here’s a couple of the traits.</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Always Prototyping and Experimenting</b> - they ship MVPs early and often, iterate quickly based on feedback and aren’t afraid to pivot or kill features that aren’t working.</p></li><li><p class="paragraph" style="text-align:left;"><b>Are Comfortable Writing</b> - Clear writing skills are a must for documenting features, providing PR feedback, and making big technical decisions with RFCs.</p></li><li><p class="paragraph" style="text-align:left;"><b>Understand the Broader Context</b> - they understand the organization’s goals and align their decisions/work with the company’s strategy.</p></li></ol><p class="paragraph" style="text-align:left;">For the rest of the traits, check out the <a class="link" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank" rel="noopener noreferrer nofollow">Product for Engineers newsletter</a>.</p><p class="paragraph" style="text-align:left;">They send out fantastic articles every month to help you develop the skills you need to deliver the most impact (<i>and get promoted faster</i>).</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://dub.link/quas-dec9?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second"><span class="button__text" style=""> Check out Product for Engineers. It’s free! </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://abhisaha.com/blog/exploring-browser-rendering-process/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://abhisaha.com/blog/exploring-browser-rendering-process/og.png"/><div class="embed__content"><p class="embed__title"> Breaking down the Browser Rendering Process </p><p class="embed__description"> Abhishek Saha published a fantastic blog post that talks about exactly what happens between going to “www.google.com“ and seeing the page load on your computer.<br><br>He delves into the DNS lookup, TCP/TLS handshake, Browser Rendering process and much more. The article is filled with interactive graphics to help you understand the process. </p><p class="embed__link"> abhisaha.com/blog/exploring-browser-rendering-process </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.google/technology/research/google-willow-quantum-chip/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/04_YouTube_Thumbnail_-_Hero_Shot_1280x720.width-1300.png"/><div class="embed__content"><p class="embed__title"> Willow - Google’s state-of-the-art quantum chip </p><p class="embed__description"> Google’s Quantum AI team just announced Willow, their latest quantum processor.<br><br>Google used the Random Circuit Sampling (RCS) benchmark to measure its performance and it was able to complete an extremely computation in under 5 minutes. That same computation would take 10 septillion years for the world’s fastest supercomputer.<br><br>At scale, Quantum computers would break many current encryption methods so there’s a big push for quantum-resistant cryptography. </p><p class="embed__link"> blog.google/technology/research/google-willow-quantum-chip </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.levelupcoding.com/p/luc-66-breaking-down-modular-monolithic-architecture-blending-tradition-with-innovation?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/ac500e78-9993-40e4-8f80-f4fc6bce37a4/Modular_Monoliths_Newsletter_Version.png?t=1730542325"/><div class="embed__content"><p class="embed__title"> Explaining the Modular Monolith Architecture </p><p class="embed__description"> Modular Monoliths are becoming increasingly popular, where you have a balance between the efficiency of traditional Monoliths and the separation of Microservices.<br><br>This is a great article that delves into this pattern, it’s defining characteristics, pros and cons. </p><p class="embed__link"> blog.levelupcoding.com/p/luc-66-breaking-down-modular-monolithic-architecture-blending-tradition-with-innovation </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.duolingo.com/reducing-cloud-spending/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/edd159f3-e1be-4ef2-bc64-f508de4fc7f8/Screenshot_2024-12-10_at_1.46.53_AM.png?t=1733813223"/><div class="embed__content"><p class="embed__title"> How Duolingo cut their Cloud Spend by 20% </p><p class="embed__description"> Duolingo published a fantastic blog post delving into the exact strategies they used to cut their AWS cloud spend.<br><br>Some of the key optimizations included<br>- Extending cache TTLs for rarely-changing resources<br>- Reducing unnecessarily verbose logging in production<br>- Changing databases to more optimal configurations<br><br>and more. </p><p class="embed__link"> blog.duolingo.com/reducing-cloud-spending </p></div></a></div><div class="embed"><a class="embed__url" href="https://refactoring.fm/p/code-quality-in-the-age-of-ai?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-airbnb-processes-a-million-user-events-every-second" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47090301-73a3-4cf3-a2aa-531023b9b456_1326x742.png"/><div class="embed__content"><p class="embed__title"> How to Maintain Code Quality in the age of AI </p><p class="embed__description"> While AI can help write code faster, it can also create a tradeoff with control vs. quality. In this article, Refactoring.fm provides a six-step process for “Lifecycle of Quality“ to help ensure that your codebase doesn’t suffer as you use tools like Claude or gpt-4o. </p><p class="embed__link"> refactoring.fm/p/code-quality-in-the-age-of-ai </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=b9c48951-43e6-49a2-ba44-4162ebd5aecb&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How LinkedIn Scaled Their System to 5 Million Queries Per Second</title>
  <description>How LinkedIn used BitSets, Bloom Filters, Caching Strategies and more to Scale their Safety system to 5 million queries per second. Plus, questions you&#39;ll get asked frequently as an engineering manager and tips on how to scale a large codebase.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/05dd8350-51c8-403d-b35e-039586a84e1a/LinkedIn_Restriction_Inforcement_System_Diagram.gif" length="716309" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-linkedin-scaled-their-system-to-5-million-queries-per-second-f8fe</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-linkedin-scaled-their-system-to-5-million-queries-per-second-f8fe</guid>
  <pubDate>Wed, 27 Nov 2024 18:30:00 +0000</pubDate>
  <atom:published>2024-11-27T18:30:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How LinkedIn scaled their Restrictions and Enforcement System to 5 million queries per second</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to BitSets and their use at LinkedIn</p></li><li><p class="paragraph" style="text-align:left;"> How Bloom Filters work and their use-cases</p></li><li><p class="paragraph" style="text-align:left;">Full Refresh-ahead Caching and the pros/cons</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of LinkedIn’s System</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Latency Numbers Every Programmer Should Know (Visualized)</p></li><li><p class="paragraph" style="text-align:left;">Serving a Billion Web Requests with Boring Code</p></li><li><p class="paragraph" style="text-align:left;">Lies we tell ourselves to keep using Golang</p></li><li><p class="paragraph" style="text-align:left;">7 questions I get asked frequently as an EM</p></li><li><p class="paragraph" style="text-align:left;">How to Scale a Large Codebase </p></li></ul></li></ul><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of LinkedIn’s Restriction Enforcement System</b></h1><p class="paragraph" style="text-align:left;">LinkedIn is the largest professional social network in the world, with over 1 billion users. Over 100 million messages are sent daily on the platform.</p><p class="paragraph" style="text-align:left;">With this scale, you’ll inevitably have some bad actors causing issues on the site. It might be users sending harassing/toxic messages, spammers posting about some cryptocurrency or LinkedIn influencers sharing how their camping trip made them better at B2B sales.</p><p class="paragraph" style="text-align:left;">To combat this malicious behavior, LinkedIn provides a number of safeguards like reporting inappropriate content and blocking problematic/annoying users.</p><p class="paragraph" style="text-align:left;">However, implementing these safeguards at LinkedIn’s scale introduces a ton of technical challenges.</p><p class="paragraph" style="text-align:left;">Some of the requirements LinkedIn needs for their Restrictions Enforcement system include:</p><ul><li><p class="paragraph" style="text-align:left;"><b>High QPS</b><b> </b>- The system needs to support 4-5 million queries per second. Many user actions (viewing the feed, sending messages, etc.) require checking the restricted/blocked accounts list. </p></li><li><p class="paragraph" style="text-align:left;"><b>Low Latency</b><b> </b>- Latency needs to be under 5 milliseconds. Otherwise, basic actions like refreshing your feed or sending a message would take too long.</p></li><li><p class="paragraph" style="text-align:left;">High Availability - This system needs to operate with 99.999% availability (5 9s of availability), so less than 30 seconds of downtime per month. </p></li><li><p class="paragraph" style="text-align:left;"><b>Low Ingestion Delay</b> - When a user blocks/reports an account, that should be reflected in the restrictions enforcement system immediately. If they refresh their feed right after, the posts from the blocked user should be immediately hidden.</p></li></ul><p class="paragraph" style="text-align:left;">Earlier this year, LinkedIn’s engineering team published a fantastic <a class="link" href="https://www.linkedin.com/blog/engineering/trust-and-safety/evolution-enforcing-our-professional-community-policies-at-scale?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">blog post</a> detailing how they built their restrictions enforcement system. They talked about the different generations of the system’s architecture and problems they faced along the way.</p><p class="paragraph" style="text-align:left;">In our Quastor article, we’ll focus on the specific data structures and strategies LinkedIn used. We’ll explain them and delve into the pros and cons LinkedIn saw.</p><h2 class="heading" style="text-align:left;" id="bit-sets"><b>BitSets</b></h2><p class="paragraph" style="text-align:left;">One of the key data structures LinkedIn uses in their restrictions system is BitSets.</p><p class="paragraph" style="text-align:left;">A BitSet is an array of boolean values where each value only takes up 1 bit of space. A bit that’s set represents a true value whereas a bit that’s not set represents false.</p><p class="paragraph" style="text-align:left;">BitSets are <i>extremely</i> memory efficient. If you need to store boolean values for 1 billion users (<i>whether a user is restricted/not restricted</i>), you would only need 1 billion bits (approximately 125 megabytes).</p><p class="paragraph" style="text-align:left;">To give a more concrete example of how LinkedIn uses BitSets, let’s say LinkedIn needs to store restricted/unrestricted account status for 1 billion users. Each user has a memberID from 1 to 1 billion. </p><p class="paragraph" style="text-align:left;">To store this, they could use an array of 64-bit integers. Each integer can store the restriction status for 64 different users (<i>we need 1 bit per user</i>) so the array would hold ~15 million integers (around 125 megabytes of space). </p><p class="paragraph" style="text-align:left;">If LinkedIn needs to check whether user with memberID <code>525234320</code> is restricted, the steps would be:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Divide 523,234,320 by 64 to get the index in the integer array (which would be 8,175,536)</p></li><li><p class="paragraph" style="text-align:left;">Take 523,234,320 modulo 64 to get which bit to check in that integer (which would be 32)</p></li><li><p class="paragraph" style="text-align:left;">Use bitwise operations to check if that specific bit is set to 1 (restricted) or 0 (not restricted)</p></li></ol><p class="paragraph" style="text-align:left;">The time and space requirements with BitSets are very efficient. Checking whether users are restricted takes constant time (<i>since the membership lookup operations are all O(1)</i>) and the storage necessary is only a couple hundred megabytes.</p><h2 class="heading" style="text-align:left;" id="bloom-filters"><b>Bloom Filters</b></h2><p class="paragraph" style="text-align:left;">In addition to BitSets, the other data structure LinkedIn found useful was Bloom Filters.</p><p class="paragraph" style="text-align:left;">A Bloom filter is a probabilistic data structure that lets you quickly test whether an item might be in a set. Bloom Filters are <i>probabilistic</i> so they will tell you if an item is in a set but will occasionally give false positives (<i>it will mistakenly say an item is in the set when it’s not</i>).</p><p class="paragraph" style="text-align:left;">Under the hood, Bloom Filters use hashing to map items to a bit array. The issue is that collisions in the hashing function can cause false positives. Here’s a <a class="link" href="https://brilliant.org/wiki/bloom-filter/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">fantastic article</a> that delves into how Bloom Filters work with visuals and a basic Python implementation. </p><p class="paragraph" style="text-align:left;">For LinkedIn, they also used Bloom Filters to quickly check whether a user’s account was restricted.  </p><p class="paragraph" style="text-align:left;">The pros were that the Bloom Filters were extremely space efficient compared to traditional caching techniques (<i>using a set or hash table</i>).</p><p class="paragraph" style="text-align:left;">The downside was the false positives. However, the Bloom Filter can be tuned to make false positives extremely rare and LinkedIn didn’t find it to be a big issue. </p><h2 class="heading" style="text-align:left;" id="full-refreshahead-caching"><b>Full Refresh-ahead Caching</b></h2><p class="paragraph" style="text-align:left;">LinkedIn explored various caching strategies to achieve their QPS and latency requirements. One approach was their full refresh-ahead cache.</p><p class="paragraph" style="text-align:left;">The dataset of account restrictions was quite small (<i>thanks to using the BitSet and Bloom Filter data structures</i>) so LinkedIn had each client application host store <i>all</i> restriction data in their in-memory cache.</p><p class="paragraph" style="text-align:left;">In order to maintain cache freshness, they implemented a polling mechanism where clients would periodically check for any new changes to member restrictions.</p><p class="paragraph" style="text-align:left;">This system resulted in a huge decrease in latencies but came with some downsides.</p><p class="paragraph" style="text-align:left;">The client-side memory footprint ended up being substantial and strained infrastructure. Additionally, the caches were stored in-memory so they weren’t persistent. Clients had to frequently build and rebuild this cache which put strain on LinkedIn’s underlying database.</p><h2 class="heading" style="text-align:left;" id="system-architecture"><b>System Architecture</b></h2><p class="paragraph" style="text-align:left;">Here’s the current architecture for LinkedIn’s restriction enforcement system.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/09874b2d-8446-409c-9ac6-74c8c6278f3a/1704921888416.png?t=1732730841"/></div><p class="paragraph" style="text-align:left;">The key components are</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Client Layer</b><b> </b>- Clients that use this system include the LinkedIn newsfeed, their recruiting/talent management tools and other products/services at the company. These clients use a REST API to query the system.</p></li><li><p class="paragraph" style="text-align:left;"><b>LinkedIn Restrictions Enforcement System</b><b> </b>- This component consists of the BitSet data structures and the main restriction enforcement system. The BitSet data structures are stored client-side and maintain a cache of all the restriction records.</p></li><li><p class="paragraph" style="text-align:left;"><b>Venice Database</b> - this is the central storage (source of truth) for all user restrictions. Venice is an open-source, horizontally scalable, eventually-consistent storage system that LinkedIn built on RocksDB. You can read more about how Venice works <a class="link" href="https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Kafka Restriction Records</b><b> </b>- When a user gets reported/blocked, the change in their account status is sent through Kafka. This allows near real-time propagation of changes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Restriction Management System</b> - LinkedIn has a legacy system (<i>check the </i><i><a class="link" href="https://www.linkedin.com/blog/engineering/trust-and-safety/evolution-enforcing-our-professional-community-policies-at-scale?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">blog post</a></i><i> for a full explanation on the previous generations</i>) of the Restriction Enforcement system that connects to Espresso to store and update blocked/restricted users. Espresso is LinkedIn’s distributed, document data store that’s built on MySQL. You can read more about Espresso <a class="link" href="https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>. </p></li></ol><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://samwho.dev/numbers/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><img class="embed__image embed__image--top" src="https://samwho.dev/images/numbers.png"/><div class="embed__content"><p class="embed__title"> Latency Numbers Every Programmer Should Know (Visualized) </p><p class="embed__description"> You’ve probably seen the blog posts with latency numbers that every programmer should know.<br><br>These are approximate numbers for how it takes to search RAM vs. disk, transfer a byte from the US to Europe, compress 1 kilobyte of data and more.<br><br>This is an awesome tool that helps you visualize those numbers and understand the orders of magnitude difference that exists between some of them. </p><p class="embed__link"> samwho.dev/numbers </p></div></a></div><div class="embed"><a class="embed__url" href="https://notes.billmill.org/blog/2024/06/Serving_a_billion_web_requests_with_boring_code.html?utm_source=www.hungryminds.dev&utm_medium=referral&utm_campaign=data-streams-101-how-to-handle-petabytes-of-data" target="_blank"><div class="embed__content"><p class="embed__title"> Serving a Billion Web Requests with Boring Code </p><p class="embed__description"> Bill Mill is a software engineer who worked as a contractor for the US Government. During his stint, he helped build the medicare plan compare tool on the website.<br><br>He published a fantastic blog post delving into the scale of the website, the “boring“ technology the team used (golang, reactjs, postgres) and architectural bets he made (gRPC, modular backend) and much more. </p><p class="embed__link"> notes.billmill.org/blog/2024/06/Serving_a_billion_web_requests_with_boring_code.html </p></div></a></div><div class="embed"><a class="embed__url" href="https://fasterthanli.me/articles/lies-we-tell-ourselves-to-keep-using-golang?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> Lies we tell ourselves to keep using Golang </p><p class="embed__description"> This is an interesting article on some common justifications for using GO and why they can cause issues as your codebase grows.<br><br>Some of the criticisms the author talks about include<br>- the difficulty of integrating Go with other technologies due to it’s unique toolchain<br>- how Go’s “zero values“ can lead to subtle bugs<br>- Go’s “simplicity“ leads to complexity in your application code<br><br>Read the full blog post for all the author’s criticisms. </p><p class="embed__link"> fasterthanli.me/articles/lies-we-tell-ourselves-to-keep-using-golang </p></div></a></div><div class="embed"><a class="embed__url" href="https://medium.com/one-to-n/7-questions-i-get-asked-frequently-as-an-em-dc361809d351?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> 7 questions I get asked frequently as an EM </p><p class="embed__description"> Nitin Dhar is a Senior Engineering Manager at Carta. He published an interesting blog post talking about the most common questions he gets asked about his team’s performance and how he answers the questions with data.<br><br>Key questions and metrics include<br>- KTLO (Keep The Lights On) Costs - this shows what percentage of capacity goes to maintenance<br>- MTTR (Mean Time to Recovery) - how long it takes to recover after an incident. Tracked via tools like PagerDuty<br>- Project Impact - what are some tangible metrics that came from what his team’s working on. You can use customer surveys or performance metrics to show this.<br><br>Read the full blog post for the rest of the questions and how Nitin answers them. </p><p class="embed__link"> medium.com/one-to-n/7-questions-i-get-asked-frequently-as-an-em-dc361809d351 </p></div></a></div><div class="embed"><a class="embed__url" href="https://vercel.com/blog/how-to-scale-a-large-codebase?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> How to Scale a Large Codebase </p><p class="embed__description"> Vercel is a cloud services provider and is also the maintainer of NextJS.<br><br>In this blog post, they delve into their approach for scaling their codebase (a monorepo). They emphasize feature flags for safe code releases, incremental builds for quick iteration and skew protection to handle version discrepancies. </p><p class="embed__link"> https://vercel.com/blog/how-to-scale-a-large-codebase </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=53656d43-7d25-475b-bf1c-2fe22895e210&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How LinkedIn Scaled Their System to 5 Million Queries Per Second </title>
  <description>How LinkedIn used BitSets, Bloom Filters, Caching Strategies and more to Scale their Safety system to 5 million queries per second. Plus, questions you&#39;ll get asked frequently as an engineering manager and tips on how to scale a large codebase.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3496db89-bf35-41ae-84ab-e91f785a23c0/LinkedIn_Restriction_Inforcement_System_Diagram.gif" length="716309" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-linkedin-scaled-their-system-to-5-million-queries-per-second</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-linkedin-scaled-their-system-to-5-million-queries-per-second</guid>
  <pubDate>Wed, 27 Nov 2024 18:20:00 +0000</pubDate>
  <atom:published>2024-11-27T18:20:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How LinkedIn scaled their Restrictions and Enforcement System to 5 million queries per second</b></p><ul><li><p class="paragraph" style="text-align:left;"> Introduction to BitSets and their use at LinkedIn</p></li><li><p class="paragraph" style="text-align:left;"> How Bloom Filters work and their use-cases</p></li><li><p class="paragraph" style="text-align:left;">Full Refresh-ahead Caching and the pros/cons</p></li><li><p class="paragraph" style="text-align:left;">The Architecture of LinkedIn’s System</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">Latency Numbers Every Programmer Should Know (Visualized)</p></li><li><p class="paragraph" style="text-align:left;">Serving a Billion Web Requests with Boring Code</p></li><li><p class="paragraph" style="text-align:left;">Lies we tell ourselves to keep using Golang</p></li><li><p class="paragraph" style="text-align:left;">7 questions I get asked frequently as an EM</p></li><li><p class="paragraph" style="text-align:left;">How to Scale a Large Codebase </p></li></ul></li></ul><hr class="content_break"><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>The Architecture of LinkedIn’s Restriction Enforcement System</b></h1><p class="paragraph" style="text-align:left;">LinkedIn is the largest professional social network in the world, with over 1 billion users. Over 100 million messages are sent daily on the platform.</p><p class="paragraph" style="text-align:left;">With this scale, you’ll inevitably have some bad actors causing issues on the site. It might be users sending harassing/toxic messages, spammers posting about some cryptocurrency or LinkedIn influencers sharing how their camping trip made them better at B2B sales.</p><p class="paragraph" style="text-align:left;">To combat this malicious behavior, LinkedIn provides a number of safeguards like reporting inappropriate content and blocking problematic/annoying users.</p><p class="paragraph" style="text-align:left;">However, implementing these safeguards at LinkedIn’s scale introduces a ton of technical challenges.</p><p class="paragraph" style="text-align:left;">Some of the requirements LinkedIn needs for their Restrictions Enforcement system include:</p><ul><li><p class="paragraph" style="text-align:left;"><b>High QPS</b><b> </b>- The system needs to support 4-5 million queries per second. Many user actions (viewing the feed, sending messages, etc.) require checking the restricted/blocked accounts list. </p></li><li><p class="paragraph" style="text-align:left;"><b>Low Latency</b><b> </b>- Latency needs to be under 5 milliseconds. Otherwise, basic actions like refreshing your feed or sending a message would take too long.</p></li><li><p class="paragraph" style="text-align:left;">High Availability - This system needs to operate with 99.999% availability (5 9s of availability), so less than 30 seconds of downtime per month. </p></li><li><p class="paragraph" style="text-align:left;"><b>Low Ingestion Delay</b> - When a user blocks/reports an account, that should be reflected in the restrictions enforcement system immediately. If they refresh their feed right after, the posts from the blocked user should be immediately hidden.</p></li></ul><p class="paragraph" style="text-align:left;">Earlier this year, LinkedIn’s engineering team published a fantastic <a class="link" href="https://www.linkedin.com/blog/engineering/trust-and-safety/evolution-enforcing-our-professional-community-policies-at-scale?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">blog post</a> detailing how they built their restrictions enforcement system. They talked about the different generations of the system’s architecture and problems they faced along the way.</p><p class="paragraph" style="text-align:left;">In our Quastor article, we’ll focus on the specific data structures and strategies LinkedIn used. We’ll explain them and delve into the pros and cons LinkedIn saw.</p><h2 class="heading" style="text-align:left;" id="bit-sets"><b>BitSets</b></h2><p class="paragraph" style="text-align:left;">One of the key data structures LinkedIn uses in their restrictions system is BitSets.</p><p class="paragraph" style="text-align:left;">A BitSet is an array of boolean values where each value only takes up 1 bit of space. A bit that’s set represents a true value whereas a bit that’s not set represents false.</p><p class="paragraph" style="text-align:left;">BitSets are <i>extremely</i> memory efficient. If you need to store boolean values for 1 billion users (<i>whether a user is restricted/not restricted</i>), you would only need 1 billion bits (approximately 125 megabytes).</p><p class="paragraph" style="text-align:left;">To give a more concrete example of how LinkedIn uses BitSets, let’s say LinkedIn needs to store restricted/unrestricted account status for 1 billion users. Each user has a memberID from 1 to 1 billion. </p><p class="paragraph" style="text-align:left;">To store this, they could use an array of 64-bit integers. Each integer can store the restriction status for 64 different users (<i>we need 1 bit per user</i>) so the array would hold ~15 million integers (around 125 megabytes of space). </p><p class="paragraph" style="text-align:left;">If LinkedIn needs to check whether user with memberID <code>525234320</code> is restricted, the steps would be:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Divide 523,234,320 by 64 to get the index in the integer array (which would be 8,175,536)</p></li><li><p class="paragraph" style="text-align:left;">Take 523,234,320 modulo 64 to get which bit to check in that integer (which would be 32)</p></li><li><p class="paragraph" style="text-align:left;">Use bitwise operations to check if that specific bit is set to 1 (restricted) or 0 (not restricted)</p></li></ol><p class="paragraph" style="text-align:left;">The time and space requirements with BitSets are very efficient. Checking whether users are restricted takes constant time (<i>since the membership lookup operations are all O(1)</i>) and the storage necessary is only a couple hundred megabytes.</p><h2 class="heading" style="text-align:left;" id="bloom-filters"><b>Bloom Filters</b></h2><p class="paragraph" style="text-align:left;">In addition to BitSets, the other data structure LinkedIn found useful was Bloom Filters.</p><p class="paragraph" style="text-align:left;">A Bloom filter is a probabilistic data structure that lets you quickly test whether an item might be in a set. Bloom Filters are <i>probabilistic</i> so they will tell you if an item is in a set but will occasionally give false positives (<i>it will mistakenly say an item is in the set when it’s not</i>).</p><p class="paragraph" style="text-align:left;">Under the hood, Bloom Filters use hashing to map items to a bit array. The issue is that collisions in the hashing function can cause false positives. Here’s a <a class="link" href="https://brilliant.org/wiki/bloom-filter/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">fantastic article</a> that delves into how Bloom Filters work with visuals and a basic Python implementation. </p><p class="paragraph" style="text-align:left;">For LinkedIn, they also used Bloom Filters to quickly check whether a user’s account was restricted.  </p><p class="paragraph" style="text-align:left;">The pros were that the Bloom Filters were extremely space efficient compared to traditional caching techniques (<i>using a set or hash table</i>).</p><p class="paragraph" style="text-align:left;">The downside was the false positives. However, the Bloom Filter can be tuned to make false positives extremely rare and LinkedIn didn’t find it to be a big issue. </p><h2 class="heading" style="text-align:left;" id="full-refreshahead-caching"><b>Full Refresh-ahead Caching</b></h2><p class="paragraph" style="text-align:left;">LinkedIn explored various caching strategies to achieve their QPS and latency requirements. One approach was their full refresh-ahead cache.</p><p class="paragraph" style="text-align:left;">The dataset of account restrictions was quite small (<i>thanks to using the BitSet and Bloom Filter data structures</i>) so LinkedIn had each client application host store <i>all</i> restriction data in their in-memory cache.</p><p class="paragraph" style="text-align:left;">In order to maintain cache freshness, they implemented a polling mechanism where clients would periodically check for any new changes to member restrictions.</p><p class="paragraph" style="text-align:left;">This system resulted in a huge decrease in latencies but came with some downsides.</p><p class="paragraph" style="text-align:left;">The client-side memory footprint ended up being substantial and strained infrastructure. Additionally, the caches were stored in-memory so they weren’t persistent. Clients had to frequently build and rebuild this cache which put strain on LinkedIn’s underlying database.</p><h2 class="heading" style="text-align:left;" id="system-architecture"><b>System Architecture</b></h2><p class="paragraph" style="text-align:left;">Here’s the current architecture for LinkedIn’s restriction enforcement system.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/09874b2d-8446-409c-9ac6-74c8c6278f3a/1704921888416.png?t=1732730841"/></div><p class="paragraph" style="text-align:left;">The key components are</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Client Layer</b><b> </b>- Clients that use this system include the LinkedIn newsfeed, their recruiting/talent management tools and other products/services at the company. These clients use a REST API to query the system.</p></li><li><p class="paragraph" style="text-align:left;"><b>LinkedIn Restrictions Enforcement System</b><b> </b>- This component consists of the BitSet data structures and the main restriction enforcement system. The BitSet data structures are stored client-side and maintain a cache of all the restriction records.</p></li><li><p class="paragraph" style="text-align:left;"><b>Venice Database</b> - this is the central storage (source of truth) for all user restrictions. Venice is an open-source, horizontally scalable, eventually-consistent storage system that LinkedIn built on RocksDB. You can read more about how Venice works <a class="link" href="https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Kafka Restriction Records</b><b> </b>- When a user gets reported/blocked, the change in their account status is sent through Kafka. This allows near real-time propagation of changes.</p></li><li><p class="paragraph" style="text-align:left;"><b>Restriction Management System</b> - LinkedIn has a legacy system (<i>check the </i><a class="link" href="https://www.linkedin.com/blog/engineering/trust-and-safety/evolution-enforcing-our-professional-community-policies-at-scale?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow"><i>blog post</i></a><i> for a full explanation on the previous generations</i>) of the Restriction Enforcement system that connects to Espresso to store and update blocked/restricted users. Espresso is LinkedIn’s distributed, document data store that’s built on MySQL. You can read more about Espresso <a class="link" href="https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>. </p></li></ol><hr class="content_break"><div class="custom_html"><iframe src="https://embeds.beehiiv.com/c8ef45bf-1b8e-469c-baf2-b51f4701e532" data-test-id="beehiiv-embed" width="100%" height="320" frameborder="0" style="border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;"></iframe></div><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://samwho.dev/numbers/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><img class="embed__image embed__image--top" src="https://samwho.dev/images/numbers.png"/><div class="embed__content"><p class="embed__title"> Latency Numbers Every Programmer Should Know (Visualized) </p><p class="embed__description"> You’ve probably seen the blog posts with latency numbers that every programmer should know.<br><br>These are approximate numbers for how it takes to search RAM vs. disk, transfer a byte from the US to Europe, compress 1 kilobyte of data and more.<br><br>This is an awesome tool that helps you visualize those numbers and understand the orders of magnitude difference that exists between some of them. </p><p class="embed__link"> samwho.dev/numbers </p></div></a></div><div class="embed"><a class="embed__url" href="https://notes.billmill.org/blog/2024/06/Serving_a_billion_web_requests_with_boring_code.html?utm_source=www.hungryminds.dev&utm_medium=referral&utm_campaign=data-streams-101-how-to-handle-petabytes-of-data" target="_blank"><div class="embed__content"><p class="embed__title"> Serving a Billion Web Requests with Boring Code </p><p class="embed__description"> Bill Mill is a software engineer who worked as a contractor for the US Government. During his stint, he helped build the medicare plan compare tool on the website.<br><br>He published a fantastic blog post delving into the scale of the website, the “boring“ technology the team used (golang, reactjs, postgres) and architectural bets he made (gRPC, modular backend) and much more. </p><p class="embed__link"> notes.billmill.org/blog/2024/06/Serving_a_billion_web_requests_with_boring_code.html </p></div></a></div><div class="embed"><a class="embed__url" href="https://fasterthanli.me/articles/lies-we-tell-ourselves-to-keep-using-golang?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> Lies we tell ourselves to keep using Golang </p><p class="embed__description"> This is an interesting article on some common justifications for using GO and why they can cause issues as your codebase grows.<br><br>Some of the criticisms the author talks about include<br>- the difficulty of integrating Go with other technologies due to it’s unique toolchain<br>- how Go’s “zero values“ can lead to subtle bugs<br>- Go’s “simplicity“ leads to complexity in your application code<br><br>Read the full blog post for all the author’s criticisms. </p><p class="embed__link"> fasterthanli.me/articles/lies-we-tell-ourselves-to-keep-using-golang </p></div></a></div><div class="embed"><a class="embed__url" href="https://medium.com/one-to-n/7-questions-i-get-asked-frequently-as-an-em-dc361809d351?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> 7 questions I get asked frequently as an EM </p><p class="embed__description"> Nitin Dhar is a Senior Engineering Manager at Carta. He published an interesting blog post talking about the most common questions he gets asked about his team’s performance and how he answers the questions with data.<br><br>Key questions and metrics include<br>- KTLO (Keep The Lights On) Costs - this shows what percentage of capacity goes to maintenance<br>- MTTR (Mean Time to Recovery) - how long it takes to recover after an incident. Tracked via tools like PagerDuty<br>- Project Impact - what are some tangible metrics that came from what his team’s working on. You can use customer surveys or performance metrics to show this.<br><br>Read the full blog post for the rest of the questions and how Nitin answers them. </p><p class="embed__link"> medium.com/one-to-n/7-questions-i-get-asked-frequently-as-an-em-dc361809d351 </p></div></a></div><div class="embed"><a class="embed__url" href="https://vercel.com/blog/how-to-scale-a-large-codebase?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-scaled-their-system-to-5-million-queries-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> How to Scale a Large Codebase </p><p class="embed__description"> Vercel is a cloud services provider and is also the maintainer of NextJS.<br><br>In this blog post, they delve into their approach for scaling their codebase (a monorepo). They emphasize feature flags for safe code releases, incremental builds for quick iteration and skew protection to handle version discrepancies. </p><p class="embed__link"> https://vercel.com/blog/how-to-scale-a-large-codebase </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=ddc44129-b203-40f7-be46-5cfaea878477&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How LinkedIn uses Event Driven Architectures to Scale</title>
  <description>An introduction to EDAs and the Actor Model. Plus, how to ship projects at big tech companies, how Coinbase uses ML to predict traffic patterns and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bccee9e1-6a15-414e-a0ee-6a32b25cca4c/unnamed__20_.png" length="148120" type="image/png"/>
  <link>https://blog.quastor.org/p/how-linkedin-uses-event-driven-architectures-to-scale-6ecc</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-linkedin-uses-event-driven-architectures-to-scale-6ecc</guid>
  <pubDate>Fri, 15 Nov 2024 10:00:00 +0000</pubDate>
  <atom:published>2024-11-15T10:00:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How LinkedIn uses Event Driven Architectures to Scale their Infrastructure</b></p><ul><li><p class="paragraph" style="text-align:left;"> An Introduction to the Actor Model and how it works</p></li><li><p class="paragraph" style="text-align:left;"> How LinkedIn collects and processes server metrics from their fleet with an Event Driven Architecture</p></li><li><p class="paragraph" style="text-align:left;">LinkedIn’s monitoring system for server consoles</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">How I Ship Projects at Big Tech Companies</p></li><li><p class="paragraph" style="text-align:left;">How Binary Vector Embeddings work and why they’re so useful</p></li><li><p class="paragraph" style="text-align:left;">How Coinbase uses ML to Predict Traffic and Auto-scale Databases</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-uses-design-patterns-"><b>How LinkedIn uses Design Patterns to Scale their Infrastructure</b></h1><p class="paragraph" style="text-align:left;">LinkedIn is the largest professional social networking platform in the world with over 950 million users in 200+ countries.</p><p class="paragraph" style="text-align:left;">To serve this user base, they maintain dozens of data centers around the world with hundreds of thousands of servers globally. </p><p class="paragraph" style="text-align:left;">In order to manage these servers, LinkedIn makes use of many tried-and-tested design patterns.</p><p class="paragraph" style="text-align:left;">One pattern is the Producer-Consumer pattern, commonly used in event driven architectures (EDAs).</p><p class="paragraph" style="text-align:left;">This pattern consists of three main components:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Producer</b> - generates events/messages (server metrics, status updates, data from queries, etc.)</p></li><li><p class="paragraph" style="text-align:left;"><b>Queue</b> - acts as a buffer to store messages until they’re ready to be processed. LinkedIn uses Redis, Kafka or built-in queues for this.</p></li><li><p class="paragraph" style="text-align:left;"><b>Consumer</b> - reads and processes messages from the queue</p></li></ul><p class="paragraph" style="text-align:left;">Saira Khanum is a Staff Software Engineer at LinkedIn and she wrote a fantastic <a class="link" href="https://www.linkedin.com/blog/engineering/infrastructure/how-design-patterns-power-linkedin-infrastructure?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into how the engineering team uses this pattern in three different systems:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">To Collect and Maintain Data from Servers for Real-time and Analytical Queries</p></li><li><p class="paragraph" style="text-align:left;">To Check Servers for Availability and Accessibility</p></li><li><p class="paragraph" style="text-align:left;">To Detect and Fix any Access Policy Violations on the Servers</p></li></ol><p class="paragraph" style="text-align:left;">We’ll explore these and talk about how LinkedIn implemented them.</p><h2 class="heading" style="text-align:left;" id="actor-pattern"><b>Actor Pattern</b></h2><p class="paragraph" style="text-align:left;">When building event driven architectures, LinkedIn frequently uses the <i><a class="link" href="https://www.brianstorti.com/the-actor-model/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">Actor Pattern</a></i>. Event Driven Architectures are loosely defined so the Actor Pattern (<i>or Actor Model</i>) is a specific implementation of an EDA.</p><p class="paragraph" style="text-align:left;">With this model, everything is represented as an <i>actor</i>.</p><p class="paragraph" style="text-align:left;">An actor is an independent entity that can</p><ul><li><p class="paragraph" style="text-align:left;">Send messages to other actors</p></li><li><p class="paragraph" style="text-align:left;">Process messages/requests</p></li><li><p class="paragraph" style="text-align:left;">Create new actors and designate their behavior</p></li><li><p class="paragraph" style="text-align:left;">Have independent state</p></li></ul><p class="paragraph" style="text-align:left;">To give you a better sense of how this might work, here’s a <i>hypothetical</i> example of an Actor model at Uber for handling ride requests.</p><ol start="1"><li><p class="paragraph" style="text-align:left;">When a user first requests a ride, a <b>RequestActor</b> is created specifically for their request. This actor maintains the state of the request (<i>whether it’s active or canceled</i>) and coordinates the entire matching process.</p></li><li><p class="paragraph" style="text-align:left;">The RequestActor might first create a child <b>PricingActor</b> to figure out a reasonable price for the request based on the trip distance and time of day. The PricingActor will run internal logic based on the RequestActor’s message and return the ride price.</p></li><li><p class="paragraph" style="text-align:left;">Once it has the pricing figured out, the RequestActor will communicate with nearby DriverActors (<i>one actor per active driver on Uber</i>) by sending them ride offer messages.</p></li><li><p class="paragraph" style="text-align:left;">The DriverActor will then handle sending a notification to the Uber driver that there&#39;s someone looking for a ride. If the driver accepts the ride then the DriverActor might create a new <b>TripActor</b> to handle the ongoing ride (tracking location updates, route changes, payment processing, etc.)</p></li></ol><p class="paragraph" style="text-align:left;">If you’re looking for more details, here’s a <a class="link" href="https://www.brianstorti.com/the-actor-model/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">fantastic article</a> that delves deeper on the Actor model.</p><p class="paragraph" style="text-align:left;">Back to LinkedIn…</p><h2 class="heading" style="text-align:left;" id="event-driven-architectures-at-linke"><b>Event Driven Architectures at LinkedIn</b></h2><p class="paragraph" style="text-align:left;">LinkedIn talks about a few systems where they’ve found EDAs useful for managing infrastructure.</p><h3 class="heading" style="text-align:left;" id="distributed-server-queries-at-linke"><b>Distributed Server Queries at LinkedIn</b></h3><p class="paragraph" style="text-align:left;">The first system is LinkedIn’s distributed server query system. This is responsible for collecting system facts (CPU/memory usage, network connections, disk space usage, etc.) from across the server fleet and storing them so they can be queried and analyzed.</p><p class="paragraph" style="text-align:left;">Some of the requirements are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Scale</b> - the system needs to process terabytes of data from hundreds of thousands of servers in near real-time</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Refresh</b> - the data needs to be collected several times every hour</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Maintenance </b>- the last known good snapshot of system facts needs to be maintained for a defined retention period. (<i>after the retention period is over, the system facts need to be marked as stale</i>)</p></li></ul><p class="paragraph" style="text-align:left;">Here’s the high level architecture of the system</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdqdrDEyruFDePvPR8ky_7PZBtxfOxX8QfFRlwaNdpl8IdTmWvSDlJJpwwP0rtf6toA9EyGUsLTWUjQaBniSh5h5KJE2XiMtUwAUsVZX3aX9akazZWELyKRqLrd4VSqKEat4oE3Ew?key=_e7itvZYih2R2xyyEuPfoiQw"/></div><ol start="1"><li><p class="paragraph" style="text-align:left;">Agents (producers) are deployed across the server fleet to collect system facts</p></li><li><p class="paragraph" style="text-align:left;">These facts are sent to worker processes (using the Actor Pattern) and stored on Redis</p></li><li><p class="paragraph" style="text-align:left;">Different worker processes consume the data from Redis, process it and store it in different datastores </p></li></ol><p class="paragraph" style="text-align:left;">Some of the choices LinkedIn made were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Redis</b> - LinkedIn picked Redis as the queue since they were looking for low latency. The messages are short-lived and introducing a tool like Kafka would introduce too much overhead.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actor Pattern</b><b> </b>- Workers that collect and process server metrics use the Actor pattern. They’re implemented with <a class="link" href="https://en.wikipedia.org/wiki/Gunicorn?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">Gunicorn</a>. </p></li></ul><h3 class="heading" style="text-align:left;" id="server-console-monitoring"><b>Server Console Monitoring</b></h3><p class="paragraph" style="text-align:left;">The second system is LinkedIn’s distributed system to monitor the server console for their infrastructure. Server consoles (<i>often called service processors</i>) allow administrators to manage and monitor physical servers remotely (<i>even when the server is powered off or unresponsive</i>). They’re essential for troubleshooting, rebooting and maintaining servers.</p><p class="paragraph" style="text-align:left;">LinkedIn’s monitoring system checks that these server management consoles are available and accessible.</p><p class="paragraph" style="text-align:left;">Here’s the architecture for how they do that.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9898dc1b-5f7c-453e-a6cf-fef0f3ba9033/Screenshot_2024-11-13_at_5.51.33_PM.png?t=1731538296"/></div><ol start="1"><li><p class="paragraph" style="text-align:left;">Satellite servers run checks across the servers in the data center. Each check is handled by a separate actor.</p></li><li><p class="paragraph" style="text-align:left;">Messages from each check are passed through RabbitMQ. The result of each check determines if the next check should be run (<i>if the next actor should be created</i>)</p></li><li><p class="paragraph" style="text-align:left;">Final results are sent to Kafka. Consumer applications can read results for storage/analysis from the various Kafka streams.</p></li></ol><p class="paragraph" style="text-align:left;">Some of the tech choices LinkedIn made were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Actor Pattern</b><b> </b>- Each check that LinkedIn has to do is an actor. The checks are done sequentially so they pass messages to each other to send results and status updates.</p></li><li><p class="paragraph" style="text-align:left;"><b>Kafka and RabbitMQ </b>- RabbitMQ is used for communication between the actors whereas Kafka is used for forwarding the final results down to the consumer applications for further processing and storage</p></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://www.coinbase.com/blog/how-coinbase-is-using-machine-learning-to-predict?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/bccee9e1-6a15-414e-a0ee-6a32b25cca4c/unnamed__20_.png?t=1731540588"/><div class="embed__content"><p class="embed__title"> How Coinbase uses Machine Learning to Predict Traffic and Autoscale Infrastructure </p><p class="embed__description"> With crypto markets, sudden price movements can cause massive traffic surges on Coinbase.<br><br>Instead of scaling reactively based on CPU usage (which is often too late), Coinbase built an ML model that predicts traffic spikes in advance by analyzing:<br>- price volatility in major cryptocurrencies<br>- current traffic patterns and growth rates<br>- historical seasonal trends<br>- load testing data<br><br>This has helped them prevent system outages and reduce costs from over-provisioning resources. </p><p class="embed__link"> www.coinbase.com/blog/how-coinbase-is-using-machine-learning-to-predict </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.seangoedecke.com/how-to-ship/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><div class="embed__content"><p class="embed__title"> How I ship projects at big tech companies </p><p class="embed__description"> Sean Goedecke is a Staff Engineer at GitHub. He wrote a terrific article on what it actually means to “ship“ projects at big tech companies.<br><br>Here’s some insights<br>- Shipping isn’t automatic - the default state of most projects is to get delayed indefinitely. Someone needs to take ownership and make sure it gets launched.<br><br>- Shipping is more than just deploying code - A project hasn’t truly shipped until important stakeholders acknowledge it. <br><br>- Deploy Early - Sean recommends deploying features behind flags as early as possible so you can catch issues early.<br><br>Read the full article for more details. </p><p class="embed__link"> www.seangoedecke.com/how-to-ship </p></div></a></div><div class="embed"><a class="embed__url" href="https://emschwartz.me/binary-vector-embeddings-are-so-cool/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/8ece7178-e1ff-4b54-85f8-cbf61e88f5bb/Screenshot_2024-11-13_at_6.17.46_PM.png?t=1731539892"/><div class="embed__content"><p class="embed__title"> How Binary Vector Embeddings work and why they’re so useful </p><p class="embed__description"> Vector Embeddings allow you to convert text into numbers that represent meaning. This is super useful for semantic search and similarity matching.<br><br>Traditional embeddings use 32-bit floating point numbers but binary quantization converts each number to a single bit.<br><br>With this approach, you can compress embeddings to just 3% of their original size while retaining 95%+ of the original accuracy.<br><br>Evan Schwartz wrote a fantastic article on this technique. </p><p class="embed__link"> emschwartz.me/binary-vector-embeddings-are-so-cool </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.pointer.io/?utm_source=quastor&utm_medium=crosspromo" target="_blank"><div class="embed__content"><p class="embed__title"> Essential Reading For Engineering Leaders </p><p class="embed__description"> If you find Quastor useful, you should check out Pointer.<br><br>It’s essential reading for engineering leaders to hone in on improving their soft skills. They send out super high quality engineering-related content twice a week.<br><br>Sign Up for Free!<br><br>(cross promo) </p><p class="embed__link"> www.pointer.io/?utm_source=quastor&utm_medium=crosspromo </p></div><img class="embed__image embed__image--right" src="http://www.pointer.io/static/images/social-og.png"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=951773ec-745b-44d4-ada1-c976cdb8b2c2&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How LinkedIn uses Event Driven Architectures to Scale</title>
  <description>An introduction to EDAs and the Actor Model. Plus, how to ship projects at big tech companies, how Coinbase uses ML to predict traffic patterns and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/455e84df-9ef3-45ff-ade2-e9c9623dd02e/ezgif-3-97abac12e7.gif" length="1604502" type="image/gif"/>
  <link>https://blog.quastor.org/p/how-linkedin-uses-event-driven-architectures-to-scale</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-linkedin-uses-event-driven-architectures-to-scale</guid>
  <pubDate>Thu, 14 Nov 2024 14:00:00 +0000</pubDate>
  <atom:published>2024-11-14T14:00:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How LinkedIn uses Event Driven Architectures to Scale their Infrastructure</b></p><ul><li><p class="paragraph" style="text-align:left;"> An Introduction to the Actor Model and how it works</p></li><li><p class="paragraph" style="text-align:left;"> How LinkedIn collects and processes server metrics from their fleet with an Event Driven Architecture</p></li><li><p class="paragraph" style="text-align:left;">LinkedIn’s monitoring system for server consoles</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">How I Ship Projects at Big Tech Companies</p></li><li><p class="paragraph" style="text-align:left;">How Binary Vector Embeddings work and why they’re so useful</p></li><li><p class="paragraph" style="text-align:left;">How Coinbase uses ML to Predict Traffic and Auto-scale Databases</p></li></ul></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" rel="noopener" target="_blank"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc4AKer6fUop8MY7HQ6c2KSlF0ESztrdXX7kCfyjBhe5twElv9H8aIGEu6qrMA8K9IT8NhVyxz9xFcvK-29eSFrB3J_wXxWF9wn3aaSKhrC2ILsqMZNHyCr7rvGKOCpskIYTaLQaEPXql8wjukmyP8kvcAQ?key=Kc1z5OwJrg6DL2XaUhbRcQ"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-build-ai-agents-that-interac"><a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">How to build AI Agents that interact with Slack and Salesforce in your Product</a></h1><p class="paragraph" style="text-align:left;">Interested in building <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">AI agents</a> into your product?</p><p class="paragraph" style="text-align:left;">Your AI agent may need to automate tasks that take place outside your application, in your user’s third-party apps.</p><p class="paragraph" style="text-align:left;">This could be reading data from the third party (<i>checking inventory in Shopify, a ticket’s status in Jira, etc.</i>) and then writing data in those third-party apps (<i>Slack, Salesforce, Jira, etc.</i>)</p><p class="paragraph" style="text-align:left;">That’s where <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">AI actions and function tools</a> come into play.</p><p class="paragraph" style="text-align:left;">In this tutorial and demo, you’ll learn how to equip your AI agent with AI actions such as:</p><ul><li><p class="paragraph" style="text-align:left;">Sending a Slack message</p></li><li><p class="paragraph" style="text-align:left;">Creating/Updating a Record in Salesforce</p></li><li><p class="paragraph" style="text-align:left;">Chain actions together</p></li></ul><p class="paragraph" style="text-align:left;">Read/watch the tutorial and access the <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">GitHub Repo</a> below.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial"><span class="button__text" style=""> Access the Tutorial and GitHub Repo </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-uses-design-patterns-"><b>How LinkedIn uses Design Patterns to Scale their Infrastructure</b></h1><p class="paragraph" style="text-align:left;">LinkedIn is the largest professional social networking platform in the world with over 950 million users in 200+ countries.</p><p class="paragraph" style="text-align:left;">To serve this user base, they maintain dozens of data centers around the world with hundreds of thousands of servers globally. </p><p class="paragraph" style="text-align:left;">In order to manage these servers, LinkedIn makes use of many tried-and-tested design patterns.</p><p class="paragraph" style="text-align:left;">One pattern is the Producer-Consumer pattern, commonly used in event driven architectures (EDAs).</p><p class="paragraph" style="text-align:left;">This pattern consists of three main components:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Producer</b> - generates events/messages (server metrics, status updates, data from queries, etc.)</p></li><li><p class="paragraph" style="text-align:left;"><b>Queue</b> - acts as a buffer to store messages until they’re ready to be processed. LinkedIn uses Redis, Kafka or built-in queues for this.</p></li><li><p class="paragraph" style="text-align:left;"><b>Consumer</b> - reads and processes messages from the queue</p></li></ul><p class="paragraph" style="text-align:left;">Saira Khanum is a Staff Software Engineer at LinkedIn and she wrote a fantastic <a class="link" href="https://www.linkedin.com/blog/engineering/infrastructure/how-design-patterns-power-linkedin-infrastructure?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">blog post</a> delving into how the engineering team uses this pattern in three different systems:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">To Collect and Maintain Data from Servers for Real-time and Analytical Queries</p></li><li><p class="paragraph" style="text-align:left;">To Check Servers for Availability and Accessibility</p></li><li><p class="paragraph" style="text-align:left;">To Detect and Fix any Access Policy Violations on the Servers</p></li></ol><p class="paragraph" style="text-align:left;">We’ll explore these and talk about how LinkedIn implemented them.</p><h2 class="heading" style="text-align:left;" id="actor-pattern"><b>Actor Pattern</b></h2><p class="paragraph" style="text-align:left;">When building event driven architectures, LinkedIn frequently uses the <a class="link" href="https://www.brianstorti.com/the-actor-model/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow"><i>Actor Pattern</i></a>. Event Driven Architectures are loosely defined so the Actor Pattern (<i>or Actor Model</i>) is a specific implementation of an EDA.</p><p class="paragraph" style="text-align:left;">With this model, everything is represented as an <i>actor</i>.</p><p class="paragraph" style="text-align:left;">An actor is an independent entity that can</p><ul><li><p class="paragraph" style="text-align:left;">Send messages to other actors</p></li><li><p class="paragraph" style="text-align:left;">Process messages/requests</p></li><li><p class="paragraph" style="text-align:left;">Create new actors and designate their behavior</p></li><li><p class="paragraph" style="text-align:left;">Have independent state</p></li></ul><p class="paragraph" style="text-align:left;">To give you a better sense of how this might work, here’s a <i>hypothetical</i> example of an Actor model at Uber for handling ride requests.</p><ol start="1"><li><p class="paragraph" style="text-align:left;">When a user first requests a ride, a <b>RequestActor</b> is created specifically for their request. This actor maintains the state of the request (<i>whether it’s active or canceled</i>) and coordinates the entire matching process.</p></li><li><p class="paragraph" style="text-align:left;">The RequestActor might first create a child <b>PricingActor</b> to figure out a reasonable price for the request based on the trip distance and time of day. The PricingActor will run internal logic based on the RequestActor’s message and return the ride price.</p></li><li><p class="paragraph" style="text-align:left;">Once it has the pricing figured out, the RequestActor will communicate with nearby DriverActors (<i>one actor per active driver on Uber</i>) by sending them ride offer messages.</p></li><li><p class="paragraph" style="text-align:left;">The DriverActor will then handle sending a notification to the Uber driver that there&#39;s someone looking for a ride. If the driver accepts the ride then the DriverActor might create a new <b>TripActor</b> to handle the ongoing ride (tracking location updates, route changes, payment processing, etc.)</p></li></ol><p class="paragraph" style="text-align:left;">If you’re looking for more details, here’s a <a class="link" href="https://www.brianstorti.com/the-actor-model/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">fantastic article</a> that delves deeper on the Actor model.</p><p class="paragraph" style="text-align:left;">Back to LinkedIn…</p><h2 class="heading" style="text-align:left;" id="event-driven-architectures-at-linke"><b>Event Driven Architectures at LinkedIn</b></h2><p class="paragraph" style="text-align:left;">LinkedIn talks about a few systems where they’ve found EDAs useful for managing infrastructure.</p><h3 class="heading" style="text-align:left;" id="distributed-server-queries-at-linke"><b>Distributed Server Queries at LinkedIn</b></h3><p class="paragraph" style="text-align:left;">The first system is LinkedIn’s distributed server query system. This is responsible for collecting system facts (CPU/memory usage, network connections, disk space usage, etc.) from across the server fleet and storing them so they can be queried and analyzed.</p><p class="paragraph" style="text-align:left;">Some of the requirements are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Scale</b> - the system needs to process terabytes of data from hundreds of thousands of servers in near real-time</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Refresh</b> - the data needs to be collected several times every hour</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Maintenance </b>- the last known good snapshot of system facts needs to be maintained for a defined retention period. (<i>after the retention period is over, the system facts need to be marked as stale</i>)</p></li></ul><p class="paragraph" style="text-align:left;">Here’s the high level architecture of the system</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdqdrDEyruFDePvPR8ky_7PZBtxfOxX8QfFRlwaNdpl8IdTmWvSDlJJpwwP0rtf6toA9EyGUsLTWUjQaBniSh5h5KJE2XiMtUwAUsVZX3aX9akazZWELyKRqLrd4VSqKEat4oE3Ew?key=_e7itvZYih2R2xyyEuPfoiQw"/></div><ol start="1"><li><p class="paragraph" style="text-align:left;">Agents (producers) are deployed across the server fleet to collect system facts</p></li><li><p class="paragraph" style="text-align:left;">These facts are sent to worker processes (using the Actor Pattern) and stored on Redis</p></li><li><p class="paragraph" style="text-align:left;">Different worker processes consume the data from Redis, process it and store it in different datastores </p></li></ol><p class="paragraph" style="text-align:left;">Some of the choices LinkedIn made were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Redis</b> - LinkedIn picked Redis as the queue since they were looking for low latency. The messages are short-lived and introducing a tool like Kafka would introduce too much overhead.</p></li><li><p class="paragraph" style="text-align:left;"><b>Actor Pattern</b><b> </b>- Workers that collect and process server metrics use the Actor pattern. They’re implemented with <a class="link" href="https://en.wikipedia.org/wiki/Gunicorn?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank" rel="noopener noreferrer nofollow">Gunicorn</a>. </p></li></ul><h3 class="heading" style="text-align:left;" id="server-console-monitoring"><b>Server Console Monitoring</b></h3><p class="paragraph" style="text-align:left;">The second system is LinkedIn’s distributed system to monitor the server console for their infrastructure. Server consoles (<i>often called service processors</i>) allow administrators to manage and monitor physical servers remotely (<i>even when the server is powered off or unresponsive</i>). They’re essential for troubleshooting, rebooting and maintaining servers.</p><p class="paragraph" style="text-align:left;">LinkedIn’s monitoring system checks that these server management consoles are available and accessible.</p><p class="paragraph" style="text-align:left;">Here’s the architecture for how they do that.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9898dc1b-5f7c-453e-a6cf-fef0f3ba9033/Screenshot_2024-11-13_at_5.51.33_PM.png?t=1731538296"/></div><ol start="1"><li><p class="paragraph" style="text-align:left;">Satellite servers run checks across the servers in the data center. Each check is handled by a separate actor.</p></li><li><p class="paragraph" style="text-align:left;">Messages from each check are passed through RabbitMQ. The result of each check determines if the next check should be run (<i>if the next actor should be created</i>)</p></li><li><p class="paragraph" style="text-align:left;">Final results are sent to Kafka. Consumer applications can read results for storage/analysis from the various Kafka streams.</p></li></ol><p class="paragraph" style="text-align:left;">Some of the tech choices LinkedIn made were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Actor Pattern</b><b> </b>- Each check that LinkedIn has to do is an actor. The checks are done sequentially so they pass messages to each other to send results and status updates.</p></li><li><p class="paragraph" style="text-align:left;"><b>Kafka and RabbitMQ </b>- RabbitMQ is used for communication between the actors whereas Kafka is used for forwarding the final results down to the consumer applications for further processing and storage</p></li></ul><hr class="content_break"><div class="image"><a class="image__link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" rel="noopener" target="_blank"><img alt="" class="image__image" style="border-radius:0px 0px 0px 0px;border-style:solid;border-width:0px 0px 0px 0px;box-sizing:border-box;border-color:#E5E7EB;" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc4AKer6fUop8MY7HQ6c2KSlF0ESztrdXX7kCfyjBhe5twElv9H8aIGEu6qrMA8K9IT8NhVyxz9xFcvK-29eSFrB3J_wXxWF9wn3aaSKhrC2ILsqMZNHyCr7rvGKOCpskIYTaLQaEPXql8wjukmyP8kvcAQ?key=Kc1z5OwJrg6DL2XaUhbRcQ"/></a></div><h1 class="heading" style="text-align:left;" id="how-to-build-ai-agents-that-interac"><a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">How to build AI Agents that interact with Slack and Salesforce in your Product</a></h1><p class="paragraph" style="text-align:left;">Interested in building <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">AI agents</a> into your product?</p><p class="paragraph" style="text-align:left;">Your AI agent may need to automate tasks that take place outside your application, in your user’s third-party apps.</p><p class="paragraph" style="text-align:left;">This could be reading data from the third party (<i>checking inventory in Shopify, a ticket’s status in Jira, etc.</i>) and then writing data in those third-party apps (<i>Slack, Salesforce, Jira, etc.</i>)</p><p class="paragraph" style="text-align:left;">That’s where <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">AI actions and function tools</a> come into play.</p><p class="paragraph" style="text-align:left;">In this tutorial and demo, you’ll learn how to equip your AI agent with AI actions such as:</p><ul><li><p class="paragraph" style="text-align:left;">Sending a Slack message</p></li><li><p class="paragraph" style="text-align:left;">Creating/Updating a Record in Salesforce</p></li><li><p class="paragraph" style="text-align:left;">Chain actions together</p></li></ul><p class="paragraph" style="text-align:left;">Read/watch the tutorial and access the <a class="link" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial" target="_blank" rel="noopener noreferrer nofollow">GitHub Repo</a> below.</p><div class="button" style="text-align:center;"><a target="_blank" rel="noopener nofollow noreferrer" class="button__link" style="" href="https://www.useparagon.com/learn/implementing-agentic-actions-with-third-party-integrations/?utm_source=quastor_newsletter&utm_medium=newsletter_sponsorship&utm_content=rag_agentic_tutorial"><span class="button__text" style=""> Access the Tutorial and GitHub Repo </span></a></div><p class="paragraph" style="text-align:left;">sponsored</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://www.coinbase.com/blog/how-coinbase-is-using-machine-learning-to-predict?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/bccee9e1-6a15-414e-a0ee-6a32b25cca4c/unnamed__20_.png?t=1731540588"/><div class="embed__content"><p class="embed__title"> How Coinbase uses Machine Learning to Predict Traffic and Autoscale Infrastructure </p><p class="embed__description"> With crypto markets, sudden price movements can cause massive traffic surges on Coinbase.<br><br>Instead of scaling reactively based on CPU usage (which is often too late), Coinbase built an ML model that predicts traffic spikes in advance by analyzing:<br>- price volatility in major cryptocurrencies<br>- current traffic patterns and growth rates<br>- historical seasonal trends<br>- load testing data<br><br>This has helped them prevent system outages and reduce costs from over-provisioning resources. </p><p class="embed__link"> www.coinbase.com/blog/how-coinbase-is-using-machine-learning-to-predict </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.seangoedecke.com/how-to-ship/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><div class="embed__content"><p class="embed__title"> How I ship projects at big tech companies </p><p class="embed__description"> Sean Goedecke is a Staff Engineer at GitHub. He wrote a terrific article on what it actually means to “ship“ projects at big tech companies.<br><br>Here’s some insights<br>- Shipping isn’t automatic - the default state of most projects is to get delayed indefinitely. Someone needs to take ownership and make sure it gets launched.<br><br>- Shipping is more than just deploying code - A project hasn’t truly shipped until important stakeholders acknowledge it. <br><br>- Deploy Early - Sean recommends deploying features behind flags as early as possible so you can catch issues early.<br><br>Read the full article for more details. </p><p class="embed__link"> www.seangoedecke.com/how-to-ship </p></div></a></div><div class="embed"><a class="embed__url" href="https://emschwartz.me/binary-vector-embeddings-are-so-cool/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-linkedin-uses-event-driven-architectures-to-scale" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/8ece7178-e1ff-4b54-85f8-cbf61e88f5bb/Screenshot_2024-11-13_at_6.17.46_PM.png?t=1731539892"/><div class="embed__content"><p class="embed__title"> How Binary Vector Embeddings work and why they’re so useful </p><p class="embed__description"> Vector Embeddings allow you to convert text into numbers that represent meaning. This is super useful for semantic search and similarity matching.<br><br>Traditional embeddings use 32-bit floating point numbers but binary quantization converts each number to a single bit.<br><br>With this approach, you can compress embeddings to just 3% of their original size while retaining 95%+ of the original accuracy.<br><br>Evan Schwartz wrote a fantastic article on this technique. </p><p class="embed__link"> emschwartz.me/binary-vector-embeddings-are-so-cool </p></div></a></div><div class="embed"><a class="embed__url" href="https://www.pointer.io/?utm_source=quastor&utm_medium=crosspromo" target="_blank"><div class="embed__content"><p class="embed__title"> Essential Reading For Engineering Leaders </p><p class="embed__description"> If you find Quastor useful, you should check out Pointer.<br><br>It’s essential reading for engineering leaders to hone in on improving their soft skills. They send out super high quality engineering-related content twice a week.<br><br>Sign Up for Free!<br><br>(cross promo) </p><p class="embed__link"> www.pointer.io/?utm_source=quastor&utm_medium=crosspromo </p></div><img class="embed__image embed__image--right" src="http://www.pointer.io/static/images/social-og.png"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=d60ead66-ff40-4c57-828b-669ab2e8db6f&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>How Reddit built a Metadata Store that Handles 100k Reads per Second</title>
  <description>We&#39;ll talk about the design of Reddit&#39;s Metadata Store and the tech behind it. Plus, non-llm software trends to be excited about, how Python&#39;s Asyncio works, how GitLab automates engineering management and more.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/ce604d24-0895-423c-807d-8cc038c0d904/Screenshot_2024-05-07_at_4.39.21_PM.png" length="20395" type="image/png"/>
  <link>https://blog.quastor.org/p/how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second-9434</link>
  <guid isPermaLink="true">https://blog.quastor.org/p/how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second-9434</guid>
  <pubDate>Wed, 13 Nov 2024 20:12:00 +0000</pubDate>
  <atom:published>2024-11-13T20:12:00Z</atom:published>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:left;">Today we’ll be talking about</p><ul><li><p class="paragraph" style="text-align:left;"><b>How Reddit built a Metadata store that handles 100k reads per second</b></p><ul><li><p class="paragraph" style="text-align:left;">High level goals of Reddit’s metadata store</p></li><li><p class="paragraph" style="text-align:left;">Picking Sharded Postgres vs. Cassandra</p></li><li><p class="paragraph" style="text-align:left;">Data Migration Process</p></li><li><p class="paragraph" style="text-align:left;">Scaling with sharding and denormalization</p></li></ul></li><li><p class="paragraph" style="text-align:left;"><b>Tech Snippets</b></p><ul><li><p class="paragraph" style="text-align:left;">5 Non-LLM Software Trends To Be Excited About</p></li><li><p class="paragraph" style="text-align:left;">How Python Asyncio Works</p></li><li><p class="paragraph" style="text-align:left;">Writing a File System from Scratch in Rust</p></li><li><p class="paragraph" style="text-align:left;">How GitLab automates engineering management</p></li></ul></li></ul><hr class="content_break"><h1 class="heading" style="text-align:left;" id="how-linked-in-serves-5-million-user"><b>How Reddit built a Metadata store that Handles 100k Reads per Second</b></h1><p class="paragraph" style="text-align:left;">Over the past few years, Reddit has seen their user-base <i>triple</i> in size. They went from 430 million monthly active users in 2019 to 1.2 billion in 2024.</p><p class="paragraph" style="text-align:left;">The good news with all this growth is that they finally IPO’d earlier this year and let employees cash in on their stock options. The bad news is that the engineering team had to deal with a bunch of headaches.</p><p class="paragraph" style="text-align:left;">One issue that Reddit faced was with their media metadata store. Reddit is built on AWS and GCP, so they store any media uploaded to the site (<i>images, videos, gifs, etc.</i>) on AWS S3.</p><p class="paragraph" style="text-align:left;">Every piece of media uploaded also comes with metadata. Each media file will have metadata like video thumbnails, playback URLs, S3 file locations, image resolution, etc.</p><p class="paragraph" style="text-align:left;">Previously, Reddit’s media metadata was distributed across different storage systems. To make this easier to manage, the engineering team wanted to create a unified system for managing all this data.</p><p class="paragraph" style="text-align:left;">The high-level design goals were</p><ul><li><p class="paragraph" style="text-align:left;"><b>Single System</b> - they needed a <i>single</i> system that could store all of Reddit’s media metadata. Reddit’s growing quickly, so this system needs to be highly scalable. At the current rate of growth, Reddit expects the size of their media metadata to be 50 terabytes by 2030.</p></li><li><p class="paragraph" style="text-align:left;"><b>Read Heavy Workload </b>- This data store will have a <i>very</i> read-heavy workload. It needs to handle over 100k reads <i>per second</i> with latency less than 50 ms.</p></li><li><p class="paragraph" style="text-align:left;"><b>Support Writes</b> - The data store should also support data creation/updates. However these requests have <i>significantly</i> lower traffic than reads and Reddit can tolerate higher latency for this.</p></li></ul><p class="paragraph" style="text-align:left;">Reddit wrote a <a class="link" href="https://www.reddit.com/r/RedditEng/comments/1avlywv/the_reddit_media_metadata_store/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">fantastic article</a> delving into their process for creating a unified media metadata store.</p><p class="paragraph" style="text-align:left;">We’ll be summarizing the article and adding some extra context.</p><h2 class="heading" style="text-align:left;" id="picking-the-database"><b>Picking the Database</b></h2><p class="paragraph" style="text-align:left;">To build this media metadata store, Reddit considered two choices: Sharded Postgres vs. Cassandra.</p><h3 class="heading" style="text-align:left;" id="postgres"><b>Postgres</b></h3><p class="paragraph" style="text-align:left;">Postgres is one of the most popular relational databases in the world and is consistently voted <i>most loved database</i> in Stack Overflow’s developer surveys.</p><p class="paragraph" style="text-align:left;">Some of the pros for Postgres are</p><ul><li><p class="paragraph" style="text-align:left;"><b>Battle Tested </b>- Tens of thousands of companies use Postgres and there’s been countless tests on benchmarks, scalability and more. Postgres is used (<i>or has been used</i>) at companies like Uber, Skype, Spotify, etc.<br><br>With this, there’s a massive wealth of knowledge around potential issues, common bugs, pitfalls, etc. on forums like Stack Overflow, Slack/IRC, mailing threads and more.</p></li><li><p class="paragraph" style="text-align:left;"><b>Open Source & Community</b> - Postgres has been open source since 1995 with a liberal license that’s similar to the BSD and MIT license. There’s a vibrant community of developers who help teach the database and provide support for people with issues. Postgres also has outstanding <a class="link" href="https://www.postgresql.org/docs/current/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">documentation</a>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Extensibility & Interoperability</b> - One of the initial design goals of Postgres was extensibility. Over it’s 30 year history, there’s been a countless number of extensions that have been developed to make Postgres more powerful. We’ll talk about a couple Postgres extensions that Reddit uses for sharding.</p></li></ul><h3 class="heading" style="text-align:left;" id="cassandra"><b>Cassandra</b></h3><p class="paragraph" style="text-align:left;"><a class="link" href="https://en.wikipedia.org/wiki/Apache_Cassandra?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">Cassandra</a> is a NoSQL, distributed database created at Facebook in 2007. The initial project was heavily inspired by <a class="link" href="https://en.wikipedia.org/wiki/Bigtable?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">Google Bigtable</a> and also took many ideas from <a class="link" href="https://en.wikipedia.org/wiki/Dynamo_(storage_system)?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">Amazon’s Dynamo</a>.</p><p class="paragraph" style="text-align:left;">Here’s some characteristics of Cassandra:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Large Scale - </b>Cassandra is completely distributed and can scale to a massive size. It has out of the box support for things like distributing/replicating data in different locations.</p><p class="paragraph" style="text-align:left;"><br>Additionally, Cassandra is also designed with a <i><a class="link" href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">decentralized </a></i><a class="link" href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">architecture</a> to minimize any central points of failure. </p></li><li><p class="paragraph" style="text-align:left;"><b>Highly Tunable - </b>Cassandra is highly customizable so you can configure it based on your exact workload. You can change how the communications between nodes happens (gossip protocol), how data is read from disk (LSM Tree), the consensus between nodes for writes (consistency level) and <a class="link" href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/configuration/configCassandra_yaml.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second#configCassandra_yaml__PerformanceTuningProps" target="_blank" rel="noopener noreferrer nofollow">much more</a>.</p></li><li><p class="paragraph" style="text-align:left;"><b>Wide Column - </b>Cassandra uses a wide column storage model, which allows for flexible storage. Data is organized into column families, where each can have multiple rows with varying numbers of columns. You can read/write large amounts of data quickly and also add new columns without having to do a schema migration.</p></li></ul><p class="paragraph" style="text-align:left;">We did a much more detailed dive on Cassandra that you can read <a class="link" href="https://blog.quastor.org/p/uber-scaled-cassandra-tens-thousands-nodes-a4d4?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p><h2 class="heading" style="text-align:left;" id="picking-postgres"><b>Picking Postgres</b></h2><p class="paragraph" style="text-align:left;">After evaluating both choices extensively, Reddit decided to go with Postgres.</p><p class="paragraph" style="text-align:left;">The reasons why included:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Challenges with Managing Cassandra</b> - They found some challenges with managing Cassandra. Ad-hoc queries for debugging and visibility were far more difficult compared to Postgres.</p></li><li><p class="paragraph" style="text-align:left;"><b>Data Denormalization Issues with Cassandra</b> - In Cassandra, data is typically denormalized and stored in a way to optimize specific queries (<i>this is based on your specific application</i>). However, this can lead to challenges when creating new queries that your data hasn’t been specifically modeled for.</p></li></ul><p class="paragraph" style="text-align:left;">Reddit uses AWS, so they went with AWS Aurora Postgres. For more on AWS RDS, you can check out a detailed tech dive we did <a class="link" href="https://blog.quastor.org/p/tech-dive-aws-rds-quastor-pro?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p><h2 class="heading" style="text-align:left;" id="data-migration"><b>Data Migration</b></h2><p class="paragraph" style="text-align:left;">Migrating to Postgres was a big challenge for the team. They had to transfer terabytes of data from the different systems to Postgres while ensuring that the legacy systems could continue serving over 100k reads per second.</p><p class="paragraph" style="text-align:left;">Here’s the steps they went through for the migration</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Dual Writes</b> - Any new media metadata would be written to both the old systems and to Postgres.</p></li><li><p class="paragraph" style="text-align:left;"><b>Backfill Data</b> - Data from the older systems would be backfilled into Postgres</p></li><li><p class="paragraph" style="text-align:left;"><b>Dual Reads</b> - After Postgres has enough data, enable dual reads so that read requests are served by <i>both</i> Postgres and the old system</p></li><li><p class="paragraph" style="text-align:left;"><b>Monitoring and Ramp Up</b> - Compare the results from the dual reads and fix any data gaps. Slowly ramp up traffic to Postgres until they could fully cutover.</p></li></ol><h2 class="heading" style="text-align:left;" id="scaling-strategies"><b>Scaling Strategies</b></h2><p class="paragraph" style="text-align:left;">With that strategy, Reddit was able to successfully migrate over to Postgres.</p><p class="paragraph" style="text-align:left;">Currently, they’re seeing peak loads of ~100k reads <i>per second</i>. At that load, the latency numbers they’re seeing with Postgres are</p><ul><li><p class="paragraph" style="text-align:left;"><b>2.6 ms P50</b> - 50% of requests have a latency lower than 2.6 milliseconds</p></li><li><p class="paragraph" style="text-align:left;"><b>4.7 ms P90</b> - 90% of requests have a latency lower than 4.7 milliseconds</p></li><li><p class="paragraph" style="text-align:left;"><b>17 ms P99</b> - 99% of requests have a latency lower than 17 milliseconds</p></li></ul><p class="paragraph" style="text-align:left;">They’re able to achieve this <i>without </i>needing a read-through cache.</p><p class="paragraph" style="text-align:left;">We’ll talk about some of the strategies they’re using to scale.</p><h3 class="heading" style="text-align:left;" id="table-partitioning"><b>Table Partitioning</b></h3><p class="paragraph" style="text-align:left;">At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.</p><p class="paragraph" style="text-align:left;">Reddit shards their tables based on <code>post_id</code> where they use <i>range-based </i>partitioning. All posts with a <code>post_id</code> in a certain range will be put on the same database shard.</p><p class="paragraph" style="text-align:left;"><code>post_id</code> increases monotonically, so this means that their table will be partitioned by time periods.</p><p class="paragraph" style="text-align:left;">Many of their read requests involve batch queries on multiple IDs from the same time period, so this design helps minimize cross-shard joins.</p><p class="paragraph" style="text-align:left;">Reddit uses the <a class="link" href="https://github.com/pgpartman/pg_partman?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">pg_partman</a> Postgres extension to manage the table partitioning.</p><h3 class="heading" style="text-align:left;" id="denormalization"><b>Denormalization</b></h3><p class="paragraph" style="text-align:left;">Another way Reddit minimizes joins is by using denormalization.</p><p class="paragraph" style="text-align:left;">They took all the metadata fields required for displaying an image post and put them together into a single <a class="link" href="https://www.postgresql.org/docs/current/datatype-json.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank" rel="noopener noreferrer nofollow">JSONB field</a>. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.</p><p class="paragraph" style="text-align:left;">This made it <i>much</i> more efficient to fetch all the data needed to render a post.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/9d8f7620-4d37-428f-9c27-ab485803f304/Screenshot_2024-05-07_at_2.53.33_PM.png?t=1715108016"/><div class="image__source"><span class="image__source_text"><p>All the metadata needed to render an image post</p></span></div></div><p class="paragraph" style="text-align:left;">It also simplified the querying logic, especially across different media types. Instead of worrying about exactly which data fields you needed, you just fetch the single JSONB value.</p><hr class="content_break"><h1 class="heading" style="text-align:left;" id="tech-snippets"><b>Tech Snippets</b></h1><div class="embed"><a class="embed__url" href="https://read.engineerscodex.com/p/5-non-llm-software-trends-to-be-excited?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank"><img class="embed__image embed__image--top" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7e0f9b-fc89-4bd6-8c00-c5ddd149a9c9_1477x1098.png"/><div class="embed__content"><p class="embed__title"> 5 Non-LLM Software Trends To Be Excited About </p><p class="embed__description"> With all the AI hype that’s been going around, almost everyone has been completely focused on LLMs. However, there’s been a ton of research and advancements in other areas of software engineering.<br><br>Engineer’s Codex published a fantastic blog post talking about some of the other exciting tech in software.<br><br>Some of the advancements discussed are CRDTs, WebAssembly improvements, strides in Cross-Platform development and more. </p><p class="embed__link"> read.engineerscodex.com/p/5-non-llm-software-trends-to-be-excited </p></div></a></div><div class="embed"><a class="embed__url" href="https://blog.carlosgaldino.com/writing-a-file-system-from-scratch-in-rust.html?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank"><div class="embed__content"><p class="embed__title"> Writing a File System from Scratch in Rust </p><p class="embed__description"> This is a really interesting blog post that walks through the process of building your own file system.<br><br>This post explains how a file system structures data on disks with things like superblocks, bitmaps, inodes, data blocks and more.<br><br>It delves into the author’s implementation of a file system called GotenksFS. Key parts covered include initializing the file system image, mounting the file system, on-disk data structures and more. </p><p class="embed__link"> blog.carlosgaldino.com/writing-a-file-system-from-scratch-in-rust.html </p></div></a></div><div class="embed"><a class="embed__url" href="https://jacobpadilla.com/articles/recreating-asyncio?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank"><img class="embed__image embed__image--left" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/480a4920-e8b0-4ac3-afbf-8a24d5f0f4dc/Screenshot_2024-05-07_at_4.26.47_PM.png?t=1715113619"/><div class="embed__content"><p class="embed__title"> How Python Asyncio Works </p><p class="embed__description"> This is an excellent deep dive into Python’s asyncio library. It explains how it works under the hood by recreating a simplified version from scratch.<br><br>The post goes through creating a basic event loop and how to build a sleep generator that pauses a task’s execution for a certain duration. </p><p class="embed__link"> jacobpadilla.com/articles/recreating-asyncio </p></div></a></div><div class="embed"><a class="embed__url" href="https://about.gitlab.com/blog/2021/11/16/engineering-managers-automate-their-jobs/?utm_source=blog.quastor.org&utm_medium=newsletter&utm_campaign=how-reddit-built-a-metadata-store-that-handles-100k-reads-per-second" target="_blank"><img class="embed__image embed__image--top" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/asset/file/ce604d24-0895-423c-807d-8cc038c0d904/Screenshot_2024-05-07_at_4.39.21_PM.png?t=1715114369"/><div class="embed__content"><p class="embed__title"> How GitLab automates engineering management </p><p class="embed__description"> Engineering managers at GitLab use a ton of automation scripts to help them manage projects and keep track of issues.<br><br>This is a post by GitLab Engineering with some examples of automation scripts they use to help keep projects on track. </p><p class="embed__link"> about.gitlab.com/blog/2021/11/16/engineering-managers-automate-their-jobs </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=662fe219-a5be-49b6-b425-62aed1e947a1&utm_medium=post_rss&utm_source=quastor">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

  </channel>
</rss>
