Quastor

How Booking.com scaled their Customer Review System

Fri, 12 Jul 2024 12:45:00 +0000

Hey Everyone!

Today we’ll be talking about

How Booking.com Scaled Their Customer Review System
- Booking.com sharded their customer reviews across a cluster of servers. In order to scale this system, they needed to add more machines to the cluster.
- They talk about how they made it easy to scale the system up with Jump Consistent Hash, a consistent hashing algorithm developed at Google.
- We'll talk about the process they use to scale up the cluster to avoid any consistency/routing issues.
How to improve your Focus
- Andrew Huberman is a professor at the Stanford School of Medicine and he runs a great podcast where he gives actionable tips on how to improve your health. In this episode, he talks about how to improve your ability to focus.
- Using Binaural Beats to get into flow
- Working in 90 minute durations with a 10 minute break where you deliberately defocus
Tech Snippets
- 6 Best Practices to Manage Pull Request Creation and Feedback
- Good programmers focus on data structures over code
- Coding Interview Roadmap with specific LeetCode questions and solutions

How WorkOS achieved 5 9’s of Availability with Kubernetes

WorkOS is a developer platform that lets you quickly add enterprise features like single sign-on, access controls, user management and more.

Previously, they built their user management service on Heroku, but they were facing quite a few issues with the platform:

No SLA - Heroku doesn’t provide a service level agreement (SLA) for uptime. This made providing an SLA to WorkOS users extremely difficult.
Limited Rollout Functionality - Heroku does not have built-in support for blue/green or canary deployment strategies.
Security - WorkOS wasn’t able to implement the custom security measures they needed with Heroku

To address these issues, WorkOS decided to migrate to AWS Elastic Kubernetes Service (EKS) and build their own Heroku-like platform called Terrace. Terrace handles autoscaling machines, load balancing, deploying updates, etc..

If you’d like to learn more, WorkOS just published an extremely detailed blog post delving into how Terrace works and what tech they use under the hood (ArgoCD, Karpenter, cdk8s, and more).

After migrating to Terrace, they’ve seen a significant increase in uptime (consistently achieving 5 9’s of availability), much faster deployments (a 90% drop in roll-out time) and other improvements.

Learn about WorkOS’s migration from Heroku to AWS EKS

partnership

How Booking.com Scaled their Customer Review System

Booking.com is one of the largest online travel agencies in the world; they booked over 35 million airplane tickets and a billion hotel room nights through the app/website in 2023.

Whenever you use Booking.com to book a hotel room/flight, you’re prompted to leave a customer review. So far, they have nearly 250 million customer reviews that need to be stored, aggregated and filtered through.

Storing these many reviews on a single machine is not possible due to the size of the data, so Booking.com has partitioned this data across several shards. They also run replicas of each shard for redundancy and have the entire system replicated across several availability zones. They wrote a great blog post on how they do this.

Sharding is done based on a field of the data, called the partition key. For this, Booking.com uses the internal ID of an accommodation. The hotel/property/airline’s internal ID would be used to determine which shard its customer reviews would be stored on.

A basic way of doing this is with the modulo operator.

If the accommodation internal ID is 10125 and you have 90 shards in the cluster, then customer reviews for that accommodation would go to shard 45 (equal to 10125 % 90).

Challenges with Scaling

The challenge with this sharding scheme comes when you want to add/remove machines to your cluster (change the number of shards).

Booking.com expected a huge boom in users during the summer. They forecasted that they would be seeing some of the highest traffic ever and they needed to come up with a scaling strategy.

However, adding new machines to the cluster will mean rearranging all the data onto new shards.

Let’s go back to our example with internal ID 10125. With 90 shards in our cluster, that accommodation would get mapped to shard 45. If we add 10 shards to our cluster, then that accommodation will now be mapped to shard 25 (equal to 10125 % 100).

This process is called resharding, and it’s quite complex with our current scheme. You have a lot of data being rearranged and you’ll have to deal with issues around ambiguity during the resharding process. Your routing layer won’t know if the 10125 accommodation was already resharded (moved to the new shard) and is now on shard 25 or if it’s still stuck in the processing queue and its data is still located on shard 45.

The solution to this is a family of algorithms called Consistent Hashing. These algorithms minimize the number of keys that need to be remapped when you add/remove shards to the system.

ByteByteGo did a great video on Consistent Hashing (with some awesome visuals), so I’d highly recommend watching that if you’re unfamiliar with the concept. Their explanation was the clearest out of all the videos/articles I read on the topic.

Using the Jump Hash Sharding Algorithm

For their sharding scheme, Booking now uses the Jump Hash Sharding algorithm, a consistent hashing algorithm that was created at Google. It’s extremely fast, takes minimal memory and is simple to implement (can be expressed in 5 lines of code).

With Jump Hash, Booking.com can rely on a property called monotonicity. This property states that when you add new shards to the cluster, data will only move from old shards to new shards; thus there is no unnecessary rearrangement.

With the previous scheme, we had 90 shards at first (labeled shard 1 to 90) and then added 10 new shards (labeled 91 to 100).

Accommodation ID 10125 was getting remapped from shard 45 to shard 25; it was getting moved from one of the old shards in the cluster to another old shard. This data transfer is pointless and doesn’t benefit end users.

What you want is monotonicity, where data is only transferred from shards 1-90 onto the new shards 91 - 100. This data transfer serves a purpose because it balances the load between the old and new shards so you don’t have hot/cold shards.

The Process for Adding New Shards

Booking.com set a clear process for adding new shards to the cluster.

They provision the new hardware and then have coordinator nodes that figure out which keys will be remapped to the new shards and loads them.

The resharding process begins and the old accommodation IDs are transferred over to the new shards, but the remapped keys are not deleted from the old shards during this process.

This allows the routing layer to ignore the resharding and continue directing traffic to remapped accommodation IDs to the old locations.

Once the resharding process is complete, the routing layer is made aware and it will start directing traffic to the new shards (the remapped accommodation ID locations).

Then, the old shards can be updated to delete the keys that were remapped.

Common Pitfalls of Webhooks and how WorkOS avoids them

Webhooks are a popular choice for sending data to users. Instead of having users query your API for the latest data, you can immediately push updates to their webhook endpoint.

However, webhooks come with quite a few pitfalls. Some common challenges include:

Out of Order Updates - webhooks do not always arrive in the order that they’re sent. Even if your servers are sending them out in the correct sequence, they can be processed by your users in a different order due to network issues, proxy servers (your users might have a load balancer or API gateway), etc.
Spiky Throughput - If there’s a large amount of activity in a short-period of time, then you might overwhelm an end user by sending too many webhooks at once. An alternative is to rate limit the amount of webhooks you send out but this will result in missed/delayed updates to the user.
Audit Trail - Most vendors do not provide a built-in history of webhooks. This makes data loss recovery and incident remediation much more challenging for users.

Previously, WorkOS relied on webhooks for real-time updates but they’ve shifted to using an Events API instead.

They published a really interesting blog post on the features of their Events API, the tradeoffs they made and the technical design of the API.

Learn how WorkOS dealt with the pitfalls of Webhooks

partnership

Tech Snippets

6 Best Practices to Manage Pull Request Creation and Feedback

This is a useful blog post on the DoorDash Engineering blog with tips on how to minimize the number of bad PRs in your development cycle.

PRs should be short and have a clear title and description. Having short PRs helps programmers get feedback on their code more quickly, so it reduces the number of rewrites.

It also reduces friction in the code review process (you don't want to spend 2 hours reading someone else's PR).

doordash.engineering/2022/08/23/6-best-practices-to-manage-pull-request-creation-and-feedback

Good programmers focus on data structures over code

Linus Torvalds was quoted saying to “design your code around the data rather than the other way around“ when he was talking about Git’s design.

Engineer’s Codex wrote a great blog post delving into this quote further. When building you should start with the data and spend extra time thinking through that ahead of time. Complexity in the data structures is almost always preferable to complexity in the code.

read.engineerscodex.com/p/good-programmers-worry-about-data

A Coding Interview Roadmap for Big Tech

If you're preparing for coding interviews, this is a great roadmap that you can use to build your study plan.

It gives an ordering of the topics you need to study, what order to practice them in, and which LeetCode problems to solve. All the problems listed also come with a free video solution on YouTube by NeetCode.

neetcode.io/roadmap

How to Improve Your Ability to Focus

Andrew Huberman is a researcher at Stanford University and he has an amazing podcast (Huberman Lab) where he goes through current research on how you can live a happier, healthier and more productive life.

Being a programmer requires an ability to focus for long periods of time (get into a flow), so I found his podcast episode - Focus Tookit: Tools to Improve Your Focus & Concentration to be very useful.

Here's a quick summary of the tips he mentioned (based off peer-reviewed scientific literature).

Binaural Beats - There are playlists available on YouTube (or free apps if you google) that will play Binaural Beats (where you listen to two tones with slightly different frequencies at the same time). 40 Hz binaural beats have been shown to improve focus, attention and memory retention in a number of peer reviewed studies. Huberman recommends listening to 5-10 minutes of Binaural Beats prior to when you start your task to make it easier to get into a state of flow.
90 Minutes - The ideal duration for focused sessions is 90 minutes or less. Past that, fatigue begins to set in and the amount of focus people are able to dedicate begins to drop off. Therefore, Huberman sets a timer for 90 minutes when he begins a focused task and stops after that.
Defocus - After the 90 minutes (or less) focus session, you should spend 10-20 minutes where you deliberately defocus and give your brain/body a chance to rest. During this time, you should avoid focusing on any single thing (so avoid using your phone) and can work on menial tasks where your mind can wander (talk a short walk, do the dishes, wash the laundry, etc.). This is the best way to recharge for the next 90 minute focus session after the break.
Visual Field - A great deal of our cognitive focus is directed by our visual system. If you focus your eyes on a pen, you'll naturally start to focus on it and notice details about the pen. Cognitive focus tends to follow overt visual focus. Therefore, you can help ease yourself into a focused state by picking something in your room (part of the wall, an object, etc.) and staring at that object for 30 seconds to a few minutes (blinking is fine, don't try to force your eyes open). This helps you get into a focused state, and you can redirect your focus to your task after the 30 seconds is up.

These are a few of the tips Huberman mentions.

In the podcast, he also talks about using supplements like coffee, EPA, creatine and more.

Here's the full episode

The Architecture of Notion's Data Lake

Arpan KG — Tue, 09 Jul 2024 19:56:10 +0000

Hey Everyone!

Today we’ll be talking about

The Architecture of Notion’s Data Lake
- Notion has been experiencing exponential growth. The data they store has been doubling every 6-12 months
- They needed to build a data pipeline to extract data from Postgres and store it in a data warehouse for analytics and ML jobs
- They built their system using AWS S3, Snowflake, Fivetran, Apache Hudi, Spark, and more
- We’ll talk about some problems they faced and how they resolved them
Tech Snippets
- How to build a strong relationship with your manager
- How Instacart built a text-to-image service to generate images for grocery stores
- How to improve hiring quality at your organization

The Architecture of Notion’s Data Lake

Notion is a productivity application that serves as an “all-in-one” workspace. It’s incredibly powerful and you can use it to write notes, build spreadsheets, manage calendars, store documents, track timelines and much more.

The company was founded in 2013 and has since grown to over 35 million users and a $10 billion valuation.

With this exponential user growth comes massive scaling pains. The Notion team has had to deal with their user data doubling every 6-12 months.

Now, user content on Notion (notes, images, documents, etc.) has grown to hundreds of terabytes of compressed data.

The company uses a cluster of Postgres servers for storing this data with 96 physical machines.

Previously, they used Postgres for both their online and offline traffic. Online traffic consists of read/write requests from people using the Notion app. Offline traffic is from Notion employees running data analytics and machine learning jobs.

In 2021, Notion’s data engineers decided that they needed to build dedicated infrastructure for handling the offline traffic so that it wouldn’t interfere with requests from users. To solve this, the Notion engineers built an ELT pipeline.

Last week, they published a fantastic blog post talking about the design choices they made for their pipeline and data lake. They also delved into their tech stack and why they picked those specific technologies (Snowflake, S3, Fivetran and more).

Brief Intro to ETL and ELT

The data for your app is probably in a bunch of different places. You might have payment data in Stripe, customer data in Postgres, website-usage data in Google Analytics, etc.

ETL and ELT are processes for extracting data from all the different sources and loading a cleaned version into your data warehouse.

ETL (Extract-Transform-Load) has been the traditional approach with data warehousing where you extract data from the sources, transform the data in your data pipelines (clean and aggregate it) and then load it into your data warehouse.

A newer paradigm is ELT (Extract-Load-Transform), where you extract the raw, unstructured data and load it into your data warehouse. Then, you run the transform step on the data in the data warehouse. With this approach, you can have more flexibility in how you do your data transformations compared to using data pipelines.

In order to effectively do ELT, you need to use a “modern data warehouse” that can ingest unstructured data and run complex queries and aggregations to clean and transform it. Examples of modern data warehouses include Snowflake, AWS Redshift, Google BigQuery and more.

Notion’s ELT Approach

In 2021, Notion built an ELT pipeline with Fivetran and Snowflake. Fivetran is a platform that moves data between your sources and the data warehouse you’re using. It comes with hundreds of pre-built connectors that can connect to Stripe, Postgres, Salesforce, MySQL, etc. to ingest data.

You can perform transformations for cleaning, validating, structuring, etc. in Fivetran and then load the data into your data warehouse (Redshift, Snowflake, etc.).

Notion connected Fivetran to ingest data from Postgres (by reading Postgres’ Write Ahead Log) and send it to Snowflake. They set up 480 Fivetran connectors to connect to each of the 480 logical Postgres shards Notion has in their fleet.

Scaling Challenges

Notion started facing issues with this ELT setup.

Operability - monitoring and maintaining 480 Fivetran connectors was creating a significant on-call burden for the engineers. Developers would also have to resync them during Postgres upgrade/maintenance periods.
Data Freshness and Speed - Ingesting data to Snowflake became slow and costly due to Notion’s specific workload (they have a very update-heavy workload). Most data warehouses are optimized for insert-heavy workloads instead of update-heavy.
Use Case Support - the data transformation logic Notion wanted to implement became increasingly complex. It was challenging to implement this in the standard SQL interface offered by off-the-shelf data warehouses.

The Architecture of Notion’s In-house Data Lake

To solve this, Notion decided to build their own data lake internally.

Here’s the high level architecture.

Notion shifted their pipeline to use Debezium (for transferring data between sources and the data warehouse), Apache Spark (for processing and transforming data), AWS S3 (for the data warehouse) and Apache Hudi (for writing updates from Kafka to S3).

Note - despite this shift, Notion continues to use Snowflake and Fivetran. They found Snowflake works well for insert-heavy workloads and that Fivetran is effective for non-update heavy tables and small dataset ingestion.

Debezium is an open-source platform for Change Data Capture (CDC). Notion has Debezium CDC connectors set up with their Postgres shards. As new data/changes are written to Postgres, these connectors will send the updates to Apache Kafka.

From Kafka, these updates are cleaned, aggregated and transformed using Apache Spark.

Spark is an open source data processing framework that can process hundreds of terabytes of data in a distributed way. It offers a ton of customization options around partitioning and resource allocation. We did a deep dive on Spark that you can read here.

After processing, the data is written to AWS S3 using Apache Hudi. Hudi is a data management framework that brings ACID transactional guarantees to data lakes like S3. It makes it much easier to make incremental updates to your data lake while having rollbacks, auditing and debugging capabilities.

With this new architecture, Notion was able to reduce the end-to-end ingestion time from over a day to minutes/hours. Additionally, it’s resulted in millions of dollars of savings and helped Notion integrate new AI features into the app.

Tech Snippets

How to Build a Strong Relationship with your Manager

It’s no secret that the key to career growth is having a great relationship with your manager. However, the how-to of building this relationship is rarely taught.

Jordan Cutler is a senior software engineer at Pinterest and he wrote a fantastic post delving into his strategies for nurturing this relationship.

The principles are to:

1. Reduce Uncertainty - help your manager understand where you’re at by outlining your objectives and regularly briefing them on your progress.

2. Managing your Priorities - understand your team’s work priorities and make sure your work is always aligned with that.

3. Show a Growth Mindset - Proactively seek feedback and actively respond to it.

Read the full blog post for all of Jordan’s strategies.

read.highgrowthengineer.com/p/managing-up-realizations

How Instacart built a text-to-image service to generate Images for Grocery Stores on the app

After analyzing user data, Instacart’s data scientists saw that adding food images can massively improve the customer experience. However, generating images for every possible add-on, topping, filling, etc., can be extremely time-consuming for the grocery stores on Instacart.

To solve this, they built a text-to-image service to create customizable food images for their grocery order management system. The system relies on Google Imagen for image generation.

Retailers can create prompts like “sliced monterey jack cheese on a white background” and then see multiple image variations to choose the best result.

Read the full blog post to learn about how Instacart built this.

tech.instacart.com/enhancing-foodstorm-with-ai-image-generation-d76a74867fa4

How to Improve Hiring Quality

As a company scales in size, it’s important to maintain the quality and consistency of hiring. Tactics to do ensure this include adding a “bar raiser” interviews, having a hiring committee, using structured approval and more.

In this article, Will Lethain talks about different strategies he’s seen implemented and his personal opinion on each. He also goes through questions you should ask to evaluate how well your org is implementing the approach.

His personal recommendation is to use a structured approval process where a candidate goes through a clear set of stages. Each stage has a single person who’s responsible (recruiter, hiring manager, head of engineering, etc.) for passing/failing the candidate.

lethain.com/bar-raising-hiring-committees-hiring-quality

The Architecture of Notion's Data Lake

Arpan KG — Tue, 09 Jul 2024 11:35:00 +0000

Hey Everyone!

Today we’ll be talking about

The Architecture of Notion’s Data Lake
- Notion has been experiencing exponential growth. The data they store has been doubling every 6-12 months
- They needed to build a data pipeline to extract data from Postgres and store it in a data warehouse for analytics and ML jobs
- They built their system using AWS S3, Snowflake, Fivetran, Apache Hudi, Spark, and more
- We’ll talk about some problems they faced and how they resolved them
Tech Snippets
- How to build a strong relationship with your manager
- How Instacart built a text-to-image service to generate images for grocery stores
- How to improve hiring quality at your organization

What Engineers get Wrong about Analytics

Many engineering roles today need developers to get involved in product decisions, talk to users and analyze usage data. However, understanding how to use analytics data to make the right decision is hard.

Product for Engineers wrote a fantastic blog post delving into some of the mistakes devs make when they’re trying to make product decisions based on analytics data.

Some of the mistakes include

Making it too Complicated - It’s easy to get overwhelmed by the huge swath of data tools. Instead, start small. Pick a specific feature and track its usage with trends and retention. Use that to iterate.
Not Using Session Replays - Session replays are a fantastic tool for uncovering bugs, unexpected behavior and UX issues. They have a very high information density and aren’t just for PMs or marketers.
Only focusing on the Numbers - Relying on data alone is like tying one arm behind your back. You also need qualitative data like surveys and user interviews. Combining the two will help you build better products.

For the rest of the mistakes, check out Product for Engineers. It’s a fantastic newsletter by PostHog that helps developers learn how to build apps that users love.

To hone your product skills and read more articles like this, check out Product for Engineers below.

Check out Product for Engineers. It’s free!

partnership

The Architecture of Notion’s Data Lake

With this exponential user growth comes massive scaling pains. The Notion team has had to deal with their user data doubling every 6-12 months.

Now, user content on Notion (notes, images, documents, etc.) has grown to hundreds of terabytes of compressed data.

The company uses a cluster of Postgres servers for storing this data with 96 physical machines.

Brief Intro to ETL and ELT

The data for your app is probably in a bunch of different places. You might have payment data in Stripe, customer data in Postgres, website-usage data in Google Analytics, etc.

ETL and ELT are processes for extracting data from all the different sources and loading a cleaned version into your data warehouse.

Notion’s ELT Approach

You can perform transformations for cleaning, validating, structuring, etc. in Fivetran and then load the data into your data warehouse (Redshift, Snowflake, etc.).

Scaling Challenges

Notion started facing issues with this ELT setup.

Operability - monitoring and maintaining 480 Fivetran connectors was creating a significant on-call burden for the engineers. Developers would also have to resync them during Postgres upgrade/maintenance periods.
Data Freshness and Speed - Ingesting data to Snowflake became slow and costly due to Notion’s specific workload (they have a very update-heavy workload). Most data warehouses are optimized for insert-heavy workloads instead of update-heavy.
Use Case Support - the data transformation logic Notion wanted to implement became increasingly complex. It was challenging to implement this in the standard SQL interface offered by off-the-shelf data warehouses.

The Architecture of Notion’s In-house Data Lake

To solve this, Notion decided to build their own data lake internally.

Here’s the high level architecture.

From Kafka, these updates are cleaned, aggregated and transformed using Apache Spark.

An Engineer’s Guide to Feature Flags

Feature flags make it simple to switch a new feature on/off without requiring a redeploy. They lower the stakes when releasing new features by making rollbacks extremely easy.

Here are some tips on using Feature Flags

Ensure that each feature flag has a clear and documented purpose. A single flag should not be used for multiple, unrelated features
Flags should be temporary and removed from the codebase once they’re no longer needed
Use a centralized system or tool to manage feature flags

For more tips on feature flags, running A/B tests, talking to users, pricing for SaaS and much more, check out Product for Engineers.

This is an awesome newsletter that teaches product skills to software developers.

Having a great sense of product will help you ship more impactful features at work so you can get promoted faster.

Subscribe to Product for Engineers for Free

sponsored

Tech Snippets

How to Build a Strong Relationship with your Manager

read.highgrowthengineer.com/p/managing-up-realizations

How Instacart built a text-to-image service to generate Images for Grocery Stores on the app

tech.instacart.com/enhancing-foodstorm-with-ai-image-generation-d76a74867fa4

How to Improve Hiring Quality

lethain.com/bar-raising-hiring-committees-hiring-quality

How Uber Minimizes Flaky Tests

Arpan KG — Fri, 05 Jul 2024 13:00:00 +0000

Hey Everyone!

Today we’ll be talking about

How Uber Minimizes Flaky Tests
- For every code diff, Uber has to run 10k+ tests on average. A key challenge is to make sure these tests are reliable and don’t fail due to errors unrelated to the code.
- Uber wrote a great blog post delving into how they ensure this with Testopedia, a service that analyzes test statistics around staleness, reliability, latency, etc.
- We’ll talk about the architecture of Testopedia and how they manage/treat flaky tests.
Tech Snippets
- Insights from over 10,000 comments on Hacker News’ Hiring Threads
- Programming advice I’d give to myself 15 Years Ago
- Why clever code is probably the worst code you could write
- Switching from a Software Engineer to Engineering Manager

An Interactive Explanation of Large Language Models

If you’re curious about how LLMs like GPT-4o, LLaMA and Claude work then you should check out this fantastic course by Brilliant.

It’s fully interactive with animations, hands-on graphics and detailed explanations on things like

N-Gram Models
Transformers
Fine Tuning LLMs

Brilliant is a learning platform that has thousands of lessons on math, data science, computer science and machine learning. Their content is structured in bite-sized chunks with tons of interactive animations, graphics and more.

This makes it really easy to build a daily learning habit with the Brilliant app, making you a better problem solver and a faster learner.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

Check out Brilliant for Free

sponsored

How Uber Minimizes Flaky Tests with Testopedia

Uber has an enormous engineering team with over 5,000 developers. On a typical day, they contribute 2,500+ pull requests with features, bug fixes, rewrites, etc.

For each pull request, Uber needs to run over 10,000+ tests. This helps them ensure that their master branch is always green (able to compile and execute).

However, running all these tests can quickly hurt developer productivity if they don’t make sure that the tests are reliable (accurately failing/succeeding without flaky behavior) and performant (low latency).

To handle this, Uber built Testopedia, a system that sits within their CI infrastructure and takes care of

Generating and Storing Test Statistics - keep track of individual test stats around staleness, reliability, historical runs, execution time, etc.
Retrieve Test Statistics - retrieve the statistics for a test case (or a group of test cases).
Notifying Teams for Bad Tests - whenever a test becomes unhealthy, Testopedia can create a JIRA ticket and assign it to the relevant team.

In this article, we’ll delve into some interesting problems Uber engineers faced when building Testopedia and how they solved them.

Testopedia Architecture

Testopedia keeps track of all the tests Uber runs and categorizes them into different states like new, stable, disabled, unsound, deleted.

The service will automatically move tests between states based on the test statistics it collects (or based on actions like tests being added/deleted from the codebase).

Testopedia uses unique string identifiers (Uber refers to these identifiers as “Fully Qualified Names“) to keep track of the tests. It generates these names based on the domain of the test, its path in the codebase and its target.

When Uber runs tests on a PR, they’ll query Testopedia to figure out which specific tests need to run.

This results in a query like golang.unit/src/infra/*, which returns all the tests within that domain.

One of the challenges Uber faced was figuring out how to make serving these queries efficient. Iterating across the entire database to find all the prefix matches was not acceptable since Uber had millions of tests.

Instead, Uber went with a bucketing algorithm for storing their tests. In some ways, it’s similar to a trie data structure.

Uber maintains a prefix table that stores the tests for each prefix. Here’s an example of how Uber might add a new test golang.unit/a/b/c/d:test to the table.

Generate Integer Bucket ID - Each test case is identified by an integer bucket ID, so the first step is to generate that bucket ID. For golang.unit/a/b/c/d:test, the generated bucket ID is 10.
Identify the Prefix - Uber strips away the domain and just looks at the prefix for the test. They look at the first 3 prefixes so in this case it’s a/b/c
Add to Prefix Layers - First, they add 10 to the a/ prefix. Then, they add 10 to the a/b prefix. Finally, they add 10 to the a/b/c prefix.

When Testopedia gets a query for all the test cases with a certain prefix, they can just look that prefix up in the table and immediately get all the test case bucket IDs for that prefix.

Managing Flaky Tests

When identifying flaky tests, Uber needs to ensure the system isn’t too aggressive at removing unstable tests (while also being useful).

Uber identifies flaky tests with a sliding window algorithm. If a test fails once in the last X window of runs, then it’s classified as unstable. On the other hand, tests are classified as stable if they pass N times consecutively.

The system is also highly extensible. For example, integration tests might be more computationally intensive so they can be likelier to timeout and fail randomly. In this scenario, Testopedia also has percentage-based analyzers that will look at a test’s failure percentage over the last N runs. If the failure rate exceeds the configured threshold then the test will be set as unstable.

Treating Flaky Tests

When a test is marked as flaky, Uber has several strategies to deal with this to ensure that code test coverage is not impacted.

Critical Tests - if a test is marked as critical, then it’ll be run on CI jobs regardless of whether or not it’s set as flaky.
Test Switches - when submitting a PR, engineers can add tags/keywords to enable flaky tests or disable critical tests.
Non-Blocking Mode - Uber’s CI system can also run the flaky tests asynchronously and then send the results to the engineer without blocking the PR.

Testopedia also ensures that the relevant engineering teams are notified when they have flaky tests. Sending a notification for each flaky test would quickly overwhelm the system (and also overwhelm engineers) so Testopedia batches flaky test notifications by domain/prefix/severity.

Then, each engineering team can set their specific notification frequency for how frequently they want to be alerted about flaky tests.

An Interactive Explanation of Computer Memory

Have you ever been curious about the inner workings of RAM and why it’s so much faster than disk? Are you familiar with the many optimizations that your computer does to minimize RAM usage like page sharing, demand loading, copy-on-write and more?

If you’d like to learn about this, Brilliant has a new detailed course on Computer Memory with 26 interactive lessons that you can easily do on your phone or laptop.

This is just one of thousands of lessons that Brilliant has that cover all topics across software engineering, machine learning, data science, quantitative finance and more.

They structure their lessons in bite-sized pieces, making it ideal for those short 10-minute breaks that you might otherwise waste on Instagram or Twitter.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

Check out Brilliant for Free

sponsored

Tech Snippets

Insights from over 10,000 comments on Hacker News’ Hiring Threads

Every month, Hacker News posts a “Who’s hiring?“ thread where employers can post developer job posts for the HN audience.

Tamer Comcuoglu scrapped all the past job posts from HN and ran them through GPT-4o to gather insights. He wrote a terrific blog post delving into his findings.

tamerc.com/posts/ask-hn-who-is-hiring

Programming Advice I’d Give to Myself 15 Years Ago

Marcus Buffett is the founder of Chessbook (an app that uses spaced repetition to help you get better at Chess) and he was previously a software engineer at Apple and OpenSea.

He wrote a fantastic article delving into lessons he learned from his 15 years of programming.

Some of the things he talks about include:
- Understanding the trade-off between code quality and implementation speed
- Why you should spend more effort on making your code easier to debug
- Asking more questions when you’re learning a new codebase

mbuffett.com/posts/programming-advice-younger-self

Clever code is probably the worst code you could write

A common misconception amongst new developers is that it “takes more skill/intellect“ to write complex, “clever“ code instead of simple, readable code. 

You might see engineers use overly-complex list comprehensions or obscure regex. While they may feel smart for coming up with that solution at the time, it’s going to be extremely painful for other developers who need to read and maintain that code. 

This article delves into why you should just focus on clean and readable solutions that everyone can easily understand.

read.engineerscodex.com/p/clever-code-is-probably-the-worst

Switching from a Software Engineer to Engineering Manager

Vladimir Klepov is an Engineering Manager at Ozon Fintech and he wrote a great blog post on his experiencing switching from an individual contributor to an engineering manager.

He delves into the things he loves and hates. The amount of impact you can have as an engineering manager can be very large and it also means many new, interesting challenges.

However, he wasn’t a fan of the increased amount of corporate BS he had to put up with. Additionally, there’s a much longer feedback loop on your performance when you’re an engineering manager. This can make it harder to improve.

thoughtspile.github.io/2024/02/16/eng-to-em

How Uber Minimizes Flaky Tests

Arpan KG — Fri, 05 Jul 2024 12:50:00 +0000

Hey Everyone!

Today we’ll be talking about

How Uber Minimizes Flaky Tests
- For every code diff, Uber has to run 10k+ tests on average. A key challenge is to make sure these tests are reliable and don’t fail due to errors unrelated to the code.
- Uber wrote a great blog post delving into how they ensure this with Testopedia, a service that analyzes test statistics around staleness, reliability, latency, etc.
- We’ll talk about the architecture of Testopedia and how they manage/treat flaky tests.
Tech Snippets
- Insights from over 10,000 comments on Hacker News’ Hiring Threads
- Programming advice I’d give to myself 15 Years Ago
- Why clever code is probably the worst code you could write
- Switching from a Software Engineer to Engineering Manager

How Uber Minimizes Flaky Tests with Testopedia

Uber has an enormous engineering team with over 5,000 developers. On a typical day, they contribute 2,500+ pull requests with features, bug fixes, rewrites, etc.

For each pull request, Uber needs to run over 10,000+ tests. This helps them ensure that their master branch is always green (able to compile and execute).

To handle this, Uber built Testopedia, a system that sits within their CI infrastructure and takes care of

Generating and Storing Test Statistics - keep track of individual test stats around staleness, reliability, historical runs, execution time, etc.
Retrieve Test Statistics - retrieve the statistics for a test case (or a group of test cases).
Notifying Teams for Bad Tests - whenever a test becomes unhealthy, Testopedia can create a JIRA ticket and assign it to the relevant team.

In this article, we’ll delve into some interesting problems Uber engineers faced when building Testopedia and how they solved them.

Testopedia Architecture

Testopedia keeps track of all the tests Uber runs and categorizes them into different states like new, stable, disabled, unsound, deleted.

The service will automatically move tests between states based on the test statistics it collects (or based on actions like tests being added/deleted from the codebase).

When Uber runs tests on a PR, they’ll query Testopedia to figure out which specific tests need to run.

This results in a query like golang.unit/src/infra/*, which returns all the tests within that domain.

Instead, Uber went with a bucketing algorithm for storing their tests. In some ways, it’s similar to a trie data structure.

Uber maintains a prefix table that stores the tests for each prefix. Here’s an example of how Uber might add a new test golang.unit/a/b/c/d:test to the table.

Generate Integer Bucket ID - Each test case is identified by an integer bucket ID, so the first step is to generate that bucket ID. For golang.unit/a/b/c/d:test, the generated bucket ID is 10.
Identify the Prefix - Uber strips away the domain and just looks at the prefix for the test. They look at the first 3 prefixes so in this case it’s a/b/c
Add to Prefix Layers - First, they add 10 to the a/ prefix. Then, they add 10 to the a/b prefix. Finally, they add 10 to the a/b/c prefix.

When Testopedia gets a query for all the test cases with a certain prefix, they can just look that prefix up in the table and immediately get all the test case bucket IDs for that prefix.

Managing Flaky Tests

When identifying flaky tests, Uber needs to ensure the system isn’t too aggressive at removing unstable tests (while also being useful).

Treating Flaky Tests

When a test is marked as flaky, Uber has several strategies to deal with this to ensure that code test coverage is not impacted.

Critical Tests - if a test is marked as critical, then it’ll be run on CI jobs regardless of whether or not it’s set as flaky.
Test Switches - when submitting a PR, engineers can add tags/keywords to enable flaky tests or disable critical tests.
Non-Blocking Mode - Uber’s CI system can also run the flaky tests asynchronously and then send the results to the engineer without blocking the PR.

Then, each engineering team can set their specific notification frequency for how frequently they want to be alerted about flaky tests.

Tech Snippets

Insights from over 10,000 comments on Hacker News’ Hiring Threads

tamerc.com/posts/ask-hn-who-is-hiring

Programming Advice I’d Give to Myself 15 Years Ago

mbuffett.com/posts/programming-advice-younger-self

Clever code is probably the worst code you could write

read.engineerscodex.com/p/clever-code-is-probably-the-worst

Switching from a Software Engineer to Engineering Manager

thoughtspile.github.io/2024/02/16/eng-to-em

How LinkedIn Reduced Latency with JSON

Wed, 19 Jun 2024 18:30:00 +0000

Hey Everyone!

Today we'll be talking about

Why LinkedIn switched from JSON to Protobuf
- A brief intro to Rest.li, LinkedIn’s framework for creating web servers and clients
- Issues they were facing with JSON and alternatives they considered
- An overview of Protobuf
- Results from switching to Protobuf
Tech Snippets
- How SMS Fraud Works and How to Guard Against It
- Things DBs Don’t Do - But Should
- Structural Lessons in Engineering Management

How to run A/B tests for Engineers

Product for Engineers is a fantastic newsletter by PostHog that helps developers learn how to find product-market fit and build apps that users love.

A/B testing and experimentation are crucial for building a feature roadmap, improving conversion rates and accelerating growth. However, many engineers don’t understand the ins-and-outs of how to run these tests effectively (they just leave it to the data scientists).

This edition of Product for Engineers delves into A/B tests and discusses

The 5 traits of good A/B tests
How to think about statistical significance and p-values
Avoiding false positives

And more.

To hone your product skills and read more articles like this, check out Product for Engineers below.

Check out Product for Engineers. It’s free!

partnership

Why LinkedIn switched from JSON to Protobuf

LinkedIn has over 900 million members in 200 countries. To serve this traffic, they use a microservices architecture with thousands of backend services. These microservices combine to tens of thousands of individual API endpoints across their system.

As you might imagine, this can lead to quite a few headaches if not managed properly.

To simplify the process of creating and interacting with these services, LinkedIn built (and open-sourced) Rest.li, a Java framework for writing RESTful clients and servers.

To create a web server with Rest.li, all you have to do is define your data schema and write the business logic for how the data should be manipulated/sent with the different HTTP requests (GET, POST, etc.).

Rest.li will create Java classes that represent your data model with the appropriate getters, setters, etc. It will also use the code you wrote for handling the different HTTP endpoints and spin up a highly scalable web server.

For creating a client, Rest.li handles things like

Service Discovery - Translates a URI to the proper address - http://myD2service.something.com:9520/.
Type Safety - Uses the schema created when building the server to check types for requests/responses.
Load Balancing - Balancing request load between the different servers that are running a certain backend service.
Common Request Patterns - You can do things like make parallel Scatter-Gather requests, where you get data from all the nodes in a cluster.

and more.

To learn more about Rest.li, you can check out the docs here.

JSON

Since it was created, Rest.li has used JSON as the default serialization format for sending data between clients and servers.

{

  "id": 43234,

  "type": "post",

  "authors": [

    "jerry",

    "tom"
  ]
}

JSON has tons of benefits

Human Readable - Makes it much easier to work with than looking at 08 96 01 (binary-encoded message). If something’s not working, you can just log the JSON message.
Broad Support - Every programming language has libraries for working with JSON. (I actually tried looking for a language that didn’t and couldn’t find one. Here’s a JSON library for Fortran.)
Flexible Schema - The format of your data doesn’t have to be defined in advance and you can dynamically add/remove fields as necessary. However, this flexibility can also be a downside since you don’t have type safety.
Huge amount of Tooling - There’s a huge amount of tooling developed for JSON like linters, formatters/beautifiers, logging infrastructure and more.

However, the downside that LinkedIn kept running into was with performance.

With JSON, they faced

Increased Network Bandwidth Usage - plaintext is pretty verbose and this resulted in large payload sizes. The increased network bandwidth usage was hurting latency and placing excess load on LinkedIn’s backend.
Serialization and Deserialization Latency - Serializing and deserializing an object to JSON can be suboptimal due to how verbose the messages are. This is not an issue for the majority of applications, but at Linkedin’s volume it was becoming a problem.

To reduce network usage, engineers tried integrating compression algorithms like gzip to reduce payload size. However, this just made the serialization/deserialization latency worse.

Instead, LinkedIn looked at several formats as an alternative to JSON.

They considered

Protocol Buffers (Protobuf) - Protobuf is a widely used message-serialization format that encodes your message in binary. It’s very efficient, supported by a wide range of languages and also strongly typed (requires a predefined schema). We’ll talk more about this below.
Flatbuffers - A serialization format that was also open-sourced by Google. It’s similar to Protobuf but also offers “zero-copy deserialization”. This means that you don’t need to parse/unpack the message before you access data.
MessagePack - Another binary serialization format with wide language support. However, MessagePack doesn’t require a predefined schema so this can cause it to be less safe and less performant than Protobuf.
CBOR - A binary serialization format that was inspired by MessagePack. CBOR extends MessagePack and adds some features like distinguishing text strings from byte strings. Like MessagePack, it does not require a predefined schema.

And a couple other formats.

They ran some benchmarks and also looked at factors like language support, community and tooling. Based on their examination, they went with Protobuf.

Overview of Protobuf

Protocol Buffers (Protobuf) are a language-agnostic, binary serialization format created at Google in 2001. Google needed an efficient way for storing structured data to send across the network, store on disk, etc.

Protobuf is strongly typed. You start by defining how you want your data to be structured in a .proto file.

The proto file for serializing a user object might look like…

syntax = "proto3";

message Person {

  string name = 1;

  int32 id = 2;

  repeated string email = 3;

}

They support a huge variety of types including: bools, strings, arrays, maps, etc. You can also update your schema later without breaking deployed programs that were compiled against the older formats.

Once you define your schema in a .proto file, you use the protobuf compiler (protoc) to compile this to data access classes in your chosen language. You can use these classes to read/write protobuf messages.

Some of the benefits of Protobuf are

Smaller Payload - Encoding is much more efficient. If you have {“id”:59} in JSON, then this takes around 10 bytes assuming no whitespace and UTF-8 encoding. In protobuf, that message would be 0x08 0x3b (hexadecimal), and it would only take 2 bytes.
Fast Serialization/Deserialization - Because the payload is much more compact, serializing and deserializing it is also faster. Additionally, knowing what format to expect for each message allows for more optimizations when deserializing.
Type Safety - As we discussed, having a schema means that any deviations from this schema are caught at compile time. This leads to a better experience for users and (hopefully) fewer 3 am calls.
Language Support - There’s wide language support with tooling for Python, Java, Objective-C, C++, Kotlin, Dart, and more.

Results

Using Protobuf resulted in an increase in throughput for response and request payloads. For large payloads, LinkedIn saw improvements in latency of up to 60%.

They didn’t see any statistically significant degradations when compared to JSON for any of their services.

Here’s the P99 latency comparison chart from benchmarking Protobuf against JSON for servers under heavy load.

For more details, read the full blog post here.

What Engineers get Wrong about Analytics

Many engineering roles today need developers to get involved in product decisions, talk to users and analyze usage data. Understanding how to do this well is hard.

Product for Engineers wrote a fantastic blog post delving into some of the mistakes devs make when they’re trying to make decisions based on analytics data.

Some of the mistakes include

Making it too Complicated - It’s easy to get overwhelmed by the huge swath of data tools. Instead, start small. Pick a specific feature and track its usage with trends and retention. Use that to iterate.
Not Using Session Replays - Session replays are a fantastic tool for uncovering bugs, unexpected behavior and UX issues. They have a very high information density and aren’t just for PMs or marketers.
Only focusing on the Numbers - relying on data alone is like tying one arm behind your back. You also need qualitative data like surveys and user interviews. Combining the two will help you build better products.

For the rest of the mistakes, check out Product for Engineers. It’s a fantastic newsletter by PostHog that helps developers learn how to build apps that users love.

To hone your product skills and read more articles like this, check out Product for Engineers below.

Subscribe to Product for Engineers. It’s free!

partnership

Tech Snippets

How SMS Fraud Works and How to Guard Against It

This is an interesting deep dive into SMS fraud and why it happens. It also gives suggestions on what you can do to guard against it. You should have cool-off periods for phone numbers where they’ll only get a certain amount of SMS messages before being limited.

apuchitnis.substack.com/p/how-sms-fraud-works-and-how-to-guard-against-it

Things DBs Don't Do - But Should

This is an interesting read on functionalities that are often needed in data platforms but are not typically provided by databases.

Examples include
- Version control for Schema changes
- Soft Deletes
- Modern Protocols

and more.

www.thenile.dev/blog/things-dbs-dont-do

Structural Lessons in Engineering Management

This is a great blog post by Camille Fournier (author of The Manager’s Path) on the balance between maintaining a structured approach vs. accommodating individual needs for engineering management. Strict structural thinking (ex. a manager must have x direct reports and there should be clear hierarchy) can lead to unintended consequences which Camille describes in her article.

www.elidedbranches.com/2022/01/structural-lessons-in-engineering.html

How Github Rebuilt Their Push Processing System

Arpan KG — Sun, 16 Jun 2024 10:00:00 +0000

Hey Everyone!

Today we’ll be talking about

How GitHub Rebuilt their Push Processing System
- GitHub recently rebuilt their system for handling code pushes to make it more decoupled
- We’ll give a brief overview of decoupled architectures and their pros/cons
- After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.
Tech Snippets
- Pair Programming Antipatterns
- Resources for CTOs
- Nine ways to shoot yourself in the foot with Postgres
- How CloudFlare debugged an issue with dropped packets

How GitHub Rebuilt their Push Processing System

GitHub has over 420 million repos and 100 million registered users. In May, the platform handled over 500 million code pushes from 8.5 million developers.

Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.

GitHub has to do things like:

Update the repo with the latest commits
Dispatch any Push webhooks
Trigger relevant GitHub workflows

And much more. In fact, Github has 20 different services that run in response to a developer pushing code.

Previously, push requests were handled by a single, enormous job (called RepositoryPushJob). Whenever you pushed code, GitHub’s Ruby on Rails monolith would enqueue RepositoryPushJob and handle all the underlying sub-jobs in a sequential manner.

However, the company faced issues with this approach and decided to switch to a more decoupled architecture with Apache Kafka.

Last week, GitHub published a great blog post delving into the details.

In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.

Overview of Decoupled Architectures

If a user does some significant action on your app, you might have to perform a series of different jobs. If you have a video sharing website, you’ll have a bunch of different things that need to be done when someone uploads a video (encoding, generating transcripts, checking for piracy, etc.)

A key question is how coupled you want these jobs to be.

On one side of the spectrum, you can combine these sub-jobs (encoding, generating transcripts, etc.) into a single larger job (UploadVideo) and then execute them in a sequential manner.

On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event stream (like Kafka). Then, the different sub-jobs will consume the event and run independently.

Some of the pros of a decoupled approach are

Scalability - Each of the components can be scaled up/down independently based on their specific load and demand.
Fault Isolation - Components are independent so a failure in one component can be contained (hopefully).
Easier Development - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together.

Cons with the decoupled approach include

Increased Complexity - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.
System Overhead - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.
Data Consistency - You’ll need to think about making sure data is consistent across the components.

GitHub’s Old Tightly Coupled Architecture

Previously, GitHub used a single massive job called RepositoryPushJob for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.

However, the GitHub team was facing quite a few issues with this approach

Difficulty with Retries - If RepositoryPushJob failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (you couldn’t run them multiple times). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks.
Huge Blast Radius - The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in RepositoryPushJob, the probability of failure increases.
Too Slow - Having a super long sequential process is bad for latency. The sub-jobs at the end of RepositoryPushJob had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (over a second in some cases).

GitHub New Architecture

To decrease the coupling in the push system, GitHub decided to break up RepositoryPushJob into smaller, independent jobs.

They looked at each of the sub-jobs in RepositoryPushJob and grouped them based on dependencies, retry-ability, owning service, etc. Each group of sub-jobs was placed into an independent job with a clear owner and appropriate retry configuration.

Whenever a developer pushes to a repo, GitHub will add a new event to Kafka. A Kafka consumer service will monitor the Kafka topic and consume the events.

If there’s a new event, the service will enqueue all the independent background jobs onto a job queue for processing. A dedicated pool of worker nodes will then handle the jobs in the queue.

In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.

Results

GitHub has seen great results with the new system. Some of the improvements include

Reliability Improvements - The old RepositoryPushJob was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.
Lower Latency - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (in the P50 time).
Smaller Blast Radius - previously, an issue with a single step in RepositoryPushJob could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.

Tech Snippets

Debugging Dropped Packets at Cloudflare

This is an interesting article that delves into a problem Cloudflare was facing where they had drops in bandwidth and failing API requests after making a change to their load balancers. They eventually traced the problem to a bug in the linux kernel.

Terain Stock is a software engineer at Cloudflare and he wrote a post delving into packet handling and using tools like pwru to debug network issues and kprobe for kernel issues.blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths

https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/

Pair Programming Antipatterns

Pair Programming can be an excellent tool for educating junior developers on the codebase however there are quite a few anti-patterns you’ll want to avoid.

This article gives a great list of some of them for the person leading the pair programming session (the driver) and the person following (the navigator).

tuple.app/pair-programming-guide/antipatterns

Resources for CTOs

Resources for CTOs This is a great github repo with resources for CTOs (or aspiring CTOs).

It contains resources on software development processes, architecture, product management, hiring and much more!

github.com/kuchin/awesome-cto

Nine ways to shoot yourself in the foot with PostgreSQL

Nine ways to shoot yourself in the foot with Postgres Many developers have Postgres as their first-choice when they need a database (for good reason).

However, there are some gotchas you should be aware of, especially if you plan on scaling the database. Phil Booth wrote a great blog post delving into some of these potential pitfalls that can become a problem as you scale.

philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql

How Github Rebuilt Their Push Processing System

Arpan KG — Sun, 16 Jun 2024 09:00:00 +0000

Hey Everyone!

Today we’ll be talking about

How GitHub Rebuilt their Push Processing System
- GitHub recently rebuilt their system for handling code pushes to make it more decoupled
- We’ll give a brief overview of decoupled architectures and their pros/cons
- After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.
Tech Snippets
- Pair Programming Antipatterns
- Resources for CTOs
- Nine ways to shoot yourself in the foot with Postgres
- How CloudFlare debugged an issue with dropped packets

An Interactive Explanation of Large Language Models

If you’re curious about how LLMs like GPT-4o, LLaMA and Claude work then you should check out this fantastic course by Brilliant.

It’s fully interactive with animations, hands-on graphics and detailed explanations on things like

N-Gram Models
Transformers
Fine Tuning LLMs

This makes it really easy to build a daily learning habit with the Brilliant app, making you a better problem solver and a faster learner.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

Check out Brilliant for Free

sponsored

How GitHub Rebuilt their Push Processing System

GitHub has over 420 million repos and 100 million registered users. In May, the platform handled over 500 million code pushes from 8.5 million developers.

Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.

GitHub has to do things like:

Update the repo with the latest commits
Dispatch any Push webhooks
Trigger relevant GitHub workflows

And much more. In fact, Github has 20 different services that run in response to a developer pushing code.

Last week, GitHub published a great blog post delving into the details.

In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.

Overview of Decoupled Architectures

A key question is how coupled you want these jobs to be.

On one side of the spectrum, you can combine these sub-jobs (encoding, generating transcripts, etc.) into a single larger job (ProcessVideo) and then execute them in a sequential manner.

On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event streaming platform (like Kafka). Then, the different sub-jobs will consume the event and run independently.

Some of the pros of a decoupled approach are

Scalability - Each of the components can be scaled up/down independently based on their specific load and demand.
Fault Isolation - Components are independent so a failure in one component can be contained (hopefully).
Easier Development - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together.

Cons with the decoupled approach include

Increased Complexity - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.
System Overhead - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.
Data Consistency - You’ll need to think about making sure data is consistent across the components.

GitHub’s Old Tightly Coupled Architecture

Previously, GitHub used a single massive job called RepositoryPushJob for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.

However, the GitHub team was facing quite a few issues with this approach

Difficulty with Retries - If RepositoryPushJob failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (you couldn’t run them multiple times). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks.
Huge Blast Radius - The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in RepositoryPushJob, the probability of failure increases.
Too Slow - Having a super long sequential process is bad for latency. The sub-jobs at the end of RepositoryPushJob had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (over a second in some cases).

GitHub New Architecture

To decrease the coupling in the push system, GitHub decided to break up RepositoryPushJob into smaller, independent jobs.

In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.

Results

GitHub has seen great results with the new system. Some of the improvements include

Reliability Improvements - The old RepositoryPushJob was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.
Lower Latency - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (in the P50 time).
Smaller Blast Radius - previously, an issue with a single step in RepositoryPushJob could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.

An Interactive Explanation of Computer Memory

If you’d like to learn about this, Brilliant has a new detailed course on Computer Memory with 26 interactive lessons that you can easily do on your phone or laptop.

This is just one of thousands of lessons that Brilliant has that cover all topics across software engineering, machine learning, data science, quantitative finance and more.

They structure their lessons in bite-sized pieces, making it ideal for those short 10-minute breaks that you might otherwise waste on Instagram or Twitter.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

Check out Brilliant for Free

sponsored

Tech Snippets

Debugging Dropped Packets at Cloudflare

https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/

Pair Programming Antipatterns

tuple.app/pair-programming-guide/antipatterns

Resources for CTOs

github.com/kuchin/awesome-cto

Nine ways to shoot yourself in the foot with PostgreSQL

philbooth.me/blog/nine-ways-to-shoot-yourself-in-the-foot-with-postgresql

The Architecture of Stripe's Document Database

Arpan KG — Wed, 12 Jun 2024 20:45:00 +0000

Hey Everyone!

Today we’ll be talking about

The Architecture of Stripe’s Document Database - Stripe wrote a great blog post describing DocDB, their internal database as a service. DocDB is built on MongoDB and stores petabytes of data.
- Brief Intro to MongoDB and its Benefits
- Why Stripe built DocDB
- Architecture of DocDB and how it works
- Rebalancing Data Shards on DocDB for Efficiency
Tech Snippets
- Why Netflix uses FreeBSD
- AI Hype is getting out of Control
- How to Scale a Startup’s Engineering Team
- Build GPT-2 in 4 Hours

The Architecture of Stripe’s Document Database

Stripe is one of the largest payment processors in the world. In 2023, they processed over $1 trillion USD of payment volume, and they did this with a 99.999% uptime.

A crucial system that helped the company achieve this is DocDB, Stripe's internal Database as a Service.

Developers at Stripe can use the API for reading/writing data (OLTP reads/writes) and not have to worry about scaling compute, increasing storage, schema changes, etc. They can just focus on the product they're building.

Stripe's Database Infrastructure team published a fantastic blog post delving into the internals of DocDB and how it works.

When Stripe was founded in 2011, the company adopted MongoDB as their online database. They found it to be easier to use than a traditional relational database.

As the company grew to hundreds of terabytes of data, Stripe built DocDB on top of MongoDB to make scaling easier.

DocDB handles dynamic rebalancing between shards, gives fine-grained control over data distribution, ensures data consistency during migrations and more.

In this article, we'll first give a brief overview of MongoDB and then talk about how Stripe designed DocDB.

MongoDB Overview

MongoDB is a document-oriented database. It stores your data in semi-structured documents using BSON ( a binary format that extends JSON).

It was first developed in 2007 and released as an open-source database in 2009 (note - in 2018, they changed their license to be more restrictive).

The database was created by the founders of DoubleClick, the startup that would later get acquired by Google and become Google Ads.

The founders faced scalability and usability issues with traditional relational databases so they built MongoDB with certain design goals:

Developer-Friendly - Relational databases will store data in structured tables with relationships defined by primary and foreign keys. This doesn't naturally map with the traditional data structures you use in an object-oriented programming language (this issue is called Object-Relational Impedance Mismatch). MongoDB stores data as flexible, JSON-like documents so it's much more natural to map to objects in your code.
Scalability - Horizontal scalability is built into the core of MongoDB and it supports range, hash and zone-based sharding. Document-oriented databases encourage denormalization (where all related data is embedded into a single document) to minimize joins. Avoiding cross-shard joins is crucial for scalability.
Schema Flexibility - With a Relational Database, you need to define a schema and ensure that any data you insert follows the schema. Changing the schema means doing a database migration. On the other hand, MongoDB is schemaless. Each document can have different fields and the data types of those fields can vary from document to document.

Why Stripe built DocDB

Stripe originally started with MongoDB. As they grew (at an insanely fast rate), the engineering team desired additional features.

In order to utilize their database infrastructure most efficiently, they needed to transfer data between different shards in their fleet. Stripe has thousands of shards, so managing this can be very complex.

They wanted a solution that gave them complete operational control, so they could move individual data chunks between shards. Additionally, this needs to be done with minimal downtime and strong data consistency (Stripe is dealing with financial data).

To solve this, they built DocDB on top of MongoDB.

Architecture of DocDB

As we mentioned earlier, DocDB is a Database as a Service that Stripe engineers can use through an API.

Developers can send a read/write request to DocDB and it'll first go to a Database Proxy server.

The proxy server will first check for things like access controls, potential bugs, scalability, etc.

Then, it'll figure out which specific data chunks are being read/modified and it'll talk to a central Chunk Metadata Service to get the locations of the specific database shards.

Finally, the proxy server will send the read/write requests to the specified database shards. Each of the database shards has multiple replicas so they use a CDC (Change Data Capture) service to replicate changes between the replicas.

Data Movement Platform

When you're sharding horizontally, it's important to remember that your data won't be static. You'll need to move data between shards for expanding capacity, hardware upgrades, rebalancing hot/cold shards and more.

In Stripe's case, this is especially complicated because they're handling financial data.

Some of their requirements were

Data Consistency - data being migrated needs to be consistent between both the source and target shards.
Zero Downtime - any prolonged downtime is unacceptable as businesses need to process payments 24/7. Downtime should be under a few seconds so the product application can just retry the read/write request and there is minimal impact to the customer.
Granularity - they should be able to migrate an arbitrary number of data chunks between shards without any restrictions on the number of in-flight transfers or the number of migrations a given shard can perform at once.

Here's the steps they follow for zero-downtime migrations across database shards:

Confirm Migration and build Indexes - the system first registers the start of the migration in the Chunk Metadata Service. It also builds indexes on the target shards for the data chunks that are being migrated.
Bulk Data Import - take a snapshot of the data chunk on the original shard and copy it onto one or more target database shards.
Asynchronous Replication - After you copy over the original shard to the target shard, the original shard will still be getting writes. In this step, you asynchronously replicate any writes that happen on the original shard over to the target shards.
Correctness Check - Take point-in-time snapshots of the source and target shards and compare them to ensure data completeness and correctness.
Traffic Switch - Once the data is imported to the target shard and mutations are being properly replicated, DocDB's coordinator will switch traffic over to the target shard. Stripe does this by first blocking any new writes on the source shard. Then, they wait for the replication service to replicate any outstanding writes to the target shard. Finally, the update the route for the data chunk to point to the target shard in the Chunk Metadata service.
This step takes a few seconds, so any database requests that get blocked during this process can just retry after a small timeout and get served by the new shards.
Finalize Migration - The last step is to mark the migration as complete in the chunk metadata service. They can also delete the chunk data off the source shard.

Results

DocDB's ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.

In 2023, they migrated petabytes of data between shards and it helped them achieve much better utilization of their database infrastructure.

Tech Snippets

AI Hype is getting out of Control

With the acceleration in AI over the past few years, there’s been a ton of hype generated by the press, wall street and big tech companies.

This video provides a more critical look at the hype and delves into benchmarks, LLM performance and some past fakedemos by big tech companies.

Companies like BP are claiming that they “need 70% fewer coders“ thanks to AI but skeptics are saying this is a massive over-estimation.

https://youtu.be/VctsqOo8wsc

Why Netflix uses FreeBSD

Netflix is a significant contributor to FreeBSD and they maintain an internal distro that they use for their servers.

These are slides from an interesting talk where Netflix delved into why they use FreeBSD, how they’ve improved the operating system, their testing suite for reliability and more.

people.freebsd.org/~gallatin/talks/OpenFest2023.pdf

Build GPT-2 in 4 Hours

Andrej Karpathy is a founding member of OpenAI and was previously Director of AI at Tesla. He runs a terrific YouTube channel where he posts longform content on building LLMs.

In his latest video, he builds the GPT-2 model and talks about optimization techniques like mixed precision training and kernel fusion. He runs benchmarks on the created model and shows that it has performance close to GPT-3.

https://www.youtube.com/watch?v=l8pRSuU81PU

How to Scale a Startup’s Engineering Team

Gad Salner was an Engineering Manager at Melio and he wrote a great blog post on how he scaled their engineering team with the first few dozen hires. You have to balance different aspects like onboarding, knowledge sharing, dev experience and more. Gad talks about Melio's strategy for how they expanded their engineering team while balancing all these areas.

medium.com/meliopayments/how-to-scale-a-unicorn-building-engineering-team-and-stay-sane-40af8ac7e3db

The Architecture of Stripe's Document Database

Arpan KG — Wed, 12 Jun 2024 18:15:00 +0000

Hey Everyone!

Today we’ll be talking about

The Architecture of Stripe’s Document Database - Stripe wrote a great blog post describing DocDB, their internal database as a service. DocDB is built on MongoDB and stores petabytes of data.
- Brief Intro to MongoDB and its Benefits
- Why Stripe built DocDB
- Architecture of DocDB and how it works
- Rebalancing Data Shards on DocDB for Efficiency
Tech Snippets
- Why Netflix uses FreeBSD
- AI Hype is getting out of Control
- How to Scale a Startup’s Engineering Team
- Build GPT-2 in 4 Hours

Automate CSV Imports with OneSchema

Engineering-time is super expensive and you shouldn’t waste it wrangling CSVs or debugging the edge cases of a custom-built spreadsheet importer tool.

OneSchema is a fantastic app that completely automates mapping, cleaning, and validating messy spreadsheets received via SFTP and email – so you can finally stop wasting precious engineering time building and maintaining CSV integrations.

With features like intelligent mapping and data correction, importing clean customer data into your product is easier than ever. Say goodbye to “import error on line 53” and endless email chains with your support team.

Companies like Scale, Vanta, Airbase, SecureFrame and more all use OneSchema to launch delightful spreadsheet import experiences.

Learn how OneSchema automates CSV Imports

sponsored

The Architecture of Stripe’s Document Database

Stripe is one of the largest payment processors in the world. In 2023, they processed over $1 trillion USD of payment volume, and they did this with a 99.999% uptime.

A crucial system that helped the company achieve this is DocDB, Stripe's internal Database as a Service.

Stripe's Database Infrastructure team published a fantastic blog post delving into the internals of DocDB and how it works.

When Stripe was founded in 2011, the company adopted MongoDB as their online database. They found it to be easier to use than a traditional relational database.

As the company grew to hundreds of terabytes of data, Stripe built DocDB on top of MongoDB to make scaling easier.

DocDB handles dynamic rebalancing between shards, gives fine-grained control over data distribution, ensures data consistency during migrations and more.

In this article, we'll first give a brief overview of MongoDB and then talk about how Stripe designed DocDB.

MongoDB Overview

MongoDB is a document-oriented database. It stores your data in semi-structured documents using BSON ( a binary format that extends JSON).

It was first developed in 2007 and released as an open-source database in 2009 (note - in 2018, they changed their license to be more restrictive).

The database was created by the founders of DoubleClick, the startup that would later get acquired by Google and become Google Ads.

The founders faced scalability and usability issues with traditional relational databases so they built MongoDB with certain design goals:

Developer-Friendly - Relational databases will store data in structured tables with relationships defined by primary and foreign keys. This doesn't naturally map with the traditional data structures you use in an object-oriented programming language (this issue is called Object-Relational Impedance Mismatch). MongoDB stores data as flexible, JSON-like documents so it's much more natural to map to objects in your code.
Scalability - Horizontal scalability is built into the core of MongoDB and it supports range, hash and zone-based sharding. Document-oriented databases encourage denormalization (where all related data is embedded into a single document) to minimize joins. Avoiding cross-shard joins is crucial for scalability.
Schema Flexibility - With a Relational Database, you need to define a schema and ensure that any data you insert follows the schema. Changing the schema means doing a database migration. On the other hand, MongoDB is schemaless. Each document can have different fields and the data types of those fields can vary from document to document.

Why Stripe built DocDB

Stripe originally started with MongoDB. As they grew (at an insanely fast rate), the engineering team desired additional features.

To solve this, they built DocDB on top of MongoDB.

Architecture of DocDB

As we mentioned earlier, DocDB is a Database as a Service that Stripe engineers can use through an API.

Developers can send a read/write request to DocDB and it'll first go to a Database Proxy server.

The proxy server will first check for things like access controls, potential bugs, scalability, etc.

Then, it'll figure out which specific data chunks are being read/modified and it'll talk to a central Chunk Metadata Service to get the locations of the specific database shards.

Data Movement Platform

In Stripe's case, this is especially complicated because they're handling financial data.

Some of their requirements were

Data Consistency - data being migrated needs to be consistent between both the source and target shards.
Zero Downtime - any prolonged downtime is unacceptable as businesses need to process payments 24/7. Downtime should be under a few seconds so the product application can just retry the read/write request and there is minimal impact to the customer.
Granularity - they should be able to migrate an arbitrary number of data chunks between shards without any restrictions on the number of in-flight transfers or the number of migrations a given shard can perform at once.

Here's the steps they follow for zero-downtime migrations across database shards:

Confirm Migration and build Indexes - the system first registers the start of the migration in the Chunk Metadata Service. It also builds indexes on the target shards for the data chunks that are being migrated.
Bulk Data Import - take a snapshot of the data chunk on the original shard and copy it onto one or more target database shards.
Asynchronous Replication - After you copy over the original shard to the target shard, the original shard will still be getting writes. In this step, you asynchronously replicate any writes that happen on the original shard over to the target shards.
Correctness Check - Take point-in-time snapshots of the source and target shards and compare them to ensure data completeness and correctness.
Traffic Switch - Once the data is imported to the target shard and mutations are being properly replicated, DocDB's coordinator will switch traffic over to the target shard. Stripe does this by first blocking any new writes on the source shard. Then, they wait for the replication service to replicate any outstanding writes to the target shard. Finally, the update the route for the data chunk to point to the target shard in the Chunk Metadata service.
This step takes a few seconds, so any database requests that get blocked during this process can just retry after a small timeout and get served by the new shards.
Finalize Migration - The last step is to mark the migration as complete in the chunk metadata service. They can also delete the chunk data off the source shard.

Results

DocDB's ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.

In 2023, they migrated petabytes of data between shards and it helped them achieve much better utilization of their database infrastructure.

Automate CSV Imports with OneSchema

Engineering-time is super expensive and you shouldn’t waste it wrangling CSVs or debugging the edge cases of a custom-built spreadsheet importer tool.

Companies like Scale, Vanta, Airbase, SecureFrame and more all use OneSchema to launch delightful spreadsheet import experiences.

Learn how OneSchema automates CSV Imports

sponsored

Tech Snippets

AI Hype is getting out of Control

https://youtu.be/VctsqOo8wsc

Why Netflix uses FreeBSD

people.freebsd.org/~gallatin/talks/OpenFest2023.pdf

Build GPT-2 in 4 Hours

https://www.youtube.com/watch?v=l8pRSuU81PU

How to Scale a Startup’s Engineering Team

medium.com/meliopayments/how-to-scale-a-unicorn-building-engineering-team-and-stay-sane-40af8ac7e3db

How Uber Tracks Billions of Trips

Arpan KG — Fri, 07 Jun 2024 17:25:00 +0000

Hey Everyone!

Today we’ll be talking about

How Uber Keeps Track of Billions of Completed Trips - At Uber’s scale, keeping track of simple insights can become a challenge. They posted an interesting blog about the issues they dealt with when keeping track of the number of completed trips by Uber drivers.
- Introduction to Apache Pinot and why Uber uses it
- Planning for storage and compute needs
- Improving Query performance with Bloom Filters, Inverted Indices and Colocation
- How the Uber team dealt with spiky traffic patterns
Tech Snippets
- How to be a -10x Engineer
- Employees Who Stay In Companies Longer Than Two Years Get Paid 50% Less
- How to Manage your Motivation as a Solo Developer
- Building a basic RDBMS from Scratch

How Uber built their Job Counting System

Uber is the largest ride-sharing company in the world, with over 150 million users and nearly 10 billion trips completed in 2023.

At this scale, keeping track of even simple insights can become a challenge.

One issue the Uber team had to deal with was keeping track of the number of jobs a gig worker had completed on the platform. They needed a way to quickly count the number of jobs a driver completed in the last day, week, month, etc.

Uber uses Apache Pinot as their OLAP (Online Analytics Processing) datastore, and they wrote an interesting blog post about the challenges they faced while building the job counter.

Some of the issues they faced include

Planning for how much storage and compute they'd need
Improving Query performance
Dealing with large spikes in data requests

We'll give a brief overview of Apache Pinot and then delve into each challenge and how the Uber team addressed it.

Introduction to Apache Pinot

Apache Pinot is an open-source, distributed database created at LinkedIn in the mid-2010s. It was open-sourced in 2015 and donated to the Apache Foundation in 2019.

Some of LinkedIn's original design goals for Pinot were

Scalability - Pinot is designed to scale horizontally by partitioning your data and distributing it across multiple nodes.
Low-Latency Analytical Queries - Pinot uses a columnar storage format where data is written to disk by column (each column is stored in adjacent slots of memory). This allows for fast scanning and aggregation of data in the same column.
Data Freshness - Pinot can ingest data from both real-time streams (like Kafka) and batch datasets (in a data lake or warehouse). With real-time ingestion, data can be available for querying within a few seconds after ingestion.

Here's an architecture diagram of a Pinot setup.

Pinot stores data in segment files, which are typically 100-500 MB in size. Segment files are stored on Pinot server nodes and backed up on AWS S3/HDFS/Google Cloud/etc.

Realtime data can be ingested from Kafka, EventHub, Google PubSub, etc. and batch data can be ingested by a different server.

Pinot relies on Apache Helix and Zookeeper for cluster management, coordination and metadata storage.

Issues Uber faced when building their Job Counting System

Some of the issues Uber faced with their Job Counting System were:

Planning for how much storage and compute they'd need
Improving Query performance
Dealing with large spikes in data requests

We'll talk about each of these and how they addressed them.

Capacity Planning

The first challenge was to figure out how much storage and compute they'd need for the job counting system.

Pinot uses a bunch of different techniques to compress your data so knowing exactly how much disk space you'll need can be a challenge.

The Uber team stored a sample dataset (around 10% of the total data) in Pinot. They used disk usage from the sample to estimate how much storage they'd need.

However, predicting query performance was much more difficult. The team had to wait until they scaled up to the entire dataset before they could get accurate insights from load testing.

They found that production-sized traffic quickly overwhelmed their Pinot system and maxed out the storage servers' read throughput and CPU usage.

This led to their next challenge of improving query performance.

Improving Query Performance

Uber used Pinot to calculate the number of jobs specific gig-workers had done. That resulted in queries that looked like…

SELECT *
FROM pinot_hybrid_table
WHERE
    provider_id = "..."
    AND requester_id = "..."
    AND timestamp >= .... AND timestamp <= ...

They looked for a certain provider_id and requester_id within a specific time range.

Some strategies they used to improve query performance were

Adding Inverted Indices - An Inverted Index is a data structure frequently used for full-text search (in search engines/databases). It's similar to the "index section" in the back of a textbook, where you map values to their locations in the database. Uber enabled inverted indices on the provider_id and requester_id columns. The data structure tracked which segment files (and which rows within the segments) held data for a certain provider_id or requester_id.
Adding Bloom Filters - A Bloom Filter is a probabilistic data structure that will quickly tell you whether an item is not within a set of values. Bloom Filters will only tell you if something is not within the set. Bloom Filters can have false positives when determining whether something is within the set.

Uber enabled Bloom Filters for each segment file based on the provider_id and requester_id. Before searching the segment file for a provider_id/requester_id, Pinot will first check the Bloom Filter to see if they don't exist. Using a Bloom Filter will help Pinot skip searching some of the segment files.
Grouping Trips by Provider - Uber sorted by the provider_id column. This meant that trips made by the same provider on the same day would be placed in the same segment file. Uber drivers can typically handle 5+ jobs in a single day, so this greatly minimizes the number of segment files visited per query.

Dealing with Large Spikes in Requests

A third issue the Uber team faced was that traffic coming to the job counting service had very spiky traffic patterns. The original rate-limiting implementation they had set up caused a large spike in traffic every 10 seconds.

To solve this, they added jitter to the upstream clients to distribute the search queries evenly over time. Jitter adds a bit of randomness to the delay time the service will wait before retrying their request. If you have a large number of failed requests, then each of those requests will wait a preconfigured amount of time before they retry. Adding jitter to those requests stops them from retrying at the same time (which would just overwhelm the service again).

Tech Snippets

How to be a -10x Engineer

There’s controversy over the idea of a “10x Engineer” but it’s definitely possible to be a “-10x” engineer…. someone who can waste the output of 10+ engineers.

This is a comprehensive list of traits/actions you should watch for when identifying a “-10x engineer”.

Some include
- holding 10 engineers hostage in a technical discussion
- burning hours of developer time on communication overhead
- writing pointless and flaky tests

and more.

taylor.town/-10x

Employees Who Stay In Companies Longer Than Two Years Get Paid 50% Less

Having lots of short (less than 1.5 years) stints at companies can hurt you when you’re job hunting. However, staying at the same company for too long can be far more detrimental for your overall compensation.

Not switching jobs enough can result in you getting paid 50% less over the course of your entire career.

Annual raises are based on a percentage of your current salary and there’s often a limit to how high your manager can bump you up.

However, moving to another company will let you start fresh and can usually command a higher base salary. The same applies to promotions. Getting promoted internally can often be significantly harder than trying to recruit for a new job with a higher title.

www.forbes.com/sites/cameronkeng/2014/06/22/employees-that-stay-in-companies-longer-than-2-years-get-paid-50-less

How to Manage your Motivation as a Solo Developer

Marcus Buffet is the founder of Chessbook, an online site that helps you get better at Chess with spaced repetition.

He wrote a really interesting blog post delving into how he manages his motivation as a solo developer building the entire app.

Some strategies he uses include

- Leaving some tasks slightly unfinished. This helps him get into a flow state the next day

- Finding external sources of motivation. He uses subscriptions from users.

- Getting an accountability partner.

mbuffett.com/posts/maintaining-motivation

Building a basic RDBMS from Scratch

Building a basic RDBMS from Scratch Akila is a software engineer at OpenAI and he wrote a great blog post on building a basic RDBMS from scratch based off MIT’s Database Systems course.

He talks about the architecture, building the query parser, optimizer and adding features like transactions. The database is open source and he gives instructions on how to run it.

www.awelm.com/posts/simple-db

How Uber Tracks Billions of Trips

Arpan KG — Fri, 07 Jun 2024 17:15:00 +0000

Hey Everyone!

Today we’ll be talking about

How Uber Keeps Track of Billions of Completed Trips - At Uber’s scale, keeping track of simple insights can become a challenge. They posted an interesting blog about the issues they dealt with when keeping track of the number of completed trips by Uber drivers.
- Introduction to Apache Pinot and why Uber uses it
- Planning for storage and compute needs
- Improving Query performance with Bloom Filters, Inverted Indices and Colocation
- How the Uber team dealt with spiky traffic patterns
Tech Snippets
- How to be a -10x Engineer
- Employees Who Stay In Companies Longer Than Two Years Get Paid 50% Less
- How to Manage your Motivation as a Solo Developer
- Building a basic RDBMS from Scratch

Think you can ace a Technical Interview? Find your weaknesses with TIRA

A fantastic career-hack is to always stay prepared on your technical interview chops.

When you see the perfect job opportunity, you won’t have to waste weeks cramming algorithms and data structures. Instead, you can apply immediately and give yourself the best shot of landing the position.

The first step to always being ready is to figure out where your current weaknesses are with coding interviews. What data structures and algorithms do you need the most practice with?

TIRA is a free coding aptitude test that’ll tell you exactly what those are.

It’s a 45-minute quiz that’ll give you detailed feedback on

What data structures and algorithms you need to review
Which tech companies you can currently pass an interview at
What level (junior, mid, senior, etc.) you’d be classified as

TIRA was built by hiring managers at big tech companies so they’ve designed it based on their experience giving thousands of technical interviews at Facebook, Microsoft, Google and more.

It’s completely free and only takes 45 minutes.

See how prepared you are for technical interviews with TIRA

sponsored

How Uber built their Job Counting System

Uber is the largest ride-sharing company in the world, with over 150 million users and nearly 10 billion trips completed in 2023.

At this scale, keeping track of even simple insights can become a challenge.

Uber uses Apache Pinot as their OLAP (Online Analytics Processing) datastore, and they wrote an interesting blog post about the challenges they faced while building the job counter.

Some of the issues they faced include

Planning for how much storage and compute they'd need
Improving Query performance
Dealing with large spikes in data requests

We'll give a brief overview of Apache Pinot and then delve into each challenge and how the Uber team addressed it.

Introduction to Apache Pinot

Apache Pinot is an open-source, distributed database created at LinkedIn in the mid-2010s. It was open-sourced in 2015 and donated to the Apache Foundation in 2019.

Some of LinkedIn's original design goals for Pinot were

Scalability - Pinot is designed to scale horizontally by partitioning your data and distributing it across multiple nodes.
Low-Latency Analytical Queries - Pinot uses a columnar storage format where data is written to disk by column (each column is stored in adjacent slots of memory). This allows for fast scanning and aggregation of data in the same column.
Data Freshness - Pinot can ingest data from both real-time streams (like Kafka) and batch datasets (in a data lake or warehouse). With real-time ingestion, data can be available for querying within a few seconds after ingestion.

Here's an architecture diagram of a Pinot setup.

Pinot stores data in segment files, which are typically 100-500 MB in size. Segment files are stored on Pinot server nodes and backed up on AWS S3/HDFS/Google Cloud/etc.

Realtime data can be ingested from Kafka, EventHub, Google PubSub, etc. and batch data can be ingested by a different server.

Pinot relies on Apache Helix and Zookeeper for cluster management, coordination and metadata storage.

Issues Uber faced when building their Job Counting System

Some of the issues Uber faced with their Job Counting System were:

Planning for how much storage and compute they'd need
Improving Query performance
Dealing with large spikes in data requests

We'll talk about each of these and how they addressed them.

Capacity Planning

The first challenge was to figure out how much storage and compute they'd need for the job counting system.

Pinot uses a bunch of different techniques to compress your data so knowing exactly how much disk space you'll need can be a challenge.

The Uber team stored a sample dataset (around 10% of the total data) in Pinot. They used disk usage from the sample to estimate how much storage they'd need.

However, predicting query performance was much more difficult. The team had to wait until they scaled up to the entire dataset before they could get accurate insights from load testing.

They found that production-sized traffic quickly overwhelmed their Pinot system and maxed out the storage servers' read throughput and CPU usage.

This led to their next challenge of improving query performance.

Improving Query Performance

Uber used Pinot to calculate the number of jobs specific gig-workers had done. That resulted in queries that looked like…

SELECT *
FROM pinot_hybrid_table
WHERE
    provider_id = "..."
    AND requester_id = "..."
    AND timestamp >= .... AND timestamp <= ...

They looked for a certain provider_id and requester_id within a specific time range.

Some strategies they used to improve query performance were

Adding Inverted Indices - An Inverted Index is a data structure frequently used for full-text search (in search engines/databases). It's similar to the "index section" in the back of a textbook, where you map values to their locations in the database. Uber enabled inverted indices on the provider_id and requester_id columns. The data structure tracked which segment files (and which rows within the segments) held data for a certain provider_id or requester_id.
Adding Bloom Filters - A Bloom Filter is a probabilistic data structure that will quickly tell you whether an item is not within a set of values. Bloom Filters will only tell you if something is not within the set. Bloom Filters can have false positives when determining whether something is within the set.

Uber enabled Bloom Filters for each segment file based on the provider_id and requester_id. Before searching the segment file for a provider_id/requester_id, Pinot will first check the Bloom Filter to see if they don't exist. Using a Bloom Filter will help Pinot skip searching some of the segment files.
Grouping Trips by Provider - Uber sorted by the provider_id column. This meant that trips made by the same provider on the same day would be placed in the same segment file. Uber drivers can typically handle 5+ jobs in a single day, so this greatly minimizes the number of segment files visited per query.

Dealing with Large Spikes in Requests

Think you can ace a Technical Interview? Find your weaknesses with TIRA

A fantastic career-hack is to always stay prepared on your technical interview chops.

The first step to always being ready is to figure out where your current weaknesses are with coding interviews. What data structures and algorithms do you need the most practice with?

TIRA is a free coding aptitude test that’ll tell you exactly what those are.

It’s a 45-minute quiz that’ll give you detailed feedback on

What data structures and algorithms you need to review
Which tech companies you can currently pass an interview at
What level (junior, mid, senior, etc.) you’d be classified as

TIRA was built by hiring managers at big tech companies so they’ve designed it based on their experience giving thousands of technical interviews at Facebook, Microsoft, Google and more.

It’s completely free and only takes 45 minutes.

See how prepared you are for technical interviews with TIRA

sponsored

Tech Snippets

How to be a -10x Engineer

taylor.town/-10x

Employees Who Stay In Companies Longer Than Two Years Get Paid 50% Less

www.forbes.com/sites/cameronkeng/2014/06/22/employees-that-stay-in-companies-longer-than-2-years-get-paid-50-less

How to Manage your Motivation as a Solo Developer

mbuffett.com/posts/maintaining-motivation

Building a basic RDBMS from Scratch

www.awelm.com/posts/simple-db

The Architecture of Grab's Data Lake

Arpan KG — Mon, 03 Jun 2024 17:15:00 +0000

Hey Everyone!

Today we’ll be talking about

The Architecture of Grab's Data Lake
- Introduction to Data Storage Formats
- Design Choices when picking a Data Storage Format
- High Throughput vs. Low Throughput Data at Grab
- Using Avro and Merge on Read for High Throughput Data
- Using Parquet and Copy on Write for Low Throughput Data
Tech Snippets
- A Detailed Guide to Software Architecture Documentation
- The really important job interview questions engineers should ask (but don’t)
- How Figma overhauled their Performance Testing Framework

How Modern Databases Implement Hybrid Search (a tech talk from the CTO of Rockset)

Hybrid search is where you combine different types of search (vector search, text search, metadata-filtering, etc.) into a single query. It’s become a crucial feature for modern databases, especially if you’re looking to build AI applications.

In order to implement hybrid search, you’ll need a Converged Index. This is a flexible indexing system that can handle different types of data like vectors, text, documents, geospatial data, time series and more.

Dhruba Borthakur is the CTO of Rockset and was previously the creator of RocksDB. He was also a founding engineer of Hadoop Distributed File System (HDFS).

He’ll be giving a talk delving into how databases implement hybrid search, creating converged indexes, scalability considerations and more. You’ll understand the data structures involved, what design choices Rockset made and the different algorithms they’re using for text search, ranking, etc.

If you want to understand how modern databases are built (and what factors you should look at when choosing one) then you can check out the talk below.

The talk is completely free to watch.

Learn how Modern Databases implement Hybrid Search

sponsored

The Architecture of Grab's Data Lake

Grab is one of the largest technology companies in Southeast Asia, with over 35 million monthly users. They run a "super-app" that offers ride-sharing, food delivery, banking, and communication all within a single app.

As you might guess, operating all these services generates a lot of data. Grab's data analysts need to comb through this data for insights that can help the company improve its operations.

The data is primarily stored in a data lake. Some of the data is high-throughput and gets frequent updates (multiple updates every second/minute). Other data is low-throughput and is rarely updated (updated daily/weekly).

Grab needs to store this data efficiently and also allow data analysts to run ad-hoc queries on it without it being too costly.

One crucial design choice is picking the right storage format for the data on the data lake. Choosing the wrong format can make data storage significantly more expensive and make gaining insights from the data more difficult.

We'll first explore data storage formats, discussing the tradeoffs involved and some commonly used technologies.

Then, we'll talk about what Grab chose and why. You can read the full article by Grab here.

Introduction to Data Storage Formats

Let's say you have a bunch of sensor data from a weather satellite. If you need to store it on S3, how would you do it?

Do you just upload the CSV? What if the file is 50 gigabytes and has many repeat values? Would you pick a format that compresses the data?

The format you choose for encoding your data is crucial.

Here are some of the tradeoffs you might consider:

Human Readability - It's pretty easy for you to read a CSV or a JSON file however these formats aren't very efficient. Compressed formats like Protobuf are much smaller but not human-readable.
Row vs. Column-oriented - In a row-oriented format, you put data in the same row next to each other on disk. In a column-oriented format, you put data in the same column next to each other on disk. This choice has a ton of effects on read/write performance, compression efficiency and more. We did a deep dive on Row vs. Column-oriented databases that you can read here.
Schema Evolution - how easy is changing the data schema over time without breaking existing data? Some formats support adding/removing fields.
Compression - How efficiently can you compress the data? Compression can save storage costs and reduce latencies.
Splitability - How easy is it to divide the file format into smaller chunks? Spitability is important if you need to do parallel processing on the data.
Ecosystem Compatibility - Is the format commonly used? Does it have good support with tools like Spark, Redshift, Snowflake , etc.

Some common formats that you'll frequently see are

JSON - this is a text-based format so it's human-readable. Almost every programming language will have support for JSON. I'm assuming you've already used JSON before but you can read more here if you haven't.
CSV - a text format for storing data in a table-like way. Each line corresponds to a row and every value in the line is separated by a comma. CSV is also human-readable but it's not an efficient way to store data.
Avro - a binary data serialization framework developed within the Hadoop ecosystem. Data on disk is row-oriented and compressed. Avro supports robust schema evolution. Learn more here.
Parquet - developed in 2013 as a joint effort between Twitter and Cloudera. Parquet is column-oriented and provides efficient compression on disk. It supports schema evolution so you can add/remove columns without breaking compatibility. Read more here.
ORC - created in 2013 at Facebook. It's optimized for large streaming reads and efficient data compression. Read more.

Data Storage Formats at Grab

For Grab, they looked at the characteristics of their data sources and used that to determine their storage formats.

One characteristic they used was throughput.

High Throughput data is updated/changed frequently (several times a second). An example is the stream of booking events from customers who book a ride share.
Low Throughput data could be transaction events generated from a nightly batch process.

High Throughput Data

For High Throughput data, Grab uses Apache Avro with a strategy called Merge on Read (MOR).

Here's the main operations with Merge on Read:

Write Operations - When data is written, it's appended to the end of a log file. This is much more efficient than merging it in the current data and reduces the latency of writes.
Read Operations - When you need to read data, the base file is combined with the updates in the log file to provide the latest view. This can make reads more costly as you have to merge the updated data.
Periodic Compaction - To prevent reads from becoming too costly, updates in the log files are periodically merged with the base files. This limits the number of past updates you must merge during a read.

Low Throughput Data

For low throughput data, Grab uses Parquet with Copy on Write (CoW).

Here's the main operations for Copy on Write:

Write Operations - Whenever there's a write, you create a new version of the file that includes the latest change. You can also keep the previous version for consistency and rollback purposes. This helps prevent data corruption, inconsistent reads, and more.
Read Operations - You read the latest versioned data file. Reads are faster than Merge on Read since there is no merge process.

How Modern Databases Implement Hybrid Search (a tech talk from the CTO of Rockset)

Dhruba Borthakur is the CTO of Rockset and was previously the creator of RocksDB. He was also a founding engineer of Hadoop Distributed File System (HDFS).

If you want to understand how modern databases are built (and what factors you should look at when choosing one) then you can check out the talk below.

The talk is completely free to watch.

Learn how Modern Databases implement Hybrid Search

sponsored

Tech Snippets

A Detailed Guide to Software Architecture Documentation

This is a fantastic guide to documenting things in your codebase that aren’t code.

You should be documenting things like
- non-functional requirements
- architectural decisions and their arguments
- data flow
- maintenance and update procedures
and much more.

This is a fantastic guide on how to document all these other areas of your system.

www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide

The really important job interview questions engineers should ask (but don’t)

This is a good list of questions you should ask when you’re interviewing for a job (especially if it’s a start up).

Questions include
- Does the company have product-market fit?
- How much runway does the company have? What’s the burn?
- What’s in store for the future? What is the company strategy?
- Who decides what to build?

and more.

posthog.com/founders/what-to-ask-in-interviews

How Figma overhauled their Performance Testing Framework

Initially, Figma relied on a single macbook for their in-house performance testing system. As you might imagine, they eventually had to find a more scalable solution.

This is an interesting read from the Figma engineering blog on how the company built a new system to spot performance regressions in the app.

They ultimately shipped two systems: a cloud based system that covered the majority of tests and a hardware system for highly targeted tests.

www.figma.com/blog/keeping-figma-fast

How Netflix Implements Load Shedding

Sat, 01 Jun 2024 02:45:00 +0000

Hey Everyone!

Today we’ll be talking about

How Netflix Implemented Load Shedding
- Categorizing Traffic as High Priority or Low Priority
- Implementing Load Shedding in their API Gateway
- Testing using Principles from Chaos Engineering
Tech Snippets
- Why Consensus is Harder Than It Looks
- How to Coach a Mentee from Junior to Senior Engineer
- Advice for Engineering Managers Who Want to Climb the Ladder
- Computer Graphics from Scratch

How Netflix Implemented Load Shedding

The API Gateway sits between the backend and the client and it handles things like rate limiting, authentication, monitoring and routing requests to all the various backend services.

One crucial feature that API gateways can handle is load shedding, where it can start ignoring certain requests during times of high stress (there’s a spike in user trafic or something’s wrong with your backend).

Non-critical requests like logging or background requests can be dropped/shed by the API gateway so that critical requests (that directly impact the user experience) have fewer failures.

Netflix built and maintains a popular API Gateway called Zuul and they gave a great talk at AWS Re:Invent 2021 about how they designed and tested Zuul’s prioritized load shedding feature for their internal use.

Here’s a summary of the talk

During times of high-stress, Netflix uses load shedding to minimize the impact on end users.

For Netflix, high-stress periods can be due to failure in the backend (under-saled services, network issues, cloud provider outages, etc.) or from spikes in traffic (a new season of Love is Blind).

Despite all the effort Netflix engineers put into developing resiliency, there are still many different incidents that degrade user experience.

With load-shedding, Netflix will start ignoring low-priority requests. For example, Zuul will start dropping requests for show trailers.

When you’re scrolling through Netflix, the default behavior of the website is to autoplay the trailer of whatever movie/TV show you have selected.

If the system is under severe strain, then Netflix will ignore these requests and the client will fall-back on just displaying the show’s cover image (and not playing any trailer).

This doesn’t really hurt the user experience that much and it allows the system to prioritize requests that directly relate to the streaming experience for users.

In order to implement prioritized load shedding, Netflix engineers went through 3 steps:

Define a Request Taxonomy - Create a way to categorize requests by priority and assign a score to each request that describes how critical the request is to the user streaming experience.
Implement the Load Shedding Algorithm - Netflix chose to implement it in their API Gateway, Zuul. They use a cubic function to determine the rate at which they load shed requests.
Validate Assumptions using Fault Injection - The Chaos Engineering discipline started at Netflix and the company uses those principles for testing system resilience in a scientific way.

We’ll go through each of these steps in more detail.

Define a Request Taxonomy

Netflix created a scoring system from 0 to 100 that assigns a priority to a request, with 0 being the highest priority and 100 being the lowest priority.

The score was based on 4 dimensions

Functionality - What functionality gets impacted if this request gets throttled? Is it important to the user experience? For example, if logging-related requests are throttled then that doesn’t really hurt the user experience.
Throughput - Some traffic is higher throughput than others. Logs, background requests, events data, etc. are higher throughput and they contribute to a large percentage of load on the system. Throttling them will have a bigger impact on reducing load.
Criticality - If this request gets throttled, is there a well-defined fallback that still delivers an acceptable user experience? For example, if the client’s request for the movie trailer gets blocked, then the fallback is to just show the image for the movie. This is acceptable.
Request State - Was the request initiated by the user? Or was the request initiated by the Netflix app?

Using these dimensions, the API gateway assigns a priority score to every request that comes in.

Load Shedding Algorithm

The first decision was where to implement the load shedding algorithm. Netflix decided to put the logic in their API Gateway, Zuul.

(Note - NLB stands for Network Load Balancer)

When a request comes in to Zuul, the first thing that the gateway does is execute a set of inbound filters. These filters are responsible for decorating the incoming request with extra information. This is where the priority score is computed and added to the request.

With the priority score information, Zuul can now do global throttling. This is where Zuul will throttle requests below a certain priority threshold. This is meant to protect the API gateway itself. The metrics used to trigger global throttling are concurrent requests, connection count and CPU utilization.

Netflix also implemented service throttling, where they can load shed requests for specific microservices that Zuul is talking to. Zuul will monitor the error rate and concurrent requests for each of the microservices. If a threshold is crossed for those metrics, then Zuul will reduce load on the service by throttling traffic above a certain priority level.

In order to calculate the priority level, Netflix uses a cubic function. When the overload percentage is at 35%, Netflix will shed any requests that are above 95% priority. When the overload percentage reaches 80%, then the API Gateway will shed any request with a priority score of greater than ~50.

Validating Assumptions using Chaos Testing

A Fault Injection experiment is where you methodically introduce disruptive events (spike in traffic, CPU load, increased latency, etc.) in your testing or production environments and observe how the system responds.

Netflix routinely runs these types of experiments in their production environment and have built tools like Chaos Monkey and ChAP (Chaos Automation Platform) to make this testing easier.

They created a failure injection point in Zuul that allowed them to shed any request based on a configured priority. Therefore, they could manually simulate a load shedded experience and get an idea of exactly how shedding certain requests affected the user.

Engineers staged an A/B test that will allocate a small number of production users to either a control or treatment group for 45 minutes. During that time period, they’ll throttle a range of priorities for the treatment group and measure the impact on playback experience.

This allows Netflix to quickly determine how the load shedding system is performing across a variety of client devices, client versions, locations, etc.

Tech Snippets

Why Consensus is Harder Than It Looks

Marc Booker is a Distinguished Engineer at Amazon and he writes a great blog where he talks about distributed systems.

In this post, he delves into some of the complexities that people often overlook when building a consensus system. He talks about ensuring determinism, effective monitoring and control. He ends with whether or not you truly need strong consistency.

brooker.co.za/blog/2020/10/05/consensus.html

How Jordan Cutler’s Mentee went from Junior to Senior Engineer in 2 years

Jordan Cutler is a senior software engineer at Pinterest. He was able to help one of his mentees get promoted from a junior-role to a senior role in 2 years by helping them find high impact projects and run a “weekly-braindump” where they kept track of their wins.

He delves into the details in his blog post.

read.highgrowthengineer.com/p/my-mentee-went-from-junior-senior

Advice for Engineering Managers Who Want to Climb the Ladder

Charity Majors is the co-founder and CTO of Honeycomb and she wrote a great post with advice for Engineering Managers on how they can advance in their careers.

The post is centered around becoming an Engineering Director (go from managing ICs to managing managers).

The most opportunities will be at fast-growing startups with at least 100 engineers. You should demonstrate impact beyond your team and be proactive in communicating problems with your manager/management (along with solutions).

Check out the full post for her advice.

charity.wtf/2022/06/13/advice-for-engineering-managers-who-want-to-climb-the-ladder

Computer Graphics from Scratch

This is a fantastic, free textbook that delves into building a raytracer and a rasterizer from scratch.

The code is all language-agnostic, so you don’t need to know C++. You just need basic programming knowledge and high school level math to follow along.

www.gabrielgambetta.com/computer-graphics-from-scratch/index.html

How Netflix Implements Load Shedding

Fri, 31 May 2024 19:15:00 +0000

Hey Everyone!

Today we’ll be talking about

How Netflix Implemented Load Shedding
- Categorizing Traffic as High Priority or Low Priority
- Implementing Load Shedding in their API Gateway
- Testing using Principles from Chaos Engineering
Tech Snippets
- Why Consensus is Harder Than It Looks
- How to Coach a Mentee from Junior to Senior Engineer
- Advice for Engineering Managers Who Want to Climb the Ladder
- Computer Graphics from Scratch

How to create Stronger Customer Passwords with AuthKit

One of the most common causes of data breaches and account hijacking is customers using weak or reused passwords.

It doesn't matter how much time you spend on security audits, penetration testing, encryption or whatever. If your customers are setting weak passwords, then your business is at risk.

A crucial way to mitigate this risk is to inform new users about the strength of their passwords as they create them.

To solve this problem, Dropbox created zxcvbn, an open-source library that calculates password strength based on factors like:

Entropy - the library measures how random and unpredictable the password is
Dictionary checks - zxcvbn checks a range of dictionaries (common passwords, names, english words) to see if the password is too similar to a current word
Pattern recognition - the library uses pattern matching to identify common password elements like dates, repeated characters, sequences, etc.

And more

If you want an easy way to implement user password security in your app, check out AuthKit, an open-source login box that incorporates zxcvbn and other modern authentication best practices to provide a much more secure onboarding experience for new users.

Learn how to help users create stronger passwords