Hello, World!

I found jobs for you!

Sat, 27 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Load Balancing?

Fullstack Developer, Remote, 90,000/Year USD @Pesto Tech: https://www.linkedin.com/jobs/view/3982424452
Software Engineer 2 @Uber: https://www.linkedin.com/jobs/view/3982949422
Senior Software Engineer @Atlassian: https://www.linkedin.com/jobs/view/3984161464
Software Engineer @Stripe: https://www.linkedin.com/jobs/view/3984915009
Backend Engineer @Jio: https://www.linkedin.com/jobs/view/3984612862
Backend Software Engineer, Remote @Weave: https://www.linkedin.com/jobs/view/3982438724
Software Engineer - 2, Remote @Nivoda: https://www.linkedin.com/jobs/view/3982693016
Backend Developer, Remote @Soul AI: https://www.linkedin.com/jobs/view/3984925230
Backend Engineer, Remote @Stranger Soccer: https://www.linkedin.com/jobs/view/3980713431
Backend Developer, Remote @Weekday: https://www.linkedin.com/jobs/view/3982714721

EP 36: What is Load Balancing?

Mon, 22 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 35: What is Grafana?

10 Billion Documents: How Canva Scaled MongoDB to 100 Nodes

Discover how Canva has scaled from zero to 10 billion designs, handling up to 30,000 updates per second using MongoDB. This rapid growth exposed issues with their hands-off MongoDB approach.

In our fast-paced digital world, we expect websites and applications to be quick and always available, no matter how many people are using them at the same time. Have you ever wondered how this is possible? The answer lies in a crucial technique known as load balancing. Let’s dive into load balancing in this blog today.

What Happens Without a Load Balancer?

Before diving into how load balancers work, let's first understand the issues that arise without one through a simple example.

Imagine you have an application running on a single server, and clients connect directly to this server.

Problems Without a Load Balancer

Single Point of Failure
- If the server goes down or encounters an issue, the entire application becomes unavailable. This causes downtime and a poor user experience, which is unacceptable for any service provider.
Overloaded Servers
- A single server has a limit to how many requests it can handle. As the number of users grows, the server can become overloaded, leading to slow performance or crashes.

Why Load Balancers Are Essential

To handle an increasing number of requests, additional servers need to be added. A load balancer distributes the incoming traffic across these servers, preventing any single server from becoming overwhelmed and ensuring continuous availability and smooth performance for users.

Let's understand what a load balancer is with an example.

Imagine you're running a website that's wildly popular and needs to serve millions of users simultaneously. How can you ensure every user has a smooth experience?

Understanding Load Balancers

The User Experience

A single user opens your website on their laptop, sending a request through the internet to your application server. This server processes the request and sends the data back to the user's device. Simple enough if only one person is accessing your site.

But what if 10,000 people access it at once? One server can't handle that load.

Scaling Out with Multiple Servers

Instead of relying on a single server, you scale out by using multiple servers. You might have three, four, ten, or even thousands of application servers to meet the demand.

But how does the user's request know which server to reach? This is where the load balancer comes in.

The Role of a Load Balancer

A load balancer is either a hardware device or a software-defined solution that sits between the user's device and your application servers. It distributes incoming traffic across multiple servers to ensure no single server gets overwhelmed. The load balancer can also monitor server performance and adjust traffic distribution dynamically.

Types of Load Balancers

There are several types of load balancers, each suited for different environments:

Hardware Load Balancers: Physical devices that distribute network traffic.
Known for high performance and reliability but can be costly.
Software Load Balancers: Applications that perform load balancing tasks.
More flexible and cost-effective than hardware solutions.
Cloud Load Balancers: Services provided by cloud platforms like AWS, Google Cloud, and Azure.
They offer scalability and easy integration with cloud applications.

What Are Load Balancing Algorithms?

A load balancing algorithm is a set of rules that a load balancer follows to determine the best server for handling each client request. These algorithms can be categorized into two main types: static and dynamic.

Static Load Balancing

Static load balancing algorithms follow fixed rules and do not consider the current state of the servers.

Round-Robin Method
- In this method, servers have IP addresses mapped to them by a Domain Name System (DNS). When a client sends a request, the DNS returns the IP addresses of different servers in a sequential, round-robin manner.
Weighted Round-Robin Method
- This method assigns different weights to each server based on their capacity or priority. Servers with higher weights receive more traffic.
IP Hash Method
- The load balancer uses a hashing function on the client's IP address, converting it into a number that maps to a specific server. This ensures the same client IP always routes to the same server.

Dynamic Load Balancing

Dynamic load balancing algorithms assess the current state of servers before distributing traffic.

Least Connection Method
- The load balancer directs traffic to the server with the fewest active connections, assuming all connections require equal processing power.
Weighted Least Connection Method
- This method assigns different capacities to each server. The load balancer sends new requests to the server with the least number of connections relative to its capacity.
Least Response Time Method
- The load balancer combines server response time and the number of active connections to determine the best server, ensuring faster service for users.
Resource-Based Method
- The load balancer analyzes the current load on each server. Software agents running on each server calculate resource usage, such as computing capacity and memory. The load balancer then checks these agents for available resources before distributing traffic.

Examples of Load Balancing
Static Load Balancing: Imagine a company has a website with mostly unchanging content. They use identical web servers to evenly handle predictable traffic through a static load balancer.
Dynamic Load Balancing: Now, consider a company that faces unpredictable spikes and drops in website visitors. This could include an e-commerce site during a big sale or a healthcare provider during vaccine appointments. Dynamic load balancing adjusts resources in real-time to ensure smooth access during high-demand periods.

Benefits of Load Balancers:

Increased Performance: Load balancers ensure no downtime and better performance by distributing traffic, preventing server overload.
Scalability: With auto-scaling, load balancers provision more servers during high traffic, maintaining optimal performance.
Failure Management: They exclude unhealthy servers from the traffic distribution, ensuring only healthy servers handle requests.
Traffic Prediction: Software load balancers predict traffic surges and warn for necessary measures to prevent bottlenecks.
Session Persistence: Load balancers maintain user sessions even when directing requests to different servers, crucial for stateful applications.
Fault Tolerance: They redirect traffic away from failed servers, maintaining service continuity.

Drawbacks of Load Balancers:

Single Point of Failure: If the load balancer fails, it can disrupt traffic distribution.
Complexity and Cost: Implementation and management can be complex and expensive.
Configuration Challenges: Correctly configuring load balancers can be difficult, especially in complex environments.
Potential Overhead: Depending on the algorithm, there can be latency and processing time overhead.

By now, you must have had a clear idea of Load Balancing, from its meaning to working you know all of it now. In a nutshell, Load balancing is the process of distributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, enhancing performance and reliability. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 20 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Grafana?

Backend Developer @CRED: https://www.linkedin.com/jobs/view/3976301482
Software Engineer, Remote @Alphablocks: https://www.linkedin.com/jobs/view/3975896767
Software Engineer @Stripe: https://www.linkedin.com/jobs/view/3974392084
Software Engineer @Atlassian: https://www.linkedin.com/jobs/view/3976051484
Software Engineer, Hybrid @ShareChat: https://www.linkedin.com/jobs/view/3979033596
Backend Developer, Remote @Soul Ai: https://www.linkedin.com/jobs/view/3976797871
Backend Developer @Visa: https://www.linkedin.com/jobs/view/3969525344
Full Stack Engineer, Remote @Rare Labs: https://www.linkedin.com/jobs/view/3975394885
Software Engineer, Remote @ProductIn: https://www.linkedin.com/jobs/view/3955548537
Software Engineer, Remote @GoDaddy: https://www.linkedin.com/jobs/view/3974784960

EP 35: What is Grafana?

Mon, 15 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 34: What is Kibana?

Have you ever wondered how to monitor and visualize data from different sources in real-time easily? Grafana provides a straightforward solution, allowing you to create customizable dashboards and real-time graphs to track metrics seamlessly across various systems and applications. Let’s understand how Grafana enables quick insights and informed decision-making based on the latest data in this blog today.

What is Grafana?

Grafana is an open-source observability platform designed for visualizing metrics, logs, and traces from various data sources.

In case you don’t know what an observability platform is,

An observability platform collects and analyzes data to give you insights into the health and performance of your systems. It gathers metrics (like CPU usage), logs (detailed event records), and traces (paths requests take through your system) to provide a complete picture of what's happening inside your applications.

When Should Grafana Be Used?

Key Features of Grafana

Customizable Dashboards: Grafana allows you to create interactive and customizable dashboards to visualize data from various sources.
Visualize your data using different types of graphs and charts, including time series, histograms, and heatmaps.
Wide Data Source: It supports a wide range of data sources including Prometheus, InfluxDB, Graphite, Elasticsearch, MySQL, PostgreSQL, and many more.
Alerting: Set up alerts to notify you when metrics reach specific thresholds. Grafana supports multiple notification channels such as email, Slack, and PagerDuty.
Templating: Use variables to create dynamic and reusable dashboards. This allows you to easily switch between different data sources and metrics without creating new dashboards from scratch.
Annotations: Add notes directly to your graphs to provide context or highlight significant events.
Plugins: Extend Grafana’s functionality with plugins for additional data sources, panels, and apps.
Flexible Deployment: Host Grafana on-premises or use the managed Grafana Cloud service.

Architecture of Grafana

Let’s look into a simplified example of how the infrastructure for Grafana might be configured,

Data Producer: This component produces the data that you want to visualize. It could be a Jenkins CI server, a Raspberry Pi, a virtual machine in a data center, Kubernetes pods, or various sensors like IoT sensors or weather instrumentation.
Data Source: In this diagram, the data source is a database such as Prometheus, InfluxDB, MySQL, or other similar databases. The data source is connected to the data producer.
Depending on the type of database, it will either pull data from the data producer (e.g., sensor data like temperature or weather data) or the data producer will push data to the database.
- For example, Prometheus typically pulls data from the data producer. It does this by reaching out to a dedicated endpoint provided by the data producer, which aggregates and serves the data on a specific URL endpoint. Prometheus then scrapes this data and stores it in its database.
Grafana Server: This is the front-end that visualizes the data. Grafana queries the data source to retrieve the requested data, which is then displayed on a Grafana dashboard.

Real-Life Use Cases of Grafana

Infrastructure Monitoring: Companies like DigitalOcean use Grafana to monitor server performance, network traffic, and system health. This ensures they can quickly identify and resolve issues, maintaining high uptime and performance.
Business Metrics: Grafana helps organizations like Bloomberg track key business metrics such as user engagement, revenue, and transaction volumes.
IoT Applications: Grafana is used in IoT projects to monitor sensor data from various devices.
For example, smart city initiatives use Grafana to track environmental data like air quality, temperature, and humidity.
DevOps: Development and operations teams use Grafana to monitor application performance, error rates, and deployment statuses.

Companies using ELK Stack and Grafana:

Netflix: Uses the ELK stack to monitor and analyze customer service operation logs, leveraging Elasticsearch's robust features.
LinkedIn: Integrates the ELK stack with Kafka to monitor performance and security across 100+ clusters.
Tripwire: Utilizes the ELK stack for global SIEM operations and packet log analysis.
Medium: Employs the ELK stack to debug production issues and detect DynamoDB hotspots, supporting 25 million readers and thousands of posts weekly.

Advantages of Grafana

Open Source: Grafana is free to use and has a large, active community contributing to its development and support.
Versatile: Supports a wide range of data sources and can be used across different industries and applications.
User-Friendly: Intuitive interface makes it easy for users of all skill levels to create and manage dashboards.
Real-Time Insights: Provides real-time data visualization and analysis.
Customization: Highly customizable with support for custom plugins and panels
Scalable: Suitable for both small projects and large-scale deployments.

Disadvantages of Grafana

Complexity: While Grafana is user-friendly, setting up and configuring dashboards for complex environments can be challenging and time-consuming.
Dependency on Data Sources: Grafana relies on data from external sources, so any issues with these sources can impact the accuracy and reliability of the visualizations.
Resource Overhead: Data retrieval can impact the performance of your monitored resources.

Difference between Grafana and Kibana

Since the last post was on Kibana, here’s difference between both,

	Grafana	Kibana
Cross-Platform Tool	It is a cross-platform tool.	It is not a cross-platform tool.
Support and Working	It supports InfluxDB, AWS, MySQL, PostgreSQL, etc and its working is metrics-based.	It supports Elasticsearch and its working is log based.
Syntax	It uses a query editor.	It follows the Lucene syntax.
Full-Text Queries	It does not support full-text queries	It supports full-text queries.
Alerts	It gives real-time alerts when the data arrives. Users can define its alert visually for the important metrics.	It supports alerts but only with the help of plugins.
Usage	This tool is used by applications that require continuous real-time monitoring metrics.	This tool is used for log file analysis and full-text search queries.
Environment Variables	It is configured via the .ini file	YAML files store all the configuration details of an environment variable.
Speed	It is slow in speed.	It is fast.
Full-Text Search	It does not perform a full-text search.	It performs a full-text search.
Organizations using	9gag, Digitalocean, postmates, etc are the organizations using Grafana.	Trivago, bitbucket, Hubspot, etc are the organizations that are using Kibana.

By now, you must have had a clear idea of Grafana, from its meaning to working you know all of it now. In a nutshell, Grafana is an open-source platform for monitoring and observability, enabling real-time visualization and analysis of metrics from multiple sources. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 13 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Kibana?

Software Engineer @Google: https://www.linkedin.com/jobs/view/3968094206
Software Engineer 2 @Uber: https://www.linkedin.com/jobs/view/3970913927
Software Developer @Jar: https://www.linkedin.com/jobs/view/3973164903
Software Engineer, Remote @Airbase: https://www.linkedin.com/jobs/view/3974268782
Software Engineer, Remote @Allica Bank: https://www.linkedin.com/jobs/view/3973601659
Software Developer, @Jio: https://www.linkedin.com/jobs/view/3968074310
Software Developer 2, Hybrid @Quince: https://www.linkedin.com/jobs/view/3973479592
Backend Developer, Remote @Calyptus: https://www.linkedin.com/jobs/view/3974301382
Software Engineer, Remote @Precisely: https://www.linkedin.com/jobs/view/3974389076
Backend Developer, Remote @Kirana Club: https://www.linkedin.com/jobs/view/3971392964

EP 34: What is Kibana?

Mon, 08 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 33: What is MongoDB?

BuildeKit is an AI SaaS Boilerplate that helps you ship your AI SaaS super-Fast.

BuilderKit comes with a comprehensive NextJS Boilerplate, 10 Landing Pages, Waitlist & Auth Pages, Components & modules for Auth, Payments, Email, Supabase Integrations, SEO Optimisations for your SaaS, Typesfe Code & Docs, Vercel Deployment Support and 10+ Pre Built AI Apps.

BuilderKit is a Github Code Repo that you can clone and customise to ship your AI SaaS super-fast.

BuilderKit - Ship your AI Startup in days

Highly modular NextJS AI Boilerplate that allows you to ship any AI Apps within days.

www.builderkit.ai

If you've ever worked with data visualization or explored the world of big data, you might have come across the term Kibana. But have you ever wondered what Kibana can really do for you? Whether you’re a data analyst, a developer, or a business leader, understanding how to leverage Kibana can transform the way you interact with data. Let’s dive into this blog today to see what makes Kibana a powerful tool for visualizing and analyzing data.

What is Kibana?

Kibana is a part of the Elastic Stack (formerly known as ELK Stack), which includes Elasticsearch, Logstash, and Kibana. While Elasticsearch stores and indexes your data, and Logstash processes it, Kibana provides a way to visualize and explore it. It allows users to create and share dynamic dashboards, visualize data trends, and perform powerful searches with an intuitive and user-friendly interface.

In short,

Kibana is a tool for visualizing and exploring data, particularly useful for log and time-series analytics, application monitoring, and operational intelligence.
It features histograms, line graphs, pie charts, heat maps, and geospatial support, integrating seamlessly with Elasticsearch.

What is the ELK Stack?

The ELK Stack is a collection of three open-source tools: Elasticsearch, Logstash, and Kibana, developed by Elastic.

It provides a centralized logging solution to help identify issues with servers or applications, enabling the search of all logs in one place and connecting logs from multiple servers over a specific timeframe.

ELK Stack Architecture

Logs: Identify server logs that need analysis.
Logstash: Collects, parses, and transforms log and event data from various sources.
Elasticsearch: Stores, searches, and indexes the transformed data from Logstash.
Kibana: Accesses data from Elasticsearch and displays it as visualizations like line graphs, bar charts, and pie charts.

Additionally, Beats can be used for data collection, leading to the ELK Stack being renamed the Elastic Stack. For handling large data volumes, tools like Kafka or RabbitMQ may be used for buffering and resilience, with nginx providing security.

Features of Kibana

Visualization: Create charts, graphs, and maps to visualize data easily.
Dashboard: Combine multiple visualizations into a single dashboard for an overview.
Dev Tools: Manage indexes and test queries.
Reports: Export data as CSV or share via URLs.
Filters and Search: Use filters and search queries to refine data views.
Plugins: Extend functionality with third-party plugins.
Coordinate and Region Maps: Visualize data on geographical maps.
Timelion: Perform time-based data analysis and comparisons.
Canvas: Create rich, colorful visualizations with custom designs.

How Kibana is Used

Kibana allows users to search, view, and interact with data stored in Elasticsearch directories. It enables advanced data analysis and visualizations through tables, charts, and maps. There are different methods for performing searches in Kibana:

Free text searches: Search for a specific string.
Field-level searches: Search for a string within a specific field.
Logical statements: Combine searches into a logical statement.
Proximity searches: Search for terms within a specific character proximity.

Real-Life Use Cases of Kibana

Log Analysis: IT departments use Kibana to analyze server logs, identify issues, and improve performance.
Security Monitoring: Security teams monitor threats and vulnerabilities in real time.
Business Analytics: Companies visualize sales, customer data, and market trends.
Application Performance Monitoring: Developers track application performance metrics and troubleshoot issues.
IoT Data Analysis: Industries analyze sensor data from IoT devices for insights and operational efficiency.

Why Log Analysis is Important

In cloud-based environments, performance and reliability are crucial. Log management platforms monitor system performance, process logs, and help in web traffic analysis, application logs, and AWS logs. This helps DevOps engineers and system admins make better decisions.

Advantages of Kibana

Open-source and browser-based, making it accessible and free.
Easy for beginners to use and understand.
Converts visualizations and dashboards into reports effortlessly.
Advanced features like Canvas and Timelion help analyze complex and time-based data.

Disadvantages of Kibana

Adding plugins can be difficult if there are version mismatches.
Upgrading from older versions to new ones can cause issues.
Can require significant system resources, particularly for large data sets.
Relies on Elasticsearch for data storage and processing.

Case Studies

Companies using ELK stack are Netflix, LinkedIn, Tripwire, Medium, Cisco etc.

Netflix: Uses ELK stack to monitor and analyze customer service operations and security logs, indexing and searching documents from multiple clusters.
LinkedIn: Uses ELK to monitor performance and security, integrated with Kafka to handle real-time loads.
Tripwire: Uses ELK for information packet log analysis in their Security Information Event Management system.
Medium: Uses ELK to debug production issues and detect DynamoDB hotspots, supporting millions of unique readers and thousands of posts each week.

Difference between Kibana and Splunk

	Kibana	Splunk
Market Trends	It is new in the market in compare to Splunk.	It is well-established software.
Set-Up	Set-up is Easy and very flexible with its setup.	Set-up is Quite complex and is very powerful with its on-premise/off-premise integration.
Solaris portability	It offers a Solaris portability feature.	It does not offer Solaris portability feature.
Expenses	It is open-source and hence free.	It is licensed and hence charged. It is quite expensive.
Usage	It uses Apache Lucene’s syntax.	It uses its custom-written Search Processing Language.
Security	It offers security but less when compared with Splunk.	It offers extra security to users’ data.
Speed	It is slow when compared to Splunk.	It is fast.
Focus	The focus is mainly on monitoring tools.	The focus is mainly on log analysis.
Interactive	It is highly interactive and its User Interface is quite friendly.	It is not as interactive as Kibana.
Debugging	Debugging is not available.	It provides debugging and troubleshooting support.
Data Formats	It allows data formats like JSON and can be integrated with third parties to send data in the desired format.	It allows any data format like .csv, log files, JSON, etc. It is quite flexible in integrating with other plugins.
Organizations	LinkedIn, Netflix, and StackOverflow are a few organizations that use Kibana.	Bosch, Cisco, and Adobe are a few organizations that use Splunk. (Now Cisco has acquired splunk)

By now, you must have had a clear idea of Kibana, from its meaning to working you know all of it now. In a nutshell, Kibana is an open-source data visualization tool that works with Elasticsearch to create interactive charts, graphs, and dashboards for exploring and analyzing data. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

Job Opening

FAANG:

Software Development Engineer-I (FTC) @Amazon:
https://www.linkedin.com/jobs/view/3965912990

Software Engineer @Microsoft:
https://www.linkedin.com/jobs/view/3964923199

Software Engineer, Front End, Google Cloud @Google:
https://www.linkedin.com/jobs/view/3966119376

Others:

Software Engineer @PayPal:
https://www.linkedin.com/jobs/view/3963910540

React UI developer @Deloitte:
https://www.linkedin.com/jobs/view/3967619115

Software Engineer 2 @Intuit:
https://www.linkedin.com/jobs/view/3963364789

I found jobs for you!

Sat, 06 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is MongoDB?

Backend Engineer, Hybrid @Rapido: https://www.linkedin.com/jobs/view/3964481481
Software Engineer II, Fullstack @Uber: https://www.linkedin.com/jobs/view/3959050308
Full Stack Engineer @Stripe: https://www.linkedin.com/jobs/view/3962213345
Software Engineer @Fi: https://www.linkedin.com/jobs/view/3957551904
Backend Engineer @Jio: https://www.linkedin.com/jobs/view/3966907130
Backend Engineer 2 @Weekday: https://www.linkedin.com/jobs/view/3966076638
Software Engineer II @Toast: https://www.linkedin.com/jobs/view/3967336052
Software Engineer, Remote @Microsoft: https://www.linkedin.com/jobs/view/3963962549
Back End Developer, Remote @Kavida Ai: https://www.linkedin.com/jobs/view/3966082906
Software Engineer @Google: https://www.linkedin.com/jobs/view/3965258279

EP 33: What is MongoDB?

Mon, 01 Jul 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 32: What is ⁠PostgresSQL?

Have you ever needed a database that can effortlessly handle large volumes of diverse data, without being constrained by rigid schemas? MongoDB might just be the answer you're looking for. Let’s discuss more about MongoDB in this blog today.

History of MongoDB

MongoDB was launched in 2009 by 10gen (now MongoDB Inc.), addressing limitations of traditional relational databases with its document-oriented, JSON-based storage. It quickly gained popularity for its scalability, speed, and flexibility, becoming a leading NoSQL database solution.

What is MongoDB?

MongoDB is an open-source, document-oriented NoSQL database designed to handle large amounts of data and provide fast performance.

In simple terms,

MongoDB is a type of database that doesn't use tables like traditional databases do. Instead, it stores data in a way that's similar to how you'd organize information in JSON files.
This makes it easy to handle data that doesn't have a fixed structure. MongoDB can quickly find information because it indexes everything, and it can copy data across different machines automatically. Plus, it has easy-to-use tools for working with your data.

Below is an example of a JSON-like document in a MongoDB database:

{
  "company_name": "Glich Technologies",
  "address": {
       "street": "1522 MG Road", 
       "city": "Bangalore"
   },
  "phone_number": 1234567890,
  "industry": ["education", "technology"]
  "type": "private",
  "number_of_employees": 1200
}

Elements of MongoDB

For example, let's consider a database named "Glich Technologies". This database contains two collections, and within these collections, there are two documents. These documents store data in the form of fields, as illustrated below:

Document: A single unit of data in MongoDB, similar to a row in a SQL table.
Field: A key-value pair within a document that stores a specific piece of data.
Collection: A grouping of MongoDB documents, similar to a table in a SQL database.
Database: A container for collections in MongoDB, similar to a database in a SQL system.
Schema: Defines the structure of documents in MongoDB. Unlike SQL databases, MongoDB does not require predefined schemas for collections.
Index: A data structure that improves the speed of data retrieval in MongoDB, similar to indexes in SQL databases.
Primary Key: A unique identifier for each document in a MongoDB collection, automatically created and indexed by MongoDB.
Denormalization: Storing duplicated data within documents to improve query performance, contrary to the normalization process in SQL databases.
Joins: A method to combine data from multiple collections or documents in MongoDB, introduced in MongoDB version 3.6 with some limitations.
Transactions: Atomic operations that ensure either all or none of the updates to a single document succeed in MongoDB, extended to multi-document transactions in MongoDB version 4.0.

CRUD operations in MongoDB

Creating and Inserting Documents: To add a new document, define its structure using JSON-like objects. For instance, to create a user document with name, email, and age:

{ "name": "John Smith", "email": "john.smith@example.com", "age": 32 }

Insert it into MongoDB using insertOne():

db.users.insertOne({ "name": "John Smith", "email": "john.smith@example.com", "age": 32 });

Reading Documents: Retrieve documents with find(). To get all users:

db.users.find();

To filter by age greater than 30:

db.users.find({ "age": { $gt: 30 } });

Updating Documents: Modify a document with updateOne(). Update John Smith's email:

db.users.updateOne( { "name": "John Smith" }, { $set: { "email": "new.email@example.com" } } );

Deleting Documents: Remove a document using deleteOne(). Delete John Smith's entry:

db.users.deleteOne({ "name": "John Smith" });

What Is MongoDB Used For?

Document-Oriented: Stores data in JSON format, allowing flexible and fast retrieval.
Flexible Querying: MongoDB supports dynamic queries by fields within documents, including regular expressions and range queries. This makes it easy to find specific data as needed.
High Speed: Known for its fast performance, even with massive datasets.
Flexible Indexing: Any field within a document can be indexed to enhance search performance, providing additional flexibility in data access.
Language Agnostic: It offers drivers for various programming languages like Node.js, Python, Java, and more. This versatility allows MongoDB to integrate seamlessly with different application environments.
Sharding: MongoDB can horizontally scale by distributing data across multiple servers using a technique called sharding. This ensures high availability and performance by spreading data load efficiently.

Benefits of MongoDB:

Handles diverse types of data (structured, semi-structured, unstructured)
Supports high availability, scalability, and real-time analytics

Use Cases:

Real-time analytics and processing of large datasets
Content management systems requiring flexible data storage (text, images, video)

Companies using MongoDB

eBay, Shutterfly, Confidential Records Inc, Toyota, Paylocity, Verizon, TIM, Marsello, MetLife, Cisco, Zendesk Inc, Brookings Institution, Forbes, Sanoma, Conrad, Helvetia, Intuit, FEMA,etc.

Disadvantages of MongoDB

ACID Support: No full ACID support (atomicity, consistency, isolation, durability)
Standardization: Lack of standardized practices compared to relational databases.
Scale-out Performance: MongoDB has scale-out (horizontal scaling) abilities via its replica sets, but the scale-out performance may be limited compared to what relational databases platforms can produce.

How MongoDB is different from RDBMS?

MongoDB	RDBMS
It is a non-relational and document-oriented database.	It is a relational database.
It is suitable for hierarchical data storage.	It is not suitable for hierarchical data storage.
It has a dynamic schema.	It has a predefined schema.
It centers around the CAP theorem (Consistency, Availability, and Partition tolerance).	It centers around ACID properties (Atomicity, Consistency, Isolation, and Durability).
In terms of performance, it is much faster than RDBMS.	In terms of performance, it is slower than MongoDB.

By now, you must have had a clear idea of MongoDB, from its meaning to working you know all of it now. In a nutshell, MongoDB is a flexible, scalable, and high-performance NoSQL database that stores data in JSON-like documents, making it ideal for modern web applications. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 29 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is ⁠PostgreSQL?

Software Engineer @PayPal: https://www.linkedin.com/jobs/view/3959506784
Software Engineer, Remote @Red Hat: https://www.linkedin.com/jobs/view/3958630910
Software Engineer, Remote @Microsoft: https://www.linkedin.com/jobs/view/3960250697
Software Engineer, Hybrid @Observe Ai: https://www.linkedin.com/jobs/view/3959581185
Software Engineer @Intuit: https://www.linkedin.com/jobs/view/3959338104
Backend Engineer @ShopDeck https://www.linkedin.com/jobs/view/3960343757
Backend Engineer (Web 3), Remote @Calyptus: https://www.linkedin.com/jobs/view/3962655886
Backend Engineer @Signeasy: https://www.linkedin.com/jobs/view/3961630403
Full Stack Engineer @Calyptus: https://www.linkedin.com/jobs/view/3962662151
Backend Engineer @Zamp: https://www.linkedin.com/jobs/view/3959501021

EP 32: What is ⁠PostgreSQL?

Mon, 24 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 31: What is RabbitMQ?

How Figma Hacked Postgres Into Scalability

To handle growing data demands, Figma shifted from vertical partitioning to horizontal sharding, splitting data across multiple databases. This nine-month effort improved performance and scalability, ensuring efficient data management and minimal disruptions.

Read the full blog here: How Figma's Databases Team Lived to Tell the Scale | Figma Blog

Have you ever wondered what makes modern applications so reliable and efficient in handling vast amounts of data? Behind the scenes of many data-driven services lies a robust Relational Database Management System (RDBMS). One standout in this field is PostgreSQL, an open-source powerhouse renowned for its reliability, scalability, and advanced features. Whether you're developing a web application, managing a data warehouse, or exploring IoT solutions, PostgreSQL offers a versatile and powerful platform that meets diverse needs. Let’s discover more about PostgreSQL in this blog today.

PostgreSQL is one of the most trusted names in open-source relational databases. Its development started back in 1986 at UC Berkeley under Michael Stonebraker.
Like other relational databases, it stores data in tables with columns and rows and uses SQL (Structured Query Language) to read and write data.

PostgreSQL is technically an object-relational database, meaning:

It can create custom data types to store objects with properties

Supports advanced features like inheritance and polymorphism.

When writing data, PostgreSQL uses fully ACID-compliant transactions to ensure data integrity. It also features Multi-Version Concurrency Control (MVCC), which allows multiple transactions to run simultaneously by giving each one a snapshot of the database, preventing traffic jams or locks.

Developers love PostgreSQL for its extensibility.

You can reuse queries by writing stored procedures, and it supports languages beyond SQL, such as Python and C. PostgreSQL has a robust ecosystem of extensions like PostGIS for geospatial data used by apps like Uber, and Citus for scaling and distributing the database, and PGEmbedding for giving AI chatbots long-term memory. To get started, you can download and install PostgreSQL locally.

Why PostgreSQL is Best for You?

Key Features:

User-defined Types: Customize your own data types.
Table Inheritance: Use table inheritance for better data management.
Sophisticated Locking Mechanism: Ensures data consistency.
Foreign Key Referential Integrity: Maintains database integrity.
Views, Rules, Subquery: Powerful query capabilities.
Nested Transactions (Savepoints): Manage transactions more effectively.
Multi-Version Concurrency Control (MVCC): Allows multiple transactions without conflicts.
Asynchronous Replication: For high availability.
Native Microsoft Windows Server Version: Compatible with Windows servers.
Tablespaces: Organize data storage efficiently.
Point-in-time Recovery: Restore data to a specific point in time.

CRUD Operations in PostgreSQL

To add new records to a PostgreSQL table, you first define the table structure.

Create and Insert Records: To create a table and insert records

CREATE TABLE users ( 
  id SERIAL PRIMARY KEY, 
  name VARCHAR (100), 
  email VARCHAR (100), 
  age INT 
); 

INSERT INTO users (name, email, age) VALUES ('John Smith', 'john.smith@example.com', 32);

Read: To retrieve records

SELECT * FROM users; SELECT * FROM users WHERE age > 30;

Update: To update a record

UPDATE users SET email = 'new.email@example.com' WHERE name = 'John Smith';

Delete: To delete a record

DELETE FROM users WHERE name = 'John Smith';

Architecture

Its architecture is divided into two primary components: the client and the server.

The client sends requests to the server, which processes these requests and sends responses back. Typically, the client and server communicate over a TCP/IP network if they are on different hosts.

Key Components of PostgreSQL Architecture

1. Shared Memory
Shared memory is crucial for efficient data handling and consists of:

Shared Buffers: These minimize disk I/O by holding frequently accessed data. The default size has increased from 32MB in older versions to 128MB in versions 9.3 and later.
WAL Buffers: Temporarily store changes before they are written to WAL (Write Ahead Log) files, aiding in data recovery.
Work Memory: Allocates memory for sorting operations and hash tables during data writes. The default has increased from 1MB to 4MB in newer versions.
Maintenance Work Memory: Allocates memory for maintenance tasks like VACUUM and CREATE INDEX, with the default size increasing from 16MB to 64MB.

2. Background Processes
PostgreSQL uses several background processes for various tasks:

Background Writer: Regularly performs checkpoint processing.
WAL Writer: Periodically writes and flushes WAL data to persistent storage.
Logging Collector: Writes log data to files.
Autovacuum Launcher: Manages vacuum operations on tables.
Archiver: Copies WAL log files to a specified directory if archive mode is enabled.
Stats Collector: Gathers statistics on database activity.
Checkpointer: Writes all dirty pages from memory to disk and cleans the buffer area.

3. Data Files / Data Directory Structure
PostgreSQL databases are organized into a cluster, with a specific directory structure:

pg_version: Contains database version information.
Base: Holds database subdirectories.
Global: Contains cluster-wide tables.
pg_xlog: Stores WAL files.
pg_clog, pg_multixact, pg_notify: Store various status and transaction data.
pg_stat_tmp: Holds temporary statistics files.

Why PostgreSQL is Unique?

Unique Features:

Pioneered MVCC: PostgreSQL was the first to implement Multi-Version Concurrency Control (MVCC).
Custom Functions: Add custom functions in languages like C/C++, Python, and Java.
Extensibility: Define custom data types, index types, and functional languages.
Custom Plugins: Enhance and modify the system with custom plugins to meet specific needs.

PostgreSQL vs MySQL: A Quick Comparison

Similarities:

Relational Database Management Systems (RDBMS): Both organize data into tables.
SQL Support: Both use Structured Query Language (SQL) for interacting with the database.
JSON Support: Both can store and transport data using JSON (JavaScript Object Notation).

Differences:

Feature	PostgreSQL	MySQL
Type	Object-Relational Database Management System (ORDBMS)	Relational Database Management System (RDBMS)
Complex Queries	Excellent support for complex queries	Good for simpler queries and transactions
Extensibility	Highly extensible with custom functions and types	Limited extensibility
Concurrency Control	Multi-Version Concurrency Control (MVCC)	Combination of MVCC and locking mechanisms
Performance	Strong for complex queries and large datasets	Optimized for read-heavy operations and speed
Scalability	Vertical and some horizontal scalability	Primarily vertical scalability
High Availability	Asynchronous and synchronous replication	Replication available, less robust
Ease of Use	Steeper learning curve due to advanced features	Easy to set up and use

Conclusion:

PostgreSQL: Choose PostgreSQL for enterprise applications that need robust support for complex queries, multiple data types, and high concurrency.
MySQL: Opt for MySQL if you need a fast, easy-to-use database for small to medium-sized web applications.

Common Use Cases of PostgreSQL

1) Robust Database in the LAPP Stack
LAPP stands for Linux, Apache, PostgreSQL, and PHP (or Python and Perl). PostgreSQL is often used as a reliable back-end database for dynamic websites and web applications.

2) General-Purpose Transaction Database
Both large corporations and startups use PostgreSQL as their main database to support various applications and products.

3) Geospatial Database
With the PostGIS extension, PostgreSQL supports geospatial databases, making it ideal for geographic information systems (GIS) for apps like Uber and Citus.

Language Support

PostgreSQL supports a wide range of popular programming languages, including:

Python
Java
C#
C/C++
Ruby
JavaScript (Node.js)
Perl
Go
Tcl

Large Scale users of PostgreSQL

Several companies have built products and solutions using PostgreSQL. A few of those companies are Apple, Fujitsu, Red Hat, Cisco, Juniper Network, etc

By now, you must have had a clear idea of PostgreSQL, from its meaning to working you know all of it now. In a nutshell, PostgreSQL is a powerful, open-source object-relational database management system known for its advanced features, scalability, and support for complex queries. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 22 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is RabbitMQ?

Grow your Engineering Team with LatAm

🌎 CloudDevs is your gateway to 10,000+ vetted LatAm engineers, hand-selected for your project in just 24 hours.

✏️ Our talent is rigorously tested for IQ, tech stack and communication, with 7+ years of experience.

🌤️Try CloudDevs free for 7 days.

Typescript Developer, Remote @Pesto Tech (70K USD/ year): https://www.linkedin.com/jobs/view/3954373577
Software Engineer - Full Stack, Hybrid @LinkedIn: https://www.linkedin.com/jobs/view/3935675839
Software Engineer - Fullstack, Remote @Red Hat: https://www.linkedin.com/jobs/view/3932745489
AI Engineer, Hybrid @BCG X: https://www.linkedin.com/jobs/view/3940383125
Software Engineer, Hybrid @Simpplr: https://www.linkedin.com/jobs/view/3947503567
Back End Developer @Practo: https://www.linkedin.com/jobs/view/3952890197
Nodejs Developer @Pine Labs: https://www.linkedin.com/jobs/view/3948503480
Back End Developer, Remote @Lucio: https://www.linkedin.com/jobs/view/3951642436
Software Engineer @Jio: https://www.linkedin.com/jobs/view/3953168325
Backend Developer, Remote @alle: https://www.linkedin.com/jobs/view/3948339807

EP 31: What is RabbitMQ?

Mon, 17 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Elasticsearch?

How The Internet Travels Across Oceans

99% of all internet traffic – whether it's streaming a video or chatting on WhatsApp, relies on a vast network of undersea cables. It's important to understand this because our daily lives are becoming more reliant on these hidden cables. Surprisingly, they can even face threats from shark attacks.

Have you ever wondered how large-scale applications efficiently manage communication between different services? The answer often lies in message brokers, and RabbitMQ is one of the most popular and robust solutions available. In this blog, we'll explore what RabbitMQ is, how it works, and why it's a critical component in modern software architectures.

Before moving on to “What is RabbitMQ?”, let’s see “Why RabbitMQ and How it came into the picture?”

Let's travel back to the era of monolithic architecture, where application components were tightly coupled and directly connected. That means they were directly connected.

In a simple retail application, if a checkout service needed to communicate with an inventory service, it would typically be done directly, often through a TCP connection.

Both services communicate via a TCP connection

But this had some limitations.

The checkout service had to wait for a reply before moving on to the next task.
If the inventory service went down, the checkout would keep trying until the connection was restored.
If too many checkouts happened at once, the inventory service couldn't keep up, causing the system to slow down.

That's why message queues, or message brokers, were created. A message queue sits between the two services that need to communicate with each other.

Message broker sits between the two services that need to communicate with each other

So, with a message queue, a checkout can add a message to the queue and then immediately move on to the next task.

And then similarly, the inventory, when it's ready, can consume from the queue, process the message and then immediately consume the next message.

So, this is going to decouple the two applications.

Benefits:

A message broker is also going to help with SCALABILITY.

If a lot of checkouts happen at once, the queue fills up, and you can have multiple inventory services reading from the queue to handle the workload, making the system more scalable.

Another major advantage of message queues is that they can run on separate machines, offloading some of the work from the web application and enhancing the system's overall performance.

So, let's talk about RabbitMQ.

What is RabbitMQ?

RabbitMQ is an implementation of the AMQP message model (Advanced Message Queueing Protocol) Version 091.

In this messaging model, the producer (in our case, the checkout service) sends messages to an exchange instead of directly to a message queue. Think of the exchange as a post office: it receives all the messages and distributes them based on their addresses. An exchange can connect to multiple queues. For our example, we'll connect it to two queues: one for inventory and one for shipping. The checkout sends a message to the exchange, which uses bindings (connections identified by binding keys) to link to the queues. The consuming services, like inventory and shipping, then subscribe to these queues to receive messages.

One great aspect of this message model is its flexibility in how messages move through the system, thanks to the various types of exchanges available.

Types of Exchange:

Fanout Exchange

In a fanout exchange, the checkout service produces a message to the exchange, which then duplicates the message and sends it to every queue it's connected to.

Direct Exchange

Here, the checkout service produces a message with a routing key. The direct exchange then compares this routing key to the binding keys of the queues. If there's an exact match, the message is routed to the corresponding queue.

Topic Exchange

With a topic exchange, messages can be routed based on a partial match between the routing key and the binding key. For example, if the routing key is "ship.shoes" and the binding key is "ship.any", the message will be routed to the corresponding shipping queue.

Header Exchange

With a header exchange, the routing key is ignored entirely, and messages are routed based on the headers instead.

Default Exchange

The default exchange is unique to RabbitMQ and not part of the standard AMQP model. Also known as the nameless exchange, it routes messages based on the routing key matching the queue's name. For example, if a message has a routing key "inv" and there's a queue named "inv," the message will be routed to that queue.

Advantages of RabbitMQ:

Background Processing: Ideal for handling long-running tasks such as processing uploads on web servers, where users can continue with additional tasks while processing occurs in the background.
Control: Developers can define how messages move through the system using message metadata, rather than relying solely on broker administrators.
Cross-Language Communication: Messages produced in one language can be consumed by applications in different languages, enhancing versatility.
Message Acknowledgements: Ensures message delivery by keeping messages in the queue until receipt is acknowledged by consumers, preventing loss.
Microservices Communication: Acts as a middleman for communication between microservices, enabling seamless message passing and avoiding bottlenecks.
Priority Queueing: Supports priority queues for tasks like batch processing, allowing immediate processing of high-priority tasks rather than waiting for scheduled jobs.

Limitations:

Message Persistence: Messages are removed from the queue once delivered, unlike systems like Apache Kafka where messages can be persisted for specified retention periods, potentially limiting future insights.
Vertical Scaling: RabbitMQ scales vertically, requiring more powerful hardware for increased throughput, unlike Kafka which scales horizontally by adding more machines.
Traditional Messaging: Designed for traditional messaging rather than streaming, limiting its ability to handle massive volumes of data like Kafka.
Memory Management: Stores messages in memory until space is exhausted, impacting performance, whereas Kafka is designed for wider scalability and can handle trillions of messages.
High Throughput Limitation: Not optimized for high throughput as it lacks message batching support and processes one message at a time.

Real-World Use Cases of RabbitMQ

Companies that use RabbitMQ include Amazon, Microsoft, Apple, Deloitte, Google, Citi, LinkedIn, Credit Suisse, Robinhood, Reddit, and more.

Alternatives to RabbitMQ

Kafka: Great for handling large volumes of data and real-time processing.
ActiveMQ: Reliable and supports multiple messaging protocols.
SQS (Simple Queue Service): Message queuing service provided by AWS.
RedHat: Offers messaging solutions like Red Hat AMQ, which is based on ActiveMQ.
ZeroMQ: Lightweight and designed for high-performance messaging.

By now, you must have had a clear idea of RabbitMQ, from its meaning to working you know all of it now. In a nutshell, RabbitMQ is an open-source message broker that enables reliable, scalable, and asynchronous communication between distributed applications. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 15 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Elasticsearch?

Software Developer 2, Remote @Apna: https://www.linkedin.com/jobs/view/3838857556
Backend Engineer, Remote @Weekday: https://www.linkedin.com/jobs/view/3949254449
Senior Backend Developer, Remote @Typo: https://www.linkedin.com/jobs/view/3905998008
Senior Software Engineer @Attri: https://www.linkedin.com/jobs/view/3942498056
Software Engineer, Hybrid @Linkedin: https://www.linkedin.com/jobs/view/3935675839
Software Engineer @Coinbase: https://www.linkedin.com/jobs/view/3925559556
Software Engineer @Google: https://www.linkedin.com/jobs/view/3941965753
Founding Engineer @p0: https://www.linkedin.com/jobs/view/3646815481
Software Engineer @Boston Consulting Group: https://www.linkedin.com/jobs/view/3940379462
Full Stack Engineer, Hybrid @GoDaddy: https://www.linkedin.com/jobs/view/3939474380

EP 30: What is Elasticsearch?

Mon, 10 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 29: What is Cassandra?

Discover how Uber Migrates 1 Trillion Records from DynamoDB to LedgerStore to Save $6 Million Annually

Read the article at: Uber Migrates 1 Trillion Records from DynamoDB to LedgerStore to Save $6 Million Annually - InfoQ

Imagine you're running a massive library with millions of books. Visitors come in, asking for specific books or topics, and they expect to find what they need instantly. To manage this efficiently, you need a powerful search system that can quickly sift through vast amounts of information and provide accurate results. This is exactly what Elasticsearch does for data.

Elasticsearch is a distributed, open-source search and analytics engine designed for handling large volumes of data in real-time. Whether you're searching through text documents, logs, metrics, or spatial information, Elasticsearch can provide lightning-fast search capabilities.

When people ask, “What is Elasticsearch?”, you might hear responses like “an index”, “a search engine”, “an analytics database”, “a big data solution”, “fast and scalable”, or “it’s kind of like Google”. Depending on your familiarity with the technology, these answers might either clarify things or add to the confusion. The reality is, all of these descriptions are accurate, which is part of what makes Elasticsearch so appealing.

So, what is Elasticsearch?

In simple terms, Elasticsearch is basically a distributed database where data is stored as JSON documents. When you say distributed database that means the database can run in multiple nodes at a time.

Let’s say you have an Elasticsearch cluster running in four nodes that means the Elasticsearch database is running in four servers so the data is distributed here and you can even have replication and you can even add more nodes in the future so this database is horizontally scalable.

Analogy to Relational database

Index is a collection of documents. It is like an RDBMS table.

RDBMS	Elasticsearch
Table	Index
Row	JSON Document
Columns in a row	attributes of JSON document

To implement the employee table in Elasticsearch you create an Employee Index and you insert all the employees data as JSON documents. Here the first employee data is stored in this first Json the second employee data is stored in the second Json and so on in this Json so the collection of these three JSONS is an index and it's something like a database table so this is how data is stored in an Elasticsearch.

Why Elasticsearch is Popular?

As per this survey from Stackshare, the following are the top reasons why developers and companies choose Elasticsearch:

Distributed
Horizontally scalable
Elasticsearch is designed for faster queries
Powerful API
Great search engine
Open source
Restful
Near real-time search
Free
Search everything
Easy to get started
Analytics

The data is stored as queries in an Elasticsearch database that is the reason why the queries are really fast even with huge amounts of data in Elasticsearch. So elastics as database uses a data structure called inverted index.

Let's demonstrate with simple example,

Here I have a table with 8 documents and let's think there is an attribute called geoscope ID in those eight documents so if you want to search for Bangalore you will go through all these documents or you will create some kind of index and you will search to get one two and seven or the document IDs which contain the word Bangalore.

But in Elasticsearch it stored something like this, the data is stored as such so it will say Bangalore is present in these three(1,2,7) documents, Mumbai is present in this(3) document and when a search query comes “Where is Bangalore?”, instantaneously you can say that it's in these documents.

In short Elasticsearch uses a data structure called inverted index and it is the reason why Elasticsearch is fast.

How Elasticsearch Works

Let’s breakdown some of its core components and concepts:

Documents:
- Basic units of data in Elasticsearch, similar to rows in a database.
- Stored in JSON format.
- Represent entities like a product, a log entry, or an article.
Indices:
- Collections of documents with similar characteristics.
- Similar to a database in traditional systems.
- Identified by a unique name, allowing operations like search and update.
Inverted Index:
- Core mechanism used for searching.
- Maps words to the documents they appear in, enabling quick look-ups.
- Splits documents into individual terms and creates a map from terms to documents.
Cluster:
- A group of interconnected nodes (servers) that work together.
- Distributes tasks like searching and indexing across all nodes.
Nodes:
- Individual servers within a cluster.
- Different types:
  - Master Node: Manages cluster operations like creating indices.
  - Data Node: Stores data and performs search and aggregation.
  - Client Node: Routes requests to the appropriate nodes.
Shards:
- Subdivisions of an index, each a fully functional "mini-index."
- Distributed across nodes to enhance redundancy and query capacity.
Replicas:
- Copies of primary shards.
- Provide data redundancy and improve read capacity.

In a nutshell: Elasticsearch indexes documents using an inverted index for efficient search. It scales by distributing data across clusters, nodes, and shards, with replicas ensuring data redundancy and high availability.

Primary Use Cases for Elasticsearch

Application Search: Enhances data access, retrieval, and reporting in applications.
Website Search: Enables effective and accurate searches on content-heavy websites.
Enterprise Search: Facilitates enterprise-wide searches, including documents, products, blogs, and more.
Logging and Log Analytics: Analyzes log data in near-real-time for operational insights.
Infrastructure Metrics and Container Monitoring: Gathers and analyzes performance data for various use cases.
Security Analytics: Analyzes access logs and system security data in real-time.
Business Analytics: Utilizes built-in features for business insights, though it has a steep learning curve; platforms like Knowi offer easier alternatives.

Company Use Cases

Netflix: Uses Elasticsearch for monitoring and analyzing customer service operations and security logs.
eBay: Employs Elasticsearch for text search and analytics with an 'Elasticsearch-as-a-Service' platform.
Walmart: Utilizes Elasticsearch for real-time insights into customer patterns and store performance.

By now, you must have had a clear idea of Elasticsearch, from its meaning to working you know all of it now. In a nutshell, Elasticsearch is a distributed, open-source search and analytics engine designed to handle large volumes of data in real-time. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 08 Jun 2024 06:12:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Cassandra?

Instantly calculate the time you can save by automating compliance

Whether you’re starting or scaling your security program, Vanta helps you automate compliance across frameworks like SOC 2, ISO 27001, ISO 42001, HIPAA, HITRUST CSF, NIST AI, and more.

Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing Trust Center, all powered by Vanta AI.

Instantly calculate how much time you can save with Vanta.

[Calculate now]

Senior Product Engineer, Remote @Pesto Tech ($120k / year): https://www.linkedin.com/jobs/view/3943525909
Software Engineer II @Postman: https://www.linkedin.com/jobs/view/3942710542
Node + AWS Developer @Deloitte: https://www.linkedin.com/jobs/view/3943404165
Fullstack Developer @Airbus: https://www.linkedin.com/jobs/view/3943956568
Software Engineer II @Uber: https://www.linkedin.com/jobs/view/3943155161
Software Engineer, Remote @Clipboard Health: https://www.linkedin.com/jobs/view/3941318172
Software Engineer @Coinbase: https://www.linkedin.com/jobs/view/3925559556
Software Engineer (Frontend), Remote @Coursera: https://www.linkedin.com/jobs/view/3931399834
Software Engineer @Paypal: https://www.linkedin.com/jobs/view/3939804508
Engineer I @American Express: https://www.linkedin.com/jobs/view/3939493396

EP 29: What is Cassandra?

Mon, 03 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Kafka?

How Uber Manages a Million Writes Per Second

Discover Uber manages a million writes per second by using a combination of Apache Mesos and Cassandra across multiple data centers.

Mesos provides efficient resource management and scheduling for Uber's infrastructure, allowing it to handle large-scale, distributed workloads.

Cassandra, a highly scalable NoSQL database, ensures data is written and replicated across multiple data centers, providing high availability and fault tolerance.

This architecture enables Uber to manage vast amounts of real-time data efficiently, supporting their global operations with resilience and scalability.

How Uber Manages a Million Writes Per Second

If you are Uber and you need to store the location data that is sent out every 30 seconds by both driver and rider apps.

planetcassandra.org/leaf/how-uber-manages-a-million-writes-per-second-using-mesos-and-cassandra-across-multiple-datacenters-high-scalability

History of Cassandra

Cassandra was originally developed by Facebook and then open-sourced in 2008, becoming an Apache project. It gained popularity quickly, and by 2010, it was one of the top NoSQL database systems. Known for its scalability and reliability, Cassandra is used by companies like Netflix, Twitter, and Reddit.

From SQL to NoSQL: Why NoSQL Was Invented

For many years, relational database management systems (RDBMS) were the standard. However, the explosion of data-driven by tech giants like Apple, Facebook, and Instagram created challenges that RDBMS couldn't handle. The sheer volume, speed, and variety of new data exceeded their capabilities.

To address these challenges, NoSQL databases were invented. They were designed to manage large amounts of data, handle fast data processing needs, and accommodate various data types and relationships.

Different types of NoSQL databases include:

Time-series databases (e.g., Prometheus): Optimized for time-stamped data.
Document databases (e.g., MongoDB): Store data in document formats like JSON.
Graph databases (e.g., DataStax Graph): Focus on data relationships.
Ledger databases (e.g., Amazon QLDB): Designed for immutable and cryptographically verifiable records.
Key/value databases (e.g., Amazon DynamoDB): Simple storage systems using key-value pairs.

How does Cassandra work?

In Cassandra, all servers (nodes) are treated equally. Unlike traditional systems where one server leads and others follow, Cassandra uses a peer-to-peer setup. This means data is spread across multiple nodes in a cluster (or data center), reducing the risk of a single point of failure.

Nodes and Gossip

Each node in Cassandra holds a chunk of data and can store a few terabytes. These nodes constantly communicate with each other, sharing updates about their status. This process is called "gossiping." If one node fails, another node takes over, ensuring the system keeps running without downtime.

Data Replication

Cassandra keeps your data safe by copying it across multiple nodes. This is called replication. The replication factor (RF) determines how many copies of your data are stored. For example:

RF = 1 means each piece of data is stored on one node.
RF = 2 means each piece of data is stored on two nodes.
RF = 3 (which is standard) means each piece of data is stored on three nodes.

This way, even if one node goes down, the data is still available on other nodes.

The CAP theorem: is Cassandra AP or CP?

The CAP theorem is like a rulebook for databases in tough situations. It says that when things go wrong, a database can't always keep all its promises. It has to pick two out of three important things: consistency, availability, and partition tolerance.

Consistency means making sure you always get the latest information when you ask for it. If one server tells you something old while others have new data, that's a problem.
Availability is about keeping the system running smoothly. Even if some servers crash, you should still get a response when you ask for data.
Partition tolerance is being able to survive when parts of the system can't talk to each other. If some servers lose connection with others, the system should keep going.

Cassandra is known as an "AP" system, which means it leans more towards availability and partition tolerance, even if it means sometimes the data might not be perfectly up-to-date. But the cool thing about Cassandra is that you can adjust how much you care about consistency versus availability based on what your needs are. So, you have some flexibility to tweak things to fit your situation just right.

How does Cassandra structure and distribute data?

Cassandra is designed to handle and distribute large amounts of data across many servers without any downtime. Each server (or node) in a Cassandra cluster knows how data is distributed. This means your application can contact any node and quickly get the data it needs.

Key-Based Partitioning

Cassandra organizes data using key-based partitioning. Here’s a breakdown of its components:

Keyspace: This is like a schema in a relational database, containing multiple tables.
Table: A collection of columns and rows. Data in tables is stored in partitions.
Partition: A set of rows with the same partition key.
Row: A single data entry in a table.

Storing Data in Partitions

Data in Cassandra is stored in partitions, which are groups of rows. Each row has a partition key, which is hashed to decide how data is spread across the nodes in the cluster. This makes it easy to manage and access large datasets because the data is split into manageable chunks.

Why Partitioning is Important

Partitioning makes it easier to scale Cassandra.

Big Data can't fit on a single server, so it's divided into pieces and spread across many servers. If you need more storage or processing power, you can add more nodes, and Cassandra automatically redistributes the data.

Token Ranges and Scaling

When you create a table, you set a partition key. A partitioner hashes this key into tokens, and each node gets a range of these tokens. This process helps distribute data evenly. If you add a new node, Cassandra adjusts the token ranges and redistributes the data. You can also remove nodes without causing problems.

Designing Partitions

Data architects must design partitions carefully to ensure queries run quickly and accurately. Once you set a primary key for a table, it can’t be changed. If changes are needed, you must create a new table and move the data to it.

What Makes Cassandra So Powerful?

Cassandra is known as the "Lamborghini" of NoSQL databases due to its exceptional performance and scalability.

It's a peer-to-peer system without a single leader node, which means any node can handle read or write requests, ensuring no single point of failure.

Key Features of Cassandra:

Big Data Ready: Can handle petabyte-scale data through its distributed architecture. Just add more nodes for more capacity.
High Performance: A single node is powerful, but a cluster of nodes in multiple data centers boosts throughput significantly.
Linear Scalability: No limits on data volume or speed. Cassandra grows with your needs without extra overhead.
Cassandra Query Language (CQL): CQL, resembles SQL but is optimized for working with table-based data. It combines tabular database management with key-value operations, offering a familiar syntax for administrators.
Fault Tolerance: Cassandra achieves fault tolerance through data replication, storing multiple copies of data across different nodes. This redundancy ensures high availability and resilience to node failures or data center outages, enhancing backup and recovery capabilities.
High Availability: Designed for 100% uptime with data replication and a decentralized structure.
Self-Healing and Automation: Automatically manages scaling, data replacement, and recovery, reducing operational headaches.
Geographical Distribution: Supports multi-data center deployments, ensuring disaster tolerance and proximity to clients worldwide.
Platform Agnostic: Works with any platform or service provider, ideal for hybrid-cloud and multi-cloud solutions.
Vendor Independent: Maintained by the non-profit Apache Software Foundation, ensuring open access and continuous development.

Real-World Examples:

Netflix: Handles 30 million operations per second on its busiest cluster, storing 98% of streaming data on Cassandra.
Apple: Runs over 160,000 Cassandra instances across thousands of clusters.

Top 3 Benefits of Using Cassandra

1. Performance – Speed

Cassandra processes data quickly due to two key features:

Hashing Algorithm: Decides where to store data swiftly.
Decentralized Data Storage: Any node can make storage decisions, eliminating the need for a central "master node" and speeding up operations.

2. Scalability

Cassandra is highly scalable and can grow easily by adding new nodes.

No Master Node: All nodes are equal, allowing the use of cheaper servers.
Peer-to-Peer Communication: Uses a "gossip protocol" for nodes to communicate and manage metadata, simplifying the addition of new nodes.
Relaxed Consistency: Reduces the need for complex consistency checks, which typically slow down scalability.

3. Reliability – Data Replication and High Availability

Cassandra is designed to be robust:

Data Replication: Automatically makes copies of data and stores them in different locations.
High Availability: Ensures that even if a node fails, the data is still accessible from another copy.

Challenges

Cassandra's focus on availability over consistency means that data can sometimes contradict itself, and resolving these contradictions can slow down data reads.

When Not to Use Cassandra

While Apache Cassandra is powerful, it's not suitable for every situation. Here are some cases where you might want to choose a different tool:

Small-Scale Applications:
- Cassandra is overkill for small apps with limited data and low traffic. A simpler single-node database would be easier to manage and sufficient for these needs.
Systems Needing Strong ACID Compliance:
- Cassandra sacrifices strict ACID properties (Atomicity, Consistency, Isolation, Durability) for high availability and scalability. If your application requires strong consistency and complex transactions, Cassandra may not be the best choice.
Complex Queries and Joins:
- Cassandra excels at fast writes and simple data retrievals using primary keys. However, it's not efficient for applications that rely on complex queries, ad-hoc searches, or joins.
Frequent Updates or Deletes:
- Optimized for high write-throughput, Cassandra handles frequent updates or deletions less efficiently. These operations can create performance issues due to increased storage demands from tombstone markers.
Read-Heavy Workloads:
- While capable of handling reads and writes, Cassandra may not perform as well as databases optimized for read-heavy workloads with minimal writes.
Static or Infrequently Changing Schemas:
- Cassandra is ideal for dynamic and evolving schemas. If your data structure is stable and rarely changes, using Cassandra could add unnecessary complexity.
Single Node Deployments:
- Cassandra is designed for multi-node setups. Running it on a single node misses out on its distributed architecture benefits, making simpler databases more suitable for standalone use.
Short-Lived Data Storage:
- Cassandra is built for long-term data storage. For temporary or short-lived data, other solutions like temporary data stores or caches are more appropriate.

Not clear even now? Don’t worry we still have something in the box for you!

Let’s understand Cassandra through a Real-Life Example: An Online Shopping Platform

Equal Servers (Nodes)

In Cassandra, all servers (called nodes) are created equal. Imagine an online shopping platform with warehouses in different locations. Each warehouse can handle orders, manage inventory, and process returns. If one warehouse is busy or goes down, the others can still process orders without any disruption, ensuring a smooth shopping experience.

Peer-to-Peer Architecture

Cassandra uses a peer-to-peer architecture. Think of each warehouse being able to communicate directly with any other warehouse without needing a central headquarters to manage operations. This ensures no single point of failure, as all warehouses work together to keep the platform running efficiently.

Data Replication

Data replication in Cassandra means that the same piece of information is stored in multiple locations. On the shopping platform, this is like having the same product catalog and order details available in multiple warehouses. If one warehouse fails, the others still have the data, ensuring that customers can continue to place orders and access product information without interruption.

Partitioning

Cassandra stores data by partitioning it. Imagine dividing the inventory into different categories, such as electronics, clothing, and home goods, and then distributing these categories across various warehouses. Each warehouse manages a portion of the inventory, allowing for easy scaling. If the demand for electronics increases, you can add another warehouse to handle the increased load.

Scaling Up or Down

Scaling in Cassandra is straightforward. If your shopping platform gains more customers, you can easily add more warehouses to handle the increased activity. Conversely, if customer activity decreases, you can reduce the number of warehouses without disrupting operations.

Handling Orders and Inventory

For your online shopping platform, customer orders and inventory details are distributed across different warehouses. When a customer places an order, the order details are stored in multiple warehouses. If one warehouse fails, other warehouses still have the data, ensuring the order is processed smoothly. This replication and distribution mechanism keeps your platform running efficiently even under heavy load.

To conclude,

Cassandra ensures high availability, fault tolerance, and easy scalability by treating all servers equally, replicating data, and partitioning it effectively, much like how a well-organized online shopping platform handles orders and inventory across multiple warehouses.

By now, you must have had a clear idea of Cassandra, from its meaning to working you know all of it now. In a nutshell, Cassandra is a scalable, distributed NoSQL database for handling large data with high availability. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

I found jobs for you!

Sat, 01 Jun 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Kafka?

Senior LatAm Tech Hiring in 24 Hours

CloudDevs is the largest pool of tech talent in LatAm. Our 10,000+ pre-vetted LatAm engineers have a minimum of 7 years of experience and are hand-selected for your project in just 24 hours.

Our talent is rigorously tested for problem-solving, tech stack and communication, and specialize in everything from AI and healthtech to fintech and web3.

Try CloudDevs free for 7 days – no risk, no obligation.

Software Engineer, Fullstack @LinkedIn: https://www.linkedin.com/jobs/view/3881985448
Sr Software Engineer, Fullstack @Uber: https://www.linkedin.com/jobs/view/3937609559
Web Developers, Remote @Zoho: https://www.linkedin.com/jobs/view/3937288974
Software Engineer, Backend - International (Internship), @Coinbase: https://www.linkedin.com/jobs/view/3937577186
Full Stack Engineer, Support Experience @Stripe: https://www.linkedin.com/jobs/view/3919463720
Software Engineer @Myntra: https://www.linkedin.com/jobs/view/3912993271
Full Stack Engineer @Walmart: https://www.linkedin.com/jobs/view/3936977814
Software Engineer-Backend @PhonePe: https://www.linkedin.com/jobs/view/3824547942
Nodejs Developer @Jio: https://www.linkedin.com/jobs/view/3936893976
Software Engineer, Remote @CoverForce: https://www.linkedin.com/jobs/view/3938796274

EP 28: What is Kafka?

Mon, 27 May 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: EP 27: What is Kubernetes?

How Uber scaled its Real Time Infrastructure to Trillion events per day

Discover how Uber scaled its real-time infrastructure to handle trillions of events per day by implementing Apache Kafka.

Kafka's distributed streaming platform enabled horizontal scalability, ensuring high availability and low latency. Through efficient data partitioning and optimized configurations, Uber effectively managed the massive volume of data generated by its operations, facilitating seamless scalability and performance.

Have you ever wondered how companies like Netflix deliver personalized recommendations instantly or how e-commerce platforms manage to keep track of inventory in real-time? The answer lies in Apache Kafka, a powerful tool reshaping how businesses handle data. But what exactly is Kafka, and why is it becoming increasingly essential in today's digital landscape?

Let’s discover in this blog today!

Apache Kafka is an open-source distributed event streaming platform designed for building real-time data pipelines and streaming applications, providing high-throughput, low-latency data processing across a cluster of servers.

Here's a breakdown of some terms you might not be familiar with but will come across frequently in this blog:

Latency: Latency refers to the time it takes for a system to respond to a request. In simpler terms, it's the delay between when you ask for something and when you get a response. For example, when you click a link on a webpage, the time it takes for the new page to start loading is its latency.

Low latency means things happen quickly, while high latency means there's a noticeable delay.
Throughput: Throughput is the rate at which a system can process or handle a certain amount of work within a given time frame. It's like the capacity of a highway — how many cars can pass through in an hour.
In computing, it's about how much data or tasks a system can handle per unit of time.

High throughput means the system can handle a lot of work quickly, while low throughput means it struggles to keep up with demand.

Origin of Kafka

Kafka originated at LinkedIn in 2011 to manage high-throughput, low-latency data feeds from the platform. Initially an in-house solution, it was later open-sourced and adopted by the Apache Software Foundation. Today, Apache Kafka is a widely used distributed streaming platform, powering real-time data pipelines across industries.

What is Kafka mainly used for?

Because of its fault tolerance and scalability, Kafka is often used in the big data space as a reliable way to ingest and move large amounts of data streams very quickly. Let’s look into particular use cases when Kafka is the first choice.

1. Stream processing — thanks to Kafka Streams you can build a streaming platform that transforms input Kafka topics into output Kafka topics. It ensures that the application is distributed and fault-tolerant.

2. Website activity tracking — this is the original use case that LinkedIn had in the past and which triggered the invention of Kafka. The company still uses Kafka to track activity data and operational metrics in real-time.

3. Metrics collection and monitoring — Kafka could be easily combined with a real-time monitoring application that reads from Kafka topics.

4. Log aggregation — you can publish logs into Kafka topics and that way you can store them in a Kafka cluster. Thanks to that, logs could be easily aggregated or processed using Apache Kafka.

5. Real-time analytics — Kafka could be used for real-time analytics as it’s able to process data as soon as they become available. It’s able to transmit data from producers to data handlers and further to data storage.

6. Microservices

Kafka can be used in microservices as well. They benefit from Kafka by using it as a centric intermediary that lets them communicate with each other. That way they benefit from the publish-subscribe model. It’s the receiver that decides asynchronously which events to receive.

As a result, Kafka-centric applications are more reliable and scalable when compared with architectures without Kafka.

Before we proceed, let's take a moment to explore some everyday Kafka use cases that often go unnoticed!

Imagine you're a customer browsing through an online store. With every click, data is generated – from your browsing history to your preferences. How does the store process this flood of information seamlessly, ensuring a smooth shopping experience for you? That's where Kafka comes into play. It acts as the central nervous system, and manages the flow of data in real-time, from the moment you click "search" to the instant your order is confirmed.
In the world of e-commerce, Kafka enables platforms to manage inventory, process orders, and deliver personalized recommendations instantaneously. Take Amazon, for example. By leveraging Kafka, Amazon ensures that every interaction you have with its platform – from browsing products to completing a purchase – is processed swiftly and efficiently.
But Kafka's impact extends far beyond e-commerce. In the financial sector, banks rely on Kafka for fraud detection and real-time decision-making.
Social media platforms use Kafka to deliver personalized content to millions of users simultaneously.
From IoT applications to telecommunications, Kafka is at the heart of real-time data processing across various industries, driving innovation and efficiency.

Now that you're aware of the various scenarios where Kafka is applied, let's delve deeper into its architecture and technical aspects.

Kafka Architecture

Here are the key components of the Kafka ecosystem:

Producers: These are the data sources that publish records to Kafka topics, analogous to senders in a messaging system.
Consumers: These subscribe to topics and process the published records, acting as the receivers.
Brokers: Brokers are the servers in a Kafka cluster. A Kafka cluster consists of multiple brokers to balance the load and manage data. Brokers are stateful and represent the unit of scalability in Kafka.
Topics: A topic is a category or feed where records are published. Topics are divided into partitions for scalability and parallel processing.
Partitions: Each partition is an ordered, immutable sequence of records. Partitions enable Kafka to parallelize processing, as each can be consumed independently.
Zookeeper: This is a centralized service that coordinates the activities of brokers in a Kafka cluster, maintains the list of brokers, and facilitates leader election for partitions.

How Does Kafka Work?

Kafka's architecture ensures efficient and reliable data processing by organizing its core components to work together seamlessly:

Producer: Generates and writes large amounts of data to Kafka.
Consumer: Reads data from Kafka produced by the producers.
Topic: Organizes Kafka records into categories or labels, making it easier for consumers to read data from specific topics.
Broker: Handles the data received from producers and stores it on the server.
Partition: A unit of data storage, each partition is an ordered, immutable sequence of messages. A topic can have multiple partitions to manage larger data volumes.
Zookeeper: Coordinates broker activities, maintains broker lists, and facilitates partition leader elections.

Imagining Kafka as a Mail System

Kafka operates like a highly efficient mail system:

Producers drop off letters (messages) at the post office (broker).
Each letter is sorted into specific PO boxes (topics) and organized into compartments (partitions).
Consumers collect letters from these compartments, ensuring efficient and orderly processing.

Let’s understand why we need Kafka through an example

Scenario: Zomato's Food Delivery System

1. Order Placement and Processing

When a customer places an order on Zomato:

The application generates a high volume of data, including customer details, order details, payment information, and more.
This data needs to be processed quickly to confirm the order, notify the restaurant, and arrange for delivery.

2. Challenges with Traditional Databases

Using only traditional databases for this process could present several challenges:

High Throughput Demand: The system needs to handle a large number of orders simultaneously, especially during peak hours (e.g., lunchtime, weekends). Traditional databases might struggle to handle this volume of concurrent transactions with low latency.
Real-Time Updates: The system must provide real-time updates on the order status to customers, restaurants, and delivery personnel. Databases may not be able to efficiently manage and propagate these real-time updates.
Scalability Issues: As Zomato grows, the volume of data increases. Scaling a traditional database to handle this growth can be complex and costly.

3. How Kafka Solves These Challenges

Here's how Kafka can be integrated into Zomato's architecture to address these challenges:

Order Data Flow with Kafka

Order Generation:
- When an order is placed, it is published to a Kafka topic (e.g., order_topic). This topic acts as a queue that holds the order data temporarily.
Real-Time Processing:
- Multiple consumer applications can subscribe to this order_topic. For example, one consumer might handle order validation, another might manage payment processing, and another might update the restaurant.
Decoupling and Buffering:
- Kafka decouples the order generation from the order processing. Even if the downstream systems (like databases or notification services) are slow, Kafka buffers the orders, ensuring that no data is lost and the system remains responsive.
Scalability:
- Kafka is horizontally scalable. As the number of orders increases, Zomato can add more Kafka brokers to handle the load without a significant overhaul of the existing system.
Fault Tolerance and Durability:
- Kafka ensures that the order data is stored reliably and can be replayed in case of failures. This ensures that no order is lost and the system can recover from unexpected downtimes.
Batching Data for Database Insertion:
- Continuous Data Updates: Throughout the lifecycle of an order, its data undergoes various updates within Kafka topics corresponding to each stage of processing.
- Aggregation of Order Data: Rather than immediately inserting every piece of data into the database, Kafka efficiently aggregates all relevant information related to a single order within its topics.
- Single Query Database Insertion: Once all necessary processing steps are completed, Kafka bundles the entirety of data associated with a specific order and sends it to the database in a single query or batch operation.
Optimizing Database Interaction:
- Reduced Query Overhead: By batching together data pertaining to each order before insertion into the database, Kafka minimizes the number of database queries required.
- Mitigating High Throughput Demand: In scenarios where the system encounters high throughput demand, traditional databases often struggle with the concurrent processing of numerous transactions.
  However, Kafka's batching mechanism significantly reduces the overhead associated with database interactions, ensuring efficient handling of a large volume of orders without compromising latency or system responsiveness.

Example: Customer Order to Delivery

Customer places an order:
- The order is published to Kafka topic order_topic.
Order validation:
- A consumer application subscribes to order_topic, validates the order details, and then publishes the validated order to another Kafka topic, validated_order_topic.
Payment Processing:
- Another consumer subscribes to validated_order_topic, processes the payment, and publishes the payment status to payment_status_topic.
Restaurant Notification:
- A consumer subscribes to validated_order_topic to notify the restaurant about the new order.
Delivery Assignment:
- A consumer subscribes to validated_order_topic to assign a delivery person and update the delivery status in delivery_status_topic.
Real-Time Updates to User:
- Consumers subscribing to various topics (like payment_status_topic, delivery_status_topic) update the user in real-time about their order status.

Did you know?
More than 80% of all Fortune 100 companies trust, and use Kafka.

Limitations and Considerations of Apache Kafka

While Apache Kafka is a powerful tool for real-time data processing, it's important to be aware of its limitations:

Not Ideal for Small Data Sets: Kafka is designed for high-throughput environments and may not be efficient for smaller data sets due to system overhead.
For smaller data environments, alternatives like KubeMQ, Google Cloud Pub/Sub, Azure Event Hubs, Amazon MQ, RabbitMQ, and Red Hat AMQ might be more suitable.
Complex Message Transformations: Kafka struggles with complex message transformations, as it isn't designed for heavy ETL (Extract, Transform, Load) operations.
For more advanced data transformation needs, tools like Spark Streaming offer better capabilities.
Not a Database Replacement: Kafka lacks traditional database features like indexes and transaction support. It's not meant for long-term data storage or complex concurrency handling, so it shouldn't be used as a database substitute. Instead, Kafka can complement databases for specific use cases.

By now, you must have had a clear idea of Kafka, from its meaning to working you know all of it now. In a nutshell, Kafka is a distributed streaming platform renowned for efficiently handling real-time data feeds while maintaining low latency and high throughput. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

Job Opening

You can check the job openings here: Job Openings

I found jobs for you!

Sat, 25 May 2024 12:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Kubernetes?

Software Engineer @Linkedin: https://www.linkedin.com/jobs/view/3926238541
Software Engineer, Remote @PayPal: https://www.linkedin.com/jobs/view/3928444763
Computer Scientist @Adobe: https://www.linkedin.com/jobs/view/3774717831
backend developer @CRED: https://www.linkedin.com/jobs/view/3908881694
Software Engineer II @Postman: https://www.linkedin.com/jobs/view/3921053379
Full Stack Developer @Adobe: https://www.linkedin.com/jobs/view/3925222768
Full Stack Engineer II, Remote @PriceLabs: https://www.linkedin.com/jobs/view/3931117673
Software Engineer @Stripe: https://www.linkedin.com/jobs/view/3908117985
Software Engineer @PhonePe: https://www.linkedin.com/jobs/view/3798400436
SDE 2 - Backend Developer @Groww: https://www.linkedin.com/jobs/view/3931270024

EP 27: What is Kubernetes?

Mon, 20 May 2024 04:30:00 +0000

Welcome to Hello World, we help software engineers learn a new software engineering concept every week.

You can also checkout: What is Docker?

How Big Companies Are Using Kubernetes to Scale Their Businesses

Discover how some of the world's leading companies—Tinder, Reddit, The New York Times, Airbnb, Pinterest, and Niantic's Pokémon Go—have successfully harnessed Kubernetes to transform their infrastructure and drive innovation.

Understanding Kubernetes from Real-world Use Cases

If you are considering adopting Kubernetes for your container orchestration but haven't made the leap, take a look at these six major companies that have.

dzone.com/articles/how-big-companies-are-using-kubernetes

In today's digital landscape, managing and scaling complex software systems can be a daunting task. That's where Kubernetes steps in, offering a powerful solution to automate the deployment, scaling, and management of containerized applications. Let's learn more about Kubernetes and discover how it enhances our digital experiences in this blog today.

Before going through Kubernetes, let’s go back in time and see why we need Kubernetes.

In the early days, organizations relied on physical servers to run their applications. However, this posed challenges in resource management, as multiple applications sharing one server could lead to performance issues. To address this, organizations resorted to running each application on separate physical servers, but this was costly and inefficient.

Source: Overview | Kubernetes

Then came the era of virtualization, where multiple virtual machines (VMs) could run on a single physical server. This improved resource utilization and scalability, as applications could be isolated within VMs. However, VMs still carried the overhead of running full operating systems.

Now comes the containers. Similar to VMs but lighter, containers share the same operating system kernel and offer better efficiency and portability. They allow for agile application creation and deployment, seamless integration and rollbacks, and consistent performance across different environments.

Containers have become popular due to their numerous benefits:

Agile application creation and deployment
Continuous development, integration, and deployment
Clear separation of concerns between development and operations
Enhanced observability and environmental consistency
Portability across various cloud and OS distributions
Application-centric management and support for microservices architecture

Why do you need Kubernetes?

In simple terms, when you're running applications in containers, you need a way to manage them effectively, ensuring they run smoothly without any downtime. Imagine if, when one container stops working, another starts automatically - that's where Kubernetes steps in.

Kubernetes acts like a smart system that handles all the nitty-gritty tasks of managing containers for you. It ensures your applications are always available by managing scaling and failover seamlessly. Additionally, it provides helpful deployment patterns, like canary deployments, making it easier to roll out updates without disrupting your services.

In essence, Kubernetes is like a reliable assistant that takes care of all the heavy lifting involved in running your applications, allowing you to focus on building and improving your software.

Kubernetes defined

Kubernetes (sometimes shortened to K8s with the 8 standing for the number of letters between the “K” and the “s”) is an open-source system to deploy, scale, and manage containerized applications anywhere.

What Kubernetes is not?

Kubernetes is not a traditional all-inclusive PaaS (Platform as a Service) system.
Kubernetes doesn't deploy source code or build applications; CI/CD workflows are determined separately.
It doesn't provide built-in application-level services like databases or message buses but allows them to run alongside applications.
Kubernetes doesn't enforce specific logging, monitoring, or alerting solutions, leaving the choice to users.
It doesn't mandate a specific configuration language or system, offering a flexible declarative API instead.
Kubernetes doesn't provide comprehensive machine configuration, maintenance, or self-healing systems.
It's not merely an orchestration system but a collection of independent processes working towards a desired state without centralized control.

Kubernetes is one of the most widely deployed software systems in the world being used across companies including Google, Microsoft, Amazon, Apple, Meta, Nvidia, Reddit, Pinterest and thousands of other companies.

Understanding the power of Kubernetes with the example of Tinder!

Let's explore how Tinder, a popular dating app, successfully utilized Kubernetes to handle its explosive growth and complex infrastructure needs.

Challenge:

Tinder's rapid growth presented significant technical challenges. As user traffic surged, Tinder's infrastructure struggled to keep up. Here were some of the key issues they faced:

Scalability: Tinder needed to scale from a few servers to potentially hundreds to handle peak loads.
Stability: Ensuring the application remained stable despite fluctuating traffic volumes.
Efficiency: Managing and deploying hundreds of services efficiently.

Solution:

To tackle these challenges, Tinder decided to migrate its services to Kubernetes. Here's how they did it:

Initial Setup: Tinder started by containerizing their applications using Docker. Each application and service ran inside its own container, which made it easier to manage dependencies and isolate processes.
Kubernetes Cluster Deployment: Tinder set up a Kubernetes cluster consisting of 1,000 nodes. A node is a worker machine in Kubernetes, and a cluster is a set of nodes that run containerized applications managed by Kubernetes.
Service Migration: The engineering team migrated 200 services to this Kubernetes cluster. Each service ran in its own set of pods. Pods are the smallest deployable units in Kubernetes, each containing one or more containers.
Load Management: To handle high traffic volumes, Kubernetes managed the distribution of pods across nodes. When traffic spiked, Kubernetes automatically scaled the number of pods up or down based on demand. This is known as auto-scaling.
DNS Management: Tinder's Kubernetes cluster ran a DNS service that handled 250,000 requests per second. Kubernetes ensured high availability and load balancing for this service, distributing traffic efficiently across the cluster.

Outcome:

The migration to Kubernetes had a transformative impact on Tinder's infrastructure:

Improved Scalability: Tinder could seamlessly scale from a few servers to thousands, handling massive traffic spikes with ease.
Enhanced Stability: The application remained stable and responsive even during peak usage times.
Operational Efficiency: Kubernetes automated many of the complex tasks involved in managing containerized applications, freeing up Tinder's engineers to focus on development rather than infrastructure.

Real-World Example:

To better understand the impact, let's consider a specific scenario:

Imagine it's Valentine's Day, one of the busiest days of the year for Tinder. Traffic to the app increases tenfold as millions of users login simultaneously. Without Kubernetes, scaling up to meet this demand would have required significant manual intervention, leading to potential downtime and a poor user experience.

With Kubernetes, however, the process is seamless. As traffic begins to spike, Kubernetes' auto-scaling feature kicks in. It detects the increased load and automatically spins up additional pods to handle the traffic.

These pods are distributed across the available nodes in the cluster, ensuring no single node is overwhelmed. DNS requests are managed efficiently, and users experience smooth, uninterrupted service.

Kubernetes also monitors the health of the pods. If any pod fails, Kubernetes automatically restarts it, ensuring continuous availability. Once the traffic surge subsides, Kubernetes scales down the number of pods, optimizing resource usage and reducing costs.

Kubernetes Architecture

What is Kubernetes? Why is it called k8s? What makes it so popular? Let’s take a look. Kubernetes is an open-source container orchestration platform. It automates the deployment, scaling, and management of containerized applications.

Kubernetes can be traced back to Google's internal container orchestration system, Borg, which managed the deployment of thousands of applications within Google. In 2014, Google open-sourced a version of Borg. That is Kubernetes.

Why is it called k8s? This is a somewhat nerdy way of abbreviating long words. The number 8 in k8s refers to the 8 letters between the first letter “k” and the last letter “s” in the word Kubernetes. Other examples are i18n for internationalization and l10n for localization.

A Kubernetes cluster is a set of machines, called nodes, that are used to run containerized applications. There are two core pieces in a Kubernetes cluster.

The first is the control plane. It is responsible for managing the state of the cluster. In production environments, the control plane usually runs on multiple nodes that span across several data center zones.
The second is a set of worker nodes. These nodes run the containerized application workloads. The containerized applications run in a Pod. Pods are the smallest deployable units in Kubernetes. A pod hosts one or more containers and provides shared storage and networking for those containers. Pods are created and managed by the Kubernetes control plane. They are the basic building blocks of Kubernetes applications.

Now let’s dive a bit deeper into the control plane. It consists of a number of core components. They are the API server, etcd, scheduler, and controller manager.

The API server is the primary interface between the control plane and the rest of the cluster. It exposes a RESTful API that allows clients to interact with the control plane and submit requests to manage the cluster.
etcd is a distributed key-value store. It stores the cluster's persistent state. It is used by the API server and other components of the control plane to store and retrieve information about the cluster.
The scheduler is responsible for scheduling pods onto the worker nodes in the cluster. It uses information about the resources required by the pods and the available resources on the worker nodes to make placement decisions.
The controller manager is responsible for running controllers that manage the state of the cluster. Some examples include the replication controller, which ensures that the desired number of replicas of a pod are running, and the deployment controller, which manages the rolling update and rollback of deployments.

Next, let’s dive deeper into the worker nodes. The core components of Kubernetes that run on the worker nodes include kubelet, container runtime, and kube proxy.

The kubelet is a daemon that runs on each worker node. It is responsible for communicating with the control plane. It receives instructions from the control plane about which pods to run on the node, and ensures that the desired state of the pods is maintained.
The container runtime runs the containers on the worker nodes. It is responsible for pulling the container images from a registry, starting and stopping the containers, and managing the containers' resources.
The kube-proxy is a network proxy that runs on each worker node. It is responsible for routing traffic to the correct pods. It also provides load balancing for the pods and ensures that traffic is distributed evenly across the pods.

So when should we use Kubernetes? As with many things in software engineering, this is all about tradeoffs.

Let’s look at the upsides first.

Kubernetes is scalable and highly available.
It provides features like self-healing, automatic rollbacks, and horizontal scaling. It makes it easy to scale our applications up and down as needed, allowing us to respond to changes in demand quickly.
Kubernetes is portable. It helps us deploy and manage applications in a consistent and reliable way regardless of the underlying infrastructure.
It runs on-premise, in a public cloud, or in a hybrid environment.
It provides a uniform way to package, deploy, and manage applications.

Now how about the downsides?

The number one drawback is complexity.
Kubernetes is complex to set up and operate. The upfront cost is high, especially for organizations new to container orchestration. It requires a high level of expertise and resources to set up and manage a production Kubernetes environment.
The second drawback is cost. Kubernetes requires a certain minimum level of resources to run in order to support all the features we mentioned above. It is likely an overkill for many smaller organizations.

One popular option that strikes a reasonable balance is to offload the management of the control plane to a managed Kubernetes service. Managed Kubernetes services are provided by cloud providers. Some popular ones are Amazon EKS, GKE on Google Cloud, and AKS on Azure. These services allow organizations to run Kubernetes applications without having to worry about the underlying infrastructure. They take care of tasks that require deep expertise, like setting up and configuring the control plane, scaling the cluster, and providing ongoing maintenance and support.

This is a reasonable option for a mid-size organization to test out Kubernetes. For a small organization, You ain’t gonna need it.

Now that you know Docker as well as Kubernetes, you might be curious to figure out the differences between the two. Don’t worry we’ve got you covered!

Docker is focused on containerization, while Kubernetes is focused on orchestration and management.

Kubernetes and Docker — Better Together:

Kubernetes enhances Docker's capabilities by managing container resources effectively, ensuring your app remains online even during node failures, and enabling easy scalability to handle the increased load.

Working Together, Docker provides container packaging and distribution, while Kubernetes orchestrates and manages these containers. Integrating both ensures robust infrastructure and streamlined app deployment.

When to Use Kubernetes or Docker:

Docker: Ideal for container development, editing, and management using Docker Desktop.
Kubernetes: Suited for running production-grade applications at scale.

Kubernetes Vs Docker

Kubernetes	Docker
Kubernetes is an open-source platform used for maintaining and deploying a group of containers	Docker is a tool that is used to automate the deployment of applications in lightweight containers so that applications can work efficiently in different environments.
In practice, Kubernetes is most commonly used alongside Docker for better control and implantation of containerized applications.	With Docker, multiple containers run on the same hardware much more efficiently than the VM environment & productivity of Docker is extremely high.
Applications are deployed as a combination of pods, Deployment, and services.	Apps are deployed in the form of services.
It supports auto-scaling of the container in a cluster.	Docker does not support auto-scaling.
The health check is of two kinds: liveness and readiness.	Health checks are limited to service.
Hard to set up and configure.	Docker’s setup and installation are easy.
It does not have extensive documentation but is quite less than Docker. But it does include everything from installation to deployment.	Docker documentation is more effective, more extensive, and has even more capabilities & it includes everything from installation to deployment & quick-start instructions as well as a more detailed tutorial.
Kubernetes installation is provided to be quite difficult than Docker and even the command for Kubernetes is quite more complex than Docker.	Docker installation is quite easier, by using fewer commands you can install Docker in your virtual machine or even on the cloud.
Azure, buffer, intel, Evernote, and Shopify Using Kubernetes.	Google, Amazon, ADP, VISA, citizens bank, and MetLife companies using Docker.

By now, you must have had a clear idea of Kubernetes, from its meaning to working you know all of it now. In a nutshell, Kubernetes is a tool that automates managing and scaling containerized applications across multiple machines. So, as you navigate through the vast expanse of the internet, stay informed, stay secure, and embrace the adventure of the digital realm!

Congratulations! You've just advanced another step in your tech journey. Keep progressing!

Job Opening

FAANG:

SDE @Amazon:
https://www.linkedin.com/jobs/view/3925334528

Software Engineer @Microsoft:
https://www.linkedin.com/jobs/view/3910634070

Software Engineer, Payments @Google:
https://www.linkedin.com/jobs/view/3835891212

Others:

Software Engineer @LinkedIn:
https://www.linkedin.com/jobs/view/3926238541

Software Engineer @PayPal:
https://www.linkedin.com/jobs/view/3926853002

Software Engineer, Frontend @Coinbase:
https://www.linkedin.com/jobs/view/3904630884