Niket's Dev Diary

3. Parallel Query Retrieval (Fan Out)

Aniket Mahangare — Tue, 20 May 2025 18:30:18 GMT

You might have come across this popular reel where Virat Kohli talks about Rohit Sharma’s lazy communication style.

I will describe this in English so that non Hindi speaking audience can understand. Fair warning, my mediocre English can’t just justify the humor here. When you have to say, “there is a lot of traffic in Lokhandwala (a place in Mumbai)“, Rohit Sharma will say the same thing like “that place has a lot this“. Now it’s your responsibility to know “what place“ and “has what“.

My point here being we humans are lazy. Google has exposed us to so much of convenience for so long that we generally don’t care about what we are typing in the search bar. We just expect that Google will bring us the right results. And if you want your RAG application to get popular, then you have to make it very good in understanding what the user wants to ask.

In this & next couple of articles we will try solve this exact same problem of making your RAG application understand user’s queries better, so that it can generate better results.

Before we dive deep into our topic for this article, I will highly recommend you to read my previous articles on the RAG series. We are diving into the advanced RAG topics now, hence you must clear your basics first.

Parallel Query Retrieval

So the problem at hand is, we want our RAG application to understand what the user wants to ask, given most of the times humans are going to give bad input. You may have heard this phrase, “Garbage In, Garbage Out“. It applies perfectly to LLMs. If you give bad input to LLMs, then you will most likely get bad output from them. That means you want to improve the input you are giving to the LLMs to make your RAG application “usable“ to normal users.

Parallel Query Retrieval technique tries to generate better LLM input for user’s queries. It does so by asking the LLM to generate multiple refined queries for any given user query. It then processes all the LLM generated queries along with the user’s query to generate a comprehensive output for the user’s query. Following diagram will help you understand this better.

For example, Let’s say you create a RAG application capable of answering programming related questions & you have ingested relevant data into your vector database (using the ingestion phase defined in previous article). If the user asks query “implemend goroutines golang“ (notice the spelling mistake in “implement“), then your RAG application will ask LLM model to generate queries similar to user’s query. Let’s say the LLM returns following queries:

How to implement Goroutines in GoLang?
What are the various concurrency patterns in GoLang?
How to take care of thread-safety while using Goroutines in GoLang?

As described in the above digram, you:

Generate Vector Embeddings for all the LLM generated queries & the user’s query
Fetch relevant documents from your vector database using similarity search
Aggregate unique data points from similarity search results across multiple queries
Pass the user’s query along with the aggregated data points to LLM

After following these steps, the response from LLM will be most likely better than the response from basic RAG that we coded in previous article.

💡

In classic system design, the Fan-Out Pattern refers to sending a single message or event to multiple services or consumers at once. I hope you understand now why the technique we are discussing in this article comes under the Fan-Out pattern.

Implementation in Python

Enough with theory, let’s code this thing. As discussed before, this RAG differs from the basic RAG we built in previous article in the QUERY phase. Hence, I will be reusing some components from my basic RAG implementation article. If you haven’t read it already, I will highly recommend reading it first.

https://blogs.niket.pro/implementing-rag

Let’s assume that you have ingested a PDF document about GoLang into your RAG application. Now we will discuss the changes in the query flow.

Step 1: Generate Multiple Queries Given User’s Query

Our goal from this step is to generate multiple queries which are similar to user’s queries. We will use LLM to generate queries that are similar to user’s query. On a high level there are two ways to achieve this.

You make multiples requests to your LLM, each one asking to generate a query similar to user’s query. But this is more time consuming & most importantly it will cost you more.
Second way is you ask the LLM to generate multiple queries within the same response. But there is a problem here. When you ask LLM any question, it gives you response in text. How do you extract queries from plain text response? This is where a concept called “Structured Output“ helps you. Basically, modern LLMs can respond in a specific format that you define before making requests.

Let’s see structured output in action using LangChain.

Define Output Format

We will use BaseModel from pydantic library to create a class MultipleQueries that defines our output structure that we are expecting from LLM.

from pydantic import BaseModel

# model for multiple queries
class MultipleQueries(BaseModel):
    queries: list[str]

💡

You can watch this YouTube video to learn more about Pydantic.

Instruct LLM Model to Respond in Output Format

LangChain makes it very easy to instruct the LLM models to respond in specific format.

# create LLM
llm = ChatOpenAI(
    model="gpt-4.1",
)

# llm for query generation
llm_for_query_gen = llm.with_structured_output(MultipleQueries)

💡

You can read more about Structured Output from LangChain in this tutorial. OpenAI SDK also offers a similar functionality to specify the output format directly. You can read more about OpenAI structured outputs here.

Generate Multiple Queries for a Given User Query

SYSTEM_PROMPT_QUERY_GEN = """
You are a helpul assistant. Your job is to generate 3 queries that are similar to user's queries. 
You need to give the response in the required format. 

Example:
user_query: implement goroutines in golang

response:
[
    "how to implement goroutines in golang",
    "what is goroutine in golang",
    "how to use goroutines in golang"
]
"""

# generate 3 queries similar to the user's query
def generate_queries(query: str) -> list[str]:
    # 1. use LLM to generate 3 queries similar to the user's query
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT_QUERY_GEN},
        {"role": "user", "content": query},
    ]

    response = llm_for_query_gen.invoke(messages)
    if isinstance(response, MultipleQueries):
        result = response.queries
        print(f"🌀🌀🌀 Generated {len(result)} queries")
        for i, query in enumerate(result):
            print(f"🌀🌀🌀 {i+1}. {query}")
        return result
    else:
        raise ValueError("Invalid response from LLM")

Step 2: Fetch Relevant Documents from Vector DB for Each Query

Here, we will use the method get_vector_store() which we have defined in the previous article.

COLLECTION_NAME = "golang-docs"

# fetch the relevant documents for the query
def fetch_relevant_documents_for_query(query: str) -> list[Document]:
    # 1. check if collection exists
    if not collection_exists(COLLECTION_NAME):
        raise ValueError("Collection does not exist")

    # 2. fetch the relevant documents
    vector_store = get_vector_store(COLLECTION_NAME)

    # 3. fetch the relevant documents
    docs = vector_store.similarity_search_with_score(query, k=5)

    # 4. filter the documents based on the similarity threshold
    filtered_docs = [doc for doc, score in docs if score >= SIMILARITY_THRESHOLD]

    print(f"🌀🌀🌀 QUERY: {query}. FOUND: {len(filtered_docs)} documents")

    return filtered_docs

Step 3: Aggregate Unique Documents Across Queries

from langchain_core.documents import Document
# aggregate the relevant documents
def aggregate_relevant_documents(queries: list[str]) -> list[Document]:
    # 1. fetch the relevant documents for each query
    docs = [fetch_relevant_documents_for_query(query) for query in queries]

    # 2. flatten the list of lists and get unique documents
    flattened_docs = [doc for sublist in docs for doc in sublist]
    unique_docs = list({doc.page_content: doc for doc in flattened_docs}.values())

    print(f"🌀🌀🌀 Found {len(unique_docs)} unique documents across all the queries")

    return unique_docs

Step 4: Query LLM using Aggregated Documents

SYSTEM_PROMPT_ANSWER_GEN = """
You are a helpful assistant. Your job is to generate an answer for the user's query based on the relevant documents provided.
"""

# generate the answer for the user's query
def generate_answer(query: str, docs: list[Document]) -> str:
    # 1. use LLM to generate the answer for the user's query based on the relevant documents
    system_prompt = SYSTEM_PROMPT_ANSWER_GEN
    for doc in docs:
        system_prompt += f"""
        Document: {doc.page_content}
        """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query},
    ]
    response = llm.invoke(messages)
    return response.content

💡

FYI, LangChain provides MultiQueryRetriever which combines step 1 to step 3 we did above in a single line of code 🤖. However in my opinion, LangChain does too much abstraction, which kind of takes away the fun in building stuff.

As you can see below, even though I asked a question with spelling mistake (possible to make input more stupid), my RAG application was able to answer well.

And that’s it, that’s how easy it is to implement Parallel Query Retrieval. In my future articles in this series, I will discuss more about techniques used in advanced RAG applications. Stay tuned.

Hope you liked this article. If you have questions/comments, then please feel free to comment on this article.

Source Code: GitHub

2. Implementing RAG

Aniket Mahangare — Sun, 11 May 2025 17:42:59 GMT

This is a second article in my series, RAG Deep Dive. The goal of this series is to dive deep into the world of RAG & understand it from the first principles by actually implementing a scalable, production ready RAG system.

In the previous article, Introduction to RAG we discussed what a RAG is & how it works. In this article we will implement the most basic & simplest RAG. The goal of this article is let you know how easy it is to build a basic RAG.

Set Up

Python

Make sure you have Python installed locally, preferably the latest version.

OpenAI

You need to create an account in OpenAI & generate an API key for testing. We will be storing this API key in .env file to be used in the code. You can refer to this short YouTube video to know how to generate OpenAI API key.

Clone GitHub Repository

GitHub Repository: https://github.com/Niket1997/rag-tutorial

Install Dependencies

You also need to install the required dependencies. Open the cloned repository in the IDE of your choice & run the following command to install dependencies.

# installing uv on mac
brew install uv 

# install dependencies
uv pip install .
## or alternatively, uv pip install -r pyproject.toml

Install Docker

We will be using Docker to set up the vector database qdrant locally, hence you need to install Docker in your machine. Just Google it.

Run qdrant locally using Docker

To set up qdrant using Docker, we will use following docker-compose.yml file for the set up.

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - ./qdrant_data:/qdrant/storage

volumes:
  qdrant_data:

You can start the qdrant docker container using following command.

docker compose up -d -f docker-compose.yml

Create .env file

Create a new file in the cloned repository with the name .env & and following contents to it.

OPENAI_API_KEY=""
QDRANT_URL="http://localhost:6333"

As mentioned in the previous article, a RAG system has two phases, ingestion phase & query phase. Let’s code them one by one.

💡

We will be using LangChain framework in this tutorial to build our basic RAG. LangChain is widely used open source framework for building applications on top of Large Language Models (LLMs). You can read more about LangChain here.

Ingestion Phase

As mentioned in the Introduction to RAG article, the ingestion phase has following steps. We will implement these steps one-by-one.

Load Data
Chunk Data
Generate Vector Embeddings for Individual Chunks
Store Vector Embeddings for Chunks in Vector Database

Load Data

LangChain provides loaders for different types of data as mentioned in the documentation here. In our example, we want to load the PDF data into our RAG system hence we will be using PyPDFLoader. You can find the documentation for it here. You need the package langchain_community & pypdf for this.

The docs variable here will hold the array of pages. Every element in this array will contain contents from a particular page (ordered).

from langchain_community.document_loaders import PyPDFLoader

file_path = "./demo.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

Chunk Data

A single page can contain a larger amount of data, hence we need to chunk the data in docs. This can be achieved using text splitters. In our case we will be using RecursiveCharacterTextSplitter. You can read more about it here.

from langchain_text_splitters import RecursiveCharacterTextSplitter

def get_text_splitter():
    return RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

text_splitter = get_text_splitter()
chunks = text_splitter.split_documents(docs)

Generate & Store Vector Embeddings

We need to generate vector embeddings for each of the chunk. We will use OpenAI’s text-embedding-3-small embedding model. Refer to previous article in this series to know more about vector embeddings. You need the package langchain-openai for this.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
)

We need to define certain functions & variables that we will use interact with qdrant. You need the package langchain-qdrant for this.

# create qrant client
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
)

# create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name: str):
    # check if collection exists
    if not collection_exists(collection_name):
        # create the collection if it doesn't exist
        # Note, here the dimensions 1536 is corresponding to the embedding model we chose
        # which is text-embedding-3-small
        qdrant_client.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
        )
        print(f"Collection {collection_name} created")
    else:
        print(f"Collection {collection_name} already exists")

# check if collection exists
def collection_exists(collection_name: str):
    return qdrant_client.collection_exists(collection_name)

# get the qdrant vector store for collection
def get_vector_store(collection_name: str):
    return QdrantVectorStore(
        collection_name=collection_name,
        client=qdrant_client,
        embedding=embeddings,
    )

# get the collection name
def get_collection_name(file_name: str):
    return f"rag_collection_{file_name.split('/')[-1].split('.')[0]}"

We will use these methods & above code to generate & store vector embeddings for the PDF document.

# get the name of the collection in qdrant db based on the file
collection_name = get_collection_name(pdf_path)

# create the collection in qdrant db if it does not exists
create_collection_if_not_exists(collection_name=collection_name)

# this will create a vector store & assign the OpenAI embeddings to it
vector_store = get_vector_store(collection_name=collection_name)

# this will generate the embeddings for the chunks & add them to the vector store
vector_store.add_documents(documents=chunks)

Query Phase

Now that we have ingested the PDF document into our qdrant vector database, let’s see how we can utilize the vector database for getting the relevant chunks of data from the vector database using SimilaritySearch or as defined in the introduction to RAG article SemanticSearch.

Generate Vector Embeddings for Query

Let’s begin by writing a system prompt that we will be using to provide instructions to the LLM, in our case OpenAI’s latest model gpt-4.1.

system_prompt = """
    You are a helpful AI assistant that can answer user's questions based on the documents provided.
    If there aren't any related documents, or if the user's query is not related to the documents, then you can provide the answer based on your knowledge.        Think carefully before answering the user's question.
    """

Now, we will generate vector embeddings for the user’s query & try to find the chunks of documents that are relevant to the user’s query from our vector database. Here, we first check if the collection exists in our vector database & if it does then we find the chunks of data from the vector database that have similarity score of more than 70% & add that into our system prompt.

# get only the chunks who have at least similary score of 0.5 out of 1
SIMILARITY_THRESHOLD = 0.5

collection_name = get_collection_name(file_name)
if collection_exists(collection_name):
    vector_store = get_vector_store(collection_name)
    # Get documents with their similarity scores
    docs = vector_store.similarity_search_with_score(query, k=5)

    for doc, score in docs:
        if score >= SIMILARITY_THRESHOLD:
            system_prompt += f"""
             Document: {doc.page_content}
             """

Now we will define a variable that will that will communicate with OpenAI & use the above system prompt that contains the more relevant context as per user’s query along with user’s query to get more refined & more relevant answer.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4.1",
)

messages = [("system", system_prompt), ("user", query)]

response = llm.invoke(messages)

print(f"response: {response.content}")

And that’s all, we just build our first RAG from scratch. Just run the main.py file in the 1_implementing_basic_rag directory and you can interact with the RAG.

I am attaching a screenshot of one run of our basic RAG application.

So that’s it for this one. Hope you liked this article on implementing a basic RAG from scratch! In the next set of articles, we will discuss on how to optimize our RAG application to make production ready. There are various techniques that are used in production-ready RAG applications to make them performant & efficient at scale. Stay tuned to learn more about them.

If you have questions/comments, then please feel free to comment on this article.

1. Introduction to RAG

Aniket Mahangare — Sun, 04 May 2025 14:28:34 GMT

You may have observed recently that this new buzz word RAG is sprinkled all over your LinkedIn feed. Frustrated with constant bombarding of this word on my feed, I caved in and decided to understand what this word means. What I found was quite interesting. Hence I decided to write a series of articles on this topic. This one is first in the series which will introduce you to the world of RAG.

What is RAG?

The RAG stands for Retrieval Augmented Generation. Terrifying set of words, isn’t it? Don’t worry, we will break these down in this section. For now, all you need to understand is, it’s a framework built to pass better & more relevant context to large language models to get better responses. If you have used the tools like ChatGPT, Google Gemini then you must know that quality of the answers from these tools improves drastically when you pass more relevant pieces of information.

Now, let’s break down those words.

Retrieval → It refers to the process of retrieving/fetching the relevant pieces of information. How and from where? We will discuss that later in this article.
Augmented → In this context, Augmented means enhancing the large language models by enriching them with more relevant information for the users’ queries.
Generation → This is the core capability of LLMs. Given an input prompt, generate relevant piece of data such as answers, explanations, summaries, etc.

Semantic Search

Before we get into the implementation details, we must understand Semantic Search, which is the core principle on which the RAG systems works. Semantic search is a way of finding information based on meaning rather than just matching exact words. In simple words, semantic search finds what you mean, not just what you type.

Here’s how semantic search works:

Turning text into meaning vectors: A piece of text can be passed to a pre-trained models (like Sentence-BERT or OpenAI’s text embeddings) that can map the text into vectors & can capture meaning from the text. The model converts the text into a fixed-length list of numbers (e.g. a 768-dimensional vector). Those numbers encode the text’s meaning in a high-dimensional “semantic space.”
Indexing for faster lookup: These vector embeddings are stored into the vector databases. The database builds an index so it can quickly find which vectors lie closest to any given point in that space.
Querying with meaning: When you type a search query (“why is life so hard? 😔”), the system also turns it into its own vector. It then asks the vector database, “Which stored vectors are most similar to this query vector?”. If your RAG has previously stored the data that can handle such queries, then your response from LLM will be much better.

The key benefit of using semantic search is, even if a document doesn’t literally say “why is life so hard? 😔” it might use synonyms (“What makes life so challenging?”, “Why do I face so many obstacles in life?”) and still be retrieved, because its vector sits near your query’s vector in the space.

Semantic search works on different types of data such as text, video, audio, images, etc. As long as you have a model that maps your data (text, pixels, audio waveforms, code tokens…) into real-valued vectors that capture “meaning” in that domain, you can perform semantic search.

Spotify uses audio embeddings to power “Fans also like” and “Discover Weekly” by finding tracks whose embeddings cluster together.

You can watch following video to understand semantic search & vector databases better.

https://youtu.be/gl1r1XV0SLw?si=paNNqhkEHfzGHKnw

Phases of RAG

RAG in its most basic form has two phases. Let’s understand these phases from one example. Let’s say you have a big PDF document & you want to get answers to some questions basis that document.

Ingestion Phase

This refers to ingesting the data into the RAG system that will be utilized to pass better context to LLM. In our example, we upload our PDF document to the RAG, which indexes this document and stores it in such a way, that it’s easier to fetch relevant information from it.

This phase can has following steps:

Load Data: The first step in ingestion is loading the data. This can be uploaded by the users, or we may have certain data on which we want to make a specialized RAG system.
Chunk Data: In this step the loaded data is chunked into smaller pieces called chunks. Chunking splits large documents into smaller passages that fit within the model’s context window. The queried data can’t be more than the model’s context window. This also ensures that we don’t pass the whole document, in nut shell, too much context to the LLM.
Generate Vector Embeddings: As discussed before, in this step we generate the vector embeddings associated with each of the chunk of the data. We rely on vector embedding models for this step.
Store Vector Embeddings: In this step, we store vector embeddings of the chunks in the vector database such as Pinecone for faster & efficient semantic search.

Query Phase

This refers to fetching the data that is most relevant to user’s query which is then passed to LLM. In our example, let’s say you have a question about your document & you ask the question to RAG. RAG looks at the stored information and fetches the most relevant pieces of data that can be passed to LLM to get the answers to your question.

The query phase has following steps:

Generate Vector Embeddings for Query: In this step we generate vector embeddings for user’s query using the same embedding model used for ingestion.
Semantic Search: In this step, we use the vector embeddings generated for the user’s query to do a similarity search on a vector database. This step returns the most relevant chunks of data corresponding to the user’s query.
Generate Response: In this step, we used the information retrieved from the vector database & pass that information to LLM. Since the LLM now has the most relevant context on the user’s query, it will be able to generate good results.

💡

The step 1 & step 2 here combined is called as Retrieval Phase.

So that’s it for this one. Hope you liked this introductory article on RAG! In the next article, we will build a simple RAG system, in which we will upload a PDF to our RAG system & ask the system questions on the PDF. The system will integrate with vector database & OpenAI APIs. Stay tuned for the next one!

If you have questions/comments, then please feel free to comment on this article.

Implementing Event Loops in Go: A Practical Approach

Aniket Mahangare — Mon, 14 Oct 2024 19:22:53 GMT

Ever wondered how single threaded applications, like Redis, are able to handle thousands of clients concurrently (“perceived” concurrency)? The answer is “Event Loops“. In this article, we will dive deep into how event loops work & their implementation in GoLang.

Event Loops

An event loop is a system that continuously listens for events (like user actions or messages) and handles each one sequentially. This allows programs to manage multiple tasks smoothly and efficiently using just a single thread.

Concurrency Models for Server Architecture

There are two basic concurrency models for server architecture:

Thread-Per-Request: This model uses a separate thread to handle each incoming client request. When a new request arrives, the server creates a new thread (or utilizes one from a thread pool) to process it independently.
I/O Multiplexing: This model allows a single thread (or a limited number of threads) to monitor and manage multiple I/O streams (like network sockets, files, or pipes) simultaneously. Instead of dedicating a thread to each request, the server uses mechanisms to detect when I/O operations (like reading or writing data) are ready to be performed. This thread then takes actions as per events on these streams.

The key challenge of Thread-Per-Request model is, the application needs to be thread safe, which requires addition of locking mechanism, which in turn increases the code complexity & slows down the execution, as multiple threads can compete to acquire the lock for a critical section.

Single threaded programs don’t need to handle thread safety, hence the CPU time allocated to them, can be utilized more efficiently. Single threaded applications usually rely on I/O multiplexing to implement event loops, so that they can serve clients concurrently.

Key Concepts

User Space vs. Kernel Space

User Space: User space is the environment where user-facing applications run. This includes applications such as web servers, Chrome, text editors, and command utilities. User space applications cannot directly access the system’s hardware resources. They must make system calls to the kernel to request access to these resources.
Kernel Space: Kernel space is where the core of the operating system, the kernel, operates. The kernel is responsible for managing the system’s resources, such as the CPU, memory, and storage. It also provides system calls, which are interfaces that allow user space applications to interact with the kernel. The kernel has unrestricted access to the system’s hardware resources. This is necessary for the kernel to perform its essential tasks, such as scheduling processes, managing memory, and handling interrupts.

💡

You can read more about the User-space, Kernel-space, and System Calls here.

Kernel Buffers

Receive Buffer: When data arrives from a network or other I/O source, it's stored in a kernel-managed buffer until the application reads it.
Send Buffer: Data that an application wants to send is stored in a kernel buffer before being transmitted over the network or I/O device.

File Descriptors (FDs)

Definition: Integers that uniquely identify an open file, socket, or other I/O resource within the operating system.
Usage: Applications use FDs to perform read/write operations on these resources.

💡

I highly recommend you watch these YouTube videos on File Descriptors & System Calls in Linux. TL/DR, everything in unix/linux is a file & the OS provides system calls to interact with resources.

I/O Multiplexing Mechanisms

kqueue(on MacOS) and epoll(on Linux) are kernel system calls for scalable I/O event notification mechanisms in an efficient manner. In simple words, you subscribe to certain kernel events and you get notified when any of those events occur. These system calls are desigend for scalable situations such as a webserver where thousands of concurrent connections are being handled by one server.

In this article, I will focus on using kqueue, however, I will share the GitHub repo with code for implementation using epoll.

💡

You can read more on these system calls here.

Implementation of I/O Multiplexing in GoLang

In Go, we can use the golang.org/x/sys/unix package to access low-level system calls like kqueue on Unix-like systems.

Step 1: Define Server Configuration

Create a configuration struct or use variables to hold server parameters.

var (
    host       = "127.0.0.1" // Server IP address
    port       = 8080        // Server port
    maxClients = 20000       // Maximum number of concurrent clients
)

Step 2: Create the Server Socket

A new socket can be created using unix.Socket method. A socket can be thought of as an endpoint in a two-way communication channel. Socket routines create the communication channel, and the channel is used to send data between application programs either locally or over networks. Each socket within the network has a unique name associated with it called a socket descriptor—a full-word integer that designates a socket and allows application programs to refer to it when needed.

In simpler terms, a socket is like a door through which data enters and exits a program over the network. It enables inter-process communication, either on the same machine or across different machines connected via a network. Each of these sockets is assigned a file descriptor when they are created.

serverFD, err := unix.Socket(unix.AF_INET, unix.SOCK_STREAM, unix.IPPROTO_TCP)
if err != nil {
    return fmt.Errorf("socket creation failed: %v", err)
}
defer unix.Close(serverFD)

unix.AF_INET: This option specifies that the socket will use IPv4 Internet protocol.
unix.SOCK_STREAM: This option provides reliable, ordered, and error-checked delivery of a stream of bytes, typically using TCP.
unix.IPPROTO_TCP: This option specifies that the socket will use the TCP protocol for communication, ensuring reliable data transmission.

💡

Check this documentation by IBM to read more on sockets.

Step 3: Set Socket Options

Set Non-blocking Mode

Setting a socket to non-blocking mode ensures that I/O operations return immediately without waiting. When a socket operates in this mode:

accept: If there are no incoming connections, it immediately returns an error (EAGAIN or EWOULDBLOCK) instead of waiting.
read/recv: If there's no data to read, it immediately returns an error instead of blocking.
write/send: If the socket's buffer is full and can't accept more data, it immediately returns an error instead of waiting.

if err := unix.SetNonblock(serverFD, true); err != nil {
    return fmt.Errorf("failed to set non-blocking mode: %v", err)
}

Allow Address Reuse

This is particularly useful in scenarios where you need to restart a server quickly without waiting for the operating system to release the port.

if err := unix.SetsockoptInt(serverFD, unix.SOL_SOCKET, unix.SO_REUSEADDR, 1); err != nil {
    return fmt.Errorf("failed to set SO_REUSEADDR: %v", err)
}

Step 4: Bind and Listen

Bind the Socket

Socket binding involves linking a socket to a specific local address and port on your computer. Essentially, it tells the operating system, "Hey, my application is ready to handle any network traffic that comes to this address and port."

When you're setting up a server in network programming, binding is a crucial first step. Before your server can start accepting connections or receiving data, it needs to bind its socket to a chosen address and port. This connection point is where clients will reach out to connect or send information.

addr := &unix.SockaddrInet4{Port: port}
copy(addr.Addr[:], net.ParseIP(host).To4())

if err := unix.Bind(serverFD, addr); err != nil {
    return fmt.Errorf("failed to bind socket: %v", err)
}

unix.SockaddrInet4: This struct hold the IP address/host address (IPv4) & the port of your server.

Start Listening

if err := unix.Listen(serverFD, maxClients); err != nil {
    return fmt.Errorf("failed to listen on socket: %v", err)
}

Step 5: Initialize kqueue

kq, err := unix.Kqueue()
if err != nil {
    return fmt.Errorf("failed to create kqueue: %v", err)
}
defer unix.Close(kq)

unix.Kqueue(): Creates a new kernel event queue and returns a file descriptor associated with this kqueue.

Step 6: Register Server FD with kqueue

Register the file descriptor associated with server socket to monitor for incoming connections. Just to reiterate, everything linux/unix is a file. Basically, when clients want to establish connection with our server, kqueue monitors these event & notify our application to take actions accordingly.

kev := unix.Kevent_t{
    Ident:  uint64(serverFD),
    Filter: unix.EVFILT_READ,
    Flags:  unix.EV_ADD,
}

if _, err := unix.Kevent(kq, []unix.Kevent_t{kev}, nil, nil); err != nil {
    return fmt.Errorf("failed to register server FD with kqueue: %v", err)
}

Ident: The identifier (file descriptor) to watch, in this case we want to watch the file descriptor associated with our server.
Filter: The type of event to watch (unix.EVFILT_READ for read events).
Flags: Actions to perform (unix.EV_ADD to add the event).

The Kevent method here is used to perform certain actions on the kerned event queue, we created before. It accepts following parameters:

The file descriptor associated with kqueue
A slice of Kevent_t structs. This slice tells kqueue what changes you want to make. Here, you're adding a new event (like monitoring a socket for incoming connections).
An event list, this would be a slice where kqueue writes back any events that have occurred. We will see this in the next section.
A timeout, that defines how long kevent should wait for events.

Step 7: Enter the Event Loop

Create a loop to wait for events and handle them.

events := make([]unix.Kevent_t, maxClients)

for {
    nevents, err := unix.Kevent(kq, nil, events, nil)
    if err != nil {
        if err == unix.EINTR {
            continue // Interrupted system call, retry
        }
        return fmt.Errorf("kevent error: %v", err)
    }

    for i := 0; i < nevents; i++ {
        ev := events[i]
        fd := int(ev.Ident)

        if fd == serverFD {
            // Handle new incoming connection
        } else {
            // Handle client I/O
        }
    }
}

unix.Kevent: This is a blocking call which waits for events until getting timed out (optional). This method is used to wait for events from kqueue as well alter events to be monitored by the kqueue.

Step 8: Accept New Connections

As we are monitoring the file descriptor associated with the server socket, kqueue returns events such as new client connection request. When the server socket is ready & clients request to connect with the server on the server socket, then accept connections from client.

nfd, sa, err := unix.Accept(serverFD)
if err != nil {
    log.Printf("failed to accept connection: %v", err)
    continue
}
defer unix.Close(nfd) // Ensure the FD is closed when no longer needed

// Set the new socket to non-blocking mode
if err := unix.SetNonblock(nfd, true); err != nil {
    log.Printf("failed to set non-blocking mode on client FD: %v", err)
    unix.Close(nfd)
    continue
}

// Register the new client FD with kqueue
clientKev := unix.Kevent_t{
    Ident:  uint64(nfd),
    Filter: unix.EVFILT_READ,
    Flags:  unix.EV_ADD,
}

if _, err := unix.Kevent(kq, []unix.Kevent_t{clientKev}, nil, nil); err != nil {
    log.Printf("failed to register client FD with kqueue: %v", err)
    unix.Close(nfd)
    continue
}

log.Printf("accepted new connection from %v", sa)

unix.Accept: Method to accept new incoming connection from clients.
nfd: When server accepts a new connection, it creates a new socket for that client. This nfd is file descriptor associated with that client socket.
sa: This is address of the socket the client is connected to.
Register the client FD: When server accepts connection from client, we register the file descriptor associated the client socket in kqueue, so that we can monitor events from the client such as new data sent, connection terminated, etc.

Step 9: Handle Client I/O

When clients send some data to our server, the kqueue notifies our application, then we can take actions accordingly.

buf := make([]byte, 1024)
n, err := unix.Read(fd, buf)
if err != nil {
    if err == unix.EAGAIN || err == unix.EWOULDBLOCK {
        // No data available right now
        continue
    }
    log.Printf("failed to read from client FD %d: %v", fd, err)
    // Remove the FD from kqueue and close it
    kev := unix.Kevent_t{
        Ident:  uint64(fd),
        Filter: unix.EVFILT_READ,
        Flags:  unix.EV_DELETE,
    }
    unix.Kevent(kq, []unix.Kevent_t{kev}, nil, nil)
    unix.Close(fd)
    continue
}

// Process the data received
data := buf[:n]
log.Printf("received data from client FD %d: %s", fd, string(data))

// Echo the data back to the client (optional)
if _, err := unix.Write(fd, data); err != nil {
    log.Printf("failed to write to client FD %d: %v", fd, err)
    // Handle write error if necessary
}

unix.Read: Reads data from the file descriptor associated with socket corresponding to the client.
Handling errors: EAGAIN or EWOULDBLOCK: No data available; in non-blocking mode, this is normal. You may think that, as your application is single threaded, if kqueue is saying there is some data to be read & when you attempt to read, then there has to be some data. But in certain cases, it’s possible that, it’s not the case, hence it’s recommended to handle these errors explicitly. You can read about it in Beej's Guide to Network Programming, I am adding a quote about this from this book.

Quick note to all you Linux fans out there: sometimes, in rare circumstances, Linux’s select() can return “ready-to-read” and then not actually be ready to read! This means it will block on the read() after the select() says it won’t! Why you little—! Anyway, the workaround solution is to set the O_NONBLOCK flag on the receiving socket so it errors with EWOULDBLOCK (which you can just safely ignore if it occurs).

Other errors: Close the connection.

Step 10: Clean Up Resources

Ensure that all file descriptors are properly closed when they are no longer needed.

Closing Client FDs: As shown in previous steps, remove the FD from kqueue and close it.
Closing Server FD and kqueue FD: Use defer statements to ensure they are closed when the function exits.

defer unix.Close(serverFD)
defer unix.Close(kq)

Complete Refactored Code

Here's the complete code, you can also find it in the GitHub repo I mentioned below.

package main

import (
    "fmt"
    "golang.org/x/sys/unix"
    "log"
    "net"
)

var (
    host       = "127.0.0.1" // Server IP address
    port       = 8080        // Server port
    maxClients = 20000       // Maximum number of concurrent clients
)

func RunAsyncTCPServerUnix() error {
    log.Printf("starting an asynchronous TCP server on %s:%d", host, port)

    // Create kqueue event objects to hold events
    events := make([]unix.Kevent_t, maxClients)

    // Create a socket
    serverFD, err := unix.Socket(unix.AF_INET, unix.SOCK_STREAM, unix.IPPROTO_TCP)
    if err != nil {
        return fmt.Errorf("socket creation failed: %v", err)
    }
    defer unix.Close(serverFD)

    // Set the socket to non-blocking mode
    if err := unix.SetNonblock(serverFD, true); err != nil {
        return fmt.Errorf("failed to set non-blocking mode: %v", err)
    }

    // Allow address reuse
    if err := unix.SetsockoptInt(serverFD, unix.SOL_SOCKET, unix.SO_REUSEADDR, 1); err != nil {
        return fmt.Errorf("failed to set SO_REUSEADDR: %v", err)
    }

    // Bind the IP & the port
    addr := &unix.SockaddrInet4{Port: port}
    copy(addr.Addr[:], net.ParseIP(host).To4())
    if err := unix.Bind(serverFD, addr); err != nil {
        return fmt.Errorf("failed to bind socket: %v", err)
    }

    // Start listening
    if err := unix.Listen(serverFD, maxClients); err != nil {
        return fmt.Errorf("failed to listen on socket: %v", err)
    }

    // Create kqueue instance
    kq, err := unix.Kqueue()
    if err != nil {
        return fmt.Errorf("failed to create kqueue: %v", err)
    }
    defer unix.Close(kq)

    // Register the serverFD with kqueue
    kev := unix.Kevent_t{
        Ident:  uint64(serverFD),
        Filter: unix.EVFILT_READ,
        Flags:  unix.EV_ADD,
    }

    if _, err := unix.Kevent(kq, []unix.Kevent_t{kev}, nil, nil); err != nil {
        return fmt.Errorf("failed to register server FD with kqueue: %v", err)
    }

    // Event loop
    for {
        nevents, err := unix.Kevent(kq, nil, events, nil)
        if err != nil {
            if err == unix.EINTR {
                continue // Interrupted system call, retry
            }
            return fmt.Errorf("kevent error: %v", err)
        }

        for i := 0; i < nevents; i++ {
            ev := events[i]
            fd := int(ev.Ident)

            if fd == serverFD {
                // Accept the incoming connection from client
                nfd, sa, err := unix.Accept(serverFD)
                if err != nil {
                    log.Printf("failed to accept connection: %v", err)
                    continue
                }

                // Set the new socket to non-blocking mode
                if err := unix.SetNonblock(nfd, true); err != nil {
                    log.Printf("failed to set non-blocking mode on client FD: %v", err)
                    unix.Close(nfd)
                    continue
                }

                // Register the new client FD with kqueue
                clientKev := unix.Kevent_t{
                    Ident:  uint64(nfd),
                    Filter: unix.EVFILT_READ,
                    Flags:  unix.EV_ADD,
                }

                if _, err := unix.Kevent(kq, []unix.Kevent_t{clientKev}, nil, nil); err != nil {
                    log.Printf("failed to register client FD with kqueue: %v", err)
                    unix.Close(nfd)
                    continue
                }

                log.Printf("accepted new connection from %v", sa)
            } else {
                // Handle client I/O
                buf := make([]byte, 1024)
                n, err := unix.Read(fd, buf)
                if err != nil {
                    if err == unix.EAGAIN || err == unix.EWOULDBLOCK {
                        continue // No data available right now
                    }
                    log.Printf("failed to read from client FD %d: %v", fd, err)
                    // Remove the FD from kqueue and close it
                    kev := unix.Kevent_t{
                        Ident:  uint64(fd),
                        Filter: unix.EVFILT_READ,
                        Flags:  unix.EV_DELETE,
                    }
                    unix.Kevent(kq, []unix.Kevent_t{kev}, nil, nil)
                    unix.Close(fd)
                    continue
                }

                if n == 0 {
                    // Connection closed by client
                    log.Printf("client FD %d closed the connection", fd)
                    kev := unix.Kevent_t{
                        Ident:  uint64(fd),
                        Filter: unix.EVFILT_READ,
                        Flags:  unix.EV_DELETE,
                    }
                    unix.Kevent(kq, []unix.Kevent_t{kev}, nil, nil)
                    unix.Close(fd)
                    continue
                }

                // Process the data received
                data := buf[:n]
                log.Printf("received data from client FD %d: %s", fd, string(data))

                // Echo the data back to the client (optional)
                if _, err := unix.Write(fd, data); err != nil {
                    log.Printf("failed to write to client FD %d: %v", fd, err)
                    // Handle write error if necessary
                }
            }
        }
    }
}

func main() {
    if err := RunAsyncTCPServerUnix(); err != nil {
        log.Fatalf("server error: %v", err)
    }
}

How to test above code?

netcat

netcat is a computer networking utility for reading from and writing to network connections using TCP or UDP.

Open up two or more terminal windows.

Type nc localhost 8080

💡

You can refer to this article to learn more on netcat.

Go Client Code

 package main

 import (
     "bufio"
     "fmt"
     "log"
     "net"
     "sync"
     "time"
 )

 const (
     serverAddress = "127.0.0.1:8080" // Server address
     numClients    = 100              // Number of concurrent clients to simulate
 )

 func main() {
     var wg sync.WaitGroup

     for i := 0; i < numClients; i++ {
         wg.Add(1)
         go func(clientID int) {
             defer wg.Done()
             err := runClient(clientID)
             if err != nil {
                 log.Printf("Client %d error: %v", clientID, err)
             }
         }(i)
         // Optional: Sleep to stagger client connections
         time.Sleep(100 * time.Millisecond)
     }

     wg.Wait()
 }

 func runClient(clientID int) error {
     // Connect to the server
     conn, err := net.Dial("tcp", serverAddress)
     if err != nil {
         return fmt.Errorf("failed to connect: %v", err)
     }
     defer conn.Close()

     log.Printf("Client %d connected to %s", clientID, serverAddress)

     // Send a message to the server
     message := fmt.Sprintf("Hello from client %d", clientID)
     _, err = fmt.Fprintf(conn, message+"\n")
     if err != nil {
         return fmt.Errorf("failed to send data: %v", err)
     }

     time.Sleep(100 * time.Millisecond)

     // Receive a response from the server
     reply, err := bufio.NewReader(conn).ReadString('\n')
     if err != nil {
         return fmt.Errorf("failed to read response: %v", err)
     }

     log.Printf("Client %d received: %s", clientID, reply)

     return nil
 }

So that’s it for this one. Hope you liked this article! If you have questions/comments, then please feel free to comment on this article.

You can find the implementation for kqueue, epoll & the Go client in this GitHub repository. Stay tuned for next one!

Disclaimer: The opinions expressed here are my own and do not represent the views of my employer. This blog is intended for informational purposes only.