Talk to Your Docs: Custom RAG Chatbot (Work in Progress)

Click here for the GitHub repository

Purpose of the Project

This project was built first and foremost as a learning exercise. The goal is not to ship a production-ready chatbot, but to deeply understand the core building blocks of modern NLP systems - including Retrieval-Augmented Generation (RAG), transformers, embedding models, and how LLMs can be used effectively with private, unstructured data like PDFs.

Rather than relying entirely on high-level libraries like LangChain, this project breaks down each component of the RAG pipeline and re-implements it where possible using PyTorch, HuggingFace Transformers, and lower-level NLP tools. This hands-on approach allows for a much clearer understanding of how these systems actually work under the hood.

The learning focus includes:

  • How transformer-based models process and embed text

  • How RAG pipelines retrieve relevant context from long documents

  • How to build and evaluate embedding-based search using cosine similarity and FAISS

  • How to process both English and Arabic documents, including challenges with OCR and tokenization

  • How to progressively move from API-based tools (OpenAI) to local, fully controlled models (HuggingFace)

Problem Context

LLMs such as Gemini 2.5 or GPT-4o do not have access to your private files. If you are working with confidential documents, or internal company data, uploading them to the cloud can raise serious concerns around privacy, compliance, or performance. This project explores how to build a local, secure question-answering system over your own documents without depending entirely on third-party APIs.

Overview

"Talk to Your Docs" is a modular RAG-based chatbot that takes in PDF documents and allows natural language queries. It supports both English and Arabic text, and is designed to gradually replace high-level dependencies with custom implementations to support deeper technical learning.

Development Process

The system was developed in progressive versions, each one targeting a specific learning milestone:

  • Version 1: LangChain-based MVP with OpenAI and FAISS

  • Version 2: Replaced embedding logic with custom PyTorch-based BERT embeddings

  • Version 3: Replaced LangChain chunking with manual sentence chunking using NLTK

  • Version 4 (in progress): Arabic support via AraBERT and OCR

Each step was designed to isolate a component, replace it manually, and study how it fits into the overall pipeline.

Technical Components

  • PDF ingestion and text extraction (pdfplumber)

  • Sentence chunking and overlap tuning (NLTK)

  • Embedding generation using BERT and PyTorch (with manual mean pooling and normalization)

  • Vector search using FAISS or direct cosine similarity

  • Answer generation using OpenAI GPT or HuggingFace models

Next Steps

Future improvements will continue to follow this learning-driven approach:

  • Add Arabic OCR and embeddings to support Arabic files

  • Replace API-based Question-Answering with a local HuggingFace LLM

  • Build a reranker to improve retrieval accuracy

Next
Next

Customer Satisfaction Predictive & Forecasting Model