Build Your Own ChatGPT for PDFs: A Complete Guide to AI-Powered Document Intelligence

Glxy
9 min readNov 12, 2024

--

Introduction

Ever wanted to build a system that can intelligently answer questions about your PDF documents? In this comprehensive guide, we’ll create a secure, scalable PDF Question-Answering system that combines vector search capabilities with the power of large language models. This system isn’t just about searching text — it’s about understanding documents and providing relevant, accurate answers while maintaining strict security boundaries between different teams and organizations.

What We’re Building

Our system provides:

  • Secure, team-isolated document processing and storage
  • Intelligent question answering using OpenAI’s language models
  • Efficient vector search with Qdrant
  • Authorization-scoped access to documents and queries
  • RESTful API interface for easy integration

Prerequisites

Before starting, ensure you have:

  • Python 3.8 or higher installed
  • Access to OpenAI API (Azure or regular)
  • Qdrant vector database (running locally or in cloud)
  • Basic understanding of Flask and REST APIs
  • Understanding of authentication and authorization principles
  • Familiarity with async/await patterns in Python

System Architecture Overview

Authorization Design

Our system implements a multi-tenant architecture using team_id as the primary authorization scope:

  • Each team has an isolated document space
  • All operations (uploads, queries) are scoped to specific teams
  • Cross-team access is prevented by design
  • Team-level rate limiting and access control
  • Complete data isolation between different organizations

Key Components

PDF Processing Layer:

  • Text extraction and chunking
  • Metadata preservation
  • Team-scoped document storage

Vector Search Layer:

  • Semantic embedding generation
  • Efficient similarity search
  • Team-isolated vector spaces

Answer Generation Layer:

  • Context retrieval within team scope
  • AI-powered answer generation
  • Source attribution

API Layer:

  • Secure endpoints
  • Authorization middleware
  • Rate limiting and monitoring

Detailed Implementation

1. OpenAI and Embedding Setup

First, let’s set up our AI components with proper configuration:

from openai import AzureOpenAI
from sentence_transformers import SentenceTransformer
import os
from typing import Optional

class AIConfig:
"""
Configuration manager for AI services with security considerations
"""
def __init__(self):
# Load configuration from environment variables for security
self.openai_client = AzureOpenAI(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version=os.getenv("AZURE_API_VERSION", "2024-02-01"),
azure_deployment=os.getenv("AZURE_DEPLOYMENT_NAME")
)

# Initialize embedding model
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Warm up the model to prevent cold starts
self._warmup()

def _warmup(self):
"""Warm up the embedding model"""
_ = self.embedding_model.encode("Warm up text")

def get_embedding(self, text: str) -> list:
"""Generate embeddings with error handling"""
try:
return self.embedding_model.encode(text).tolist()
except Exception as e:
print(f"Error generating embedding: {str(e)}")
raise

# Initialize global AI configuration
ai_config = AIConfig()

2. Vector Database Setup

Next, let’s configure our vector database with proper security measures:

# src/vector_store.py
from qdrant_client import QdrantClient
from qdrant_client.http.models import NamedVector
from qdrant_client.models import PointStruct, Distance, VectorParams, models
from typing import List, Optional
from src.config import config
import logging

logger = logging.getLogger(__name__)


class VectorStore:
"""Secure vector storage management with team isolation"""

def __init__(self):
self.client = QdrantClient(
host=config.QDRANT_HOST,
port=config.QDRANT_PORT,
api_key=config.QDRANT_API_KEY,
https=config.QDRANT_HTTPS
)
self.collection_name = "pdf_embeddings"
self.embedding_dim = 384 # Dimension for all-MiniLM-L6-v2

def setup_collection(self) -> bool:
"""Create or recreate the vector collection"""
try:
# Remove existing collection if it exists
collections = self.client.get_collections().collections
if any(collection.name == self.collection_name for collection in collections):
self.client.delete_collection(self.collection_name)
logger.info(f"Deleted existing collection '{self.collection_name}'")

# Create new collection
self.client.create_collection(
collection_name=self.collection_name,
vectors_config={
"custom_vector": VectorParams(
size=self.embedding_dim,
distance=Distance.COSINE
)
}
)
logger.info(f"Created collection '{self.collection_name}'")
return True

except Exception as e:
logger.error(f"Error setting up collection: {str(e)}")
raise

def search_vectors(
self,
team_id: str,
query_vector: List[float],
limit: int = 10
) -> List[PointStruct]:
"""Search vectors within team's authorization scope"""
try:
team_filter = models.Filter(
must=[
models.FieldCondition(
key="team_id",
match=models.MatchValue(value=team_id)
)
]
)

# Use NamedVector for the query
return self.client.search(
collection_name=self.collection_name,
query_vector=NamedVector(
name="custom_vector",
vector=query_vector
),
query_filter=team_filter,
limit=limit
)

except Exception as e:
logger.error(f"Error searching vectors: {str(e)}")
raise

def upsert_points(self, points: List[PointStruct]) -> bool:
"""Insert or update points in the Qdrant collection"""
try:
response = self.client.upsert(
collection_name=self.collection_name,
points=points
)
logger.info(f"Successfully upserted {len(points)} points")
return True

except Exception as e:
logger.error(f"Error upserting points: {str(e)}")
raise


# Initialize global vector store
vector_store = VectorStore()

3. Secure PDF Processing Pipeline

Let’s implement our PDF processing with proper team isolation:

import pdfplumber
from typing import List, Any
from werkzeug.utils import secure_filename
from src.ai_service import ai_service
from src.vector_store import vector_store
from qdrant_client.models import PointStruct
import uuid
import logging

logger = logging.getLogger(__name__)


class DocumentProcessor:
"""Secure document processing with team isolation"""

def process_pdf(
self,
pdf_file: Any,
team_id: str,
doc_name: str,
document_id: str,
chunk_size: int = 500
) -> List[PointStruct]:
"""Process PDF with team-scoped authorization"""
try:
points = []
doc_name = secure_filename(doc_name)

with pdfplumber.open(pdf_file) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
text = page.extract_text()
if not text:
continue

# Create chunks
chunks = self._create_chunks(text, chunk_size)

# Generate embeddings
embeddings = [ai_service.get_embedding(chunk) for chunk in chunks]

# Create points
points.extend(self._create_points(
chunks, embeddings, team_id, doc_name,
document_id, page_num
))

# Store vectors
if points:
vector_store.upsert_points(points)

return points

except Exception as e:
logger.error(f"Error processing PDF: {str(e)}")
raise

def _create_chunks(self, text: str, chunk_size: int) -> List[str]:
"""Create overlapping chunks from text"""
chunks = []
for i in range(0, len(text), chunk_size):
chunk = text[max(0, i - 50):i + chunk_size]
chunks.append(chunk)
return chunks

def _create_points(
self,
chunks: List[str],
embeddings: List[List[float]],
team_id: str,
doc_name: str,
document_id: str,
page_num: int
) -> List[PointStruct]:
"""Create points for vector storage"""
points = []
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
point = PointStruct(
id=str(uuid.uuid4()),
vector={"custom_vector": embedding},
payload={
"team_id": team_id,
"doc_name": doc_name,
"document_id": document_id,
"page_number": page_num,
"chunk_index": i,
"text": chunk,
"embedding_model": "all-MiniLM-L6-v2"
}
)
points.append(point)
return points


# Initialize global document processor
document_processor = DocumentProcessor()

4. Answer Generation with Authorization

Implement secure answer generation with team isolation:

class AnswerGenerator:
"""
Generate answers within team authorization scope
"""
def __init__(self, ai_config: AIConfig, vector_store: VectorStore):
self.ai_config = ai_config
self.vector_store = vector_store

def generate_answer(self, team_id: str, question: str) -> Dict[str, Any]:
"""
Generate answers using only team-authorized documents
"""
# Validate authorization
if not self._validate_team_id(team_id):
return {
"answer": "Unauthorized access",
"sources": [],
"status": "error"
}

try:
# Generate question embedding
query_vector = self.ai_config.get_embedding(question)

# Get relevant documents within team scope
points = self.vector_store.search_vectors(team_id, query_vector, limit=15)

if not points:
return {
"answer": "No relevant documents found",
"sources": [],
"status": "no_context"
}

# Prepare context with source tracking
context_parts = []
sources = set()
seen_text = set()

for point in points:
if point.payload:
text = point.payload.get('text', '').strip()
# Deduplicate similar content
if text in seen_text:
continue

doc_info = f"[Document: {point.payload.get('doc_name')}, Page: {point.payload.get('page_number')}]"
context_parts.append(f"{doc_info}\n{text}")
sources.add((
point.payload.get('doc_name'),
point.payload.get('page_number')
))
seen_text.add(text)

# Generate answer using AI
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant that provides accurate, "
"comprehensive answers based on the given context. "
"Always cite your sources using [Document: X, Page: Y] format."
)
},
{
"role": "user",
"content": (
f"Answer this question using only the context provided. "
f"If you cannot answer based on the context, say so.\n\n"
f"Context:\n{' '.join(context_parts)}\n\n"
f"Question: {question}"
)
}
]

response = self.ai_config.openai_client.chat.completions.create(
model=os.getenv("OPENAI_MODEL_DEPLOYMENT"),
messages=messages,
max_tokens=1000,
temperature=0.2
)

return {
"answer": response.choices[0].message.content.strip(),
"sources": list(sources),
"status": "success"
}

except Exception as e:
return {
"answer": "Error generating answer",
"sources": [],
"status": "error",
"error": str(e)
}

def _validate_team_id(self, team_id: str) -> bool:
"""
Validate team_id authorization
Implementation depends on your authentication system
"""
# Add your team validation logic here
return bool(team_id and isinstance(team_id, str))

# Initialize global answer generator
answer_generator = AnswerGenerator(ai_config, vector_store)

5. Secure API Layer

Implement the API with proper authorization and security:

from flask import Flask, request, jsonify
from functools import wraps
import time

app = Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max file size

# Rate limiting configuration
RATE_LIMIT = {
"window": 60, # seconds
"max_requests": 100 # requests per window
}

class RateLimiter:
"""Simple in-memory rate limiter"""
def __init__(self):
self.requests = {}

def is_allowed(self, team_id: str) -> bool:
now = time.time()
team_requests = self.requests.get(team_id, [])

# Clean old requests
team_requests = [req_time for req_time in team_requests
if now - req_time < RATE_LIMIT["window"]]

if len(team_requests) >= RATE_LIMIT["max_requests"]:
return False

team_requests.append(now)
self.requests[team_id] = team_requests
return True

rate_limiter = RateLimiter()

def require_team_auth(f):
"""Authorization middleware"""
@wraps(f)
def decorated_function(*args, **kwargs):
# Get team_id from request
team_id = request.form.get('team_id') or request.json.get('team_id')

if not team_id:
return jsonify({"error": "team_id is required"}), 401

# Check rate limit
if not rate_limiter.is_allowed(team_id):
return jsonify({"error": "Rate limit exceeded"}), 429

# Add your additional authorization checks here
# For example, validating JWT tokens, checking team membership, etc.

return f(*args, **kwargs)
return decorated_function

@app.post('/answer')
@require_team_auth
def get_answer():
"""
Generate answer for question within team scope
"""
try:
data = request.json
team_id = data.get('team_id')
question = data.get('question')

if not question:
return jsonify({"error": "Question is required"}), 400

response = answer_generator.generate_answer(team_id, question)
return jsonify(response)

except Exception as e:
return jsonify({"error": str(e)}), 500

```python
@app.route("/upload", methods=['POST'])
@require_team_auth
def upload_file():
"""
Upload and process PDF within team scope
"""
try:
# Validate request
if 'file' not in request.files:
return jsonify({"error": "No file part"}), 400

file = request.files['file']
if file.filename == '':
return jsonify({"error": "No selected file"}), 400

if not file.filename.endswith('.pdf'):
return jsonify({"error": "Only PDF files are allowed"}), 400

team_id = request.form['team_id']
document_id = request.form['document_id']

# Process file with team scope
chunks = doc_processor.process_pdf(
pdf_file=file,
team_id=team_id,
doc_name=secure_filename(file.filename),
document_id=document_id
)

return jsonify({
"status": "success",
"chunks_processed": len(chunks),
"document_id": document_id
})

except Exception as e:
return jsonify({
"error": str(e),
"status": "error"
}), 500

@app.route("/documents", methods=['GET'])
@require_team_auth
def list_documents():
"""
List documents available for a team
"""
team_id = request.args.get('team_id')

try:
# Query vector store for team's documents
filter_query = Filter(
must=[
FieldCondition(
key="team_id",
match=MatchValue(value=team_id)
)
]
)

# Get unique documents
points = vector_store.client.scroll(
collection_name=vector_store.collection_name,
scroll_filter=filter_query,
limit=1000 # Adjust based on your needs
)

# Extract unique document information
documents = set()
for point in points[0]: # points[0] contains the actual points
if point.payload:
documents.add((
point.payload.get('document_id'),
point.payload.get('doc_name')
))

return jsonify({
"status": "success",
"documents": [
{"id": doc_id, "name": doc_name}
for doc_id, doc_name in documents
]
})

except Exception as e:
return jsonify({
"error": str(e),
"status": "error"
}), 500

# Environmental configuration
if __name__ == "__main__":
# Initialize vector store collection
vector_store.setup_collection()

# Configure server
app.run(
host='0.0.0.0',
port=int(os.getenv('PORT', 8000)),
debug=os.getenv('DEBUG', 'False').lower() == 'true',
ssl_context='adhoc' if os.getenv('ENABLE_HTTPS', 'False').lower() == 'true' else None
)

System Usage Examples

1. Upload a Document

import requests

def upload_document(file_path: str, team_id: str, document_id: str):
"""Example: Upload a PDF document"""
with open(file_path, 'rb') as file:
response = requests.post(
'http://localhost:8000/upload',
files={'file': file},
data={
'team_id': team_id,
'document_id': document_id
}
)
return response.json()

2. Ask Questions

def ask_question(team_id: str, question: str):
"""Example: Ask a question about uploaded documents"""
response = requests.post(
'http://localhost:8000/answer',
json={
'team_id': team_id,
'question': question
}
)
return response.json()

Deployment Guide

Docker Setup

Create a Dockerfile for the application:

FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

# Run the application
CMD ["python", "app.py"]

Docker Compose Configuration

version: '3.8'

services:
app:
build: .
ports:
- "8000:8000"
environment:
- AZURE_OPENAI_ENDPOINT=${AZURE_OPENAI_ENDPOINT}
- AZURE_OPENAI_API_KEY=${AZURE_OPENAI_API_KEY}
- AZURE_API_VERSION=${AZURE_API_VERSION}
- AZURE_DEPLOYMENT_NAME=${AZURE_DEPLOYMENT_NAME}
- QDRANT_HOST=qdrant
- QDRANT_PORT=6333
- ENABLE_HTTPS=false
depends_on:
- qdrant
networks:
- app-network

qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
networks:
- app-network

networks:
app-network:
driver: bridge

volumes:
qdrant_data:

2. Security Headers

Implement secure headers middleware:

from flask_talisman import Talisman

# Initialize Talisman with security headers
Talisman(app,
force_https=True,
strict_transport_security=True,
session_cookie_secure=True,
content_security_policy={
'default-src': "'self'",
'img-src': '*',
'script-src': "'self'"
}
)

Monitoring and Maintenance

1. Health Check Implementation

@app.route("/health", methods=['GET'])
def health_check():
"""System health check endpoint"""
try:
# Check components
health_status = {
"vector_store": "healthy",
"openai": "healthy",
"timestamp": time.time()
}

# Test vector store
vector_store.client.get_collections()

# Test OpenAI connection
ai_config.get_embedding("test")

return jsonify(health_status)

except Exception as e:
return jsonify({
"status": "unhealthy",
"error": str(e)
}), 500

2. Logging Configuration

import logging.config

# Configure logging
logging.config.dictConfig({
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'standard': {
'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s'
},
},
'handlers': {
'default': {
'level': 'INFO',
'formatter': 'standard',
'class': 'logging.StreamHandler',
},
'file': {
'level': 'INFO',
'formatter': 'standard',
'class': 'logging.FileHandler',
'filename': 'app.log',
'mode': 'a',
},
},
'loggers': {
'': {
'handlers': ['default', 'file'],
'level': 'INFO',
'propagate': True
}
}
})

logger = logging.getLogger(__name__)

Conclusion

This implementation provides a robust, secure foundation for building a PDF question-answering system. Key features include:

Security

  • Team-based isolation
  • Rate limiting
  • Secure file handling
  • Authorization middleware

Scalability

  • Docker containerization
  • Efficient vector search
  • Modular design

Maintainability

  • Comprehensive logging
  • Health monitoring
  • Clear documentation

Remember to:

  • Keep dependencies updated
  • Monitor system performance
  • Regularly backup vector data
  • Review security configurations
  • Test thoroughly before deployment

This guide provides a foundation for building intelligent document systems. Ready to implement it in your organization or need help with a custom solution? I’m available for select consulting projects and technical advisory roles, focusing on production-grade AI systems. Let’s discuss your implementation: me@arif.sh

Full Source Code: github.com/doganarif/pdf-gpt-vectordb-qa

Star ⭐️ the repository if you found this guide helpful!

⚡ Happy Building!

--

--

Glxy
Glxy

Written by Glxy

geek @liftOS | startups and tech stuff

No responses yet