I modified the script to properly created paragraph sized chunks to be returned on semantic search. I modified the script that uses the data to properly return rag enhancements to a prompt.
Here's a Python script to create a RAG database from text and PDF documents. This script will:
- Process text and PDF files
- Create embeddings for the content
- Build a FAISS index for efficient retrieval
Here's the script:
import osimport fitz # PyMuPDFimport faissimport numpy as npfrom sentence_transformers import SentenceTransformerimport re# readme# add these libraries## pip install PyMuPDF faiss-cpu numpy sentence-transformers#def read_text_file(file_path):with open(file_path, 'r', encoding='utf-8') as file:return file.read()def read_pdf_file(file_path):doc = fitz.open(file_path)text = ""for page in doc:text += page.get_text()return textdef process_documents(directory):documents = []for filename in os.listdir(directory):file_path = os.path.join(directory, filename)if filename.endswith('.txt'):text = read_text_file(file_path)pattern = r'(?<=[.!?])\s*\n+'paragraphs = re.split(pattern, text)paragraphs = [para.strip() for para in paragraphs if para.strip()]documents.extend([para.strip() for para in paragraphs if para.strip()]) # Add non-empty paragraphselif filename.endswith('.pdf'):text = read_pdf_file(file_path)#print (text)#paragraphs = text.split(".") # Split by double newline (common paragraph separator)pattern = r'(?<=[.!?])\s*\n+'paragraphs = re.split(pattern, text)paragraphs = [para.strip() for para in paragraphs if para.strip()]documents.extend([para.strip() for para in paragraphs if para.strip()]) # Add non-empty paragraphsreturn documentsdef create_rag_database(directory):# Process documentsdocuments = process_documents(directory)# Create embeddingsembedder = SentenceTransformer('all-MiniLM-L6-v2')embeddings = embedder.encode(documents)# Create FAISS indexdimension = embeddings.shape[1]index = faiss.IndexFlatL2(dimension)index.add(embeddings.astype('float32'))return index, documents# Usagedirectory = "docs/"index, documents = create_rag_database(directory)# Save the index and documents for later usefaiss.write_index(index, "rag_index.faiss")np.save("document_chunks.npy", documents)print(f"RAG database created with {len(documents)} document chunks.")
To use this script:
- Install required libraries:
pip install PyMuPDF faiss-cpu numpy sentence-transformers
- Replace
/path/to/your/documents
with the actual path to the directory containing your text and PDF files. - Run the script. It will create two files:
rag_index.faiss
: The FAISS index for efficient similarity searchdocuments.npy
: A NumPy array containing the original documents
Now you can load these files in your RAG-enhanced workflow:
import faissimport numpy as npfrom sentence_transformers import SentenceTransformerimport textwrapdef rag_annotate(prompt:str, path:str, output: list, k:int =5)-> tuple[str, Dict]:# Check if data is already loadedk = int(k)if not hasattr(rag_annotate, 'index'):# Load data for the first timerag_annotate.index = faiss.read_index(f"{path}/rag_index.faiss")rag_annotate.document_chunks = np.load(f"{path}/document_chunks.npy", allow_pickle=True)rag_annotate.embedder = SentenceTransformer('all-MiniLM-L6-v2')# Use the loaded dataprompt_embedding = rag_annotate.embedder.encode([prompt])_, I = rag_annotate.index.search(prompt_embedding, k)retrieved_chunks = [rag_annotate.document_chunks[i] for i in I[0]]context = "\n".join(retrieved_chunks)return (f"Context:\n{context}\n\nQuestion: {prompt}\nAnswer:",{"status": {"value": 0, "reason": "Success"}})
This setup allows you to create a RAG database from your documents and then use it in your workflow to enhance prompts before sending them to an LLM. The database creation is done separately, so you only need to run it when you want to update your document set.
No comments:
Post a Comment