Building a custom PDF for PYPDF and Langchain

Photo by writer | Kanele
PDF files are not everywhere. You have probably seen them in different areas, such as college papers, electrical debt, office contracts, product brochures, and more. They are very familiar, but working with them is not easy as it looks. Suppose you want to issue useful information on the PDF, such as reading text, to distinguish it into categories, or to find a quick summary. This can sound simple, but you will see that it's not so bad when you try.
Unlike words with words or HTMLs, PDFs do not store content in a tidy, readable way. Instead, they are designed to look good, not reading about programs. The text can be over the area, divide the strange blocks, scattered on the page, or mixed with tables and pictures. This makes it difficult to find pure, organized data.
In this article, we will build something that can manage this dirt. We will build a custom PARF PDF:
- Uninstall and clean the text from PDFs at the page level, by the savings of the optional building for better formatting
- Manage photo medadata domain
- Remove unwanted topics and foot by finding repeated lines at all pages to reduce noise
- Repeat the detailed document with a page-level Metadata, as a writer, title, creative date, rotation, and page size
- Chunk content into logical pieces to continue the operation of NLP or llm
Let's get started.
Folder structure
Before you start, it is good to organize your project files for clarification and light.
custom_pdf_parser/
│
├── parser.py
├── langchain_loader.py
├── pipeline.py
├── example.py
├── requirements.txt # Dependencies list
└── __init__.py # (Optional) to mark directory as Python package
You can leave __Iinit.py__ An empty file, as its main purpose to show that this indicator should be treated as Python package. I will explain the purpose of each file left by step by step.
Tools required (requirements.txt)
The required libraries are required by:
- PYPDF: Beaty Pure Pytho Library to read and write PDF files. Will be used to remove text from PDF files
- Langchain: A framework for building apps for tongues in language models (we will use it processing and document activities). Will be used to process and edit text clearly.
Enter them:
pip install pypdf langchain
If you want to manage neatly, create a Requirements.txt File with:
And Run:
pip install -r requirements.txt
Step 1: Set Up PDF Parser (Passer.py)
Basic class CustomFapParer Using the PYPDF to issue text and metadata each page of PDF. Including text cleanliness, issue photos of photographs (optional), and removing repeated topics or footers are common to each page.
- Supports maintenance of structure planning
- Releases the Metadata as a page number, rotation, and size of Media box
- Can filter pages with very little content
- Cleaning the text deletes the excess white while the conservation of class breaks
The Logic Used All This is:
import os
import logging
from pathlib import Path
from typing import List, Dict, Any
import pypdf
from pypdf import PdfReader
# Configure logging to show info and above messages
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CustomPDFParser:
def __init__(
self,extract_images: bool = False,preserve_layout: bool = True,remove_headers_footers: bool = True,min_text_length: int = 10
):
"""
Initialize the parser with options to extract images, preserve layout, remove repeated headers/footers, and minimum text length for pages.
Args:
extract_images: Whether to extract image info from pages
preserve_layout: Whether to keep layout spacing in text extraction
remove_headers_footers: Whether to detect and remove headers/footers
min_text_length: Minimum length of text for a page to be considered valid
"""
self.extract_images = extract_images
self.preserve_layout = preserve_layout
self.remove_headers_footers = remove_headers_footers
self.min_text_length = min_text_length
def extract_text_from_page(self, page: pypdf.PageObject, page_num: int) -> Dict[str, Any]:
"""
Extract text and metadata from a single PDF page.
Args:
page: PyPDF page object
page_num: zero-based page number
Returns:
dict with keys:
- 'text': extracted and cleaned text string,
- 'metadata': page metadata dict,
- 'word_count': number of words in extracted text
"""
try:
# Extract text, optionally preserving the layout for better formatting
if self.preserve_layout:
text = page.extract_text(extraction_mode="layout")
else:
text = page.extract_text()
# Clean text: remove extra whitespace and normalize paragraphs
text = self._clean_text(text)
# Gather page metadata (page number, rotation angle, mediabox)
metadata = {
"page_number": page_num + 1, # 1-based numbering
"rotation": getattr(page, "rotation", 0),
"mediabox": str(getattr(page, "mediabox", None)),
}
# Optionally, extract image info from page if requested
if self.extract_images:
metadata["images"] = self._extract_image_info(page)
# Return dictionary with text and metadata for this page
return {
"text": text,
"metadata": metadata,
"word_count": len(text.split()) if text else 0
}
except Exception as e:
# Log error and return empty data for problematic pages
logger.error(f"Error extracting page {page_num}: {e}")
return {
"text": "",
"metadata": {"page_number": page_num + 1, "error": str(e)},
"word_count": 0
}
def _clean_text(self, text: str) -> str:
"""
Clean and normalize extracted text, preserving paragraph breaks.
Args:
text: raw text extracted from PDF page
Returns:
cleaned text string
"""
if not text:
return ""
lines = text.split('n')
cleaned_lines = []
for line in lines:
line = line.strip() # Remove leading/trailing whitespace
if line:
# Non-empty line; keep it
cleaned_lines.append(line)
elif cleaned_lines and cleaned_lines[-1]:
# Preserve paragraph break by keeping empty line only if previous line exists
cleaned_lines.append("")
cleaned_text="n".join(cleaned_lines)
#Reduce any instances of more than two consecutive blank lines to two
while 'nnn' in cleaned_text:
cleaned_text = cleaned_text.replace('nnn', 'nn')
return cleaned_text.strip()
def _extract_image_info(self, page: pypdf.PageObject) -> List[Dict[str, Any]]:
"""
Extract basic image metadata from page, if available.
Args:
page: PyPDF page object
Returns:
List of dictionaries with image info (index, name, width, height)
"""
images = []
try:
# PyPDF pages can have an 'images' attribute listing embedded images
if hasattr(page, 'images'):
for i, image in enumerate(page.images):
images.append({
"image_index": i,
"name": getattr(image, 'name', f"image_{i}"),
"width": getattr(image, 'width', None),
"height": getattr(image, 'height', None)
})
except Exception as e:
logger.warning(f"Image extraction failed: {e}")
return images
def _remove_headers_footers(self, pages_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Remove repeated headers and footers that appear on many pages.
This is done by identifying lines appearing on over 50% of pages
at the start or end of page text, then removing those lines.
Args:
pages_data: List of dictionaries representing each page's extracted data.
Returns:
Updated list of pages with headers/footers removed
"""
# Only attempt removal if enough pages and option enabled
if len(pages_data) < 3 or not self.remove_headers_footers:
return pages_data
# Collect first and last lines from each page's text for analysis
first_lines = [page["text"].split('n')[0] if page["text"] else "" for page in pages_data]
last_lines = [page["text"].split('n')[-1] if page["text"] else "" for page in pages_data]
threshold = len(pages_data) * 0.5 # More than 50% pages
# Identify candidate headers and footers appearing frequently
potential_headers = [line for line in set(first_lines)
if first_lines.count(line) > threshold and line.strip()]
potential_footers = [line for line in set(last_lines)
if last_lines.count(line) > threshold and line.strip()]
# Remove identified headers and footers from each page's text
for page_data in pages_data:
lines = page_data["text"].split('n')
# Remove header if it matches a frequent header
if lines and potential_headers:
for header in potential_headers:
if lines[0].strip() == header.strip():
lines = lines[1:]
break
# Remove footer if it matches a frequent footer
if lines and potential_footers:
for footer in potential_footers:
if lines[-1].strip() == footer.strip():
lines = lines[:-1]
break
page_data["text"] = 'n'.join(lines).strip()
return pages_data
def _extract_document_metadata(self, pdf_reader: PdfReader, pdf_path: str) -> Dict[str, Any]:
"""
Extract metadata from the PDF document itself.
Args:
pdf_reader: PyPDF PdfReader instance
pdf_path: path to PDF file
Returns:
Dictionary of metadata including file info and PDF document metadata
"""
metadata = {
"file_path": pdf_path,
"file_name": Path(pdf_path).name,
"file_size": os.path.getsize(pdf_path) if os.path.exists(pdf_path) else None,
}
try:
if pdf_reader.metadata:
# Extract common PDF metadata keys if available
metadata.update({
"title": pdf_reader.metadata.get('/Title', ''),
"author": pdf_reader.metadata.get('/Author', ''),
"subject": pdf_reader.metadata.get('/Subject', ''),
"creator": pdf_reader.metadata.get('/Creator', ''),
"producer": pdf_reader.metadata.get('/Producer', ''),
"creation_date": str(pdf_reader.metadata.get('/CreationDate', '')),
"modification_date": str(pdf_reader.metadata.get('/ModDate', '')),
})
except Exception as e:
logger.warning(f"Metadata extraction failed: {e}")
return metadata
def parse_pdf(self, pdf_path: str) -> Dict[str, Any]:
"""
Parse the entire PDF file. Opens the file, extracts text and metadata page by page, removes headers/footers if configured, and aggregates results.
Args:
pdf_path: Path to the PDF file
Returns:
Dictionary with keys:
- 'full_text': combined text from all pages,
- 'pages': list of page-wise dicts with text and metadata,
- 'document_metadata': file and PDF metadata,
- 'total_pages': total pages in PDF,
- 'processed_pages': number of pages kept after filtering,
- 'total_words': total word count of parsed text
"""
try:
with open(pdf_path, 'rb') as file:
pdf_reader = PdfReader(file)
doc_metadata = self._extract_document_metadata(pdf_reader, pdf_path)
pages_data = []
# Iterate over all pages and extract data
for i, page in enumerate(pdf_reader.pages):
page_data = self.extract_text_from_page(page, i)
# Only keep pages with sufficient text length
if len(page_data["text"]) >= self.min_text_length:
pages_data.append(page_data)
# Remove repeated headers and footers
pages_data = self._remove_headers_footers(pages_data)
# Combine all page texts with a double newline as a separator
full_text="nn".join(page["text"] for page in pages_data if page["text"])
# Return final structured data
return {
"full_text": full_text,
"pages": pages_data,
"document_metadata": doc_metadata,
"total_pages": len(pdf_reader.pages),
"processed_pages": len(pages_data),
"total_words": sum(page["word_count"] for page in pages_data)
}
except Exception as e:
logger.error(f"Failed to parse PDF {pdf_path}: {e}")
raise
Step 2: Mix with Langchain (Langchain_ Loader.py)
This page LangchainPdfler The class wraps a custom PARSER and changes the combined pages into a langchain Writing Things, blocks of building Langchain pipeline.
- Allows decrease in documentation into small pieces using Langchain's RECSUSUSSCHECACTEXTERTTERTATE
- You can customize the chunk size and overlapping of lower llm
- This loader supports a clean integration between PDF green content and release of Kalangchain Document
The logic after this is:
from typing import List, Optional, Dict, Any
from langchain.schema import Document
from langchain.document_loaders.base import BaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from parser import CustomPDFParser # import the parser defined above
class LangChainPDFLoader(BaseLoader):
def __init__(
self,file_path: str,parser_config: Optional[Dict[str, Any]] = None,chunk_size: int = 500, chunk_overlap: int = 50
):
"""
Initialize the loader with the PDF file path, parser configuration, and chunking parameters.
Args:
file_path: path to PDF file
parser_config: dictionary of parser options
chunk_size: chunk size for splitting long texts
chunk_overlap: chunk overlap for splitting
"""
self.file_path = file_path
self.parser_config = parser_config or {}
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.parser = CustomPDFParser(**self.parser_config)
def load(self) -> List[Document]:
"""
Load PDF, parse pages, and convert each page to a LangChain Document.
Returns:
List of Document objects with page text and combined metadata.
"""
parsed_data = self.parser.parse_pdf(self.file_path)
documents = []
# Convert each page dict to a LangChain Document
for page_data in parsed_data["pages"]:
if page_data["text"]:
# Merge document-level and page-level metadata
metadata = {**parsed_data["document_metadata"], **page_data["metadata"]}
doc = Document(page_content=page_data["text"], metadata=metadata)
documents.append(doc)
return documents
def load_and_split(self) -> List[Document]:
"""
Load the PDF and split large documents into smaller chunks.
Returns:
List of Document objects after splitting large texts.
"""
documents = self.load()
# Initialize a text splitter with the desired chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=["nn", "n", " ", ""] # hierarchical splitting
)
# Split documents into smaller chunks
split_docs = text_splitter.split_documents(documents)
return split_docs
Step 3: Create an active pipeline (pipe.py)
This page Pdfprocessingpingpine The class provides an interface of high quality of:
- Processing one PDF
- To select the output format (green, Langchain documents, or transparent text)
- To enable or disable to dim with the size of a chunk of support
- Managing errors and login
This release allows simple integration into large programs or work flow. The logic after this is:
from typing import List, Optional, Dict, Any
from langchain.schema import Document
from parser import CustomPDFParser
from langchain_loader import LangChainPDFLoader
import logging
logger = logging.getLogger(__name__)
class PDFProcessingPipeline:
def __init__(self, parser_config: Optional[Dict[str, Any]] = None):
"""
Args:
parser_config: dictionary of options passed to CustomPDFParser
"""
self.parser_config = parser_config or {}
def process_single_pdf(
self,pdf_path: str,output_format: str = "langchain",chunk_documents: bool = True,chunk_size: int = 500,chunk_overlap: int = 50
) -> Any:
"""
Args:
pdf_path: path to PDF file
output_format: "raw" (dict), "langchain" (Documents), or "text" (string)
chunk_documents: whether to split LangChain documents into chunks
chunk_size: chunk size for splitting
chunk_overlap: chunk overlap for splitting
Returns:
Parsed content in the requested format
"""
if output_format == "raw":
# Use raw CustomPDFParser output
parser = CustomPDFParser(**self.parser_config)
return parser.parse_pdf(pdf_path)
elif output_format == "langchain":
# Use LangChain loader, optionally chunked
loader = LangChainPDFLoader(pdf_path, self.parser_config, chunk_size, chunk_overlap)
if chunk_documents:
return loader.load_and_split()
else:
return loader.load()
elif output_format == "text":
# Return combined plain text only
parser = CustomPDFParser(**self.parser_config)
parsed_data = parser.parse_pdf(pdf_path)
return parsed_data.get("full_text", "")
else:
raise ValueError(f"Unknown output_format: {output_format}")
Step 4: Check the Parser (for example.py)
Let's examine a parser like this:
import os
from pathlib import Path
def main():
print("👋 Welcome to the Custom PDF Parser!")
print("What would you like to do?")
print("1. View full parsed raw data")
print("2. Extract full plain text")
print("3. Get LangChain documents (no chunking)")
print("4. Get LangChain documents (with chunking)")
print("5. Show document metadata")
print("6. Show per-page metadata")
print("7. Show cleaned page text (header/footer removed)")
print("8. Show extracted image metadata")
choice = input("Enter the number of your choice: ").strip()
if choice not in {'1', '2', '3', '4', '5', '6', '7', '8'}:
print("❌ Invalid option.")
return
file_path = input("Enter the path to your PDF file: ").strip()
if not Path(file_path).exists():
print("❌ File not found.")
return
# Initialize pipeline
pipeline = PDFProcessingPipeline({
"preserve_layout": False,
"remove_headers_footers": True,
"extract_images": True,
"min_text_length": 20
})
# Raw data is needed for most options
parsed = pipeline.process_single_pdf(file_path, output_format="raw")
if choice == '1':
print("nFull Raw Parsed Output:")
for k, v in parsed.items():
print(f"{k}: {str(v)[:300]}...")
elif choice == '2':
print("nFull Cleaned Text (truncated preview):")
print("Previewing the first 1000 characters:n"+parsed["full_text"][:1000], "...")
elif choice == '3':
docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=False)
print(f"nLangChain Documents: {len(docs)}")
print("Previewing the first 500 characters:n", docs[0].page_content[:500], "...")
elif choice == '4':
docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=True)
print(f"nLangChain Chunks: {len(docs)}")
print("Sample chunk content (first 500 chars):")
print(docs[0].page_content[:500], "...")
elif choice == '5':
print("nDocument Metadata:")
for key, value in parsed["document_metadata"].items():
print(f"{key}: {value}")
elif choice == '6':
print("nPer-page Metadata:")
for i, page in enumerate(parsed["pages"]):
print(f"Page {i+1}: {page['metadata']}")
elif choice == '7':
print("nCleaned Text After Header/Footer Removal.")
print("Showing the first 3 pages and first 500 characters of the text from each page.")
for i, page in enumerate(parsed["pages"][:3]): # First 3 pages
print(f"n--- Page {i+1} ---")
print(page["text"][:500], "...")
elif choice == '8':
print("nExtracted Image Metadata (if available):")
found = False
for i, page in enumerate(parsed["pages"]):
images = page["metadata"].get("images", [])
if images:
found = True
print(f"n--- Page {i+1} ---")
for img in images:
print(img)
if not found:
print("No image metadata found.")
if __name__ == "__main__":
main()
Run this and you will be directed to enter your choice and the way to the PDF. Add that. The PDF I use is publicly available, and don't download using the link.
👋 Welcome to the Custom PDF Parser!
What would you like to do?
1. View full parsed raw data
2. Extract full plain text
3. Get LangChain documents (no chunking)
4. Get LangChain documents (with chunking)
5. Show document metadata
6. Show per-page metadata
7. Show cleaned page text (header/footer removed)
8. Show extracted image metadata.
Enter the number of your choice: 5
Enter the path to your PDF file: /content/articles.pdf
Output:
LangChain Chunks: 16
First chunk preview:
San José State University Writing Center
www.sjsu.edu/writingcenter
Written by Ben Aldridge
Articles (a/an/the), Spring 2014. 1 of 4
Articles (a/an/the)
There are three articles in the English language: a, an, and the. They are placed before nouns
and show whether a given noun is general or specific.
Examples of Articles
Store
In this guide, you have learned to create a variable and powerful PDF pipeline using only open tools. Because it is model, can easily pass, maybe add a search bar using the broadcasts, keep chunks in the VECTABASE database such as FAISS FAIS. You do not have to rebuild anything, you just connect to the next piece.PDFS don't have to feel like locked boxes. This way, you can change any text into something you can read, search, and understand about your terms.
Kanal Mehreen Kanwal is a machine learning device and a technical writer who has a great deal of data science and a combination of Ai and a drug. Authorized EBOOK “that added a product with chatGPT”. As a Google scene 2022 in the Apac, it is a sign of diversity and the beauty of education. He was recognized as a Teradata variation in a Tech scholar, Mitacs Globalk scholar research, and the Harvard of Code Scholar. Kanalal is a zealous attorney for a change, who removes Femcodes to equip women to women.


