ANI

5 Useful Python Scripts for Automating Boring PDF Functions

0 2 5 minutes read

5 Useful Python Scripts for Automating Boring PDF Functions

# Introduction

PDF files are widely used in many applications. You may need to consolidate reports, split large files, extract text or tables, add watermarks, or reorganize sensitive content. These are all common tasks, but handling them manually on multiple files can be slow and error-prone. These five Python scripts automate the process. They are from the command line, support batch processing, and are easy to configure.

You can find all the scripts on GitHub.

# 1. Merging and Splitting PDF Files

// Pain Point

Merging multiple PDF files into one, or splitting a large PDF into separate files by page spacing, are among the most common PDF tasks. Both are tedious to do manually, especially when dealing with large files or large page counts.

// What the Script Does

Combines a folder of PDF files into a single output file in an adjustable order, or splits a single PDF into separate files with fixed page widths, etc. N pages, or by a list of specific page numbers. Both functions are handled by the same script with the mode flag.

// How It Works

The script is running pypdf for all page-level operations. In merge mode, it reads all the PDFs in the input folder, sorts them by file name (or the custom order defined in the text file), and writes them into one PDF sequence. In split mode, accept a list of page ranges, a fixed chunk size, or a list of page numbers to split on. Each partition is written to a numbered output file. The metadata from the original input file is stored in the merge mode.

⏩ Get a PDF merge and split script

# 2. Extracts text and tables from PDFs

// Pain Point

Getting usable data from a PDF – whether it's text from a report or tabular data from a statement – is something that must happen before proceeding further. Copy-pasting in a PDF viewer doesn't work for anything beyond a few pages, and the output is rarely clean.

// What the Script Does

Extracts text and tables from one or more PDF files and writes the results to structured output files. Text is written to plain text or markup files. Tables are written in CSV or Excel, one sheet is obtained for each table. It supports both document-based PDFs and domain-based file formats.

// How It Works

The script uses pypdf to extract the basic text and pdfplumber with structure-aware extraction and table discovery. For each input file, it runs page by page, extracting blocks of text and finding table regions using pdfplumber's table finder. The extracted tables are standard – blank lines removed, headers found – and are written to separate the output files. A summary report lists how many pages and tables were found in each file, and flags any pages where the extraction yielded nothing.

⏩ Get the PDF text and table extractor script

# 3. Stamping, Watermarking, and Adding Page Numbers

// Pain Point

Adding a watermark, stamp, or page numbers to a collection of PDFs before distributing them is straightforward in concept but quick to do one file at a time using a graphical user interface (GUI). If the batch is large or the requirement is recurring, it needs to be automated.

// What the Script Does

It applies a text or image stamp to each page of one or more PDF files. Supports diagonal watermarks, header/footer text, page numbers, and image overlay. Position, font size, opacity, and color are all adjustable. Processes all folders in batch.

// How It Works

The script uses pypdf to manipulate the page once reportlab to produce a stamp layer. For each input PDF, it creates a one-page PDF stamp in memory using reportlab. Gives text a fixed position, angle, font, and opacity, or places an image at specified coordinates. This stamp page is then merged into all pages of the PDF source using pypdf page merge. The result is written to a new output file, leaving the original one unchanged. Page numbers are treated as special case, creating a unique stamp for each page.

⏩ Get PDF tag text

# 4. Editing of Sensitive Content

// Pain Point

Before sharing a PDF externally, sensitive content – such as names, reference numbers, financial figures, and addresses – usually needs to be removed. Drawing black boxes over text in a PDF editor works, but it doesn't remove the underlying text in all tools, and it's not possible for more than a few pages.

// What the Script Does

It scans PDF pages for patterns matching the text you define – regex patterns, specific strings, or predefined categories like email addresses and phone numbers – and permanently rearranges similar content by replacing it with black rectangles. It outputs a new PDF with the subtitles removed, not just the hidden ones.

// How It Works

The script is running pymupdfwhich provides both text search via bounding box links and the ability to draw annotations to permanently remove the underlying content when used. On each page, the script searches for all matches of each set pattern, marks the bounding rectangles as replication annotations, and uses them — which removes the text from the page's content stream. A report is written listing all resets made, including the page number, the same text (before the reset), and the pattern that triggered it.

⏩ Get the script to regenerate the PDF

# 5. Extracting Metadata and Generating a PDF Inventory

// Pain Point

When working with a large collection of PDF files, it's often helpful to know the basic facts about each one—page count, file size, creation date, author, whether it's encrypted, whether it contains text or a scanned image. Checking each file with a viewer doesn't work at scale.

// What the Script Does

It scans a folder of PDF files and extracts metadata from each one, including page count, file size, creation and modification dates, author, producer, encryption status, and whether the document appears to contain searchable text or scanned images. It writes everything to a single CSV or Excel file.

// How It Works

The script uses pypdf to read document metadata from the PDF information dictionary and pdfplumber to sample pages of text content. For each file, it tries to open the PDF and read the standard metadata fields. It samples the first few pages to determine if the file contains extractable text as opposed to scanned image pages. Encrypted files that cannot be opened are flagged rather than silently skipped. The output list includes one line per file with all fields extracted, and a summary line below with values and averages.

⏩ Get the PDF inventory text

# Wrapping up

These five Python scripts handle PDF tasks that often turn into repetitive manual work: splitting files, extracting content, processing batches, and cleaning up the document workflow. Each script is designed to work safely on single files or entire folders while generating new output instead of modifying the original.

Start with a small batch, verify the output, and scale up to larger folders once everything looks good. Most of the setup involves only installing the listed dependencies and configuring the configuration section of your file paths and settings.

Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.

Source link

nimda 3 weeks ago

0 2 5 minutes read