Merge PDF Files Python

The need to merge PDF files arises frequently in both personal and professional contexts. Whether you're consolidating invoices, combining chapters of a book, or assembling reports, merging PDFs streamlines document management. However, many online tools offering this service raise valid privacy concerns, as your sensitive documents are uploaded to external servers. This is where Python, with its powerful libraries, offers a secure and flexible alternative for merging PDFs locally on your own machine. As an alternative to coding, BreezePDF provides a straightforward, secure solution for merging your PDF documents without ever sending them to a server.

Merge PDFs Effortlessly and Privately

Use BreezePDF to combine your PDFs securely, without uploading to any server. It's 100% free.

Merge PDFs Securely →

Why Use Python for Merging PDFs?

Using Python for merging PDFs grants you unparalleled flexibility and control over the entire process. You can tailor the merging process to your specific needs, such as specifying page ranges or handling encrypted files. Furthermore, Python enables you to automate the merging process, creating scripts that can handle repetitive tasks with ease. By using Python, you avoid the limitations and potential privacy risks associated with online PDF merging tools, ensuring your data remains secure and local. As opposed to using online PDF tools, you can avoid the risk of exposing your documents.

Libraries for Merging PDFs with Python

Several robust Python libraries cater to PDF manipulation, including merging. Among the most popular are `pypdf` (formerly PyPDF2), `PyMuPDF`, `pdfrw`, and `pikepdf`. Additionally, the `pdfmerge` command-line utility provides a quick solution for simple merging tasks.

pypdf (formerly PyPDF2): Known for its simplicity and ease of use, making it ideal for basic merging tasks. However, it may struggle with more complex PDFs or encrypted files.
PyMuPDF: Offers a comprehensive feature set and excellent performance, suitable for handling various PDF complexities. However, it might have a steeper learning curve compared to `pypdf`.
pdfrw: Provides low-level control over PDF files, making it powerful but requiring a deeper understanding of the PDF format. It's an excellent choice for specialized tasks.
pikepdf: Actively maintained library focused on security and modern features. It also supports repairing corrupted PDF files.

When choosing a library, consider factors such as maintenance status, processing speed, available features, and the complexity of your PDF merging requirements. It is always important to consider the features and limitations of each one.

Using PyPDF (PyPDF2) to Merge PDFs

`pypdf` (formerly `PyPDF2`) is a beginner-friendly library for basic PDF merging tasks. You can install it using pip: `pip install pypdf`.

Basic Example: Merging Multiple PDF Files

from pypdf import PdfWriter

merger = PdfWriter()

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf']

for pdf in pdfs:
    merger.append(pdf)

merger.write('merged_file.pdf')
merger.close()

This code snippet imports the `PdfWriter` class, creates an instance, and then iterates through a list of PDF filenames. For each PDF, the `append()` method adds all pages to the merger object. Finally, the `write()` method outputs the merged PDF to a new file. It's a straightforward approach for simple PDF concatenation.

Fine-grained Control with `merge()`

from pypdf import PdfReader, PdfWriter

merger = PdfWriter()

input1 = PdfReader(open('file1.pdf', 'rb'))
input2 = PdfReader(open('file2.pdf', 'rb'))

merger.append(input1)
merger.merge(1, input2)

merger.write('merged_file.pdf')
merger.close()

The `merge()` method allows you to specify an insertion point within the existing PDF. In this example, `file2.pdf` is merged into `file1.pdf` after the second page (index 1). This offers precise control over the order of pages in the final merged document, enhancing customization options. This functionality lets users interweave PDF documents together seamlessly.

Specifying Page Ranges

from pypdf import PdfReader, PdfWriter

merger = PdfWriter()

input1 = PdfReader(open('file1.pdf', 'rb'))
merger.append(input1, pages=(0, 2)) # pages from index 0 to 2 (exclusive)

merger.write('merged_file.pdf')
merger.close()

The `pages` keyword argument allows you to specify a range of pages to include from a source PDF. The example includes only the first two pages (from index 0 up to, but not including, index 2). This level of precision is invaluable when you only need specific sections of a document, allowing for more streamlined merging processes. This helps to ensure that you only include the necessary content.

Handling Deprecation of `PdfMerger`

In newer versions of `pypdf`, `PdfMerger` has been deprecated in favor of `PdfWriter`. Make sure to update your code accordingly, replacing `PdfMerger()` with `PdfWriter()` to ensure compatibility and avoid deprecation warnings. The functionality remains largely the same, so the transition should be straightforward. This will allow you to have the latest updated code.

Merging all PDF files in a directory

import os
from pypdf import PdfWriter

merger = PdfWriter()

directory = './pdfs'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        filepath = os.path.join(directory, filename)
        merger.append(filepath)

merger.write('merged_directory.pdf')
merger.close()

This snippet uses `os.listdir()` to get a list of all files in a specified directory. It then iterates through the list, checking if each file ends with '.pdf'. If it does, it constructs the full file path and appends the PDF to the merger object. This is an excellent way to combine all PDF documents within a folder into a single file, automating what could otherwise be a tedious manual process.

Using PyMuPDF (fitz) to Merge PDFs

PyMuPDF, also known as `fitz`, is another powerful Python library for PDF manipulation, offering a broader range of features and often better performance than `pypdf`. Install it using: `pip install pymupdf`. Its speed and versatility make it a great choice.

Merging from the command line:

python -m fitz join -o result.pdf file1.pdf file2.pdf file3.pdf

PyMuPDF provides a convenient command-line interface for merging PDFs. This command joins `file1.pdf`, `file2.pdf`, and `file3.pdf` into a single file named `result.pdf`. The command-line approach offers a quick way to perform simple merges without writing any Python code. This is great when you want a simple no-code solution.

Merging from code:

import fitz

doc = fitz.open()

for filename in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
    with fitz.open(filename) as mfile:
        doc.insert_pdf(mfile)

doc.save('merged_pymupdf.pdf')

This code creates a new PDF document and iterates through a list of filenames. For each file, it opens the PDF using `fitz.open()` and then inserts it into the main document using `insert_pdf()`. This method provides a flexible and efficient way to merge multiple PDFs programmatically. Using a code solution offers the best way to fully control your solution.

It's important to use `insert_pdf` instead of the older, deprecated `insertPDF` to ensure compatibility with current versions of PyMuPDF. By keeping your code up to date, you ensure your solution is compatible with the latest features. By staying updated, you will ensure the best performance of your code.

PyMuPDF also excels in handling different types of documents, including scanned PDFs, images, and other file formats, converting them seamlessly into a merged PDF. Furthermore, PyMuPDF can maintain the table of contents from the merged documents, providing a unified and navigable final PDF. This feature ensures that the structure and navigation of the original documents are preserved in the merged file, enhancing user experience.

Alternative Libraries and Methods

While `pypdf` and `PyMuPDF` are popular, other libraries like `pdfrw` and `pikepdf` offer alternative approaches to merging PDFs.

pdfrw:

from pdfrw import PdfReader, PdfWriter

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf']

writer = PdfWriter()

for pdf in pdfs:
    reader = PdfReader(pdf)
    writer.addpages(reader.pages)

writer.write('merged_pdfrw.pdf')

This code snippet reads each PDF file and adds its pages to a `PdfWriter` object. The pages are then written to a new merged PDF file. `pdfrw` also includes a `subset.py` example for page subsetting, offering advanced control over page selection during merging. The subset feature can be used to select just the right content.

pikepdf:

Installation: `pip install pikepdf`

import pikepdf
import glob

output = pikepdf.Pdf.new()

for src in glob.glob('pdf_files/*.pdf'):
    with pikepdf.open(src) as pdf:
        output.pages.extend(pdf.pages)

output.save('merged_pikepdf.pdf')

This code uses `glob` to find all PDF files in the 'pdf_files' directory. It then opens each PDF and extends the pages to the output PDF. It is under active maintainance so you can be sure it will have the latest features. This makes it the best choice for those that want maintainance and support.

Handling Specific Scenarios

Merging PDFs can present challenges in specific situations, such as handling rotated pages or encrypted files.

Merging PDFs with rotated pages: Some libraries provide functions like `transfer_rotation_to_content()` to ensure proper orientation. This helps in maintaining consistency across all merged documents.
Merging PDFs that might be encrypted: You may need to implement decryption steps before merging, depending on the library. Error handling should be included to gracefully manage password-protected PDFs.
Time comparison of different libraries: Benchmarking the libraries on your specific dataset can help determine the most efficient one for your needs. Consider the trade-offs between speed, memory usage, and feature set.
Handling forms: When merging PDFs with forms, be aware of potential field name collisions and implement strategies for grouping fields to avoid data loss. Renaming duplicate field names or grouping them under unique identifiers ensures data integrity.
Merging PDFs using pdfunite on Linux through `subprocess`: While `pdfunite` is a command-line utility available on Linux systems, using it through Python's `subprocess` module should be approached with caution due to potential security hazards. Always sanitize input to prevent command injection vulnerabilities.

BreezePDF: A Simpler Alternative

For those seeking a simpler alternative to coding, BreezePDF offers an intuitive and user-friendly solution. BreezePDF simplifies the PDF merging process with its drag-and-drop interface and comprehensive feature set. You can easily merge PDFs, add input boxes, type on the PDF, sign, add images, password protect, and delete PDF pages, all within your browser. Best of all, BreezePDF operates entirely on your device, ensuring your documents are never sent to a server, guaranteeing 100% privacy. No signup or download is required.

Conclusion

Using Python for merging PDFs provides a powerful and flexible solution, offering control and automation capabilities that online tools cannot match. `pypdf` provides ease of use, `PyMuPDF` offers comprehensive features, and `pdfrw` delivers low-level control. But when you want the simplest solution with complete privacy, BreezePDF is the user-friendly option you can use right in your browser. With its focus on ease of use and security, BreezePDF simplifies the process of working with PDFs, without compromising your data. Also, consider looking into our other tools such as how to create fillable form docusign to further enhance your PDFs.