---
name: pdf-toolkit
description: Comprehensive PDF manipulation toolkit for text extraction, table processing, document creation, merging, splitting, and form handling using Python libraries.
---

# PDF Processing Toolkit

This comprehensive skill covers PDF manipulation including text extraction, table processing, document creation, merging, splitting, and form handling.

## Core Libraries

| Library | Best For |
|---------|----------|
| **pypdf** | Merging, splitting, metadata, rotation |
| **pdfplumber** | Text and table extraction with layout preservation |
| **reportlab** | PDF creation from scratch |
| **pytesseract** | OCR for scanned documents |

## Command-Line Tools

- **pdftotext**: Fast text extraction
- **qpdf**: Merging, splitting, decryption
- **pdftk**: Alternative toolkit for manipulation

## Quick Reference

| Task | Recommended Tool |
|------|------------------|
| Extract text | pdfplumber or pdftotext |
| Extract tables | pdfplumber |
| Merge PDFs | pypdf or qpdf |
| Split PDF | pypdf or qpdf |
| Create PDF | reportlab |
| Fill forms | pypdf |
| OCR scanned | pytesseract + pdf2image |
| Add watermark | pypdf |
| Compress | qpdf |

## Text Extraction

### Basic Text Extraction

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)
```

### Layout-Aware Extraction

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Extract with layout preservation
    text = page.extract_text(layout=True)

    # Get words with positions
    words = page.extract_words()
    for word in words:
        print(f"{word['text']} at ({word['x0']}, {word['top']})")
```

## Table Extraction

```python
import pdfplumber
import pandas as pd

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]

    # Extract tables
    tables = page.extract_tables()

    for i, table in enumerate(tables):
        df = pd.DataFrame(table[1:], columns=table[0])
        print(f"Table {i}:")
        print(df)
```

## Document Manipulation

### Merge PDFs

```python
from pypdf import PdfMerger

merger = PdfMerger()
merger.append("doc1.pdf")
merger.append("doc2.pdf")
merger.append("doc3.pdf")
merger.write("merged.pdf")
merger.close()
```

### Split PDF

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("large_document.pdf")

# Extract pages 5-10
writer = PdfWriter()
for page_num in range(4, 10):  # 0-indexed
    writer.add_page(reader.pages[page_num])

writer.write("pages_5_to_10.pdf")
```

### Rotate Pages

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)  # Rotate 90 degrees clockwise
    writer.add_page(page)

writer.write("rotated.pdf")
```

## PDF Creation

### Using ReportLab Canvas

```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
width, height = letter

# Add text
c.setFont("Helvetica-Bold", 24)
c.drawString(100, height - 100, "Hello, PDF!")

c.setFont("Helvetica", 12)
c.drawString(100, height - 150, "This is a paragraph of text.")

# Add rectangle
c.rect(100, height - 250, 200, 50, fill=0)

c.save()
```

### Using Platypus (Document Templates)

```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Title
story.append(Paragraph("Monthly Report", styles['Heading1']))
story.append(Spacer(1, 12))

# Content
story.append(Paragraph("This is the introduction paragraph.", styles['Normal']))
story.append(Spacer(1, 12))

story.append(Paragraph("Key Findings", styles['Heading2']))
story.append(Paragraph("Finding 1: Lorem ipsum dolor sit amet.", styles['Normal']))

doc.build(story)
```

## Watermarks and Overlays

```python
from pypdf import PdfReader, PdfWriter

# Read the original and watermark PDFs
original = PdfReader("document.pdf")
watermark = PdfReader("watermark.pdf")

writer = PdfWriter()

for page in original.pages:
    page.merge_page(watermark.pages[0])
    writer.add_page(page)

writer.write("watermarked.pdf")
```

## OCR for Scanned Documents

```python
import pytesseract
from pdf2image import convert_from_path

# Convert PDF pages to images
images = convert_from_path("scanned.pdf", dpi=300)

# OCR each page
full_text = ""
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    full_text += f"\n--- Page {i+1} ---\n{text}"

print(full_text)
```

## Password Protection

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Add password protection
writer.encrypt(
    user_password="viewonly",  # Required to open
    owner_password="fullaccess",  # Required to edit/print
)

writer.write("protected.pdf")
```

## Image Extraction

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")

for page_num, page in enumerate(reader.pages):
    for img_num, image in enumerate(page.images):
        with open(f"page{page_num}_img{img_num}.{image.name.split('.')[-1]}", "wb") as f:
            f.write(image.data)
```

## Tips

- Use pdfplumber for layout-aware text extraction
- pypdf is best for document manipulation (merge, split, rotate)
- ReportLab excels at creating new PDFs programmatically
- For scanned documents, always use OCR (pytesseract)
- Check PDF encryption before processing
- Large PDFs may require chunked processing
