AI-Powered PDF Translation Tool with OCR

A robust Python automation tool that translates PDF documents using Hugging Face's MarianMT and Tesseract OCR, recreating files with preserved layouts and side-by-side text comparison.

Overview

This project is an automated tool designed to translate PDF documents while preserving their content structure. It utilizes advanced Natural Language Processing (NLP) to translate text—specifically targeting English-to-Turkish workflows using the MarianMT model—and employs Optical Character Recognition (OCR) as a fallback to extract text from images embedded within the PDFs.

Key Features

  • Intelligent Text Extraction: Utilizes PyMuPDF for standard text and Tesseract OCR for extracting text from images within the PDF.
  • Neural Machine Translation: Powered by the Helsinki-NLP/opus-mt-tc-big-en-tr MarianMT model via Hugging Face Transformers for high-quality translations.
  • PDF Reconstruction: Recreates the PDF document featuring both the original and translated text side-by-side for easy comparison.
  • Smart Caching: Includes logic to detect and skip files that have already been translated to optimize processing time.
  • Hardware Acceleration: Automatically detects CUDA-enabled GPUs to accelerate the inference process with PyTorch.

Tech Stack

  • Language: Python
  • ML & NLP: PyTorch, Transformers (Hugging Face)
  • PDF & Image Processing: PyMuPDF (fitz), FPDF, Pillow, Pytesseract
  • System: Tesseract OCR engine

Development Notes

This tool was originally developed to assist with translating academic course materials. To demonstrate functionality publicly without copyright infringement, the repository includes a script to generate random sample PDFs and utilizes public domain documents (e.g., "The Declaration of Independence") for testing.

Performance Note: While GPU-based multiprocessing was explored, the tool is currently optimized for stability with a linear processing speed of approximately 15 seconds per file.