AI Document Translator & OCR Tool
A powerful Python automation pipeline that translates PDF documents using Hugging Face's MarianMT and Tesseract OCR, generating a side-by-side comparison of original and translated text.
Overview
This tool is a comprehensive solution designed to automate the translation of PDF documents while preserving context. It bridges the gap between raw document processing and advanced AI translation by combining Optical Character Recognition (OCR) with Neural Machine Translation (NMT). Ideally suited for academic or professional use, it generates a reconstructed PDF that displays the original text alongside the translation for easy verification.
Key Features
- Hybrid Text Extraction: Intelligent pipeline that extracts standard text via
PyMuPDFand automatically falls back toTesseract OCRfor images with embedded text. - State-of-the-Art Translation: Leverages the
Helsinki-NLP/opus-mt-tc-big-en-trMarianMT model from Hugging Face for high-quality English-to-Turkish translations. - Side-by-Side Layout: Unique output format that places the source text and translated text adjacent to each other on the same page.
- Smart Optimization: Features logic to skip previously processed files and automatically utilizes CUDA-enabled GPUs for accelerated inference.
- Robust Logging: Detailed tracking of the translation process via
translation_log.txtto monitor progress and catch errors.
Tech Stack
- Core Logic: Python
- AI & ML: PyTorch, Transformers (Hugging Face)
- PDF & Image Processing: PyMuPDF (Fitz), FPDF, Pillow
- OCR Engine: Tesseract
- Hardware Support: CUDA (GPU Acceleration)