Overview
The OCR PDF endpoint intelligently extracts text from PDF documents of any type - whether they contain selectable text, are scanned images, or a combination of both. The API automatically detects the PDF type and uses the optimal extraction method, making it perfect for processing diverse document collections without manual preprocessing. Intelligent Processing:- Text-based PDFs - Fast direct extraction using PDF text layer (sub-second per page)
- Scanned PDFs - Full OCR processing with Mistral Document AI (2-5 seconds per page)
- Mixed PDFs - Intelligently processes each page with the appropriate method
- Automatic detection - No need to specify document type
- Multi-page support - Process entire documents in one request
- Page-by-page results - Access text from individual pages
- Layout preservation - Maintains document structure and formatting
- High accuracy - State-of-the-art OCR for scanned documents
- Batch processing - Handle multiple pages efficiently
- Error resilience - Continues processing even if individual pages fail
- Contract analysis - Extract text from legal documents
- Invoice processing - Automate accounts payable workflows
- Document management - Digitize paper archives
- Compliance - Make scanned documents searchable
- Data extraction - Pull structured data from forms and reports
- Research - Extract text from academic papers
- Record keeping - Convert physical records to digital text
- Accessibility - Make scanned documents screen-reader compatible
How PDF Processing Works
The API uses an intelligent multi-step process:- PDF Analysis - Document is analyzed to determine content type
- Method Selection - Each page is categorized as text-based or image-based
- Text Extraction - For text PDFs, direct extraction from text layer
- OCR Processing - For scanned pages, Mistral Document AI performs OCR
- Page Aggregation - Results from all pages are combined
- Output Generation - Returns full document text plus per-page breakdowns
- Text-based PDFs: < 1 second per page
- Scanned PDFs: 2-5 seconds per page
- Total time depends on page count and document type
Text-based vs Scanned PDFs
Understanding the difference helps you set expectations: Text-based PDFs (Digital PDFs):- Created digitally (Word, Google Docs, etc.)
- Contain selectable text layer
- Fast to process (no OCR needed)
- 100% accurate text extraction
- Examples: Digital forms, exported documents, e-books
- Created from physical documents via scanner
- Contains images of pages, not text
- Requires OCR processing
- Accuracy depends on scan quality
- Examples: Scanned contracts, historical documents, faxes
- Some pages digital, some scanned
- Common in compiled documents
- Each page processed optimally
- Examples: Reports with scanned attachments, annotated documents
Authentication
All OCR endpoints require authentication via Bearer token in the Authorization header.Request
You can submit PDFs via either file upload or base64-encoded JSON.Method 1: File Upload (multipart/form-data)
Method 2: Base64 JSON
Request Parameters
| Parameter | Type | Description |
|---|---|---|
file or pdf | file/base64 | PDF file or base64-encoded PDF data |
Responses
Success Response (OCR Processing)
Success Response (Text Extraction)
text_extraction method) don’t include raw_response since they don’t use OCR processing.
Field Reference
Top-Level Fields
- success boolean — Whether text extraction succeeded overall.
- text string — Concatenated text from all processed pages (pages separated by
\n\n). - method string — Extraction method:
"mistral_document_ai"(OCR) or"text_extraction"(direct). - pages_processed integer — Number of pages actually processed.
- total_pages integer — Total number of pages in the PDF.
- pages array — Per-page results (see below).
Per-Page Fields
- page_number integer — 1-indexed page number.
- text string — Extracted text from this page.
- success boolean — Whether extraction succeeded for this page.
- error string (optional) — Error message if extraction failed for this page.
- raw_response object (optional) — Complete raw response from Mistral Document AI for this page (only for OCR-processed pages).
Raw Response Object
Theraw_response field contains the complete, unprocessed response from Mistral Document AI:
- All fields returned by the API (not just
markdown) - Original structure and formatting
- Metadata and additional information
- Useful for advanced processing or debugging
Page Management
How Pages Are Processed
The OCR API processes PDF pages sequentially (one at a time):- Text-based PDFs: Pages are extracted directly using fast text extraction
- Scanned PDFs: Each page is converted to an image (at specified DPI) and processed through OCR individually
Accessing Per-Page Results
Use thepages array to access individual page results programmatically:
Error Responses
Authentication Required
401 Unauthorized
Invalid API Key
503 Service Unavailable
Processing Error
422 Unprocessable Entity
Best Practices
Document Preparation:- Use high-quality scans (300+ DPI for optimal results)
- Ensure pages are properly oriented before upload
- Remove password protection before processing
- Keep individual PDF files under 50MB for best performance
- For very large documents, consider splitting into sections
- Text-based PDFs process much faster than scanned PDFs
- Process large documents asynchronously with progress tracking
- Implement caching for frequently accessed documents
- Consider parallel processing for document batches
- Monitor processing time and adjust based on document type
- Always check the
successfield before using extracted text - Review
pages_processedvstotal_pagesto detect failures - Implement retry logic for failed pages
- Validate critical data extracted from documents
- Maintain original PDFs for reference
- Cache OCR results to avoid reprocessing
- Prioritize text-based PDF conversion when creating documents
- Batch similar documents together for efficiency
- Monitor API usage and optimize based on patterns
Integration Patterns
Document Management System:Common Issues and Solutions
Issue: Some pages failed to process- Solution: Check individual page errors in the
pagesarray - Solution: Retry failed pages separately
- Solution: Verify PDF is not corrupted
- Solution: Check if pages are blank or contain only images
- Solution: Verify PDF quality (for scanned documents, rescan at higher DPI)
- Solution: Check if pages are properly oriented
- Solution: Ensure text is clearly visible in original
- Solution: Try preprocessing to improve contrast
- Solution: Scanned PDFs take longer than text-based PDFs (expected)
- Solution: Process large documents asynchronously
- Solution: Consider splitting very large documents
- Solution: Implement parallel processing for batches
- Solution: Increase request timeout (recommend 5+ minutes for large documents)
- Solution: Split PDF into smaller chunks
- Solution: Implement retry logic with exponential backoff
- Solution: Process pages in batches
- Solution: Check if PDF is password protected
- Solution: Verify PDF is not corrupted
- Solution: Try opening in PDF reader to confirm text layer exists
- Solution: Check PDF version compatibility
Performance Guidelines
Document Size Recommendations:- Small (1-10 pages): Process synchronously, < 30 seconds
- Medium (11-50 pages): Consider async, 1-5 minutes
- Large (51-200 pages): Always async, 5-20 minutes
- Very Large (200+ pages): Split or batch process, 20+ minutes
- Safe to process multiple small PDFs in parallel
- Limit concurrent requests based on your rate limits
- Use threading/async for batch operations
- Monitor API response times and adjust concurrency
- Cache results for frequently accessed documents
- Store extracted text in database for quick retrieval
- Use CDN or fast storage for original PDFs
- Implement smart retry logic for transient failures
- Monitor costs and optimize based on document types
Advanced Use Cases
Multi-language Documents: The OCR engine supports multiple languages automatically. No special configuration needed. Table Extraction: Tables are preserved in the markdown output (inraw_response). Parse markdown for structured table data.
Form Processing:
Extract form fields by combining OCR with structured output:
- Extract text with OCR
- Use Answer API with schema to structure form data
- Validate extracted values
- Storing
raw_responsefor detailed metadata - Logging processing timestamps and methods
- Tracking which pages used OCR vs text extraction
- Archiving original PDFs alongside extracted text
