OCR PDF

Overview

The OCR PDF endpoint intelligently extracts text from PDF documents of any type - whether they contain selectable text, are scanned images, or a combination of both. The API automatically detects the PDF type and uses the optimal extraction method, making it perfect for processing diverse document collections without manual preprocessing. Intelligent Processing:

Text-based PDFs - Fast direct extraction using PDF text layer (sub-second per page)
Scanned PDFs - Full OCR processing with Mistral Document AI (2-5 seconds per page)
Mixed PDFs - Intelligently processes each page with the appropriate method
Automatic detection - No need to specify document type

Key capabilities:

Multi-page support - Process entire documents in one request
Page-by-page results - Access text from individual pages
Layout preservation - Maintains document structure and formatting
High accuracy - State-of-the-art OCR for scanned documents
Batch processing - Handle multiple pages efficiently
Error resilience - Continues processing even if individual pages fail

Real-world applications:

Contract analysis - Extract text from legal documents
Invoice processing - Automate accounts payable workflows
Document management - Digitize paper archives
Compliance - Make scanned documents searchable
Data extraction - Pull structured data from forms and reports
Research - Extract text from academic papers
Record keeping - Convert physical records to digital text
Accessibility - Make scanned documents screen-reader compatible

How PDF Processing Works

The API uses an intelligent multi-step process:

PDF Analysis - Document is analyzed to determine content type
Method Selection - Each page is categorized as text-based or image-based
Text Extraction - For text PDFs, direct extraction from text layer
OCR Processing - For scanned pages, Mistral Document AI performs OCR
Page Aggregation - Results from all pages are combined
Output Generation - Returns full document text plus per-page breakdowns

Processing time:

Text-based PDFs: < 1 second per page
Scanned PDFs: 2-5 seconds per page
Total time depends on page count and document type

Why automatic detection matters: Many document collections contain both types of PDFs. Manual sorting is time-consuming and error-prone. This endpoint handles everything automatically, optimizing performance while ensuring accuracy.

Text-based vs Scanned PDFs

Understanding the difference helps you set expectations: Text-based PDFs (Digital PDFs):

Created digitally (Word, Google Docs, etc.)
Contain selectable text layer
Fast to process (no OCR needed)
100% accurate text extraction
Examples: Digital forms, exported documents, e-books

Scanned PDFs (Image PDFs):

Created from physical documents via scanner
Contains images of pages, not text
Requires OCR processing
Accuracy depends on scan quality
Examples: Scanned contracts, historical documents, faxes

Mixed PDFs:

Some pages digital, some scanned
Common in compiled documents
Each page processed optimally
Examples: Reports with scanned attachments, annotated documents

The API automatically handles all three types transparently.

Authentication

All OCR endpoints require authentication via Bearer token in the Authorization header.

Authorization: Bearer ik_your_api_key_here

Request

You can submit PDFs via either file upload or base64-encoded JSON.

Method 1: File Upload (multipart/form-data)

curl -X POST "https://api.incredible.one/ocr/pdf" \
  -H "Authorization: Bearer ik_your_api_key_here" \
  -F "file=@/path/to/document.pdf" \

Method 2: Base64 JSON

curl -X POST "https://api.incredible.one/ocr/pdf" \
  -H "Authorization: Bearer ik_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf": "JVBERi0xLjcKCjEgMCBvYmoKPDwvVHlwZS...",
  }'

Request Parameters

Parameter	Type	Description
`file` or `pdf`	file/base64	PDF file or base64-encoded PDF data

Responses

Success Response (OCR Processing)

{
  "success": true,
  "text": "Extracted text from page 1...\n\nExtracted text from page 2...",
  "method": "mistral_document_ai",
  "pages_processed": 2,
  "total_pages": 10,
  "pages": [
    {
      "page_number": 1,
      "text": "Extracted text from page 1...",
      "success": true,
      "raw_response": {
        "pages": [
          {
            "index": 0,
            "markdown": "...",
            "text": "..."
          }
        ],
        "model": "mistral-document-ai-2505"
      }
    },
    {
      "page_number": 2,
      "text": "Extracted text from page 2...",
      "success": true,
      "raw_response": {
        "pages": [...]
      }
    }
  ]
}

Success Response (Text Extraction)

{
  "success": true,
  "text": "Extracted text from all pages...",
  "method": "text_extraction",
  "pages_processed": 5,
  "total_pages": 5,
  "pages": [
    {
      "page_number": 1,
      "text": "Extracted text from page 1...",
      "success": true
    },
    {
      "page_number": 2,
      "text": "Extracted text from page 2...",
      "success": true
    }
  ]
}

Note: Text-based PDFs (using text_extraction method) don’t include raw_response since they don’t use OCR processing.

Field Reference

Top-Level Fields

success boolean — Whether text extraction succeeded overall.
text string — Concatenated text from all processed pages (pages separated by \n\n).
method string — Extraction method: "mistral_document_ai" (OCR) or "text_extraction" (direct).
pages_processed integer — Number of pages actually processed.
total_pages integer — Total number of pages in the PDF.
pages array — Per-page results (see below).

Per-Page Fields

page_number integer — 1-indexed page number.
text string — Extracted text from this page.
success boolean — Whether extraction succeeded for this page.
error string (optional) — Error message if extraction failed for this page.
raw_response object (optional) — Complete raw response from Mistral Document AI for this page (only for OCR-processed pages).

Raw Response Object

The raw_response field contains the complete, unprocessed response from Mistral Document AI:

All fields returned by the API (not just markdown)
Original structure and formatting
Metadata and additional information
Useful for advanced processing or debugging

Page Management

How Pages Are Processed

The OCR API processes PDF pages sequentially (one at a time):

Text-based PDFs: Pages are extracted directly using fast text extraction
Scanned PDFs: Each page is converted to an image (at specified DPI) and processed through OCR individually

Accessing Per-Page Results

Use the pages array to access individual page results programmatically:

for page in result["pages"]:
    print(f"Page {page['page_number']}: {page['text']}")

Error Responses

Authentication Required

{
  "error": "Authentication required",
  "message": "API key must be provided in Authorization header as 'Bearer ik_your_api_key'"
}

Status Code: 401 Unauthorized

Invalid API Key

{
  "error": "Invalid API key",
  "message": "The provided API key is invalid, inactive, or expired"
}

Status Code: 503 Service Unavailable

Processing Error

{
  "success": false,
  "error": "Missing 'pdf' or 'file' field in request body"
}

Status Code: 422 Unprocessable Entity

Best Practices

Document Preparation:

Use high-quality scans (300+ DPI for optimal results)
Ensure pages are properly oriented before upload
Remove password protection before processing
Keep individual PDF files under 50MB for best performance
For very large documents, consider splitting into sections

Performance Optimization:

Text-based PDFs process much faster than scanned PDFs
Process large documents asynchronously with progress tracking
Implement caching for frequently accessed documents
Consider parallel processing for document batches
Monitor processing time and adjust based on document type

Quality Assurance:

Always check the success field before using extracted text
Review pages_processed vs total_pages to detect failures
Implement retry logic for failed pages
Validate critical data extracted from documents
Maintain original PDFs for reference

Cost Management:

Cache OCR results to avoid reprocessing
Prioritize text-based PDF conversion when creating documents
Batch similar documents together for efficiency
Monitor API usage and optimize based on patterns

Integration Patterns

Document Management System:

def process_uploaded_pdf(pdf_file):
    """Process uploaded PDF and store extracted text"""
    response = requests.post(
        "https://api.incredible.one/ocr/pdf",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": pdf_file}
    )
    
    result = response.json()
    
    if result["success"]:
        # Store full document text
        document_text = result["text"]
        
        # Store per-page text for navigation
        pages = {
            page["page_number"]: page["text"]
            for page in result["pages"]
            if page.get("success")
        }
        
        # Save to database
        save_document(
            text=document_text,
            pages=pages,
            method=result["method"],
            page_count=result["pages_processed"]
        )
        
        return document_text
    else:
        raise Exception(f"PDF processing failed: {result['error']}")

Invoice Processing Pipeline:

def extract_invoice_data(pdf_path):
    """Extract and structure invoice data from PDF"""
    # Step 1: Extract text with OCR
    with open(pdf_path, "rb") as f:
        ocr_result = requests.post(
            "https://api.incredible.one/ocr/pdf",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f}
        ).json()
    
    if not ocr_result["success"]:
        return None
    
    # Step 2: Structure the extracted text
    invoice_schema = {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "date": {"type": "string"},
            "vendor": {"type": "string"},
            "total": {"type": "number"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    }
                }
            }
        }
    }
    
    structured_data = client.answer(
        query="Extract invoice details",
        response_format=invoice_schema,
        context=ocr_result["text"]
    )
    
    return structured_data

Batch Document Processing:

import concurrent.futures

def process_pdf_batch(pdf_paths, max_workers=5):
    """Process multiple PDFs in parallel"""
    def process_single(path):
        with open(path, "rb") as f:
            response = requests.post(
                "https://api.incredible.one/ocr/pdf",
                headers={"Authorization": f"Bearer {API_KEY}"},
                files={"file": f}
            )
        return path, response.json()
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_single, path) for path in pdf_paths]
        results = {}
        
        for future in concurrent.futures.as_completed(futures):
            path, result = future.result()
            results[path] = result
    
    return results

Searchable Archive Creation:

def create_searchable_archive(pdf_directory):
    """Convert PDF archive to searchable text database"""
    import os
    from pathlib import Path
    
    documents = []
    
    for pdf_file in Path(pdf_directory).glob("*.pdf"):
        print(f"Processing: {pdf_file.name}")
        
        with open(pdf_file, "rb") as f:
            result = requests.post(
                "https://api.incredible.one/ocr/pdf",
                headers={"Authorization": f"Bearer {API_KEY}"},
                files={"file": f}
            ).json()
        
        if result["success"]:
            documents.append({
                "filename": pdf_file.name,
                "text": result["text"],
                "pages": result["pages_processed"],
                "method": result["method"],
                "page_data": result["pages"]
            })
            print(f"✓ Extracted {result['pages_processed']} pages")
        else:
            print(f"✗ Failed: {result['error']}")
    
    # Save to searchable database or index
    index_documents(documents)
    return documents

Common Issues and Solutions

Issue: Some pages failed to process

Solution: Check individual page errors in the pages array
Solution: Retry failed pages separately
Solution: Verify PDF is not corrupted
Solution: Check if pages are blank or contain only images

Issue: Extracted text is garbled or incorrect

Solution: Verify PDF quality (for scanned documents, rescan at higher DPI)
Solution: Check if pages are properly oriented
Solution: Ensure text is clearly visible in original
Solution: Try preprocessing to improve contrast

Issue: Processing is slow

Solution: Scanned PDFs take longer than text-based PDFs (expected)
Solution: Process large documents asynchronously
Solution: Consider splitting very large documents
Solution: Implement parallel processing for batches

Issue: Timeout errors for large documents

Solution: Increase request timeout (recommend 5+ minutes for large documents)
Solution: Split PDF into smaller chunks
Solution: Implement retry logic with exponential backoff
Solution: Process pages in batches

Issue: No text extracted from known text-based PDF

Solution: Check if PDF is password protected
Solution: Verify PDF is not corrupted
Solution: Try opening in PDF reader to confirm text layer exists
Solution: Check PDF version compatibility

Performance Guidelines

Document Size Recommendations:

Small (1-10 pages): Process synchronously, < 30 seconds
Medium (11-50 pages): Consider async, 1-5 minutes
Large (51-200 pages): Always async, 5-20 minutes
Very Large (200+ pages): Split or batch process, 20+ minutes

Concurrent Processing:

Safe to process multiple small PDFs in parallel
Limit concurrent requests based on your rate limits
Use threading/async for batch operations
Monitor API response times and adjust concurrency

Optimization Tips:

Cache results for frequently accessed documents
Store extracted text in database for quick retrieval
Use CDN or fast storage for original PDFs
Implement smart retry logic for transient failures
Monitor costs and optimize based on document types

Advanced Use Cases

Multi-language Documents: The OCR engine supports multiple languages automatically. No special configuration needed. Table Extraction: Tables are preserved in the markdown output (in raw_response). Parse markdown for structured table data. Form Processing: Extract form fields by combining OCR with structured output:

Extract text with OCR
Use Answer API with schema to structure form data
Validate extracted values

Compliance & Audit: Maintain audit trails by:

Storing raw_response for detailed metadata
Logging processing timestamps and methods
Tracking which pages used OCR vs text extraction
Archiving original PDFs alongside extracted text

Getting Started

Text

Prompt Engineering

File Support

Research

Media

OCR

Overview

How PDF Processing Works

Text-based vs Scanned PDFs

Authentication

Request

Method 1: File Upload (multipart/form-data)

Method 2: Base64 JSON

Request Parameters

Responses

Success Response (OCR Processing)

Success Response (Text Extraction)

Field Reference

Top-Level Fields

Per-Page Fields

Raw Response Object

Page Management

How Pages Are Processed

Accessing Per-Page Results

Error Responses

Authentication Required

Invalid API Key

Processing Error

Best Practices

Integration Patterns

Common Issues and Solutions

Performance Guidelines

Advanced Use Cases

Getting Started

Text

Prompt Engineering

File Support

Research

Media

OCR

​Overview

​How PDF Processing Works

​Text-based vs Scanned PDFs

​Authentication

​Request

​Method 1: File Upload (multipart/form-data)

​Method 2: Base64 JSON

​Request Parameters

​Responses

​Success Response (OCR Processing)

​Success Response (Text Extraction)

​Field Reference

​Top-Level Fields

​Per-Page Fields

​Raw Response Object

​Page Management

​How Pages Are Processed

​Accessing Per-Page Results

​Error Responses

​Authentication Required

​Invalid API Key

​Processing Error

​Best Practices

​Integration Patterns

​Common Issues and Solutions

​Performance Guidelines

​Advanced Use Cases

Overview

How PDF Processing Works

Text-based vs Scanned PDFs

Authentication

Request

Method 1: File Upload (multipart/form-data)

Method 2: Base64 JSON

Request Parameters

Responses

Success Response (OCR Processing)

Success Response (Text Extraction)

Field Reference

Top-Level Fields

Per-Page Fields

Raw Response Object

Page Management

How Pages Are Processed

Accessing Per-Page Results

Error Responses

Authentication Required

Invalid API Key

Processing Error

Best Practices

Integration Patterns

Common Issues and Solutions

Performance Guidelines

Advanced Use Cases