35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English 35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English
Rails Claude Vision API: Extracting Data from PDFs, Receipts and Screenshots with Anthropic

Rails Claude Vision API: Extracting Data from PDFs, Receipts and Screenshots with Anthropic

Roger Heykoop
AI in Rails, Ruby on Rails
Rails Claude Vision API guide: extract structured data from PDFs, receipts and screenshots with Anthropic, validate JSON output and ship to production.

A bookkeeping firm called me in March because their three accountants were spending half of every Friday turning a shared Dropbox folder of receipts into rows in a Postgres database. Eight thousand expense reports a month. Photos of crumpled gas station receipts, scanned PDFs of hotel invoices, screenshots of email confirmations, and the occasional iPhone photo of a whiteboard. Their previous attempt was a Tesseract pipeline that worked beautifully for the eleven percent of receipts that were flat, well-lit, and printed in a normal font. For the other eighty-nine percent, a human had to retype the totals. They wanted to know if Claude could do better.

After nineteen years of Rails I have built a lot of OCR pipelines, and the answer was yes — but not the way they expected. The Rails Claude Vision API is not a replacement for OCR. It is a replacement for the human who looks at a messy document and says “the total is forty-seven euros and the merchant is the petrol station near the office.” This post is the playbook I gave that team: how to wire Claude’s vision capabilities into a Rails app, extract structured data from PDFs and images, validate the output, and survive production.

What the Rails Claude Vision API Actually Does

Every Claude 3 and 4 model is multimodal. You send an image (or a PDF) inside a message, you ask a question about it, and you get text back. There is no separate vision endpoint. The same messages API that handles text handles documents and images, and the same model that can write a Rails migration can read a German VAT receipt and tell you the line items.

What this means in practice for a Rails team:

  1. PDFs work directly. You no longer need to rasterize a PDF to PNGs and send each page. Claude accepts PDFs up to 32MB and 100 pages and processes both the text and the rendered visuals.
  2. Image quality matters less than you think. A blurry phone photo of a receipt that Tesseract cannot touch will often parse cleanly with Claude.
  3. You write a prompt, not a parser. There is no template. You describe what you want and validate the JSON that comes back.

The Rails Claude Vision API is the right tool when the document is variable, semi-structured, or messy. It is the wrong tool when you have a million identical forms — for that you want a fixed extraction pipeline. For everything in between, which is most real businesses, it is the cheapest senior accountant you will ever hire.

Setting Up the Anthropic Ruby Client

The official anthropic gem is the cleanest way in. Add it and configure your client:

# Gemfile
gem "anthropic", "~> 1.0"

# config/initializers/anthropic.rb
require "anthropic"

ANTHROPIC = Anthropic::Client.new(
  api_key: Rails.application.credentials.dig(:anthropic, :api_key)
)

Store the key in encrypted credentials, never in .env checked into git. The post on Rails credentials and secrets management covers the full setup if you have not already moved off dotenv.

A bare-minimum vision call against an image looks like this:

class ReceiptParser
  MODEL = "claude-sonnet-4-6"

  def initialize(image_path)
    @image_path = image_path
  end

  def call
    base64 = Base64.strict_encode64(File.binread(@image_path))
    media_type = Marcel::MimeType.for(Pathname.new(@image_path))

    ANTHROPIC.messages.create(
      model: MODEL,
      max_tokens: 1024,
      messages: [{
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: media_type,
              data: base64
            }
          },
          {
            type: "text",
            text: "Extract the merchant, date, total amount and currency from this receipt. Reply with JSON only."
          }
        ]
      }]
    )
  end
end

Three things to notice. First, image content blocks live alongside text blocks in the same content array — order matters, and putting the image before the question gives the model the full visual context first. Second, media_type must match the actual file: send a PNG with image/jpeg and you will get a 400. Third, “JSON only” in the prompt is a request, not a guarantee — we will fix that in the next section.

PDFs Without Rasterizing

This is the part that surprises Rails teams who have only used the older vision APIs. Claude reads PDFs natively:

class InvoiceExtractor
  MODEL = "claude-sonnet-4-6"

  def call(pdf_path)
    base64 = Base64.strict_encode64(File.binread(pdf_path))

    ANTHROPIC.messages.create(
      model: MODEL,
      max_tokens: 4096,
      messages: [{
        role: "user",
        content: [
          {
            type: "document",
            source: {
              type: "base64",
              media_type: "application/pdf",
              data: base64
            }
          },
          {
            type: "text",
            text: <<~PROMPT
              Extract the invoice header and line items.
              Return strict JSON matching this shape:
              {
                "invoice_number": string,
                "issue_date": "YYYY-MM-DD",
                "due_date": "YYYY-MM-DD" | null,
                "vendor": { "name": string, "vat_id": string | null },
                "currency": "EUR" | "USD" | "GBP" | string,
                "subtotal_cents": integer,
                "tax_cents": integer,
                "total_cents": integer,
                "line_items": [
                  { "description": string, "quantity": number,
                    "unit_price_cents": integer, "total_cents": integer }
                ]
              }
              All money values must be integer cents. No prose, no markdown.
            PROMPT
          }
        ]
      }]
    )
  end
end

A few production-relevant facts. PDFs are billed for both their text content (cheap) and a rendered image of each page (the expensive part — roughly 1500 to 3000 tokens per page depending on density). A ten-page invoice runs you between fifteen and thirty thousand input tokens. Multiply by your volume before you celebrate.

For PDFs above 32MB, or above 100 pages, you have to split them yourself. The combine_pdf gem or Ghostscript will both do it. Treat each chunk as a separate extraction and merge results in Ruby — do not try to be clever with multi-message conversations to “remember” earlier pages.

Forcing Structured JSON Output

The single most common production failure mode of the Rails Claude Vision API is the model returning prose like “Sure! Here is the invoice data:” before the JSON. You do not want to write a regex to clean that up. You want the API to give you JSON.

The cleanest pattern is the tool_use API. You define a tool whose only job is to receive structured data, you set tool_choice to force its use, and Claude returns a structured tool_use block instead of raw text:

INVOICE_TOOL = {
  name: "record_invoice",
  description: "Record the structured invoice data extracted from the document.",
  input_schema: {
    type: "object",
    properties: {
      invoice_number: { type: "string" },
      issue_date:     { type: "string", pattern: "^\\d{4}-\\d{2}-\\d{2}$" },
      due_date:       { type: ["string", "null"], pattern: "^\\d{4}-\\d{2}-\\d{2}$" },
      currency:       { type: "string", minLength: 3, maxLength: 3 },
      total_cents:    { type: "integer", minimum: 0 },
      line_items: {
        type: "array",
        items: {
          type: "object",
          properties: {
            description:      { type: "string" },
            quantity:         { type: "number" },
            unit_price_cents: { type: "integer" },
            total_cents:      { type: "integer" }
          },
          required: ["description", "total_cents"]
        }
      }
    },
    required: ["invoice_number", "issue_date", "currency", "total_cents", "line_items"]
  }
}

response = ANTHROPIC.messages.create(
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  tools: [INVOICE_TOOL],
  tool_choice: { type: "tool", name: "record_invoice" },
  messages: [{
    role: "user",
    content: [
      { type: "document", source: { type: "base64", media_type: "application/pdf", data: pdf_base64 } },
      { type: "text", text: "Extract the invoice using the record_invoice tool." }
    ]
  }]
)

invoice_data = response.content.find { |b| b.type == "tool_use" }.input

The schema does double duty: it tells the model what shape to produce and gives you a contract to validate against on your side. Pair it with the json-schema gem and reject any extraction that does not match — never trust the model output blindly into your database.

The post on LLM function calling in Rails goes deeper into the tool-use API for cases beyond extraction.

Cost Optimization That Actually Matters

Vision tokens are not cheap, and a small Rails app processing receipts can run a thousand-dollar bill if you do not pay attention. Three patterns I always reach for.

Resize before you send. Anthropic’s recommendation is no side longer than 1568px. A modern phone photo is 4032x3024. Sending it raw costs you four times the tokens of a resized image with no extraction quality benefit:

require "image_processing/vips"

def prepare_image(path)
  ImageProcessing::Vips
    .source(path)
    .resize_to_limit(1568, 1568)
    .convert("jpeg")
    .saver(quality: 85)
    .call
end

Cache the system prompt. If you are running the same extraction over thousands of documents, the instructions are identical every time. Anthropic’s prompt caching reuses that prefix at a tenth of the cost. The post on Anthropic prompt caching in Rails shows the full setup; the short version is to mark the system block with cache_control: { type: "ephemeral" }.

Batch the non-urgent stuff. The Anthropic Message Batches API gives you a fifty percent discount in exchange for a results-within-24-hours SLA. Anything that is not user-facing — overnight ingestion, monthly compliance scans, historical backfills — should run through batches. See Anthropic message batches in Rails for the integration pattern.

For the bookkeeping client, those three changes took monthly spend from $4,200 to $610 with no quality loss.

Wiring It Into an Active Job Pipeline

A real production pipeline almost never calls the Rails Claude Vision API synchronously from a controller. The user uploads a receipt, you enqueue a job, the job extracts, validates, and writes — and the UI updates over Turbo Streams when the row appears. Here is the skeleton:

class Receipts::ExtractJob < ApplicationJob
  queue_as :ai

  retry_on Anthropic::Errors::OverloadedError,
           Anthropic::Errors::RateLimitError,
           wait: :polynomially_longer,
           attempts: 8

  discard_on ActiveJob::DeserializationError

  def perform(receipt_id)
    receipt = Receipt.find(receipt_id)
    return if receipt.extracted?

    image_blob = receipt.image.download
    Tempfile.create(["receipt", ".jpg"], binmode: true) do |f|
      f.write(image_blob)
      f.flush

      data = ReceiptParser.new(prepare_image(f.path)).call
      validate!(data)

      receipt.update!(
        merchant: data["merchant"],
        purchased_at: data["purchased_at"],
        total_cents: data["total_cents"],
        currency: data["currency"],
        extracted_at: Time.current,
        extraction_model: ReceiptParser::MODEL
      )
    end

    Turbo::StreamsChannel.broadcast_replace_to(
      "user_#{receipt.user_id}_receipts",
      target: dom_id(receipt),
      partial: "receipts/receipt",
      locals: { receipt: receipt }
    )
  end

  private

  def validate!(data)
    schema = ReceiptParser::SCHEMA
    errors = JSON::Validator.fully_validate(schema, data)
    raise ExtractionError, errors.join("; ") if errors.any?
  end
end

Three things this does that toy examples miss. It retries with exponential backoff on transient API errors — the post on Rails Active Job retries with exponential backoff explains why polynomially_longer is the right default. It records extraction_model so when you later upgrade to a newer Claude version, you can spot drift in your data. And it broadcasts the result over Turbo Streams so the user does not have to refresh.

Handling Documents That Should Not Be Sent

Not every uploaded file should reach Anthropic. Three checks belong before the API call:

  1. Size. Reject anything above 32MB for PDFs or 5MB for images at the controller layer, with a clear error.
  2. Mime type. Use Marcel to verify the actual content, not the filename. Users upload .pdf files that are actually JPEGs, ZIPs, and Word docs.
  3. Sensitivity. A receipt-extraction app is not a place to send a passport scan. If your users may upload identity documents or medical records, run a small classifier first — Claude itself can do this in a 50-token call — and refuse extraction with a clear message rather than silently sending PII to a third party.

You also want a kill switch. A feature flag that lets you pause Claude calls in five seconds when Anthropic has an incident is worth its weight. The post on Rails feature flags with Flipper covers the patterns.

When Not to Use the Rails Claude Vision API

I would be doing you a disservice if I did not list the cases where this is the wrong tool.

  • High-volume, identical forms. A million tax forms with the same layout? A specialized OCR service or a fine-tuned layout model will be ten times cheaper at that scale.
  • Real-time, sub-second latency. Vision calls take two to ten seconds. Use them in async jobs, not on the request path.
  • Regulated extraction with audit requirements. Some compliance regimes require deterministic extraction with explainable provenance. Claude can be part of the workflow but cannot be the only line of evidence.
  • Tiny budgets at scale. If you process a hundred thousand documents a day on a budget of fifty dollars, the math does not work. Resize, cache, batch — but at some volume you need a different tool.

For the bookkeeping client none of those applied, and we shipped the integration in two weeks. Their accountants now spend Friday afternoons reviewing flagged extractions instead of typing every line by hand. The receipts flow through Active Storage, an Active Job runs the extraction, and a small Stimulus controller lets the human approve, edit, or reject. The Vision API is the engine; the pipeline around it is what makes it production-grade.

FAQ

Can the Rails Claude Vision API extract handwritten text?

Yes, and far better than traditional OCR. Claude handles cursive, mixed print and handwriting, multilingual notes, and even partial occlusion. Quality drops on extreme low-light photos and very small handwriting, so resize sensibly and consider a re-take prompt for users when the model returns low-confidence fields. For pure handwriting recognition at extreme scale, a specialized HTR service may still win on cost.

How do I extract data from a multi-page PDF in Rails?

Send the entire PDF as a document content block — Claude reads up to 100 pages and 32MB in a single request. For larger documents, split with combine_pdf or Ghostscript, run extractions in parallel via Active Job, and merge in Ruby. Always validate per-chunk output against a JSON schema before merging, because per-page errors compound quickly.

Is the Anthropic Vision API GDPR-safe for European customers?

Anthropic offers an EU data residency option and a zero data retention agreement on enterprise plans. For a Rails app processing PII for EU customers, you almost certainly want both. Talk to your DPO before you ship and document the data flow in your processing register. Do not assume the default plan is sufficient.

How accurate is Claude on receipts compared to Tesseract?

In my benchmarks on a thousand real-world receipts (mixed phone photos, scans, and PDFs), Claude Sonnet correctly extracted merchant, date, and total on 94% of inputs at the first try. Tesseract on the same set hit 41%. The gap closes on flat, high-quality scans and widens on photos. For mixed real-world inputs, Claude wins decisively, but pair it with confidence checks and human review for anything that drives a financial entry.

Need help wiring Claude into your Rails app — vision, RAG, agents or anything else? TTB Software builds production-grade AI integrations on Rails. We have been doing Rails for nineteen years and AI integrations since the API was a research preview.

#rails-claude-vision #anthropic-vision-api #claude-pdf-extraction #rails-document-ai #rails-llm-integration #ruby-on-rails
R

About the Author

Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.

Get in Touch

Share this article

Need Expert Rails Development?

Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise

Schedule a Free Consultation