AutoChunk API

AutoChunk API Documentation

A single API endpoint that turns business text into AI-retrieval-ready chunks with department, access level, source URL, and per-principal permissions baked into every chunk.

Quickstart

Want to see output before you commit to integration? Try the playground — paste any text, see chunks come back, no API key required.

Ready to integrate? You'll need an API key (request access at hello@autochunk.ai — invite-only at this stage). Save it as an env var and replace$AUTOCHUNK_KEY below.

Send any text payload up to ~2MB:

curl -X POST https://autochunk.ai/api/v1/chunk \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -H "content-type: application/json" \
  -d '{
    "source": { "type": "text", "department": "finance" },
    "content": "Payment terms are net 30. Late payments incur a 2% fee per month."
  }'

You'll get back the chunked output:

{
  "source_id": "src_4f2a...",
  "chunk_count": 1,
  "total_tokens": 18,
  "chunks": [{
    "chunk_id": "chk_9b1c...",
    "source_id": "src_4f2a...",
    "chunk_text": "Payment terms are net 30. Late payments incur a 2% fee per month.",
    "summary": null,
    "department": "finance",
    "access_level": null,
    "source_url": null,
    "token_count": 18,
    "embedding_ready": true
  }]
}

Each chunk inherits department, access_level, and source_url from the parent source, plus any per-principal permissions you supplied. Use those tags at retrieval time to enforce access boundaries downstream.

Authentication

Every request to POST /api/v1/chunk must include an x-api-key header with your raw key. Keys look like rh_live_... followed by a 48-character hex string. Never commit them to source control or expose them in browser code — keys are server-side credentials.

If your key is compromised, email us and we'll rotate it within an hour.

Security model

AutoChunk is the data layer for permission-aware retrieval. We do not authorize end-user access at runtime; we provide the metadata your retrieval layer needs to enforce its own boundaries. This page explains exactly where that line is.

Data flow

  1. You POST a source with optional access metadata: department, access_level, and a permissions[] array of per-principal ACL entries.
  2. We chunk the content. Every chunk inherits department, access_level, and source_url from the source row. These are denormalized onto each chunk so retrieval-time filters never need a join.
  3. Each entry in source.permissions[] is duplicated into chunk_permissions, one row per chunk × permission pair. So a source with 3 permissions chunked into 7 pieces produces 21 ACL rows.
  4. Your retrieval pipeline reads chunks (typically from a vector DB) and filters on those tags before passing context to your LLM.

Where enforcement happens

Authorization is your retrieval layer's job. Common patterns:

  • Vector DB metadata filter — Pinecone's filter: { department: "finance" }, Weaviate's where clause, pgvector's WHERE. Most vector DBs let you embed metadata alongside vectors and filter at query time — fastest path.
  • Post-retrieval scrub — after your top-k retrieval, drop any chunk whose access_level the requesting user lacks. Slower than a metadata filter but works on any vector store, including ones without metadata-filter support.
  • Per-principal ACL join — after retrieval, JOIN against chunk_permissions where principal_id matches the user's identity in your IDP. Use this for fine-grained per-user access on top of department/access_level coarse filtering.

What we don't do

  • No IDP integration. We don't know who your users are. principal_id values are arbitrary strings that you map to your own identity provider.
  • No chunk-time access control. If you POST a confidential source, anyone with your API key can chunk it. We tag the output; we don't gate the input. Protect your key like a database password.
  • No encryption at rest beyond Postgres defaults. Chunk text is stored in plaintext in your Supabase project, by design — you need to read it back to embed it. Use Supabase's standard encryption and access controls for that layer.
  • No prompt-injection defense. If a chunk's text contains adversarial instructions, your LLM's context will receive them. Tag-based filtering doesn't protect against malicious content INSIDE a permitted chunk.

What this means in practice

AutoChunk's promise: if you tag your sources accurately and filter on those tags at retrieval time, your AI assistant will never surface a chunk to a user who shouldn't see it. We make the metadata stick. You enforce the boundaries.

For compliance-driven engagements (SOC 2, HIPAA, GDPR), document this division of responsibility in your security review. AutoChunk handles data shape and lineage; your retrieval layer handles authorization. The split is by design and verifiable in the schema (chunks, chunk_permissions) and your retrieval code.

Endpoint reference

POST /api/v1/chunk

Full request shape with every supported field:

{
  "source": {
    "type": "pdf" | "webpage" | "crm" | "sop" | "ticket" | "transcript" | "text" | "other",
    "url": "https://example.com/doc.pdf",
    "title": "MSA — Acme Corp",
    "department": "finance",
    "access_level": "public" | "internal" | "restricted" | "confidential",
    "permissions": [
      {
        "principal_type": "user" | "group" | "role" | "department",
        "principal_id": "finance-lead",
        "permission": "read" | "write"
      }
    ],
    "metadata": { "customer_id": "acme" }
  },
  "content": "...the actual text to chunk...",
  "options": {
    "chunk_size": 512,
    "chunk_overlap": 50,
    "summarize": false
  }
}

Required fields

  • source.type — one of the eight enum values above
  • content — UTF-8 text, 1 to 2,000,000 characters. AutoChunk does not extract text from binary formats; if you have a PDF or HTML page, extract first (e.g. with pdf-parse or cheerio) and send the text.

Optional fields

  • source.url — must be a valid URL. Stored on the source and denormalized onto every chunk for retrieval-time filtering.
  • source.department — free-form string up to 64 chars (suggested: finance, legal, hr, sales, ops, engineering).
  • source.access_level — one of public, internal, restricted, confidential.
  • source.permissions — up to 256 ACL entries, each with a principal type, principal id (your IDP's identifier), and permission. Each entry is duplicated onto every chunk produced from this source.
  • source.metadata — arbitrary JSON object stored alongside the source row.
  • options.chunk_size — target tokens per chunk, 64 to 4096. Default 512.
  • options.chunk_overlap — overlap tokens between consecutive chunks, 0 to 1024 (must be less than chunk_size). Default 50.
  • options.summarize — boolean, default false. Reserved for a future LLM summarization feature; currently a no-op.

Response

200 OK with the structure shown in Quickstart. Chunks are returned in document order and tokenized using the GPT tokenizer. The chunker retreats from a hard token boundary to the nearest paragraph break (\n\n), then sentence (. ), then line break (\n) within the last 40% of the window, so chunks rarely end mid-sentence.

Extraction (PDF, HTML, DOCX)

POST /api/v1/extract turns binary documents into clean UTF-8 text suitable for chunking. Compose with /api/v1/chunk to go from "raw file" to "tagged chunks" in two API calls. Same authentication, same monthly quota, same per-minute rate limit as /api/v1/chunk.

Request

Send the file as multipart/form-data with a file field. Optional format field (one of pdf, html, docx, text) overrides autodetection if you know the type.

Supported formats

  • PDF — extracted via unpdf. Returns metadata.pages. Image-only / scanned PDFs return 422 empty_output — OCR is not supported in v1.
  • HTML — extracted via cheerio. <script>, <style>, <iframe>, and other non-content elements are stripped. Prefers <main> or <article> if present, falls back to <body>.
  • DOCX — extracted via mammoth. Modern .docx only; legacy .doc, .pptx, and .xlsx aren't supported (they're ZIP-based but different schemas — return 422 extraction_failed).
  • Plain text — UTF-8 passthrough. .txt, .md, .markdown.

Limits

  • 10MB file size cap (returns 413 payload_too_large above)
  • ~25 second processing timeout (Vercel function limit)
  • Counts as 1 request against your monthly quota
  • Per-minute rate limit (60 req/min) applies same as /api/v1/chunk

Response

{
  "source_id": "extracted-lzkx2k7s",
  "format": "pdf",
  "text": "Master Services Agreement\n\nThis Master Services Agreement (the Agreement)...",
  "metadata": {
    "pages": 7,
    "characters": 12453,
    "words": 2189,
    "extraction_method": "unpdf",
    "tokens": 3142
  }
}

Curl: extract → chunk in one pipeline

# 1. Extract text from a PDF (or HTML, DOCX, plain text)
curl -X POST https://autochunk.ai/api/v1/extract \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -F "file=@contract.pdf"

# 2. Pipe the extracted text into /chunk
TEXT=$(curl -s -X POST https://autochunk.ai/api/v1/extract \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -F "file=@contract.pdf" | jq -r .text)

curl -X POST https://autochunk.ai/api/v1/chunk \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -H "content-type: application/json" \
  -d "$(jq -n --arg c "$TEXT" '{source: {type: "pdf"}, content: $c}')"

Error codes

All errors return JSON with the shape { "error": { "code": "...", "message": "..." } }. On 400 invalid_request there's also an issues array with the specific Zod failures.

HTTPerror.codeWhen it fires
400invalid_jsonBody wasn't parseable JSON.
400invalid_requestBody parsed but failed Zod validation. Check the issues array.
400empty_contentcontent was empty or whitespace-only after cleanup.
401missing_api_keyNo x-api-key header on the request.
401invalid_api_keyHeader present but key not recognized.
401key_disabledKey was explicitly disabled (compromise, suspended account).
429monthly_quota_exceededYour monthly request limit was reached. Resets the 1st of next month UTC.
429rate_limit_exceededMore than 60 requests in the last 60 seconds. Retry-After: 60 header included.
500internal_errorUnhandled error inside the handler. Will appear in our Sentry; rare.
500auth_unavailableCouldn't reach Supabase to verify your key. Retry with backoff.
400invalid_content_type/api/v1/extract: request was not multipart/form-data.
400missing_file/api/v1/extract: multipart body had no 'file' field.
413payload_too_large/api/v1/extract: file exceeds 10MB.
415unsupported_format/api/v1/extract: file format couldn't be detected (not PDF/HTML/DOCX/text).
422extraction_failed/api/v1/extract: file matched a format but parser failed (corrupt, encrypted, unsupported variant).
422empty_output/api/v1/extract: file parsed but produced no text. Often a scanned/image-only PDF.

Rate limits

Two ceilings, both per API key:

  • Per-minute burst — 60 requests per rolling 60-second window by default. Exceeding this returns 429 rate_limit_exceeded with a Retry-After: 60 header. Heavier integrations (data backfills, batch jobs) can have this raised on a per-key basis; email us with your expected sustained throughput and we'll bump it.
  • Monthly quota — set per key when you're onboarded. Default is 1,000 successful requests per month. Exceeding returns 429 monthly_quota_exceeded. Email us if you're approaching the cap and we'll raise it.

Only successful (200) responses count toward your monthly quota. Failed requests (4xx, 5xx) and rate-limited requests do not. Concurrent in-flight bursts can briefly exceed the per-minute ceiling by a few requests — that's a known characteristic of the count-based limiter, acceptable for our scale.

Code samples

JavaScript / Node.js

const res = await fetch("https://autochunk.ai/api/v1/chunk", {
  method: "POST",
  headers: {
    "x-api-key": process.env.AUTOCHUNK_KEY,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    source: { type: "text", department: "finance", access_level: "restricted" },
    content: documentText,
  }),
});

if (!res.ok) {
  const { error } = await res.json();
  throw new Error(`AutoChunk ${res.status}: ${error.code} - ${error.message}`);
}

const { chunks } = await res.json();
// chunks is now ready to embed and store in your vector DB

Python

import os
import requests

resp = requests.post(
    "https://autochunk.ai/api/v1/chunk",
    headers={
        "x-api-key": os.environ["AUTOCHUNK_KEY"],
        "content-type": "application/json",
    },
    json={
        "source": {"type": "text", "department": "finance", "access_level": "restricted"},
        "content": document_text,
    },
)
resp.raise_for_status()
chunks = resp.json()["chunks"]

Current limitations

  • No OCR. Image-only or scanned PDFs return 422 empty_output from /api/v1/extract. Run them through your own OCR (Tesseract, AWS Textract, Google Document AI) and POST the resulting text to /api/v1/chunk directly.
  • Binary extraction is PDF/HTML/DOCX only. Legacy .doc, .pptx, .xlsx, .rtf, and image formats aren't supported. Email if your workflow needs one of these and we'll prioritize.
  • No streaming. Each request is a single round-trip with the full chunked response. For documents over ~500KB consider chunking your *upload* into multiple calls.
  • No per-call summarization. options.summarize is reserved for a future feature. Don't depend on it returning anything but null today.
  • English-tuned tokenizer. Uses gpt-tokenizer (cl100k_base). Works for non-English text but token counts may diverge from production LLMs you're embedding with.

Troubleshooting

I'm getting 401 invalid_api_key but I just got my key emailed to me.

Confirm you're sending the raw key in the x-api-key header, not the SHA-256 hash. The key starts with rh_live_ followed by 48 hex characters. Watch for trailing whitespace from copy-paste.

Every request returns 500 auth_unavailable.

Supabase is temporarily unreachable from our side. Retry with exponential backoff. If it persists more than ~60 seconds, email us — we get paged on this.

Chunks contain mid-sentence breaks I didn't expect.

The chunker only retreats up to 40% of the window when looking for a paragraph/sentence boundary. If your document has very long paragraphs (>2x your chunk_size), some chunks will hard-cut. Either raise chunk_size or pre-segment your text on natural boundaries.

My token_count adds up to more than total_tokens in the response.

With chunk_overlap > 0, neighboring chunks share tokens by design. total_tokens reports the unique source tokens; chunk token_counts can sum higher because of overlap.

I'm hitting the per-minute rate limit during a backfill.

60 req/min is intended for steady-state workloads. For backfills, throttle to ~50 req/min or email us — we can raise the limit on a per-key basis if you have a legitimate batch job.

Support

One person reads every email at hello@autochunk.ai. Reasonable response time: same business day for paying customers, within 48 hours for prospects. Include your key_prefix (first 8 characters of your key) and any error response body when reporting bugs.