What does AutoChunk do?

AutoChunk is a single-purpose REST API that turns business documents (PDF, HTML, DOCX, plain text) into AI-retrieval-ready chunks. Each chunk preserves source lineage, department ownership, access level, source URL, and per-principal permissions, so retrieval-layer filtering can enforce who-sees-what without leaking restricted content into LLM context.

What file formats does AutoChunk support?

POST /api/v1/extract accepts PDF, HTML, DOCX, and plain text (.txt, .md). Files up to 10MB. Image-only / scanned PDFs return 422 empty_output — OCR is not supported in v1.

How does AutoChunk handle access control?

AutoChunk is the data layer, not the enforcement layer. Every chunk we return carries department, access_level, source_url, and per-principal ACL rows. Your retrieval pipeline filters on those tags at query time (Pinecone metadata filter, pgvector WHERE clause, post-retrieval scrub). AutoChunk makes the metadata stick; your retrieval layer enforces the boundaries.

What are the rate limits?

Two ceilings per API key. Per-minute burst: 60 requests per rolling 60-second window by default, raisable per-key on request. Monthly quota: 1,000 successful requests by default, raisable per-key on request. Failed requests do not count toward the monthly quota.

How do I get an API key?

AutoChunk is invite-only. Email hello@autochunk.ai with your use case or submit the form at autochunk.ai. We reply within 48 hours.

Where does AutoChunk fit in a RAG stack?

AutoChunk handles the parsing + chunking + permission-tagging layer. Customers compose with their preferred embedding model (Voyage, OpenAI, Cohere), vector database (Pinecone, Weaviate, Supabase pgvector, Qdrant), and LLM (Anthropic, OpenAI). The parsing+chunking output is designed to be embedding-ready and metadata-rich for retrieval-time access control.

How is this different from LangChain text splitters or LlamaIndex?

LangChain and LlamaIndex are SDKs you install and run yourself. AutoChunk is a managed API with permission tagging baked into the data shape. The differentiator is per-chunk access metadata — department, access_level, per-principal ACL rows — that ride through your embedding pipeline into your vector DB and back out at retrieval. Use LangChain/LlamaIndex if you don't need access control at retrieval; use AutoChunk if you do.

AutoChunk API

AutoChunk API Documentation

A single API endpoint that turns business text into AI-retrieval-ready chunks with department, access level, source URL, and per-principal permissions baked into every chunk.

Quickstart

Want to see output before you commit to integration? Try the playground — paste any text, see chunks come back, no API key required.

Ready to integrate? You'll need an API key (request access at hello@autochunk.ai — invite-only at this stage). Save it as an env var and replace$AUTOCHUNK_KEY below.

Send any text payload up to ~2MB:

curl -X POST https://autochunk.ai/api/v1/chunk \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -H "content-type: application/json" \
  -d '{
    "source": { "type": "text", "department": "finance" },
    "content": "Payment terms are net 30. Late payments incur a 2% fee per month."
  }'

You'll get back the chunked output:

{
  "source_id": "src_4f2a...",
  "chunk_count": 1,
  "total_tokens": 18,
  "chunks": [{
    "chunk_id": "chk_9b1c...",
    "source_id": "src_4f2a...",
    "chunk_text": "Payment terms are net 30. Late payments incur a 2% fee per month.",
    "summary": null,
    "department": "finance",
    "access_level": null,
    "source_url": null,
    "token_count": 18,
    "embedding_ready": true
  }]
}

Each chunk inherits department, access_level, and source_url from the parent source, plus any per-principal permissions you supplied. Use those tags at retrieval time to enforce access boundaries downstream.

Authentication

Every request to POST /api/v1/chunk must include an x-api-key header with your raw key. Keys look like rh_live_... followed by a 48-character hex string. Never commit them to source control or expose them in browser code — keys are server-side credentials.

If your key is compromised, email us and we'll rotate it within an hour.

Security model

AutoChunk is the data layer for permission-aware retrieval. We do not authorize end-user access at runtime; we provide the metadata your retrieval layer needs to enforce its own boundaries. This page explains exactly where that line is.

Data flow

You POST a source with optional access metadata: department, access_level, and a permissions[] array of per-principal ACL entries.
We chunk the content. Every chunk inherits department, access_level, and source_url from the source row. These are denormalized onto each chunk so retrieval-time filters never need a join.
Each entry in source.permissions[] is duplicated into chunk_permissions, one row per chunk × permission pair. So a source with 3 permissions chunked into 7 pieces produces 21 ACL rows.
Your retrieval pipeline reads chunks (typically from a vector DB) and filters on those tags before passing context to your LLM.

Where enforcement happens

Authorization is your retrieval layer's job. Common patterns:

Vector DB metadata filter — Pinecone's filter: { department: "finance" }, Weaviate's where clause, pgvector's WHERE. Most vector DBs let you embed metadata alongside vectors and filter at query time — fastest path.
Post-retrieval scrub — after your top-k retrieval, drop any chunk whose access_level the requesting user lacks. Slower than a metadata filter but works on any vector store, including ones without metadata-filter support.
Per-principal ACL join — after retrieval, JOIN against chunk_permissions where principal_id matches the user's identity in your IDP. Use this for fine-grained per-user access on top of department/access_level coarse filtering.

What we don't do

No IDP integration. We don't know who your users are. principal_id values are arbitrary strings that you map to your own identity provider.
No chunk-time access control. If you POST a confidential source, anyone with your API key can chunk it. We tag the output; we don't gate the input. Protect your key like a database password.
No encryption at rest beyond Postgres defaults. Chunk text is stored in plaintext in your Supabase project, by design — you need to read it back to embed it. Use Supabase's standard encryption and access controls for that layer.
No prompt-injection defense. If a chunk's text contains adversarial instructions, your LLM's context will receive them. Tag-based filtering doesn't protect against malicious content INSIDE a permitted chunk.

What this means in practice

AutoChunk's promise: if you tag your sources accurately and filter on those tags at retrieval time, your AI assistant will never surface a chunk to a user who shouldn't see it. We make the metadata stick. You enforce the boundaries.

For compliance-driven engagements (SOC 2, HIPAA, GDPR), document this division of responsibility in your security review. AutoChunk handles data shape and lineage; your retrieval layer handles authorization. The split is by design and verifiable in the schema (chunks, chunk_permissions) and your retrieval code.

Endpoint reference

POST /api/v1/chunk

Full request shape with every supported field:

{
  "source": {
    "type": "pdf" | "webpage" | "crm" | "sop" | "ticket" | "transcript" | "text" | "other",
    "url": "https://example.com/doc.pdf",
    "title": "MSA — Acme Corp",
    "department": "finance",
    "access_level": "public" | "internal" | "restricted" | "confidential",
    "permissions": [
      {
        "principal_type": "user" | "group" | "role" | "department",
        "principal_id": "finance-lead",
        "permission": "read" | "write"
      }
    ],
    "metadata": { "customer_id": "acme" }
  },
  "content": "...the actual text to chunk...",
  "options": {
    "chunk_size": 512,
    "chunk_overlap": 50,
    "summarize": false
  }
}

Required fields

source.type — one of the eight enum values above
content — UTF-8 text, 1 to 2,000,000 characters. AutoChunk does not extract text from binary formats; if you have a PDF or HTML page, extract first (e.g. with pdf-parse or cheerio) and send the text.

Optional fields

source.url — must be a valid URL. Stored on the source and denormalized onto every chunk for retrieval-time filtering.
source.department — free-form string up to 64 chars (suggested: finance, legal, hr, sales, ops, engineering).
source.access_level — one of public, internal, restricted, confidential.
source.permissions — up to 256 ACL entries, each with a principal type, principal id (your IDP's identifier), and permission. Each entry is duplicated onto every chunk produced from this source.
source.metadata — arbitrary JSON object stored alongside the source row.
options.chunk_size — target tokens per chunk, 64 to 4096. Default 512.
options.chunk_overlap — overlap tokens between consecutive chunks, 0 to 1024 (must be less than chunk_size). Default 50.
options.summarize — boolean, default false. Reserved for a future LLM summarization feature; currently a no-op.

Response

200 OK with the structure shown in Quickstart. Chunks are returned in document order and tokenized using the GPT tokenizer. The chunker retreats from a hard token boundary to the nearest paragraph break (\n\n), then sentence (. ), then line break (\n) within the last 40% of the window, so chunks rarely end mid-sentence.

Extraction (PDF, HTML, DOCX)

POST /api/v1/extract turns binary documents into clean UTF-8 text suitable for chunking. Compose with /api/v1/chunk to go from "raw file" to "tagged chunks" in two API calls. Same authentication, same monthly quota, same per-minute rate limit as /api/v1/chunk.

Request

Send the file as multipart/form-data with a file field. Optional format field (one of pdf, html, docx, text) overrides autodetection if you know the type.

Supported formats

PDF — extracted via unpdf. Returns metadata.pages. Image-only / scanned PDFs return 422 empty_output — OCR is not supported in v1.
HTML — extracted via cheerio. <script>, <style>, <iframe>, and other non-content elements are stripped. Prefers <main> or <article> if present, falls back to <body>.
DOCX — extracted via mammoth. Modern .docx only; legacy .doc, .pptx, and .xlsx aren't supported (they're ZIP-based but different schemas — return 422 extraction_failed).
Plain text — UTF-8 passthrough. .txt, .md, .markdown.

Limits

10MB file size cap (returns 413 payload_too_large above)
~25 second processing timeout (Vercel function limit)
Counts as 1 request against your monthly quota
Per-minute rate limit (60 req/min) applies same as /api/v1/chunk

Response

{
  "source_id": "extracted-lzkx2k7s",
  "format": "pdf",
  "text": "Master Services Agreement\n\nThis Master Services Agreement (the Agreement)...",
  "metadata": {
    "pages": 7,
    "characters": 12453,
    "words": 2189,
    "extraction_method": "unpdf",
    "tokens": 3142
  }
}

Curl: extract → chunk in one pipeline

# 1. Extract text from a PDF (or HTML, DOCX, plain text)
curl -X POST https://autochunk.ai/api/v1/extract \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -F "file=@contract.pdf"

# 2. Pipe the extracted text into /chunk
TEXT=$(curl -s -X POST https://autochunk.ai/api/v1/extract \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -F "file=@contract.pdf" | jq -r .text)

curl -X POST https://autochunk.ai/api/v1/chunk \
  -H "x-api-key: $AUTOCHUNK_KEY" \
  -H "content-type: application/json" \
  -d "$(jq -n --arg c "$TEXT" '{source: {type: "pdf"}, content: $c}')"

Error codes

All errors return JSON with the shape { "error": { "code": "...", "message": "..." } }. On 400 invalid_request there's also an issues array with the specific Zod failures.

HTTP	error.code	When it fires
400	invalid_json	Body wasn't parseable JSON.
400	invalid_request	Body parsed but failed Zod validation. Check the issues array.
400	empty_content	content was empty or whitespace-only after cleanup.
401	missing_api_key	No x-api-key header on the request.
401	invalid_api_key	Header present but key not recognized.
401	key_disabled	Key was explicitly disabled (compromise, suspended account).
429	monthly_quota_exceeded	Your monthly request limit was reached. Resets the 1st of next month UTC.
429	rate_limit_exceeded	More than 60 requests in the last 60 seconds. Retry-After: 60 header included.
500	internal_error	Unhandled error inside the handler. Will appear in our Sentry; rare.
500	auth_unavailable	Couldn't reach Supabase to verify your key. Retry with backoff.
400	invalid_content_type	/api/v1/extract: request was not multipart/form-data.
400	missing_file	/api/v1/extract: multipart body had no 'file' field.
413	payload_too_large	/api/v1/extract: file exceeds 10MB.
415	unsupported_format	/api/v1/extract: file format couldn't be detected (not PDF/HTML/DOCX/text).
422	extraction_failed	/api/v1/extract: file matched a format but parser failed (corrupt, encrypted, unsupported variant).
422	empty_output	/api/v1/extract: file parsed but produced no text. Often a scanned/image-only PDF.

Rate limits

Two ceilings, both per API key:

Per-minute burst — 60 requests per rolling 60-second window by default. Exceeding this returns 429 rate_limit_exceeded with a Retry-After: 60 header. Heavier integrations (data backfills, batch jobs) can have this raised on a per-key basis; email us with your expected sustained throughput and we'll bump it.
Monthly quota — set per key when you're onboarded. Default is 1,000 successful requests per month. Exceeding returns 429 monthly_quota_exceeded. Email us if you're approaching the cap and we'll raise it.

Only successful (200) responses count toward your monthly quota. Failed requests (4xx, 5xx) and rate-limited requests do not. Concurrent in-flight bursts can briefly exceed the per-minute ceiling by a few requests — that's a known characteristic of the count-based limiter, acceptable for our scale.

Code samples

JavaScript / Node.js

const res = await fetch("https://autochunk.ai/api/v1/chunk", {
  method: "POST",
  headers: {
    "x-api-key": process.env.AUTOCHUNK_KEY,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    source: { type: "text", department: "finance", access_level: "restricted" },
    content: documentText,
  }),
});

if (!res.ok) {
  const { error } = await res.json();
  throw new Error(`AutoChunk ${res.status}: ${error.code} - ${error.message}`);
}

const { chunks } = await res.json();
// chunks is now ready to embed and store in your vector DB

Python

import os
import requests

resp = requests.post(
    "https://autochunk.ai/api/v1/chunk",
    headers={
        "x-api-key": os.environ["AUTOCHUNK_KEY"],
        "content-type": "application/json",
    },
    json={
        "source": {"type": "text", "department": "finance", "access_level": "restricted"},
        "content": document_text,
    },
)
resp.raise_for_status()
chunks = resp.json()["chunks"]

Current limitations

No OCR. Image-only or scanned PDFs return 422 empty_output from /api/v1/extract. Run them through your own OCR (Tesseract, AWS Textract, Google Document AI) and POST the resulting text to /api/v1/chunk directly.
Binary extraction is PDF/HTML/DOCX only. Legacy .doc, .pptx, .xlsx, .rtf, and image formats aren't supported. Email if your workflow needs one of these and we'll prioritize.
No streaming. Each request is a single round-trip with the full chunked response. For documents over ~500KB consider chunking your *upload* into multiple calls.
No per-call summarization. options.summarize is reserved for a future feature. Don't depend on it returning anything but null today.
English-tuned tokenizer. Uses gpt-tokenizer (cl100k_base). Works for non-English text but token counts may diverge from production LLMs you're embedding with.

Troubleshooting

“I'm getting 401 invalid_api_key but I just got my key emailed to me.”

Confirm you're sending the raw key in the x-api-key header, not the SHA-256 hash. The key starts with rh_live_ followed by 48 hex characters. Watch for trailing whitespace from copy-paste.

“Every request returns 500 auth_unavailable.”

Supabase is temporarily unreachable from our side. Retry with exponential backoff. If it persists more than ~60 seconds, email us — we get paged on this.

“Chunks contain mid-sentence breaks I didn't expect.”

The chunker only retreats up to 40% of the window when looking for a paragraph/sentence boundary. If your document has very long paragraphs (>2x your chunk_size), some chunks will hard-cut. Either raise chunk_size or pre-segment your text on natural boundaries.

“My token_count adds up to more than total_tokens in the response.”

With chunk_overlap > 0, neighboring chunks share tokens by design. total_tokens reports the unique source tokens; chunk token_counts can sum higher because of overlap.

“I'm hitting the per-minute rate limit during a backfill.”

60 req/min is intended for steady-state workloads. For backfills, throttle to ~50 req/min or email us — we can raise the limit on a per-key basis if you have a legitimate batch job.

Support

One person reads every email at hello@autochunk.ai. Reasonable response time: same business day for paying customers, within 48 hours for prospects. Include your key_prefix (first 8 characters of your key) and any error response body when reporting bugs.