AutoChunk API Documentation
A single API endpoint that turns business text into AI-retrieval-ready chunks with department, access level, source URL, and per-principal permissions baked into every chunk.
Quickstart
Want to see output before you commit to integration? Try the playground — paste any text, see chunks come back, no API key required.
Ready to integrate? You'll need an API key (request access at hello@autochunk.ai — invite-only at this stage). Save it as an env var and replace$AUTOCHUNK_KEY below.
Send any text payload up to ~2MB:
curl -X POST https://autochunk.ai/api/v1/chunk \
-H "x-api-key: $AUTOCHUNK_KEY" \
-H "content-type: application/json" \
-d '{
"source": { "type": "text", "department": "finance" },
"content": "Payment terms are net 30. Late payments incur a 2% fee per month."
}'You'll get back the chunked output:
{
"source_id": "src_4f2a...",
"chunk_count": 1,
"total_tokens": 18,
"chunks": [{
"chunk_id": "chk_9b1c...",
"source_id": "src_4f2a...",
"chunk_text": "Payment terms are net 30. Late payments incur a 2% fee per month.",
"summary": null,
"department": "finance",
"access_level": null,
"source_url": null,
"token_count": 18,
"embedding_ready": true
}]
}Each chunk inherits department, access_level, and source_url from the parent source, plus any per-principal permissions you supplied. Use those tags at retrieval time to enforce access boundaries downstream.
Authentication
Every request to POST /api/v1/chunk must include an x-api-key header with your raw key. Keys look like rh_live_... followed by a 48-character hex string. Never commit them to source control or expose them in browser code — keys are server-side credentials.
If your key is compromised, email us and we'll rotate it within an hour.
Security model
AutoChunk is the data layer for permission-aware retrieval. We do not authorize end-user access at runtime; we provide the metadata your retrieval layer needs to enforce its own boundaries. This page explains exactly where that line is.
Data flow
- You POST a source with optional access metadata:
department,access_level, and apermissions[]array of per-principal ACL entries. - We chunk the content. Every chunk inherits
department,access_level, andsource_urlfrom the source row. These are denormalized onto each chunk so retrieval-time filters never need a join. - Each entry in
source.permissions[]is duplicated intochunk_permissions, one row per chunk × permission pair. So a source with 3 permissions chunked into 7 pieces produces 21 ACL rows. - Your retrieval pipeline reads chunks (typically from a vector DB) and filters on those tags before passing context to your LLM.
Where enforcement happens
Authorization is your retrieval layer's job. Common patterns:
- Vector DB metadata filter — Pinecone's
filter: { department: "finance" }, Weaviate'swhereclause, pgvector'sWHERE. Most vector DBs let you embed metadata alongside vectors and filter at query time — fastest path. - Post-retrieval scrub — after your top-k retrieval, drop any chunk whose
access_levelthe requesting user lacks. Slower than a metadata filter but works on any vector store, including ones without metadata-filter support. - Per-principal ACL join — after retrieval, JOIN against
chunk_permissionswhereprincipal_idmatches the user's identity in your IDP. Use this for fine-grained per-user access on top of department/access_level coarse filtering.
What we don't do
- No IDP integration. We don't know who your users are.
principal_idvalues are arbitrary strings that you map to your own identity provider. - No chunk-time access control. If you POST a
confidentialsource, anyone with your API key can chunk it. We tag the output; we don't gate the input. Protect your key like a database password. - No encryption at rest beyond Postgres defaults. Chunk text is stored in plaintext in your Supabase project, by design — you need to read it back to embed it. Use Supabase's standard encryption and access controls for that layer.
- No prompt-injection defense. If a chunk's text contains adversarial instructions, your LLM's context will receive them. Tag-based filtering doesn't protect against malicious content INSIDE a permitted chunk.
What this means in practice
AutoChunk's promise: if you tag your sources accurately and filter on those tags at retrieval time, your AI assistant will never surface a chunk to a user who shouldn't see it. We make the metadata stick. You enforce the boundaries.
For compliance-driven engagements (SOC 2, HIPAA, GDPR), document this division of responsibility in your security review. AutoChunk handles data shape and lineage; your retrieval layer handles authorization. The split is by design and verifiable in the schema (chunks, chunk_permissions) and your retrieval code.
Endpoint reference
POST /api/v1/chunk
Full request shape with every supported field:
{
"source": {
"type": "pdf" | "webpage" | "crm" | "sop" | "ticket" | "transcript" | "text" | "other",
"url": "https://example.com/doc.pdf",
"title": "MSA — Acme Corp",
"department": "finance",
"access_level": "public" | "internal" | "restricted" | "confidential",
"permissions": [
{
"principal_type": "user" | "group" | "role" | "department",
"principal_id": "finance-lead",
"permission": "read" | "write"
}
],
"metadata": { "customer_id": "acme" }
},
"content": "...the actual text to chunk...",
"options": {
"chunk_size": 512,
"chunk_overlap": 50,
"summarize": false
}
}Required fields
source.type— one of the eight enum values abovecontent— UTF-8 text, 1 to 2,000,000 characters. AutoChunk does not extract text from binary formats; if you have a PDF or HTML page, extract first (e.g. withpdf-parseorcheerio) and send the text.
Optional fields
source.url— must be a valid URL. Stored on the source and denormalized onto every chunk for retrieval-time filtering.source.department— free-form string up to 64 chars (suggested:finance,legal,hr,sales,ops,engineering).source.access_level— one ofpublic,internal,restricted,confidential.source.permissions— up to 256 ACL entries, each with a principal type, principal id (your IDP's identifier), and permission. Each entry is duplicated onto every chunk produced from this source.source.metadata— arbitrary JSON object stored alongside the source row.options.chunk_size— target tokens per chunk, 64 to 4096. Default512.options.chunk_overlap— overlap tokens between consecutive chunks, 0 to 1024 (must be less thanchunk_size). Default50.options.summarize— boolean, defaultfalse. Reserved for a future LLM summarization feature; currently a no-op.
Response
200 OK with the structure shown in Quickstart. Chunks are returned in document order and tokenized using the GPT tokenizer. The chunker retreats from a hard token boundary to the nearest paragraph break (\n\n), then sentence (. ), then line break (\n) within the last 40% of the window, so chunks rarely end mid-sentence.
Extraction (PDF, HTML, DOCX)
POST /api/v1/extract turns binary documents into clean UTF-8 text suitable for chunking. Compose with /api/v1/chunk to go from "raw file" to "tagged chunks" in two API calls. Same authentication, same monthly quota, same per-minute rate limit as /api/v1/chunk.
Request
Send the file as multipart/form-data with a file field. Optional format field (one of pdf, html, docx, text) overrides autodetection if you know the type.
Supported formats
- PDF — extracted via
unpdf. Returnsmetadata.pages. Image-only / scanned PDFs return422 empty_output— OCR is not supported in v1. - HTML — extracted via
cheerio.<script>,<style>,<iframe>, and other non-content elements are stripped. Prefers<main>or<article>if present, falls back to<body>. - DOCX — extracted via
mammoth. Modern.docxonly; legacy.doc,.pptx, and.xlsxaren't supported (they're ZIP-based but different schemas — return422 extraction_failed). - Plain text — UTF-8 passthrough.
.txt,.md,.markdown.
Limits
- 10MB file size cap (returns
413 payload_too_largeabove) - ~25 second processing timeout (Vercel function limit)
- Counts as 1 request against your monthly quota
- Per-minute rate limit (60 req/min) applies same as
/api/v1/chunk
Response
{
"source_id": "extracted-lzkx2k7s",
"format": "pdf",
"text": "Master Services Agreement\n\nThis Master Services Agreement (the Agreement)...",
"metadata": {
"pages": 7,
"characters": 12453,
"words": 2189,
"extraction_method": "unpdf",
"tokens": 3142
}
}Curl: extract → chunk in one pipeline
# 1. Extract text from a PDF (or HTML, DOCX, plain text)
curl -X POST https://autochunk.ai/api/v1/extract \
-H "x-api-key: $AUTOCHUNK_KEY" \
-F "file=@contract.pdf"
# 2. Pipe the extracted text into /chunk
TEXT=$(curl -s -X POST https://autochunk.ai/api/v1/extract \
-H "x-api-key: $AUTOCHUNK_KEY" \
-F "file=@contract.pdf" | jq -r .text)
curl -X POST https://autochunk.ai/api/v1/chunk \
-H "x-api-key: $AUTOCHUNK_KEY" \
-H "content-type: application/json" \
-d "$(jq -n --arg c "$TEXT" '{source: {type: "pdf"}, content: $c}')"Error codes
All errors return JSON with the shape { "error": { "code": "...", "message": "..." } }. On 400 invalid_request there's also an issues array with the specific Zod failures.
| HTTP | error.code | When it fires |
|---|---|---|
| 400 | invalid_json | Body wasn't parseable JSON. |
| 400 | invalid_request | Body parsed but failed Zod validation. Check the issues array. |
| 400 | empty_content | content was empty or whitespace-only after cleanup. |
| 401 | missing_api_key | No x-api-key header on the request. |
| 401 | invalid_api_key | Header present but key not recognized. |
| 401 | key_disabled | Key was explicitly disabled (compromise, suspended account). |
| 429 | monthly_quota_exceeded | Your monthly request limit was reached. Resets the 1st of next month UTC. |
| 429 | rate_limit_exceeded | More than 60 requests in the last 60 seconds. Retry-After: 60 header included. |
| 500 | internal_error | Unhandled error inside the handler. Will appear in our Sentry; rare. |
| 500 | auth_unavailable | Couldn't reach Supabase to verify your key. Retry with backoff. |
| 400 | invalid_content_type | /api/v1/extract: request was not multipart/form-data. |
| 400 | missing_file | /api/v1/extract: multipart body had no 'file' field. |
| 413 | payload_too_large | /api/v1/extract: file exceeds 10MB. |
| 415 | unsupported_format | /api/v1/extract: file format couldn't be detected (not PDF/HTML/DOCX/text). |
| 422 | extraction_failed | /api/v1/extract: file matched a format but parser failed (corrupt, encrypted, unsupported variant). |
| 422 | empty_output | /api/v1/extract: file parsed but produced no text. Often a scanned/image-only PDF. |
Rate limits
Two ceilings, both per API key:
- Per-minute burst — 60 requests per rolling 60-second window by default. Exceeding this returns
429 rate_limit_exceededwith aRetry-After: 60header. Heavier integrations (data backfills, batch jobs) can have this raised on a per-key basis; email us with your expected sustained throughput and we'll bump it. - Monthly quota — set per key when you're onboarded. Default is 1,000 successful requests per month. Exceeding returns
429 monthly_quota_exceeded. Email us if you're approaching the cap and we'll raise it.
Only successful (200) responses count toward your monthly quota. Failed requests (4xx, 5xx) and rate-limited requests do not. Concurrent in-flight bursts can briefly exceed the per-minute ceiling by a few requests — that's a known characteristic of the count-based limiter, acceptable for our scale.
Code samples
JavaScript / Node.js
const res = await fetch("https://autochunk.ai/api/v1/chunk", {
method: "POST",
headers: {
"x-api-key": process.env.AUTOCHUNK_KEY,
"content-type": "application/json",
},
body: JSON.stringify({
source: { type: "text", department: "finance", access_level: "restricted" },
content: documentText,
}),
});
if (!res.ok) {
const { error } = await res.json();
throw new Error(`AutoChunk ${res.status}: ${error.code} - ${error.message}`);
}
const { chunks } = await res.json();
// chunks is now ready to embed and store in your vector DBPython
import os
import requests
resp = requests.post(
"https://autochunk.ai/api/v1/chunk",
headers={
"x-api-key": os.environ["AUTOCHUNK_KEY"],
"content-type": "application/json",
},
json={
"source": {"type": "text", "department": "finance", "access_level": "restricted"},
"content": document_text,
},
)
resp.raise_for_status()
chunks = resp.json()["chunks"]Current limitations
- No OCR. Image-only or scanned PDFs return
422 empty_outputfrom/api/v1/extract. Run them through your own OCR (Tesseract, AWS Textract, Google Document AI) and POST the resulting text to/api/v1/chunkdirectly. - Binary extraction is PDF/HTML/DOCX only. Legacy
.doc,.pptx,.xlsx,.rtf, and image formats aren't supported. Email if your workflow needs one of these and we'll prioritize. - No streaming. Each request is a single round-trip with the full chunked response. For documents over ~500KB consider chunking your *upload* into multiple calls.
- No per-call summarization.
options.summarizeis reserved for a future feature. Don't depend on it returning anything butnulltoday. - English-tuned tokenizer. Uses
gpt-tokenizer(cl100k_base). Works for non-English text but token counts may diverge from production LLMs you're embedding with.
Troubleshooting
“I'm getting 401 invalid_api_key but I just got my key emailed to me.”
Confirm you're sending the raw key in the x-api-key header, not the SHA-256 hash. The key starts with rh_live_ followed by 48 hex characters. Watch for trailing whitespace from copy-paste.
“Every request returns 500 auth_unavailable.”
Supabase is temporarily unreachable from our side. Retry with exponential backoff. If it persists more than ~60 seconds, email us — we get paged on this.
“Chunks contain mid-sentence breaks I didn't expect.”
The chunker only retreats up to 40% of the window when looking for a paragraph/sentence boundary. If your document has very long paragraphs (>2x your chunk_size), some chunks will hard-cut. Either raise chunk_size or pre-segment your text on natural boundaries.
“My token_count adds up to more than total_tokens in the response.”
With chunk_overlap > 0, neighboring chunks share tokens by design. total_tokens reports the unique source tokens; chunk token_counts can sum higher because of overlap.
“I'm hitting the per-minute rate limit during a backfill.”
60 req/min is intended for steady-state workloads. For backfills, throttle to ~50 req/min or email us — we can raise the limit on a per-key basis if you have a legitimate batch job.
Support
One person reads every email at hello@autochunk.ai. Reasonable response time: same business day for paying customers, within 48 hours for prospects. Include your key_prefix (first 8 characters of your key) and any error response body when reporting bugs.