The Astra Docs Chat corpus was loaded once. DataStax ships doc updates regularly; v1 does not auto-refresh. This post outlines a repeatable re-ingest strategy when the markdown export changes: what to detect, how to use existing resume tooling, and when full rebuild beats incremental upsert.
Series: Building Astra Docs Chat · Batch ingest script · Chunking and embedding
Try the chat: Astra Docs Chat
When manual re-ingest is enough ¶
For a personal reference tool, running ingest after major doc releases (or when answers feel stale) may be sufficient. No cron, no diff pipeline, just:
- Re-run your crawl/extract step to refresh the local
pages/markdown export - Re-run the Langflow batch ingest script with appropriate flags
- Spot-check five questions on Astra Docs Chat
That matches how v1 shipped: one batch load, manual refresh when I notice drift.
Automation is optional until stale answers become painful. Docs-only guardrails reduce harm from stale vectors but do not replace fresh content.
Two state files, two jobs ¶
I track crawl and ingest in separate local state files:
| File | Tracks | Phase |
|---|---|---|
page_state.json |
Crawl/extract per URL (hash, filename) | Markdown export |
ingest_state.json |
Langflow ingest per file path | Vector load |
Crawl state (page_state.json) tracks upstream doc URLs and content hashes. Ingest state (ingest_state.json) tracks which local files have already been uploaded and embedded.
Crawl state uses SHA-256 hashes:
def compute_hash(content: str) -> str:
return hashlib.sha256(content.encode("utf-8")).hexdigest()
During extract, unchanged pages can skip re-writing markdown. After a crawl, compare what changed before you spend embedding API credits.
Ingest state is path-based only in v1:
{
"pages/administration_audit-log.md": {
"status": "ingested",
"uploaded_path": "7b90824f-.../administration_audit-log.md"
}
}
Gap: if markdown content changes but the path is unchanged, the batch ingest script skips the file because status is already ingested. Re-ingest requires deliberate state edits or a script enhancement.
Detecting what changed ¶
After a fresh crawl/export:
- New files in
pages/: the batch script picks them up (not iningest_state.json) - Changed files: remove their keys from
ingest_state.json, or extend the ingest script to comparecompute_hash()against a stored hash and re-run on mismatch - Deleted upstream pages: vectors for removed topics may linger until full collection rebuild or explicit deletion by metadata
Recommended incremental workflow today:
# 1. Refresh markdown export (your crawl/extract tooling)
# ...
# 2. Clear ingest state for files you know changed (or all keys for a major release)
# edit ingest_state.json manually or script it
# 3. Re-ingest via Langflow API (see batch ingest post for script flags)
python ingest_langflow.py --retry-failed
Check ingest_failed.log after every run.
Incremental vs full rebuild ¶
| Strategy | Pros | Cons |
|---|---|---|
| Append / upsert per file | Fast, resumable | Orphan vectors if pages removed; duplicate chunks if not deduped |
| Truncate collection + full ingest | Clean slate | 2-4 hours embedding time (batch post timing ) |
| Collection per version | Easy rollback | More ops complexity |
For 271 pages, full rebuild is painful but simple: truncate datastax_astra_docs, delete ingest_state.json, run a full batch ingest pass (batch post
).
Incremental with hash-aware ingest state is the sweet spot for repeat runs. A minimal enhancement:
# pseudocode: on each file before skip
if state[key].get("hash") != compute_hash(file.read_text()):
del state[key] # force re-ingest
Store hash alongside ingested status when saving state after success.
Langflow/Astra upsert behaviour depends on component settings (deletion_field, document ids). v1 appends; duplicates can inflate retrieval noise until you rebuild.
Using existing resume tooling ¶
The batch ingest script (described in the previous post ) already supports resume flags, for example:
# After fresh export, retry failures only
python ingest_langflow.py --retry-failed
# Force single file (delete its ingest_state.json entry first)
python ingest_langflow.py --limit 1
# Smoke test
python ingest_langflow.py --limit 3
Each file: upload → run ingest endpoint datastax-astra-ingest → atomic state save. Ctrl+C safe.
See Langflow ingest flow for graph details and chunking post if you change split settings during a refresh (that usually warrants full rebuild).
Automating later ¶
If manual runs become tedious:
- Scheduled job on your machine or CI: export → ingest on a cron you control
- Alert when a crawl discovers N new/changed pages (diff
page_state.jsonhashes) - Optional webhook from docs pipeline (unlikely for third-party docs you do not control)
Keep Langflow and Astra credentials in CI secrets: same as local LANGFLOW_API_KEY. Ingest runs from CI or your laptop, not from public visitors.
Do not expose /api/v2/files/ to the open internet without network restrictions (self-hosting post
).
Post-refresh validation ¶
Same as initial load:
- Five spot-check questions in Langflow Playground
- Same five on Astra Docs Chat
- Pay attention to renamed API fields: stale vectors plus confident LLM answers are the worst combination
Good test questions from the live UI starters:
- Collection creation steps
- PCU groups definition
- Hybrid search behaviour
If answers reference removed features, prefer full rebuild over incremental patch.
Cost note ¶
Re-ingest spend is mostly OpenAI embedding API calls, not Astra storage (Astra vector store post ). Budget for a full 271-file run before you schedule weekly refreshes.
Status ¶
Process documented; automation not built in v1. The batch ingest pattern and state files are what I use for manual refresh; hash-aware incremental ingest is the obvious next improvement.
Next in the series ¶
- Chunking and embedding technical docs for RAG : when refresh is a good time to revisit split settings
- Docs-only guardrails : stale vectors plus confident answers are the worst combination
Series index: Building Astra Docs Chat
Open Astra Docs Chat after your next re-ingest and compare answers to the live docs site.