Adding source citations to Astra Docs Chat


When I shipped Astra Docs Chat , I was honest about what it could not do. Top of that list: source citations with links back to the doc pages. An answer you cannot check is an answer you have to take on faith, and for technical documentation that is not good enough. If the model tells you how to create a collection, you want to click straight through to the page it read.

This post is the follow up. Every answer now shows the IBM DataStax Astra DB Serverless pages it was built from, as clickable links underneath the response.

I work at IBM, and this assistant runs end to end on our own stack: IBM DataStax Astra DB as the vector store and Langflow for the retrieval and chat orchestration. That is the main reason adding citations was a short hop rather than a rebuild. The information I needed was already sitting in the Langflow flow and in the Astra DB metadata, I just had to surface it.


You ask a question, the answer streams in exactly as before, and then a Sources row appears underneath it. One chip per doc page the retrieval actually used, each one a link to the real page on docs.datastax.com.

Ask about private links and you get the AWS, Azure, and Google Cloud pages. Ask how to create a collection and you get the collection methods page. The chips reflect what the retrieval step pulled for that specific question, not a fixed footer.


The retrieval step already knew the answer to “which pages did this come from”. Each chunk stored in Astra DB carries the filename it came from in its metadata, the slug I created back when I was chunking and embedding the docs . The problem was that none of it reached the browser. The chat flow only returned the model’s text, and the proxy only forwarded tokens.

So the work split into three small pieces: get the flow to emit the filenames, turn those filenames into real links, and render them under the answer.


The RAG flow retrieves chunks, builds a prompt, and calls the model. I tapped the retriever’s output into a second small parser that prints one filename per line, wired to a second chat output. Now the run’s end event carries two things instead of one: the answer, and the list of source filenames in retrieval order.

The original answer path is untouched. The model still streams token by token, and the old behaviour is unchanged, so the live site kept working while I built the rest.


A filename like api-reference_collection-methods_create-collection.md is not a URL. I keep a small manifest that maps each filename to its public URL and page title, and the Cloudflare Pages proxy resolves the retrieved filenames against it, drops duplicates, and hands the browser a clean numbered list.

Building that manifest had a twist worth mentioning. The state file from the original batch ingest was long gone, so I had no ready made map of filename to URL. I rebuilt it from what I still had. The filenames were created by swapping the slashes in a URL for underscores, so I could reverse that, then fetch each candidate page to confirm it resolved and to read its real title. 271 of 272 filenames mapped to a live page. The one that did not was a stray JSON asset that had been picked up during ingest, not a real doc page, so it gets dropped.


The obvious alternative was to write each page’s URL straight into its chunk metadata in IBM DataStax Astra DB and read it back at query time. I went with the proxy side manifest instead, for a few reasons.

The filename was already there. Astra DB stores each chunk with its source filename in metadata, so the URL was something I could derive from data I already had, not something I needed to persist a second time.

Metadata is written at ingest time. To stamp a URL onto the 271 pages already in the collection, I would have had to re-ingest and re-embed the whole corpus . That is real cost and downtime for a value that does not change how retrieval works.

The manifest is easier to live with. It sits next to the proxy, so if a title reads badly or a URL moves, I edit one file and redeploy. No re-embedding, and nothing touched in the vector store. Astra DB stays the vector store, the manifest stays a small lookup table, and the two jobs stay cleanly separated.


The streaming chat UI already renders markdown as it arrives. When the sources event lands, it builds a small row of pill links under the finished answer. No framework, no iframe, the same vanilla JavaScript as the rest of the page. URLs are checked before they go anywhere near the DOM, so only genuine http(s) doc links render.


The chips are deduplicated and ordered, but the answer text does not yet drop inline markers like [1] next to each individual claim. The reason is mechanical rather than philosophical. The parser I use to format the retrieved context has no per row index, so I cannot number the chunks in a way that stays lined up with the chips without writing a custom component. I would rather show no inline marker than show a [1] that points at the wrong source. The chips are the useful part, and they are accurate. Numbered inline markers are the next iteration.


The whole point of building this thing was to ask a question and trust the answer enough to act on it. Citations close that loop. If something looks off, or you just want the full detail, the source page is one click away. It also quietly keeps the tool honest: if the retrieval pulled the wrong pages, you can see that now, instead of reading a confident answer with no way to check it.


If you want the authoritative sources behind all of this:


Open Astra Docs Chat and ask something you would otherwise go digging through the docs for, then follow the chips to the pages it used. If you are doing something similar, putting citations on top of a RAG setup, I would be glad to compare notes on LinkedIn .

×
Page views: