After setting up my blog on Astro and Cloudflare Workers recently, I was wondering what else I can tinker with for my new blog that I have full control over 😆. I was aware that Cloudflare has free daily quota for AI models, so thought it would be fun to build a little AI assistant directly into the blog posts - something that could summarize the post and answer questions about it, all powered by Workers AI.
This post walks through how I added an AI summarizer and Q&A panel directly into each blog post using Cloudflare Workers AI, and how I used AI Gateway to make sure I never get an unexpected bill.
What it does
Scroll to the bottom of any post on this blog and you’ll see a small collapsible panel - ❓ “Ask AI about this post”. Click it and you can:
- Summarize the post - one click to get a 3-5 sentence summary
- Ask follow-up questions - dig into anything from the post for more context or details
The assistant only answers based on the post content. I intentionally scoped it this way so it stays on-topic and doesn’t start making things up.
How it’s wired together
There are two pieces - a client-side Astro component for the UI, and a server-side API route that calls Workers AI.
Browser
└── AiPostAssistant.astro (Astro component - UI + fetch logic)
│ POST /api/ai-chat
▼
Cloudflare Worker
└── src/pages/api/ai-chat.ts (API route)
│ env.AI.run() with gateway option
▼
AI Gateway (rate limiting, caching, observability)
│
▼
Workers AI (@cf/meta/llama-3.1-8b-instruct-fp8)
Responses are streamed back as Server-Sent Events (SSE) so the text appears token-by-token, which gives that nice “typing” feel.
The API endpoint
Since the blog runs on Cloudflare Workers, I get access to the env.AI binding for free - no API keys needed, no separate service to authenticate against.
// src/pages/api/ai-chat.ts
import { env } from "cloudflare:workers";
export const POST: APIRoute = async ({ request }) => {
const { content, messages } = await request.json();
const systemMessage = {
role: "system",
content: `You are a helpful assistant for a tech blog.
Answer questions based only on the blog post content below.
If asked for a summary, provide a structured 3–5 sentence summary.
Blog post content:
---
${content.slice(0, 8_000)}
---`,
};
const stream = await env.AI.run(
"@cf/meta/llama-3.1-8b-instruct-fp8",
{ messages: [systemMessage, ...messages], stream: true, max_tokens: 1024 }
);
return new Response(stream, {
headers: {
"content-type": "text/event-stream",
"cache-control": "no-cache",
},
});
};
A few things worth noting here:
- I trim the post content to 8,000 characters before sending it. The model supports up to 32K tokens, but most blog posts are way under that. Trimming keeps the per-request token cost low.
stream: truegives us aReadableStreamback, which is what powers the token-by-token UI effect.- The system prompt explicitly tells the assistant to only answer from the post. Without this, the model happily wanders off into general knowledge territory.
The UI component
The Astro component sits at the bottom of every post, collapsed by default so it doesn’t get in the way of reading.
// Excerpt from AiPostAssistant.astro <script>
// Grab the full post text from the DOM
const articleEl = document.querySelector("article");
const postContent = articleEl?.innerText ?? "";
async function sendMessage(userMessage) {
const res = await fetch("/api/ai-chat", {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({
content: postContent,
messages: conversationHistory,
}),
});
// Stream the SSE response token-by-token
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// Parse SSE data lines and append tokens to the UI
appendTokenToUI(chunk);
}
}
One nice thing here: the post content is read directly from the rendered DOM (article element), so I don’t need to pass anything extra as props. The component is completely self-contained.
Keeping it free with AI Gateway
Workers AI gives you 10,000 free neurons per day. A typical request to the llama model costs around 60 neurons, which works out to roughly 166 free requests before charges kick in. For a personal blog that’s fine - until a post gets picked up somewhere and suddenly 500 people hit the summarize button in an hour. At that point you’re paying.
I wanted to avoid that scenario entirely without writing any custom quota-tracking code. Turns out Cloudflare has a native solution for this called AI Gateway.
What is AI Gateway?
AI Gateway is a proxy layer that sits between your Worker and the AI provider. For Workers AI it gives you:
- Rate limiting - set a daily request cap in the dashboard, no code required
- Response caching - repeated identical requests return a cached response for free
- Observability - a dashboard showing request counts, token usage, latency, and estimated cost
The really nice part is that connecting it to your Worker is literally one extra argument to env.AI.run().
Setting it up
Step 1 - Create the gateway
In the Cloudflare Dashboard, go to AI → AI Gateway → Create Gateway. Give it a name - I used my-blog.
Step 2 - Add the gateway ID to wrangler.jsonc
// wrangler.jsonc
{
"vars": {
"AI_MODEL": "@cf/meta/llama-3.1-8b-instruct-fp8",
"AI_GATEWAY_ID": "my-blog"
}
}
Step 3 - Pass the gateway option to env.AI.run()
// src/pages/api/ai-chat.ts - the only code change needed
const gatewayId = env.AI_GATEWAY_ID ?? "";
const gatewayOptions = gatewayId ? { gateway: { id: gatewayId } } : {};
const stream = await env.AI.run(
model,
{ messages: allMessages, stream: true, max_tokens: 1024 },
gatewayOptions,
);
That’s really it. No API token, no changing the request URL — the Workers AI binding handles authentication automatically when you use it this way. See the Cloudflare AI Gateway - Workers Binding docs for more detail.
Step 4 - Set a rate limit in the dashboard
In the gateway settings, enable Rate Limiting. The dashboard only supports up to a 1-hour window (no daily option), so set it per hour and keep the daily math in mind:
| Setting | Value |
|---|---|
| Requests | 6 |
| Window | 1 Hour |
| Type | Fixed |
6 requests/hour × 24 hours = 144 requests/day × ~60 neurons = ~8,640 neurons/day, which stays comfortably under the 10,000 free tier. When the limit is hit, the gateway returns a 429 response and stops forwarding requests.
p.s. Even though you may want to strictly stay under the free tier, I recommend bumping the limit up a bit (e.g. 15 req/hour) just to avoid accidentally blocking real users if you hit the limit. The cost of a few extra requests is negligible, and it gives you some breathing room.
Handling the 429
When the gateway blocks a request, the Worker catches the error and returns something readable:
// In the catch block of the API route
const errMsg = e instanceof Error ? e.message : String(e);
if (errMsg.includes("429") || errMsg.toLowerCase().includes("rate limit")) {
return new Response(
JSON.stringify({ error: "Daily AI usage limit reached. Please try again tomorrow." }),
{ status: 429 }
);
}
The frontend checks for status 429 and shows the message directly - no ugly stack traces showing up in the UI.
Caching is a nice bonus
AI Gateway also caches responses by default. If 10 different visitors all ask “summarize this post” on the same day, only the first request actually hits Workers AI. The rest get the cached response instantly, for free. This helps stretch that 150 request budget even further.
What does it actually cost?
| Scenario | Neurons used | Cost |
|---|---|---|
| 144 requests/day × 60 neurons (6/hr × 24hr) | ~8,640 | $0.00 (under free tier) |
| Cache hit (repeated question) | 0 | $0.00 |
| Over limit (rate-limited by gateway) | 0 | $0.00 |
For a personal blog, 150 requests a day is plenty. And once you’ve set the rate limit, you can forget about it.
Seeing it in the dashboard
After deploying, the AI Gateway dashboard started filling in with real data - request counts, token breakdown (input vs output), calculated cost, cache hit rate. It was the first time I could see exactly what an AI feature was costing me in real time, right down to the sub-cent level.
Honestly that visibility alone made the AI Gateway integration worth it, even if the rate limiting wasn’t needed.
Source code
The full implementation is open source:
hossains-dev-bytes
Astro blogging site hosted using Cloudflare workers.
If you want to dig into the details, main files are:
src/pages/api/ai-chat.ts- the Worker API endpointsrc/components/AiPostAssistant.astro- the UI componentwrangler.jsonc- whereAI_MODELandAI_GATEWAY_IDare configured
If you’re running a blog on Cloudflare Workers already, adding this is pretty low effort using AI Agents. Give it a try 👍🏽