Adding a Free AI Assistant to Your Blog with Cloudflare Workers AI

After setting up my blog on Astro and Cloudflare Workers recently, I was wondering what else I can tinker with for my new blog that I have full control over 😆. I was aware that Cloudflare has free daily quota for AI models, so thought it would be fun to build a little AI assistant directly into the blog posts - something that could summarize the post and answer questions about it, all powered by Workers AI.

This post walks through how I added an AI summarizer and Q&A panel directly into each blog post using Cloudflare Workers AI, and how I used AI Gateway to make sure I never get an unexpected bill.

What it does

Scroll to the bottom of any post on this blog and you’ll see a small collapsible panel - ❓ “Ask AI about this post”. Click it and you can:

Summarize the post - one click to get a 3-5 sentence summary
Ask follow-up questions - dig into anything from the post for more context or details

The assistant only answers based on the post content. I intentionally scoped it this way so it stays on-topic and doesn’t start making things up.

How it’s wired together

There are two pieces - a client-side Astro component for the UI, and a server-side API route that calls Workers AI.

Browser
  └── AiPostAssistant.astro  (Astro component - UI + fetch logic)
        │  POST /api/ai-chat
        ▼
Cloudflare Worker
  └── src/pages/api/ai-chat.ts  (API route)
        │  env.AI.run() with gateway option
        ▼
AI Gateway  (rate limiting, caching, observability)
        │
        ▼
Workers AI  (@cf/meta/llama-3.1-8b-instruct-fp8)

Responses are streamed back as Server-Sent Events (SSE) so the text appears token-by-token, which gives that nice “typing” feel.

The API endpoint

Since the blog runs on Cloudflare Workers, I get access to the env.AI binding for free - no API keys needed, no separate service to authenticate against.

// src/pages/api/ai-chat.ts
import { env } from "cloudflare:workers";

export const POST: APIRoute = async ({ request }) => {
  const { content, messages } = await request.json();

  const systemMessage = {
    role: "system",
    content: `You are a helpful assistant for a tech blog.
Answer questions based only on the blog post content below.
If asked for a summary, provide a structured 3–5 sentence summary.

Blog post content:
---
${content.slice(0, 8_000)}
---`,
  };

  const stream = await env.AI.run(
    "@cf/meta/llama-3.1-8b-instruct-fp8",
    { messages: [systemMessage, ...messages], stream: true, max_tokens: 1024 }
  );

  return new Response(stream, {
    headers: {
      "content-type": "text/event-stream",
      "cache-control": "no-cache",
    },
  });
};

A few things worth noting here:

I trim the post content to 8,000 characters before sending it. The model supports up to 32K tokens, but most blog posts are way under that. Trimming keeps the per-request token cost low.
stream: true gives us a ReadableStream back, which is what powers the token-by-token UI effect.
The system prompt explicitly tells the assistant to only answer from the post. Without this, the model happily wanders off into general knowledge territory.

The UI component

The Astro component sits at the bottom of every post, collapsed by default so it doesn’t get in the way of reading.

// Excerpt from AiPostAssistant.astro <script>

// Grab the full post text from the DOM
const articleEl = document.querySelector("article");
const postContent = articleEl?.innerText ?? "";

async function sendMessage(userMessage) {
  const res = await fetch("/api/ai-chat", {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({
      content: postContent,
      messages: conversationHistory,
    }),
  });

  // Stream the SSE response token-by-token
  const reader = res.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = decoder.decode(value);
    // Parse SSE data lines and append tokens to the UI
    appendTokenToUI(chunk);
  }
}

One nice thing here: the post content is read directly from the rendered DOM (article element), so I don’t need to pass anything extra as props. The component is completely self-contained.

Keeping it within free limit with AI Gateway

Workers AI gives you 10,000 free neurons per day. A typical request to the llama model costs around 60 neurons, which works out to roughly 166 free requests before charges kick in. For a personal blog that’s fine - until a post gets picked up somewhere and suddenly 500 people hit the summarize button in an hour. At that point you’re paying.

I wanted to avoid that scenario entirely without writing any custom quota-tracking code. Turns out Cloudflare has a native solution for this called AI Gateway.

What is AI Gateway?

AI Gateway is a proxy layer that sits between your Worker and the AI provider. For Workers AI it gives you:

Rate limiting - set a daily request cap in the dashboard, no code required
Response caching - repeated identical requests return a cached response for free
Observability - a dashboard showing request counts, token usage, latency, and estimated cost

The really nice part is that connecting it to your Worker is literally one extra argument to env.AI.run().

Setting it up

Step 1 - Create the gateway

In the Cloudflare Dashboard, go to AI → AI Gateway → Create Gateway. Give it a name - I used my-blog.

Step 2 - Pass the gateway option to env.AI.run()

// src/pages/api/ai-chat.ts  - the only code change needed
const stream = await env.AI.run(
  model,
  { messages: allMessages, stream: true, max_tokens: 1024 },
  { gateway: { id: "my-blog" } } 
);

That’s really it. No API token, no changing the request URL - the Workers AI binding handles authentication automatically when you use it this way. See the Cloudflare AI Gateway - Workers Binding docs for more detail.

Step 3 - Set a rate limit in the dashboard

In the gateway settings, enable Rate Limiting. The dashboard only supports up to a 1-hour window (no daily option), so set it per hour and keep the daily math in mind:

Setting	Value
Requests	6
Window	1 Hour
Type	Fixed

6 requests/hour × 24 hours = 144 requests/day × ~60 neurons = ~8,640 neurons/day, which stays comfortably under the 10,000 free tier. When the limit is hit, the gateway returns a 429 response and stops forwarding requests.

p.s. Even though you may want to strictly stay under the free tier, I recommend bumping the limit up a bit (e.g. 15 req/hour) just to avoid accidentally blocking real users if you hit the limit. The cost of a few extra requests is negligible, and it gives you some breathing room.

Handling the 429

When the gateway blocks a request, the Worker catches the error and returns something readable:

// In the catch block of the API route
const errMsg = e instanceof Error ? e.message : String(e);
if (errMsg.includes("429") || errMsg.toLowerCase().includes("rate limit")) {
  return new Response(
    JSON.stringify({
      error: "Daily AI usage limit reached. Please try again tomorrow.",
    }),
    { status: 429 }
  );
}

The frontend checks for status 429 and shows the message directly - no ugly stack traces showing up in the UI.

Caching is a nice bonus

AI Gateway also caches responses by default. If 10 different visitors all ask “summarize this post” on the same day, only the first request actually hits Workers AI. The rest get the cached response instantly, for free. This helps stretch that 150 request budget even further.

What does it actually cost?

Scenario	Neurons used	Cost
144 requests/day × 60 neurons (6/hr × 24hr)	~8,640	$0.00 (under free tier)
Cache hit (repeated question)	0	$0.00
Over limit (rate-limited by gateway)	0	$0.00

For a personal blog, 150 requests a day is plenty. And once you’ve set the rate limit, you can forget about it.

Source code

The full implementation is open source:

hossains-dev-bytes

Astro blogging site hosted using Cloudflare workers.

Astro

MDX

If you want to dig into the details, main files are:

src/pages/api/ai-chat.ts - the Worker API endpoint
src/components/AiPostAssistant.astro - the UI component
wrangler.jsonc - where AI_MODEL and AI_GATEWAY_ID are configured

If you’re running a blog on Cloudflare Workers already, adding this is pretty low effort using AI Agents. Give it a try 👍🏽