@casoon/astro-site-files
Astro integration that generates all standard site meta-files from typed configuration at build time.
What it does
- Generates
robots.txt— crawl rules with per-agent overrides and automatic sitemap reference - Generates
llms.txt— AI model discovery file following the llmstxt.org specification - Generates
sitemap.xml— built-in, enabled by default, with dynamic sources, i18n hreflang and sitemap-index support - Generates
rss.xml— RSS 2.0 feed with CDATA escaping, custom namespaces and per-item hooks - Generates
/.well-known/security.txt— vulnerability disclosure contact per RFC 9116 - Generates
humans.txt— team and technology credits per humanstxt.org
All files are written to the build output directory when astro build runs.
Successor package. This integration replaces @casoon/astro-crawler-policy (robots.txt + llms.txt) and @casoon/astro-sitemap (sitemap.xml + rss.xml). Both predecessor packages are no longer actively maintained.
Requirements
- Node.js ≥ 22.12.0 (aligned with Astro 6)
- Astro ≥ 6.0.0 (peer dependency, optional for programmatic usage)
Installation
npm install @casoon/astro-site-filesQuick start
// astro.config.ts
import { defineConfig } from 'astro/config'
import siteFiles from '@casoon/astro-site-files'
export default defineConfig({
site: 'https://example.com',
integrations: [
siteFiles({
robots: {
preset: 'seoOnly', // blocks AI training and archives; search engines stay allowed
disallow: ['/admin'], // additional path rules on top of the preset
},
llms: { title: 'Example', description: 'An example website.' },
security: { contact: 'mailto:info@casoon.de' },
humans: {
team: [{ name: 'Alice', role: 'Development' }],
technology: ['Astro', 'TypeScript']
}
})
]
})robots.txt and sitemap.xml are enabled by default. The other three files are generated only when their option is configured.
robots.txt
The recommended approach is to start with a preset and add path rules on top:
siteFiles({
robots: {
preset: 'seoOnly', // blocks AI training, archiving; search engines allowed
disallow: ['/admin'], // additional paths for User-agent: *
}
})For fine-grained control without a preset:
siteFiles({
robots: {
disallow: ['/admin', '/private/'],
allow: ['/admin/public/'],
crawlDelay: 2,
sitemap: true, // auto-derive from astro.config site URL (default)
agents: [
{ userAgent: 'Googlebot', crawlDelay: 1 }
]
}
})Option reference:
| Option | Type | Default | Description |
|---|---|---|---|
disallow |
string[] |
[] |
Paths to disallow for User-agent: * |
allow |
string[] |
[] |
Paths to explicitly allow for User-agent: * |
crawlDelay |
number |
— | Crawl-delay for User-agent: * |
sitemap |
boolean | string |
true |
true = derive URL from astro.config.site, string = explicit URL, false = omit |
preset |
Preset |
— | Named preset — see Presets below |
bots |
Record<string, BotAction> |
{} |
Per-bot overrides — keyed by bot id, take precedence over groups and preset |
groups |
Groups |
{} |
Group-level action controls — see sub-table below |
extraBots |
RegistryBot[] |
[] |
Additional bots to merge into the built-in registry |
agents |
AgentRule[] |
[] |
Explicit per-agent rule blocks — appended after registry-derived rules |
BotAction values: 'allow' (emit Allow: /), 'disallow' (emit Disallow: /), 'inherit' (no rule emitted; User-agent: * applies).
Groups fields:
| Group key | Covers |
|---|---|
searchEngines |
Googlebot, Bingbot, DuckDuckBot |
verifiedAi |
Verified AI bots — OpenAI, Anthropic, Google, Perplexity, You.com, Amazon, Apple, Meta, ByteDance |
unknownAi |
Unverified or uncategorized scrapers (Diffbot, Omgilibot) |
seoScanners |
SEO analytics tools — AhrefsBot, SemrushBot, MJ12bot, DotBot |
archives |
Web archiving bots — ia_archiver, archive.org_bot |
Each entry in agents:
| Field | Type | Description |
|---|---|---|
userAgent |
string | string[] |
User-agent value(s) |
allow |
string[] |
Paths to allow |
disallow |
string[] |
Paths to disallow |
crawlDelay |
number |
Crawl-delay for this agent |
Disable: robots: false
Generated output:
User-agent: *
Disallow: /admin
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2
User-agent: Googlebot
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xml
Presets
A preset configures group defaults and known-bot rules in one step. Individual bots and groups options override the preset.
siteFiles({
robots: { preset: 'seoOnly' }
})| Preset | searchEngines | verifiedAi | unknownAi | seoScanners | archives | Notes |
|---|---|---|---|---|---|---|
seoOnly |
allow | disallow | disallow | inherit | disallow | All AI training and archiving blocked; search engines stay |
citationFriendly |
allow | allow | disallow | inherit | inherit | AI may read and cite; training crawlers overridden via bots |
openToAi |
allow | allow | allow | inherit | allow | Everything allowed |
blockTraining |
allow | allow | disallow | inherit | disallow | AI input/search allowed; training bots overridden via bots |
lockdown |
disallow | disallow | disallow | disallow | disallow | Everything blocked |
inherit means no rule is emitted for that group — User-agent: * applies. citationFriendly and blockTraining additionally override specific training bots via bots regardless of the verifiedAi group setting.
Group overrides let you adjust one category without changing the preset for others:
siteFiles({
robots: {
preset: 'seoOnly',
groups: { seoScanners: 'disallow' } // also block SEO scanners
}
})Per-bot overrides take the highest precedence:
siteFiles({
robots: {
preset: 'blockTraining',
bots: { PerplexityBot: 'disallow' } // also block AI search
}
})Adding custom bots:
siteFiles({
robots: {
preset: 'seoOnly',
extraBots: [
{ id: 'MyBot', provider: 'Example', userAgents: ['MyBot/1.0'], categories: ['ai-training'], verified: false }
]
}
})Blocking AI crawlers and web archives
robots.txt is voluntary — compliant bots respect it, aggressive scrapers often do not. For most sites the pragmatic approach is a layered "soft block": signal your preferences clearly while keeping search engines working normally.
Known bots to consider blocking
| User-agent | Origin |
|---|---|
ia_archiver |
Internet Archive / Wayback Machine |
archive.org_bot |
Internet Archive (secondary agent) |
GPTBot |
OpenAI training crawler |
ChatGPT-User |
OpenAI — when ChatGPT fetches URLs on behalf of a user |
ClaudeBot |
Anthropic |
Claude-Web |
Anthropic |
anthropic-ai |
Anthropic |
Google-Extended |
Google — Gemini / AI Overviews training |
CCBot |
Common Crawl — the base dataset behind many models |
PerplexityBot |
Perplexity AI |
YouBot |
You.com AI search |
Amazonbot |
Amazon — Alexa / Rufus training |
Applebot-Extended |
Apple AI features |
Bytespider |
ByteDance / TikTok ecosystem |
OAI-SearchBot |
OpenAI — search and browsing |
meta-externalagent |
Meta AI |
Diffbot |
Automated data extraction (unverified) |
Omgilibot |
Social media data aggregator, used in training sets (unverified) |
AhrefsBot |
Ahrefs SEO scanner |
SemrushBot |
Semrush SEO scanner |
MJ12bot |
Majestic SEO scanner |
DotBot |
OpenLinkProfiler SEO scanner |
Important: block CCBot. Many models are not trained directly from your site but via datasets derived from Common Crawl. Blocking only GPTBot while leaving CCBot open still lets your content reach training pipelines indirectly.
Variant 1 — Pragmatic / SEO-safe
Good for company sites, blogs, agencies. Normal search engines keep working; AI training and archiving are restricted.
siteFiles({
robots: { preset: 'seoOnly' }
})Variant 2 — Content-focused / block training
For publishers, premium content, or media sites. AI may read and cite content; training crawlers and archives are blocked.
siteFiles({
robots: { preset: 'blockTraining' }
})Variant 3 — Maximum restriction
Block everything including SEO scanners. Use with caution — this also prevents you from using SEO tools on your own site.
siteFiles({
robots: {
preset: 'seoOnly',
groups: { seoScanners: 'disallow' }
}
})Note on SEO scanners: Blocking AhrefsBot, SemrushBot, and similar tools prevents competitors from analysing your backlink profile or content, but also prevents you from using those tools on your own site. Evaluate the trade-off before adding them.
Meta tag and HTTP header
<meta name="robots" content="noarchive">X-Robots-Tag: noarchive
This helps against search engine caches and snapshots (e.g. Google Cache). It does not protect against active scrapers, training data dumps, or content already copied.
What robots.txt cannot do
Since 2025–2026, many AI scrapers no longer identify themselves as bots. They use residential IPs, headless browsers with standard headers, and distributed request patterns that are indistinguishable from normal traffic. robots.txt cannot stop them.
Effective countermeasures require infrastructure:
- Rate limiting
- Bot detection (e.g. Cloudflare Bot Fight Mode)
- JS challenges for suspicious traffic
- IP reputation filtering and login walls
If you use Cloudflare, a WAF rule can block unverified bots while allowing legitimate search crawlers:
(cf.client.bot and not cf.verified_bot_category in {"Search Engine Crawler"})
→ Challenge / JS Challenge / Block
robots.txt is a declaration of intent, not an enforcement mechanism.
llms.txt
Follows the llmstxt.org specification. Provides structured metadata for AI models discovering what your site is about.
siteFiles({
llms: {
title: 'Example',
description: 'An example website focused on TypeScript tooling.',
details: 'This site documents internal tools and workflows.',
sections: [
{
title: 'Documentation',
links: [
{ title: 'Getting started', url: '/docs/start', description: 'Setup guide' },
{ title: 'API reference', url: '/docs/api' }
]
}
]
}
})Use sources to generate sections from code — for example from a content collection — instead of maintaining them manually:
siteFiles({
llms: {
title: 'Example',
description: 'An example website.',
sources: [
async () => {
const posts = await getCollection('blog')
return {
title: 'Blog',
links: posts.map(p => ({ title: p.data.title, url: `/blog/${p.id}/` })),
}
},
],
},
})Sections from sources are appended after any manually defined sections.
Option reference:
| Option | Type | Description |
|---|---|---|
title |
string |
Required. Site or project name |
description |
string |
Short description rendered as a blockquote |
details |
string |
Additional plain-text context |
sections |
LlmsSection[] |
Named sections with link lists (static) |
sources |
LlmsSource[] |
Async functions that return additional sections |
Each entry in sections:
| Field | Type | Description |
|---|---|---|
title |
string |
Section heading |
links |
Link[] |
Optional list of links |
Each entry in links:
| Field | Type | Description |
|---|---|---|
title |
string |
Link label |
url |
string |
Absolute or relative URL |
description |
string |
Optional inline description after the link |
Disable: Omit the option or set llms: false
Generated output:
# Example
> An example website focused on TypeScript tooling.
This site documents internal tools and workflows.
## Documentation
- [Getting started](/docs/start): Setup guide
- [API reference](/docs/api)sitemap.xml
Sitemap generation is built-in and enabled by default. Static pages are discovered automatically from Astro's build output. Dynamic URLs can be added via sources.
siteFiles({
sitemap: {
exclude: ['/landing/'],
priority: [{ pattern: '/blog/', priority: 0.9 }],
sources: [
async () => {
const posts = await getCollection('blog')
return posts.map(p => ({ loc: `/blog/${p.id}/`, lastmod: p.data.date }))
}
]
}
})Automatic HTML Metadata Extraction (Opt-in)
The sitemap builder automatically scans your built HTML files for page-specific metadata. If a page contains a JSON-LD <script type="application/ld+json"> tag with data-sitemap-changefreq or data-sitemap-priority attributes (such as those generated by @casoon/astro-structured-data), the generator will parse and apply them directly to that page's entry in sitemap.xml.
This integration is completely decoupled and optional: if a page does not contain these custom script attributes, the sitemap generator falls back gracefully to your global configuration rules and built-in path-based defaults.
Option reference:
| Option | Type | Description |
|---|---|---|
siteUrl |
string |
Override the site URL (auto-detected from astro.config.site) |
sources |
SitemapSource[] |
Async functions returning additional SitemapEntry[] |
exclude |
(string | RegExp)[] |
URL paths or patterns to exclude |
filter |
(url: string) => boolean |
Custom filter on the full absolute URL |
priority |
PriorityRule[] |
Pattern-based priority overrides (first match wins) |
changefreq |
ChangefreqRule[] |
Pattern-based changefreq overrides (first match wins) |
serialize |
(entry) => entry | undefined |
Per-item transform or filter hook |
i18n |
{ defaultLocale, locales } |
Generates <xhtml:link rel="alternate"> hreflang entries |
rss |
RssConfig |
Generate an RSS 2.0 feed at build time — see RSS feed below |
output.mode |
'single' | 'index' |
index splits into numbered chunks (auto when > maxUrls). In index mode the index file is always sitemap-index.xml and chunks are sitemap-1.xml, sitemap-2.xml, … |
output.maxUrls |
number |
Max URLs per file in index mode — default 50 000 |
output.filename |
string |
Output filename in single-file mode — default sitemap.xml. Ignored in index mode. |
audit.warnOnEmpty |
boolean |
Warn when sitemap has zero entries — default true |
audit.errorOnDuplicates |
boolean |
Emit error instead of warning for duplicate URLs — default false |
Built-in exclusions (always applied): /404, /500, /_*, /api/, /landing/, /drafts/, sitemap.xml, robots.txt, llms.txt, rss.xml, and any page whose HTML starts with <meta http-equiv="refresh"> (meta-refresh redirect pages).
Built-in priority defaults: / → 1.0, depth 1 → 0.9, depth 2 → 0.8, depth 3+ → 0.7
Built-in changefreq defaults: / and content paths (/blog/, /artikel/, etc.) → weekly, everything else → monthly
Disable: sitemap: false
RSS feed
Configure sitemap.rss to generate an rss.xml at build time alongside the sitemap. getItems runs in astro:build:done — use filesystem reads rather than getCollection(), which is only available in Astro's SSR context.
siteFiles({
sitemap: {
rss: {
title: 'My Blog',
description: 'Latest articles about TypeScript and Astro.',
language: 'en',
getItems: async (siteUrl) => {
const { readdirSync, readFileSync } = await import('node:fs')
const matter = (await import('gray-matter')).default
const dir = './src/content/blog'
return readdirSync(dir)
.filter(f => f.endsWith('.mdx'))
.map(file => {
const { data } = matter(readFileSync(`${dir}/${file}`, 'utf-8'))
if (data.draft) return null
return {
title: data.title,
pubDate: data.date,
link: `${siteUrl}/blog/${file.replace(/\.mdx$/, '')}/`,
description: data.description,
}
})
.filter(Boolean)
.sort((a, b) => new Date(b.pubDate) - new Date(a.pubDate))
},
},
},
})rss option reference:
| Option | Type | Description |
|---|---|---|
title |
string |
Required. Feed title |
description |
string |
Required. Feed description |
getItems |
(siteUrl: string) => RssItem[] |
Required. Returns the feed items |
filename |
string |
Output filename — default rss.xml |
feedUrl |
string |
Self-link URL — defaults to {siteUrl}/{filename} |
language |
string |
BCP 47 language code, e.g. 'de-DE' |
copyright |
string |
Copyright notice |
managingEditor |
string |
RFC 822 format: email@domain.com (Name) |
feedCustomData |
string |
Raw XML injected inside <channel> |
xmlns |
Record<string, string> |
Additional namespace declarations on <rss> |
Each object returned by getItems:
| Field | Type | Description |
|---|---|---|
title |
string |
Required. Item title |
pubDate |
Date | string |
Required. Publication date |
link |
string |
Required. Full URL or root-relative path |
description |
string |
Short summary |
author |
string |
Author name or email |
categories |
string[] |
Category tags |
customData |
string |
Raw XML injected inside <item> (e.g. for custom namespaced elements) |
RSS API route (/rss sub-path)
For a live feed served at a URL — useful in development or for SSR builds — use createRssRoute from the /rss sub-path. This helper runs inside Astro's SSR context, so getCollection() is available:
// src/pages/rss.xml.ts
import { createRssRoute } from '@casoon/astro-site-files/rss'
import { getCollection } from 'astro:content'
export const GET = createRssRoute({
title: 'My Blog',
description: 'Latest posts',
language: 'de-DE',
getItems: async (siteUrl) => {
const posts = await getCollection('blog', ({ data }) => !data.draft)
return posts
.sort((a, b) => b.data.date.getTime() - a.data.date.getTime())
.map(p => ({
title: p.data.title,
pubDate: p.data.date,
link: `${siteUrl}/blog/${p.id}/`,
description: p.data.description,
}))
},
})Both approaches can coexist: build-time sitemap.rss for static deploys, API route for development previewing.
security.txt
Generated at /.well-known/security.txt per RFC 9116. The contact field is required by the specification.
siteFiles({
security: {
contact: 'mailto:info@casoon.de',
policy: 'https://www.casoon.de/security-policy',
acknowledgments: 'https://www.casoon.de/hall-of-fame',
preferredLanguages: ['en', 'de'],
expires: '2027-01-01T00:00:00.000Z',
hiring: 'https://www.casoon.de/jobs'
}
})Option reference:
| Option | Type | Description |
|---|---|---|
contact |
string | string[] |
Required. mailto: or https: URI for reporting vulnerabilities |
policy |
string |
URL of the security policy |
acknowledgments |
string |
URL of the acknowledgments or hall-of-fame page |
preferredLanguages |
string[] |
BCP 47 language tags, e.g. ['en', 'de'] |
expires |
string | Date |
ISO 8601 expiry date — when to renew the file |
encryption |
string |
URL of the PGP public key |
canonical |
string |
Canonical URL of this security.txt file |
hiring |
string |
URL of a security-focused jobs page |
Disable: Omit the option or set security: false
Generated output:
Contact: mailto:info@casoon.de
Expires: 2027-01-01T00:00:00.000Z
Acknowledgments: https://www.casoon.de/hall-of-fame
Preferred-Languages: en, de
Policy: https://www.casoon.de/security-policy
Hiring: https://www.casoon.de/jobs
humans.txt
Follows the humanstxt.org convention.
siteFiles({
humans: {
team: [
{ name: 'Alice', role: 'Development', location: 'Berlin' },
{ name: 'Bob', role: 'Design', twitter: '@bob' }
],
thanks: ['Open Source Community', 'Our early users'],
technology: ['Astro', 'TypeScript', 'Tailwind CSS'],
note: 'Built with care.'
}
})Option reference:
| Option | Type | Description |
|---|---|---|
team |
TeamMember[] |
List of team members |
thanks |
string[] |
Acknowledgment entries |
technology |
string[] |
Technologies used — rendered as a comma-separated list |
note |
string |
Free-form note |
lastUpdate |
string | Date |
Defaults to the build date |
Each entry in team:
| Field | Type | Description |
|---|---|---|
name |
string |
Required. Full name |
role |
string |
Job title or role |
twitter |
string |
Twitter / X handle |
location |
string |
City or country |
email |
string |
Contact email |
Disable: Omit the option or set humans: false
Generated output:
/* TEAM */
Name: Alice
Role: Development
Location: Berlin
/* SITE LAST UPDATED */
2026-05-06
/* TECHNOLOGY COLOPHON */
Astro, TypeScript, Tailwind CSS
Build-time audit hints
The integration emits build-time hints when configuration looks incomplete or incorrect. Each hint has a rule ID, a level (info / warn), and a help message.
All rule IDs:
| Rule ID | Level | Triggered when |
|---|---|---|
robots/legal-pages-blocked |
warn | A legal page (/privacy, /terms, /impressum, …) is in disallow |
llms/no-description |
info | llms has no description |
llms/no-sections |
info | llms has no sections or sources |
llms/sections-without-links |
info | Sections exist but none have links (and no sources configured) |
security/no-expires |
warn | security has no expires date (required by RFC 9116) |
security/no-policy |
info | security has no policy URL |
humans/no-team |
info | humans has no team entries |
humans/no-technology |
info | humans has no technology entries |
sitemap/no-site-url |
warn | No site URL is configured — <loc> entries will be relative |
sitemap/empty-sitemap |
warn | Sitemap has no entries after all sources are resolved |
sitemap/duplicate-urls |
warn/error | Duplicate URLs detected before deduplication (last wins) |
sitemap/invalid-priority |
warn | One or more entries have priority outside [0, 1] |
Disable all hints:
siteFiles({ audit: false })Suppress specific rules:
siteFiles({
audit: {
disable: [
'llms/no-description',
'security/no-expires',
],
},
})audit option reference:
| Option | Type | Description |
|---|---|---|
enabled |
boolean |
Set to false to silence all hints |
disable |
string[] |
Rule IDs to suppress individually |
Passing audit: false is equivalent to audit: { enabled: false }.
Option defaults
| Option | Default behavior |
|---|---|
robots |
Enabled — generates robots.txt that allows all crawlers by default |
llms |
Disabled — requires { title } |
sitemap |
Enabled — built-in sitemap generation from Astro's build output |
sitemap.rss |
Disabled — requires { title, description, getItems } |
security |
Disabled — requires { contact } |
humans |
Disabled — generates when any option is provided |
audit |
Enabled — emits build-time hints for all generated files |
Programmatic usage
The renderer functions are exported for use outside of the Astro integration:
import {
renderRobotsTxt,
renderLlmsTxt,
renderSecurityTxt,
renderHumansTxt,
renderSitemapXml,
renderSitemapIndex,
renderRssFeed,
resolveEntry,
deduplicateEntries,
auditSitemap,
auditRobots,
auditLlms,
auditSecurity,
auditHumans,
filterIssues,
defaultRegistry,
REGISTRY_VERSION,
} from '@casoon/astro-site-files'
import type {
AuditOptions,
AuditIssue,
RssConfig,
RssItem,
BotAction,
BotCategory,
RegistryBot,
Preset,
} from '@casoon/astro-site-files'
// With preset
const robots = renderRobotsTxt({ preset: 'seoOnly' }, 'https://example.com')
// With manual overrides on top of a preset
const robots2 = renderRobotsTxt(
{ preset: 'blockTraining', bots: { PerplexityBot: 'disallow' }, disallow: ['/admin'] },
'https://example.com',
)
// Without preset (manual)
const robots3 = renderRobotsTxt({ disallow: ['/admin'] }, 'https://example.com')
const llms = renderLlmsTxt({ title: 'My Site', description: 'A site.' })
const security = renderSecurityTxt({ contact: 'mailto:info@casoon.de' })
const humans = renderHumansTxt({ team: [{ name: 'Alice' }], technology: ['Astro'] })
const entries = [{ loc: '/blog/post/' }].map(e => resolveEntry(e, {}, 'https://example.com'))
const xml = renderSitemapXml(deduplicateEntries(entries))
const rss = renderRssFeed(
{ title: 'My Blog', description: 'Latest posts', language: 'en' },
'https://example.com',
[{ title: 'Hello', pubDate: new Date(), link: '/blog/hello/' }],
)defaultRegistry exposes the full built-in bot list. REGISTRY_VERSION is an ISO date string of the last registry update — useful for debugging or displaying in tooling.
The createRssRoute helper is available from the /rss sub-path (see RSS API route above):
import { createRssRoute } from '@casoon/astro-site-files/rss'
import type { CreateRssRouteOptions } from '@casoon/astro-site-files/rss'This package covers static file generation. Actual crawl enforcement depends on whether bots respect these files — many do not.