0.4.2 • Published 6d ago

@casoon/astro-site-files

Licence

MIT

Version

0.4.2

Deps

Size

95 kB

Vulns

Weekly

Summary Dependency Versions

@casoon/astro-site-files

Astro integration that generates all standard site meta-files from typed configuration at build time.

What it does

Generates robots.txt — crawl rules with per-agent overrides and automatic sitemap reference
Generates llms.txt — AI model discovery file following the llmstxt.org specification
Generates sitemap.xml — built-in, enabled by default, with dynamic sources, i18n hreflang and sitemap-index support
Generates rss.xml — RSS 2.0 feed with CDATA escaping, custom namespaces and per-item hooks
Generates /.well-known/security.txt — vulnerability disclosure contact per RFC 9116
Generates humans.txt — team and technology credits per humanstxt.org

All files are written to the build output directory when astro build runs.

Successor package. This integration replaces @casoon/astro-crawler-policy (robots.txt + llms.txt) and @casoon/astro-sitemap (sitemap.xml + rss.xml). Both predecessor packages are no longer actively maintained.

Requirements

Node.js ≥ 22.12.0 (aligned with Astro 6)
Astro ≥ 6.0.0 (peer dependency, optional for programmatic usage)

Installation

npm install @casoon/astro-site-files

Quick start

// astro.config.ts
import { defineConfig } from 'astro/config'
import siteFiles from '@casoon/astro-site-files'

export default defineConfig({
  site: 'https://example.com',
  integrations: [
    siteFiles({
      robots: {
        preset: 'seoOnly',      // blocks AI training and archives; search engines stay allowed
        disallow: ['/admin'],   // additional path rules on top of the preset
      },
      llms: { title: 'Example', description: 'An example website.' },
      security: { contact: 'mailto:info@casoon.de' },
      humans: {
        team: [{ name: 'Alice', role: 'Development' }],
        technology: ['Astro', 'TypeScript']
      }
    })
  ]
})

robots.txt and sitemap.xml are enabled by default. The other three files are generated only when their option is configured.

robots.txt

The recommended approach is to start with a preset and add path rules on top:

siteFiles({
  robots: {
    preset: 'seoOnly',           // blocks AI training, archiving; search engines allowed
    disallow: ['/admin'],        // additional paths for User-agent: *
  }
})

For fine-grained control without a preset:

siteFiles({
  robots: {
    disallow: ['/admin', '/private/'],
    allow: ['/admin/public/'],
    crawlDelay: 2,
    sitemap: true,               // auto-derive from astro.config site URL (default)
    agents: [
      { userAgent: 'Googlebot', crawlDelay: 1 }
    ]
  }
})

Option reference:

Option	Type	Default	Description
`disallow`	`string[]`	`[]`	Paths to disallow for `User-agent: *`
`allow`	`string[]`	`[]`	Paths to explicitly allow for `User-agent: *`
`crawlDelay`	`number`	—	Crawl-delay for `User-agent: *`
`sitemap`	`boolean \| string`	`true`	`true` = derive URL from `astro.config.site`, `string` = explicit URL, `false` = omit
`preset`	`Preset`	—	Named preset — see Presets below
`bots`	`Record<string, BotAction>`	`{}`	Per-bot overrides — keyed by bot id, take precedence over groups and preset
`groups`	`Groups`	`{}`	Group-level action controls — see sub-table below
`extraBots`	`RegistryBot[]`	`[]`	Additional bots to merge into the built-in registry
`agents`	`AgentRule[]`	`[]`	Explicit per-agent rule blocks — appended after registry-derived rules

BotAction values: 'allow' (emit Allow: /), 'disallow' (emit Disallow: /), 'inherit' (no rule emitted; User-agent: * applies).

Groups fields:

Group key	Covers
`searchEngines`	Googlebot, Bingbot, DuckDuckBot
`verifiedAi`	Verified AI bots — OpenAI, Anthropic, Google, Perplexity, You.com, Amazon, Apple, Meta, ByteDance
`unknownAi`	Unverified or uncategorized scrapers (Diffbot, Omgilibot)
`seoScanners`	SEO analytics tools — AhrefsBot, SemrushBot, MJ12bot, DotBot
`archives`	Web archiving bots — ia_archiver, archive.org_bot

Each entry in agents:

Field	Type	Description
`userAgent`	`string \| string[]`	User-agent value(s)
`allow`	`string[]`	Paths to allow
`disallow`	`string[]`	Paths to disallow
`crawlDelay`	`number`	Crawl-delay for this agent

Disable: robots: false

Generated output:

User-agent: *
Disallow: /admin
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2

User-agent: Googlebot
Crawl-delay: 1

Sitemap: https://example.com/sitemap.xml

Presets

A preset configures group defaults and known-bot rules in one step. Individual bots and groups options override the preset.

siteFiles({
  robots: { preset: 'seoOnly' }
})

Preset	searchEngines	verifiedAi	unknownAi	seoScanners	archives	Notes
`seoOnly`	allow	disallow	disallow	inherit	disallow	All AI training and archiving blocked; search engines stay
`citationFriendly`	allow	allow	disallow	inherit	inherit	AI may read and cite; training crawlers overridden via `bots`
`openToAi`	allow	allow	allow	inherit	allow	Everything allowed
`blockTraining`	allow	allow	disallow	inherit	disallow	AI input/search allowed; training bots overridden via `bots`
`lockdown`	disallow	disallow	disallow	disallow	disallow	Everything blocked

inherit means no rule is emitted for that group — User-agent: * applies. citationFriendly and blockTraining additionally override specific training bots via bots regardless of the verifiedAi group setting.

Group overrides let you adjust one category without changing the preset for others:

siteFiles({
  robots: {
    preset: 'seoOnly',
    groups: { seoScanners: 'disallow' }  // also block SEO scanners
  }
})

Per-bot overrides take the highest precedence:

siteFiles({
  robots: {
    preset: 'blockTraining',
    bots: { PerplexityBot: 'disallow' }  // also block AI search
  }
})

Adding custom bots:

siteFiles({
  robots: {
    preset: 'seoOnly',
    extraBots: [
      { id: 'MyBot', provider: 'Example', userAgents: ['MyBot/1.0'], categories: ['ai-training'], verified: false }
    ]
  }
})

Blocking AI crawlers and web archives

robots.txt is voluntary — compliant bots respect it, aggressive scrapers often do not. For most sites the pragmatic approach is a layered "soft block": signal your preferences clearly while keeping search engines working normally.

Known bots to consider blocking

User-agent	Origin
`ia_archiver`	Internet Archive / Wayback Machine
`archive.org_bot`	Internet Archive (secondary agent)
`GPTBot`	OpenAI training crawler
`ChatGPT-User`	OpenAI — when ChatGPT fetches URLs on behalf of a user
`ClaudeBot`	Anthropic
`Claude-Web`	Anthropic
`anthropic-ai`	Anthropic
`Google-Extended`	Google — Gemini / AI Overviews training
`CCBot`	Common Crawl — the base dataset behind many models
`PerplexityBot`	Perplexity AI
`YouBot`	You.com AI search
`Amazonbot`	Amazon — Alexa / Rufus training
`Applebot-Extended`	Apple AI features
`Bytespider`	ByteDance / TikTok ecosystem
`OAI-SearchBot`	OpenAI — search and browsing
`meta-externalagent`	Meta AI
`Diffbot`	Automated data extraction (unverified)
`Omgilibot`	Social media data aggregator, used in training sets (unverified)
`AhrefsBot`	Ahrefs SEO scanner
`SemrushBot`	Semrush SEO scanner
`MJ12bot`	Majestic SEO scanner
`DotBot`	OpenLinkProfiler SEO scanner

Important: block CCBot. Many models are not trained directly from your site but via datasets derived from Common Crawl. Blocking only GPTBot while leaving CCBot open still lets your content reach training pipelines indirectly.

Variant 1 — Pragmatic / SEO-safe

Good for company sites, blogs, agencies. Normal search engines keep working; AI training and archiving are restricted.

siteFiles({
  robots: { preset: 'seoOnly' }
})

Variant 2 — Content-focused / block training

For publishers, premium content, or media sites. AI may read and cite content; training crawlers and archives are blocked.

siteFiles({
  robots: { preset: 'blockTraining' }
})

Variant 3 — Maximum restriction

Block everything including SEO scanners. Use with caution — this also prevents you from using SEO tools on your own site.

siteFiles({
  robots: {
    preset: 'seoOnly',
    groups: { seoScanners: 'disallow' }
  }
})

Note on SEO scanners: Blocking AhrefsBot, SemrushBot, and similar tools prevents competitors from analysing your backlink profile or content, but also prevents you from using those tools on your own site. Evaluate the trade-off before adding them.

Meta tag and HTTP header

<meta name="robots" content="noarchive">

X-Robots-Tag: noarchive

This helps against search engine caches and snapshots (e.g. Google Cache). It does not protect against active scrapers, training data dumps, or content already copied.

What robots.txt cannot do

Since 2025–2026, many AI scrapers no longer identify themselves as bots. They use residential IPs, headless browsers with standard headers, and distributed request patterns that are indistinguishable from normal traffic. robots.txt cannot stop them.

Effective countermeasures require infrastructure:

Rate limiting
Bot detection (e.g. Cloudflare Bot Fight Mode)
JS challenges for suspicious traffic
IP reputation filtering and login walls

If you use Cloudflare, a WAF rule can block unverified bots while allowing legitimate search crawlers:

(cf.client.bot and not cf.verified_bot_category in {"Search Engine Crawler"})
→ Challenge / JS Challenge / Block

robots.txt is a declaration of intent, not an enforcement mechanism.

llms.txt

Follows the llmstxt.org specification. Provides structured metadata for AI models discovering what your site is about.

siteFiles({
  llms: {
    title: 'Example',
    description: 'An example website focused on TypeScript tooling.',
    details: 'This site documents internal tools and workflows.',
    sections: [
      {
        title: 'Documentation',
        links: [
          { title: 'Getting started', url: '/docs/start', description: 'Setup guide' },
          { title: 'API reference', url: '/docs/api' }
        ]
      }
    ]
  }
})

Use sources to generate sections from code — for example from a content collection — instead of maintaining them manually:

siteFiles({
  llms: {
    title: 'Example',
    description: 'An example website.',
    sources: [
      async () => {
        const posts = await getCollection('blog')
        return {
          title: 'Blog',
          links: posts.map(p => ({ title: p.data.title, url: `/blog/${p.id}/` })),
        }
      },
    ],
  },
})

Sections from sources are appended after any manually defined sections.

Option reference:

Option	Type	Description
`title`	`string`	Required. Site or project name
`description`	`string`	Short description rendered as a blockquote
`details`	`string`	Additional plain-text context
`sections`	`LlmsSection[]`	Named sections with link lists (static)
`sources`	`LlmsSource[]`	Async functions that return additional sections

Each entry in sections:

Field	Type	Description
`title`	`string`	Section heading
`links`	`Link[]`	Optional list of links

Each entry in links:

Field	Type	Description
`title`	`string`	Link label
`url`	`string`	Absolute or relative URL
`description`	`string`	Optional inline description after the link

Disable: Omit the option or set llms: false

Generated output:

# Example

> An example website focused on TypeScript tooling.

This site documents internal tools and workflows.

## Documentation

- [Getting started](/docs/start): Setup guide
- [API reference](/docs/api)

sitemap.xml

Sitemap generation is built-in and enabled by default. Static pages are discovered automatically from Astro's build output. Dynamic URLs can be added via sources.

siteFiles({
  sitemap: {
    exclude: ['/landing/'],
    priority: [{ pattern: '/blog/', priority: 0.9 }],
    sources: [
      async () => {
        const posts = await getCollection('blog')
        return posts.map(p => ({ loc: `/blog/${p.id}/`, lastmod: p.data.date }))
      }
    ]
  }
})

Automatic HTML Metadata Extraction (Opt-in)

The sitemap builder automatically scans your built HTML files for page-specific metadata. If a page contains a JSON-LD <script type="application/ld+json"> tag with data-sitemap-changefreq or data-sitemap-priority attributes (such as those generated by @casoon/astro-structured-data), the generator will parse and apply them directly to that page's entry in sitemap.xml.

This integration is completely decoupled and optional: if a page does not contain these custom script attributes, the sitemap generator falls back gracefully to your global configuration rules and built-in path-based defaults.

Option reference:

Option	Type	Description
`siteUrl`	`string`	Override the site URL (auto-detected from `astro.config.site`)
`sources`	`SitemapSource[]`	Async functions returning additional `SitemapEntry[]`
`exclude`	`(string \| RegExp)[]`	URL paths or patterns to exclude
`filter`	`(url: string) => boolean`	Custom filter on the full absolute URL
`priority`	`PriorityRule[]`	Pattern-based priority overrides (first match wins)
`changefreq`	`ChangefreqRule[]`	Pattern-based changefreq overrides (first match wins)
`serialize`	`(entry) => entry \| undefined`	Per-item transform or filter hook
`i18n`	`{ defaultLocale, locales }`	Generates `<xhtml:link rel="alternate">` hreflang entries
`rss`	`RssConfig`	Generate an RSS 2.0 feed at build time — see RSS feed below
`output.mode`	`'single' \| 'index'`	`index` splits into numbered chunks (auto when > `maxUrls`). In index mode the index file is always `sitemap-index.xml` and chunks are `sitemap-1.xml`, `sitemap-2.xml`, …
`output.maxUrls`	`number`	Max URLs per file in index mode — default `50 000`
`output.filename`	`string`	Output filename in single-file mode — default `sitemap.xml`. Ignored in index mode.
`audit.warnOnEmpty`	`boolean`	Warn when sitemap has zero entries — default `true`
`audit.errorOnDuplicates`	`boolean`	Emit error instead of warning for duplicate URLs — default `false`

Built-in exclusions (always applied): /404, /500, /_*, /api/, /landing/, /drafts/, sitemap.xml, robots.txt, llms.txt, rss.xml, and any page whose HTML starts with <meta http-equiv="refresh"> (meta-refresh redirect pages).

Built-in priority defaults: / → 1.0, depth 1 → 0.9, depth 2 → 0.8, depth 3+ → 0.7

Built-in changefreq defaults: / and content paths (/blog/, /artikel/, etc.) → weekly, everything else → monthly

Disable: sitemap: false

RSS feed

Configure sitemap.rss to generate an rss.xml at build time alongside the sitemap. getItems runs in astro:build:done — use filesystem reads rather than getCollection(), which is only available in Astro's SSR context.

siteFiles({
  sitemap: {
    rss: {
      title: 'My Blog',
      description: 'Latest articles about TypeScript and Astro.',
      language: 'en',
      getItems: async (siteUrl) => {
        const { readdirSync, readFileSync } = await import('node:fs')
        const matter = (await import('gray-matter')).default
        const dir = './src/content/blog'
        return readdirSync(dir)
          .filter(f => f.endsWith('.mdx'))
          .map(file => {
            const { data } = matter(readFileSync(`${dir}/${file}`, 'utf-8'))
            if (data.draft) return null
            return {
              title: data.title,
              pubDate: data.date,
              link: `${siteUrl}/blog/${file.replace(/\.mdx$/, '')}/`,
              description: data.description,
            }
          })
          .filter(Boolean)
          .sort((a, b) => new Date(b.pubDate) - new Date(a.pubDate))
      },
    },
  },
})

rss option reference:

Option	Type	Description
`title`	`string`	Required. Feed title
`description`	`string`	Required. Feed description
`getItems`	`(siteUrl: string) => RssItem[]`	Required. Returns the feed items
`filename`	`string`	Output filename — default `rss.xml`
`feedUrl`	`string`	Self-link URL — defaults to `{siteUrl}/{filename}`
`language`	`string`	BCP 47 language code, e.g. `'de-DE'`
`copyright`	`string`	Copyright notice
`managingEditor`	`string`	RFC 822 format: `email@domain.com (Name)`
`feedCustomData`	`string`	Raw XML injected inside `<channel>`
`xmlns`	`Record<string, string>`	Additional namespace declarations on `<rss>`

Each object returned by getItems:

Field	Type	Description
`title`	`string`	Required. Item title
`pubDate`	`Date \| string`	Required. Publication date
`link`	`string`	Required. Full URL or root-relative path
`description`	`string`	Short summary
`author`	`string`	Author name or email
`categories`	`string[]`	Category tags
`customData`	`string`	Raw XML injected inside `<item>` (e.g. for custom namespaced elements)

RSS API route (`/rss` sub-path)

For a live feed served at a URL — useful in development or for SSR builds — use createRssRoute from the /rss sub-path. This helper runs inside Astro's SSR context, so getCollection() is available:

// src/pages/rss.xml.ts
import { createRssRoute } from '@casoon/astro-site-files/rss'
import { getCollection } from 'astro:content'

export const GET = createRssRoute({
  title: 'My Blog',
  description: 'Latest posts',
  language: 'de-DE',
  getItems: async (siteUrl) => {
    const posts = await getCollection('blog', ({ data }) => !data.draft)
    return posts
      .sort((a, b) => b.data.date.getTime() - a.data.date.getTime())
      .map(p => ({
        title: p.data.title,
        pubDate: p.data.date,
        link: `${siteUrl}/blog/${p.id}/`,
        description: p.data.description,
      }))
  },
})

Both approaches can coexist: build-time sitemap.rss for static deploys, API route for development previewing.

security.txt

Generated at /.well-known/security.txt per RFC 9116. The contact field is required by the specification.

siteFiles({
  security: {
    contact: 'mailto:info@casoon.de',
    policy: 'https://www.casoon.de/security-policy',
    acknowledgments: 'https://www.casoon.de/hall-of-fame',
    preferredLanguages: ['en', 'de'],
    expires: '2027-01-01T00:00:00.000Z',
    hiring: 'https://www.casoon.de/jobs'
  }
})

Option reference:

Option	Type	Description
`contact`	`string \| string[]`	Required. `mailto:` or `https:` URI for reporting vulnerabilities
`policy`	`string`	URL of the security policy
`acknowledgments`	`string`	URL of the acknowledgments or hall-of-fame page
`preferredLanguages`	`string[]`	BCP 47 language tags, e.g. `['en', 'de']`
`expires`	`string \| Date`	ISO 8601 expiry date — when to renew the file
`encryption`	`string`	URL of the PGP public key
`canonical`	`string`	Canonical URL of this `security.txt` file
`hiring`	`string`	URL of a security-focused jobs page

Disable: Omit the option or set security: false

Generated output:

Contact: mailto:info@casoon.de
Expires: 2027-01-01T00:00:00.000Z
Acknowledgments: https://www.casoon.de/hall-of-fame
Preferred-Languages: en, de
Policy: https://www.casoon.de/security-policy
Hiring: https://www.casoon.de/jobs

humans.txt

Follows the humanstxt.org convention.

siteFiles({
  humans: {
    team: [
      { name: 'Alice', role: 'Development', location: 'Berlin' },
      { name: 'Bob', role: 'Design', twitter: '@bob' }
    ],
    thanks: ['Open Source Community', 'Our early users'],
    technology: ['Astro', 'TypeScript', 'Tailwind CSS'],
    note: 'Built with care.'
  }
})

Option reference:

Option	Type	Description
`team`	`TeamMember[]`	List of team members
`thanks`	`string[]`	Acknowledgment entries
`technology`	`string[]`	Technologies used — rendered as a comma-separated list
`note`	`string`	Free-form note
`lastUpdate`	`string \| Date`	Defaults to the build date

Each entry in team:

Field	Type	Description
`name`	`string`	Required. Full name
`role`	`string`	Job title or role
`twitter`	`string`	Twitter / X handle
`location`	`string`	City or country
`email`	`string`	Contact email

Disable: Omit the option or set humans: false

Generated output:

/* TEAM */
    Name: Alice
    Role: Development
    Location: Berlin

/* SITE LAST UPDATED */
    2026-05-06

/* TECHNOLOGY COLOPHON */
    Astro, TypeScript, Tailwind CSS

Build-time audit hints

The integration emits build-time hints when configuration looks incomplete or incorrect. Each hint has a rule ID, a level (info / warn), and a help message.

All rule IDs:

Rule ID	Level	Triggered when
`robots/legal-pages-blocked`	warn	A legal page (`/privacy`, `/terms`, `/impressum`, …) is in `disallow`
`llms/no-description`	info	`llms` has no `description`
`llms/no-sections`	info	`llms` has no `sections` or `sources`
`llms/sections-without-links`	info	Sections exist but none have `links` (and no `sources` configured)
`security/no-expires`	warn	`security` has no `expires` date (required by RFC 9116)
`security/no-policy`	info	`security` has no `policy` URL
`humans/no-team`	info	`humans` has no `team` entries
`humans/no-technology`	info	`humans` has no `technology` entries
`sitemap/no-site-url`	warn	No site URL is configured — `<loc>` entries will be relative
`sitemap/empty-sitemap`	warn	Sitemap has no entries after all sources are resolved
`sitemap/duplicate-urls`	warn/error	Duplicate URLs detected before deduplication (last wins)
`sitemap/invalid-priority`	warn	One or more entries have `priority` outside `[0, 1]`

Disable all hints:

siteFiles({ audit: false })

Suppress specific rules:

siteFiles({
  audit: {
    disable: [
      'llms/no-description',
      'security/no-expires',
    ],
  },
})

audit option reference:

Option	Type	Description
`enabled`	`boolean`	Set to `false` to silence all hints
`disable`	`string[]`	Rule IDs to suppress individually

Passing audit: false is equivalent to audit: { enabled: false }.

Option defaults

Option	Default behavior
`robots`	Enabled — generates `robots.txt` that allows all crawlers by default
`llms`	Disabled — requires `{ title }`
`sitemap`	Enabled — built-in sitemap generation from Astro's build output
`sitemap.rss`	Disabled — requires `{ title, description, getItems }`
`security`	Disabled — requires `{ contact }`
`humans`	Disabled — generates when any option is provided
`audit`	Enabled — emits build-time hints for all generated files

Programmatic usage

The renderer functions are exported for use outside of the Astro integration:

import {
  renderRobotsTxt,
  renderLlmsTxt,
  renderSecurityTxt,
  renderHumansTxt,
  renderSitemapXml,
  renderSitemapIndex,
  renderRssFeed,
  resolveEntry,
  deduplicateEntries,
  auditSitemap,
  auditRobots,
  auditLlms,
  auditSecurity,
  auditHumans,
  filterIssues,
  defaultRegistry,
  REGISTRY_VERSION,
} from '@casoon/astro-site-files'
import type {
  AuditOptions,
  AuditIssue,
  RssConfig,
  RssItem,
  BotAction,
  BotCategory,
  RegistryBot,
  Preset,
} from '@casoon/astro-site-files'

// With preset
const robots = renderRobotsTxt({ preset: 'seoOnly' }, 'https://example.com')

// With manual overrides on top of a preset
const robots2 = renderRobotsTxt(
  { preset: 'blockTraining', bots: { PerplexityBot: 'disallow' }, disallow: ['/admin'] },
  'https://example.com',
)

// Without preset (manual)
const robots3 = renderRobotsTxt({ disallow: ['/admin'] }, 'https://example.com')
const llms = renderLlmsTxt({ title: 'My Site', description: 'A site.' })
const security = renderSecurityTxt({ contact: 'mailto:info@casoon.de' })
const humans = renderHumansTxt({ team: [{ name: 'Alice' }], technology: ['Astro'] })

const entries = [{ loc: '/blog/post/' }].map(e => resolveEntry(e, {}, 'https://example.com'))
const xml = renderSitemapXml(deduplicateEntries(entries))

const rss = renderRssFeed(
  { title: 'My Blog', description: 'Latest posts', language: 'en' },
  'https://example.com',
  [{ title: 'Hello', pubDate: new Date(), link: '/blog/hello/' }],
)

defaultRegistry exposes the full built-in bot list. REGISTRY_VERSION is an ISO date string of the last registry update — useful for debugging or displaying in tooling.

The createRssRoute helper is available from the /rss sub-path (see RSS API route above):

import { createRssRoute } from '@casoon/astro-site-files/rss'
import type { CreateRssRouteOptions } from '@casoon/astro-site-files/rss'

This package covers static file generation. Actual crawl enforcement depends on whether bots respect these files — many do not.

Keywords

astro astro-integration withastro robots.txt llms.txt sitemap security.txt humans.txt seo