npm.io
0.4.2 • Published 6d ago

@casoon/astro-site-files

Licence
MIT
Version
0.4.2
Deps
0
Size
95 kB
Vulns
0
Weekly
76

@casoon/astro-site-files

Astro integration that generates all standard site meta-files from typed configuration at build time.

What it does

  • Generates robots.txt — crawl rules with per-agent overrides and automatic sitemap reference
  • Generates llms.txt — AI model discovery file following the llmstxt.org specification
  • Generates sitemap.xml — built-in, enabled by default, with dynamic sources, i18n hreflang and sitemap-index support
  • Generates rss.xml — RSS 2.0 feed with CDATA escaping, custom namespaces and per-item hooks
  • Generates /.well-known/security.txt — vulnerability disclosure contact per RFC 9116
  • Generates humans.txt — team and technology credits per humanstxt.org

All files are written to the build output directory when astro build runs.

Successor package. This integration replaces @casoon/astro-crawler-policy (robots.txt + llms.txt) and @casoon/astro-sitemap (sitemap.xml + rss.xml). Both predecessor packages are no longer actively maintained.

Requirements

  • Node.js ≥ 22.12.0 (aligned with Astro 6)
  • Astro ≥ 6.0.0 (peer dependency, optional for programmatic usage)

Installation

npm install @casoon/astro-site-files

Quick start

// astro.config.ts
import { defineConfig } from 'astro/config'
import siteFiles from '@casoon/astro-site-files'

export default defineConfig({
  site: 'https://example.com',
  integrations: [
    siteFiles({
      robots: {
        preset: 'seoOnly',      // blocks AI training and archives; search engines stay allowed
        disallow: ['/admin'],   // additional path rules on top of the preset
      },
      llms: { title: 'Example', description: 'An example website.' },
      security: { contact: 'mailto:info@casoon.de' },
      humans: {
        team: [{ name: 'Alice', role: 'Development' }],
        technology: ['Astro', 'TypeScript']
      }
    })
  ]
})

robots.txt and sitemap.xml are enabled by default. The other three files are generated only when their option is configured.

robots.txt

The recommended approach is to start with a preset and add path rules on top:

siteFiles({
  robots: {
    preset: 'seoOnly',           // blocks AI training, archiving; search engines allowed
    disallow: ['/admin'],        // additional paths for User-agent: *
  }
})

For fine-grained control without a preset:

siteFiles({
  robots: {
    disallow: ['/admin', '/private/'],
    allow: ['/admin/public/'],
    crawlDelay: 2,
    sitemap: true,               // auto-derive from astro.config site URL (default)
    agents: [
      { userAgent: 'Googlebot', crawlDelay: 1 }
    ]
  }
})

Option reference:

Option Type Default Description
disallow string[] [] Paths to disallow for User-agent: *
allow string[] [] Paths to explicitly allow for User-agent: *
crawlDelay number Crawl-delay for User-agent: *
sitemap boolean | string true true = derive URL from astro.config.site, string = explicit URL, false = omit
preset Preset Named preset — see Presets below
bots Record<string, BotAction> {} Per-bot overrides — keyed by bot id, take precedence over groups and preset
groups Groups {} Group-level action controls — see sub-table below
extraBots RegistryBot[] [] Additional bots to merge into the built-in registry
agents AgentRule[] [] Explicit per-agent rule blocks — appended after registry-derived rules

BotAction values: 'allow' (emit Allow: /), 'disallow' (emit Disallow: /), 'inherit' (no rule emitted; User-agent: * applies).

Groups fields:

Group key Covers
searchEngines Googlebot, Bingbot, DuckDuckBot
verifiedAi Verified AI bots — OpenAI, Anthropic, Google, Perplexity, You.com, Amazon, Apple, Meta, ByteDance
unknownAi Unverified or uncategorized scrapers (Diffbot, Omgilibot)
seoScanners SEO analytics tools — AhrefsBot, SemrushBot, MJ12bot, DotBot
archives Web archiving bots — ia_archiver, archive.org_bot

Each entry in agents:

Field Type Description
userAgent string | string[] User-agent value(s)
allow string[] Paths to allow
disallow string[] Paths to disallow
crawlDelay number Crawl-delay for this agent

Disable: robots: false

Generated output:

User-agent: *
Disallow: /admin
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2

User-agent: Googlebot
Crawl-delay: 1

Sitemap: https://example.com/sitemap.xml
Presets

A preset configures group defaults and known-bot rules in one step. Individual bots and groups options override the preset.

siteFiles({
  robots: { preset: 'seoOnly' }
})
Preset searchEngines verifiedAi unknownAi seoScanners archives Notes
seoOnly allow disallow disallow inherit disallow All AI training and archiving blocked; search engines stay
citationFriendly allow allow disallow inherit inherit AI may read and cite; training crawlers overridden via bots
openToAi allow allow allow inherit allow Everything allowed
blockTraining allow allow disallow inherit disallow AI input/search allowed; training bots overridden via bots
lockdown disallow disallow disallow disallow disallow Everything blocked

inherit means no rule is emitted for that group — User-agent: * applies. citationFriendly and blockTraining additionally override specific training bots via bots regardless of the verifiedAi group setting.

Group overrides let you adjust one category without changing the preset for others:

siteFiles({
  robots: {
    preset: 'seoOnly',
    groups: { seoScanners: 'disallow' }  // also block SEO scanners
  }
})

Per-bot overrides take the highest precedence:

siteFiles({
  robots: {
    preset: 'blockTraining',
    bots: { PerplexityBot: 'disallow' }  // also block AI search
  }
})

Adding custom bots:

siteFiles({
  robots: {
    preset: 'seoOnly',
    extraBots: [
      { id: 'MyBot', provider: 'Example', userAgents: ['MyBot/1.0'], categories: ['ai-training'], verified: false }
    ]
  }
})
Blocking AI crawlers and web archives

robots.txt is voluntary — compliant bots respect it, aggressive scrapers often do not. For most sites the pragmatic approach is a layered "soft block": signal your preferences clearly while keeping search engines working normally.

Known bots to consider blocking

User-agent Origin
ia_archiver Internet Archive / Wayback Machine
archive.org_bot Internet Archive (secondary agent)
GPTBot OpenAI training crawler
ChatGPT-User OpenAI — when ChatGPT fetches URLs on behalf of a user
ClaudeBot Anthropic
Claude-Web Anthropic
anthropic-ai Anthropic
Google-Extended Google — Gemini / AI Overviews training
CCBot Common Crawl — the base dataset behind many models
PerplexityBot Perplexity AI
YouBot You.com AI search
Amazonbot Amazon — Alexa / Rufus training
Applebot-Extended Apple AI features
Bytespider ByteDance / TikTok ecosystem
OAI-SearchBot OpenAI — search and browsing
meta-externalagent Meta AI
Diffbot Automated data extraction (unverified)
Omgilibot Social media data aggregator, used in training sets (unverified)
AhrefsBot Ahrefs SEO scanner
SemrushBot Semrush SEO scanner
MJ12bot Majestic SEO scanner
DotBot OpenLinkProfiler SEO scanner

Important: block CCBot. Many models are not trained directly from your site but via datasets derived from Common Crawl. Blocking only GPTBot while leaving CCBot open still lets your content reach training pipelines indirectly.

Variant 1 — Pragmatic / SEO-safe

Good for company sites, blogs, agencies. Normal search engines keep working; AI training and archiving are restricted.

siteFiles({
  robots: { preset: 'seoOnly' }
})

Variant 2 — Content-focused / block training

For publishers, premium content, or media sites. AI may read and cite content; training crawlers and archives are blocked.

siteFiles({
  robots: { preset: 'blockTraining' }
})

Variant 3 — Maximum restriction

Block everything including SEO scanners. Use with caution — this also prevents you from using SEO tools on your own site.

siteFiles({
  robots: {
    preset: 'seoOnly',
    groups: { seoScanners: 'disallow' }
  }
})

Note on SEO scanners: Blocking AhrefsBot, SemrushBot, and similar tools prevents competitors from analysing your backlink profile or content, but also prevents you from using those tools on your own site. Evaluate the trade-off before adding them.

Meta tag and HTTP header

<meta name="robots" content="noarchive">
X-Robots-Tag: noarchive

This helps against search engine caches and snapshots (e.g. Google Cache). It does not protect against active scrapers, training data dumps, or content already copied.

What robots.txt cannot do

Since 2025–2026, many AI scrapers no longer identify themselves as bots. They use residential IPs, headless browsers with standard headers, and distributed request patterns that are indistinguishable from normal traffic. robots.txt cannot stop them.

Effective countermeasures require infrastructure:

  • Rate limiting
  • Bot detection (e.g. Cloudflare Bot Fight Mode)
  • JS challenges for suspicious traffic
  • IP reputation filtering and login walls

If you use Cloudflare, a WAF rule can block unverified bots while allowing legitimate search crawlers:

(cf.client.bot and not cf.verified_bot_category in {"Search Engine Crawler"})
→ Challenge / JS Challenge / Block

robots.txt is a declaration of intent, not an enforcement mechanism.

llms.txt

Follows the llmstxt.org specification. Provides structured metadata for AI models discovering what your site is about.

siteFiles({
  llms: {
    title: 'Example',
    description: 'An example website focused on TypeScript tooling.',
    details: 'This site documents internal tools and workflows.',
    sections: [
      {
        title: 'Documentation',
        links: [
          { title: 'Getting started', url: '/docs/start', description: 'Setup guide' },
          { title: 'API reference', url: '/docs/api' }
        ]
      }
    ]
  }
})

Use sources to generate sections from code — for example from a content collection — instead of maintaining them manually:

siteFiles({
  llms: {
    title: 'Example',
    description: 'An example website.',
    sources: [
      async () => {
        const posts = await getCollection('blog')
        return {
          title: 'Blog',
          links: posts.map(p => ({ title: p.data.title, url: `/blog/${p.id}/` })),
        }
      },
    ],
  },
})

Sections from sources are appended after any manually defined sections.

Option reference:

Option Type Description
title string Required. Site or project name
description string Short description rendered as a blockquote
details string Additional plain-text context
sections LlmsSection[] Named sections with link lists (static)
sources LlmsSource[] Async functions that return additional sections

Each entry in sections:

Field Type Description
title string Section heading
links Link[] Optional list of links

Each entry in links:

Field Type Description
title string Link label
url string Absolute or relative URL
description string Optional inline description after the link

Disable: Omit the option or set llms: false

Generated output:

# Example

> An example website focused on TypeScript tooling.

This site documents internal tools and workflows.

## Documentation

- [Getting started](/docs/start): Setup guide
- [API reference](/docs/api)

sitemap.xml

Sitemap generation is built-in and enabled by default. Static pages are discovered automatically from Astro's build output. Dynamic URLs can be added via sources.

siteFiles({
  sitemap: {
    exclude: ['/landing/'],
    priority: [{ pattern: '/blog/', priority: 0.9 }],
    sources: [
      async () => {
        const posts = await getCollection('blog')
        return posts.map(p => ({ loc: `/blog/${p.id}/`, lastmod: p.data.date }))
      }
    ]
  }
})
Automatic HTML Metadata Extraction (Opt-in)

The sitemap builder automatically scans your built HTML files for page-specific metadata. If a page contains a JSON-LD <script type="application/ld+json"> tag with data-sitemap-changefreq or data-sitemap-priority attributes (such as those generated by @casoon/astro-structured-data), the generator will parse and apply them directly to that page's entry in sitemap.xml.

This integration is completely decoupled and optional: if a page does not contain these custom script attributes, the sitemap generator falls back gracefully to your global configuration rules and built-in path-based defaults.

Option reference:

Option Type Description
siteUrl string Override the site URL (auto-detected from astro.config.site)
sources SitemapSource[] Async functions returning additional SitemapEntry[]
exclude (string | RegExp)[] URL paths or patterns to exclude
filter (url: string) => boolean Custom filter on the full absolute URL
priority PriorityRule[] Pattern-based priority overrides (first match wins)
changefreq ChangefreqRule[] Pattern-based changefreq overrides (first match wins)
serialize (entry) => entry | undefined Per-item transform or filter hook
i18n { defaultLocale, locales } Generates <xhtml:link rel="alternate"> hreflang entries
rss RssConfig Generate an RSS 2.0 feed at build time — see RSS feed below
output.mode 'single' | 'index' index splits into numbered chunks (auto when > maxUrls). In index mode the index file is always sitemap-index.xml and chunks are sitemap-1.xml, sitemap-2.xml, …
output.maxUrls number Max URLs per file in index mode — default 50 000
output.filename string Output filename in single-file mode — default sitemap.xml. Ignored in index mode.
audit.warnOnEmpty boolean Warn when sitemap has zero entries — default true
audit.errorOnDuplicates boolean Emit error instead of warning for duplicate URLs — default false

Built-in exclusions (always applied): /404, /500, /_*, /api/, /landing/, /drafts/, sitemap.xml, robots.txt, llms.txt, rss.xml, and any page whose HTML starts with <meta http-equiv="refresh"> (meta-refresh redirect pages).

Built-in priority defaults: / → 1.0, depth 1 → 0.9, depth 2 → 0.8, depth 3+ → 0.7

Built-in changefreq defaults: / and content paths (/blog/, /artikel/, etc.) → weekly, everything else → monthly

Disable: sitemap: false

RSS feed

Configure sitemap.rss to generate an rss.xml at build time alongside the sitemap. getItems runs in astro:build:done — use filesystem reads rather than getCollection(), which is only available in Astro's SSR context.

siteFiles({
  sitemap: {
    rss: {
      title: 'My Blog',
      description: 'Latest articles about TypeScript and Astro.',
      language: 'en',
      getItems: async (siteUrl) => {
        const { readdirSync, readFileSync } = await import('node:fs')
        const matter = (await import('gray-matter')).default
        const dir = './src/content/blog'
        return readdirSync(dir)
          .filter(f => f.endsWith('.mdx'))
          .map(file => {
            const { data } = matter(readFileSync(`${dir}/${file}`, 'utf-8'))
            if (data.draft) return null
            return {
              title: data.title,
              pubDate: data.date,
              link: `${siteUrl}/blog/${file.replace(/\.mdx$/, '')}/`,
              description: data.description,
            }
          })
          .filter(Boolean)
          .sort((a, b) => new Date(b.pubDate) - new Date(a.pubDate))
      },
    },
  },
})

rss option reference:

Option Type Description
title string Required. Feed title
description string Required. Feed description
getItems (siteUrl: string) => RssItem[] Required. Returns the feed items
filename string Output filename — default rss.xml
feedUrl string Self-link URL — defaults to {siteUrl}/{filename}
language string BCP 47 language code, e.g. 'de-DE'
copyright string Copyright notice
managingEditor string RFC 822 format: email@domain.com (Name)
feedCustomData string Raw XML injected inside <channel>
xmlns Record<string, string> Additional namespace declarations on <rss>

Each object returned by getItems:

Field Type Description
title string Required. Item title
pubDate Date | string Required. Publication date
link string Required. Full URL or root-relative path
description string Short summary
author string Author name or email
categories string[] Category tags
customData string Raw XML injected inside <item> (e.g. for custom namespaced elements)
RSS API route (/rss sub-path)

For a live feed served at a URL — useful in development or for SSR builds — use createRssRoute from the /rss sub-path. This helper runs inside Astro's SSR context, so getCollection() is available:

// src/pages/rss.xml.ts
import { createRssRoute } from '@casoon/astro-site-files/rss'
import { getCollection } from 'astro:content'

export const GET = createRssRoute({
  title: 'My Blog',
  description: 'Latest posts',
  language: 'de-DE',
  getItems: async (siteUrl) => {
    const posts = await getCollection('blog', ({ data }) => !data.draft)
    return posts
      .sort((a, b) => b.data.date.getTime() - a.data.date.getTime())
      .map(p => ({
        title: p.data.title,
        pubDate: p.data.date,
        link: `${siteUrl}/blog/${p.id}/`,
        description: p.data.description,
      }))
  },
})

Both approaches can coexist: build-time sitemap.rss for static deploys, API route for development previewing.

security.txt

Generated at /.well-known/security.txt per RFC 9116. The contact field is required by the specification.

siteFiles({
  security: {
    contact: 'mailto:info@casoon.de',
    policy: 'https://www.casoon.de/security-policy',
    acknowledgments: 'https://www.casoon.de/hall-of-fame',
    preferredLanguages: ['en', 'de'],
    expires: '2027-01-01T00:00:00.000Z',
    hiring: 'https://www.casoon.de/jobs'
  }
})

Option reference:

Option Type Description
contact string | string[] Required. mailto: or https: URI for reporting vulnerabilities
policy string URL of the security policy
acknowledgments string URL of the acknowledgments or hall-of-fame page
preferredLanguages string[] BCP 47 language tags, e.g. ['en', 'de']
expires string | Date ISO 8601 expiry date — when to renew the file
encryption string URL of the PGP public key
canonical string Canonical URL of this security.txt file
hiring string URL of a security-focused jobs page

Disable: Omit the option or set security: false

Generated output:

Contact: mailto:info@casoon.de
Expires: 2027-01-01T00:00:00.000Z
Acknowledgments: https://www.casoon.de/hall-of-fame
Preferred-Languages: en, de
Policy: https://www.casoon.de/security-policy
Hiring: https://www.casoon.de/jobs

humans.txt

Follows the humanstxt.org convention.

siteFiles({
  humans: {
    team: [
      { name: 'Alice', role: 'Development', location: 'Berlin' },
      { name: 'Bob', role: 'Design', twitter: '@bob' }
    ],
    thanks: ['Open Source Community', 'Our early users'],
    technology: ['Astro', 'TypeScript', 'Tailwind CSS'],
    note: 'Built with care.'
  }
})

Option reference:

Option Type Description
team TeamMember[] List of team members
thanks string[] Acknowledgment entries
technology string[] Technologies used — rendered as a comma-separated list
note string Free-form note
lastUpdate string | Date Defaults to the build date

Each entry in team:

Field Type Description
name string Required. Full name
role string Job title or role
twitter string Twitter / X handle
location string City or country
email string Contact email

Disable: Omit the option or set humans: false

Generated output:

/* TEAM */
    Name: Alice
    Role: Development
    Location: Berlin

/* SITE LAST UPDATED */
    2026-05-06

/* TECHNOLOGY COLOPHON */
    Astro, TypeScript, Tailwind CSS

Build-time audit hints

The integration emits build-time hints when configuration looks incomplete or incorrect. Each hint has a rule ID, a level (info / warn), and a help message.

All rule IDs:

Rule ID Level Triggered when
robots/legal-pages-blocked warn A legal page (/privacy, /terms, /impressum, …) is in disallow
llms/no-description info llms has no description
llms/no-sections info llms has no sections or sources
llms/sections-without-links info Sections exist but none have links (and no sources configured)
security/no-expires warn security has no expires date (required by RFC 9116)
security/no-policy info security has no policy URL
humans/no-team info humans has no team entries
humans/no-technology info humans has no technology entries
sitemap/no-site-url warn No site URL is configured — <loc> entries will be relative
sitemap/empty-sitemap warn Sitemap has no entries after all sources are resolved
sitemap/duplicate-urls warn/error Duplicate URLs detected before deduplication (last wins)
sitemap/invalid-priority warn One or more entries have priority outside [0, 1]

Disable all hints:

siteFiles({ audit: false })

Suppress specific rules:

siteFiles({
  audit: {
    disable: [
      'llms/no-description',
      'security/no-expires',
    ],
  },
})

audit option reference:

Option Type Description
enabled boolean Set to false to silence all hints
disable string[] Rule IDs to suppress individually

Passing audit: false is equivalent to audit: { enabled: false }.

Option defaults

Option Default behavior
robots Enabled — generates robots.txt that allows all crawlers by default
llms Disabled — requires { title }
sitemap Enabled — built-in sitemap generation from Astro's build output
sitemap.rss Disabled — requires { title, description, getItems }
security Disabled — requires { contact }
humans Disabled — generates when any option is provided
audit Enabled — emits build-time hints for all generated files

Programmatic usage

The renderer functions are exported for use outside of the Astro integration:

import {
  renderRobotsTxt,
  renderLlmsTxt,
  renderSecurityTxt,
  renderHumansTxt,
  renderSitemapXml,
  renderSitemapIndex,
  renderRssFeed,
  resolveEntry,
  deduplicateEntries,
  auditSitemap,
  auditRobots,
  auditLlms,
  auditSecurity,
  auditHumans,
  filterIssues,
  defaultRegistry,
  REGISTRY_VERSION,
} from '@casoon/astro-site-files'
import type {
  AuditOptions,
  AuditIssue,
  RssConfig,
  RssItem,
  BotAction,
  BotCategory,
  RegistryBot,
  Preset,
} from '@casoon/astro-site-files'

// With preset
const robots = renderRobotsTxt({ preset: 'seoOnly' }, 'https://example.com')

// With manual overrides on top of a preset
const robots2 = renderRobotsTxt(
  { preset: 'blockTraining', bots: { PerplexityBot: 'disallow' }, disallow: ['/admin'] },
  'https://example.com',
)

// Without preset (manual)
const robots3 = renderRobotsTxt({ disallow: ['/admin'] }, 'https://example.com')
const llms = renderLlmsTxt({ title: 'My Site', description: 'A site.' })
const security = renderSecurityTxt({ contact: 'mailto:info@casoon.de' })
const humans = renderHumansTxt({ team: [{ name: 'Alice' }], technology: ['Astro'] })

const entries = [{ loc: '/blog/post/' }].map(e => resolveEntry(e, {}, 'https://example.com'))
const xml = renderSitemapXml(deduplicateEntries(entries))

const rss = renderRssFeed(
  { title: 'My Blog', description: 'Latest posts', language: 'en' },
  'https://example.com',
  [{ title: 'Hello', pubDate: new Date(), link: '/blog/hello/' }],
)

defaultRegistry exposes the full built-in bot list. REGISTRY_VERSION is an ISO date string of the last registry update — useful for debugging or displaying in tooling.

The createRssRoute helper is available from the /rss sub-path (see RSS API route above):

import { createRssRoute } from '@casoon/astro-site-files/rss'
import type { CreateRssRouteOptions } from '@casoon/astro-site-files/rss'

This package covers static file generation. Actual crawl enforcement depends on whether bots respect these files — many do not.

Keywords