npm.io
5.6.1 • Published 13h ago

url-metadata

Licence
MIT
Version
5.6.1
Deps
3
Size
45 kB
Vulns
0
Weekly
27.2K

url-metadata

Fetch a URL and scrape its metadata using Node.js or the browser. Has optional mode to parse metadata from HTML strings or Response objects instead. First-class configurable security options available. Content is returned raw by design; sanitize your own output (so you control processing efficiency).


Looking for a quick hosted solution? Minifetch is an SEO toolkit built on top of this package by the same author/ maintainer. Get started free: npm install minifetch-api

Extracts:

  • redirects
  • response headers
  • performance metrics
  • meta tags
  • hreflang
  • favicons
  • citations, per the Google Scholar spec
  • Open Graph Protocol (og:) Tags
  • Twitter Card Tags
  • JSON-LD
  • h1-h6 tags
  • img tags
  • automatic charset detection & decoding (optional)
  • the full response body as a string of html (optional)
  • x402 errors return payment requirements

Security - v5.1.0+ Protects against:

  • Infinite redirect loops: maxRedirects option defaults to 10.
  • SSRF attacks via request-filtering-agent in Node.js v18+ (custom options also available)
  • Memory-exhaustion attacks (gzip bombs, oversized responses): set size option. Pair with timeout to prevent slow/connection-holding responses.

More details below. To report a bug or request a feature please open an issue or pull request in GitHub. Please read the Troubleshooting section below before filing a bug.

Install

Works with Node.js versions >=6.0.0 or in the browser when bundled. Example build configs available in the GitHub repo /example-* dirs: Next.js, Vite and Webpack (see /example-typescript).

npm install url-metadata --save

Usage

In your project file:

// Use 'import' in .mjs/ .ts files or if your package.json
// has "type": "module", otherwise use 'require'.
// This package supports both:
import urlMetadata from 'url-metadata';
// const urlMetadata = require('url-metadata');

async function getMetadata(url) {
  try {
    return await urlMetadata(url); // pass options as 2nd argument
  } catch (err) {
    console.error(err);
  }
}

const metadata = await getMetadata('https://en.wikipedia.org/wiki/WHATWG');
Options & Defaults

To override the default options, pass in a second options argument. The default options are the values below.

const options = {

  // Customize the default request headers:
  requestHeaders: {
    'User-Agent': 'url-metadata (+https://www.npmjs.com/package/url-metadata)',
    From: 'example@example.com'
  },

  // Alternate use-case: pass `Response` object to be parsed
  // See example usage below
  parseResponseObject: undefined

  // (Node.js v18+ only)
  // To prevent SSRF attacks, the default option below blocks
  // requests to private network & reserved IP addresses via:
  // https://www.npmjs.com/package/request-filtering-agent
  // Browser security policies prevent SSRF automatically.
  requestFilteringAgentOptions: undefined,

  // (Node.js v6+ only)
  // Pass in your own custom `agent` to override the
  // built-in request filtering agent above
  // https://www.npmjs.com/package/node-fetch/v/2.7.0#custom-agent
  agent: undefined,

  // Maximum redirects in request chain, defaults to 10
  maxRedirects: 10,

  // `fetch` timeout in milliseconds, default is 10 seconds.
  // Time-bounds slow and connection-holding (Slowloris-class) responses.
  timeout: 10000,

  // (Node.js v6+ only) max size of response in bytes (decompressed).
  // Aborts before limit is exceeded so oversized upstreams can't
  // exhaust process memory. Pair with `timeout` option above.
  // Default set to 0 disables max size:
  size: 0,

  // (Node.js v6+ only) compression defaults to true
  // Support gzip/deflate content encoding, set `false` to disable
  compress: true,

  // Force-rewrites favicon and img url strings returned to use https://,
  // valid for images with absolute & protocol-relative urls.
  // Relative urls pass thru untouched:
  ensureSecureImageRequest: true,

  // Charset to decode response with (ex: 'auto', 'utf-8', 'EUC-JP')
  // defaults to auto-detect in `Content-Type` header or meta tag
  // if none found, default `auto` option falls back to `utf-8`
  // override by passing in charset here (ex: 'windows-1251'):
  decode: 'auto',

  // (Browser only) `fetch` API cache setting
  cache: 'no-cache',

  // (Browser only) `fetch` API mode (ex: 'cors', 'same-origin', etc)
  mode: 'cors',

  // Number of characters to truncate description to
  descriptionLength: 750,

  // Include raw response body as string
  includeResponseBody: false
};

// Options usage:
const metadata = await urlMetadata(url, options);

// Alternately, parse a Response object instead:
try {
  // fetch the url in your own code
  const response = await fetch('https://en.wikipedia.org/wiki/WHATWG');
  // ...do your own thing
  // then pass the `response` object to be parsed for metadata:
  const metadata = await urlMetadata(null, {
    parseResponseObject: response
  });
  console.log(metadata);
} catch (err) {
  console.error(err);
}

// Similarly, if you have a string of html you can create
// a Response object and pass the html string into it:
const html = `
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Metadata page</title>
    <meta name="author" content="foobar">
    <meta name="keywords" content="HTML, CSS, JavaScript">
  </head>
  <body>
    <h1>Metadata page</h1>
  </body>
</html>
`;
const response = new Response(html, {
  headers: {
    'Content-Type': 'text/html'
  }
});
const metadata = await urlMetadata(null, {
  parseResponseObject: response
});
console.log(metadata);
Returns

Returns a promise resolved with a JSON object. Note that the returned url field will be the last hop in the request chain if there are redirects.

A basic template for the object can be found in lib/metadata-fields.js. Any additional meta tags found on the page are appended as new fields to the object. Extractor details live in /lib/extract-*.

Content is returned raw, by design, as found on the page without sanitizing it. Only you know its output context, so sanitization is your call (e.g. DOMPurify before rendering to a DOM; parameterized queries when storing to SQL). Treat all extracted values as untrusted.

The object consists of key/value pairs as strings, with exceptions:

  • redirects is an object with count (number) and chain (array of { order, url, statusCode })
  • hreflang, favicons, and responseHeaders is an array of objects containing key/value pairs of strings
  • jsonld is an array of objects
  • all meta tags that begin with citation_ (ex: citation_author) return with keys as strings and values that are an array of strings conforming to the Google Scholar spec which allows for multiple citation meta tags with different content values. So if the html contains:
<meta name="citation_author" content="Arlitsch, Kenning">
<meta name="citation_author" content="OBrien, Patrick">

... it will return as:

'citation_author': ["Arlitsch, Kenning", "OBrien, Patrick"],
Troubleshooting

Issue: Request returns 404, 403 errors or a CAPTCHA form. Your request may have been blocked by the server because it suspects you are a bot or scraper. Check this list to ensure you're not triggering a block.

Issue: No fetch implementation found. You're in either an older browser that doesn't have the native fetch API or a Node.js environment that doesn't support node-fetch (Node.js < v6). File a GitHub issue or try dowgrading to url-metadata version 2.5.0 which uses the now-deprecated request module.

Issue: DNS Lookup errors. The SSRF filtering agent defaults on this package prevent calls to private ip addresses, link-local addresses and reserved ip addresses. To change or disable this feature you need to pass custom requestFilteringAgentOptions. More info here.

Issue: Response status code 0 or CORS errors. The fetch request failed at either the network or protocol level. Possible causes:

  • CORS errors. Try changing the mode option (ex: cors, same-origin, etc) or setting the Access-Control-Allow-Origin header on the server response from the url you are requesting if you have access to it.

  • Trying to access an https resource that has invalid certificate, or trying to access an http resource from a page with an https origin.

  • A browser plugin such as an ad-blocker or privacy protector.

You may also want to try the hosted version of this package: Minifetch.

Keywords