url-metadata
Fetch a URL and scrape its metadata using Node.js or the browser. Has optional mode to parse metadata from HTML strings or Response objects instead. First-class configurable security options available. Content is returned raw by design; sanitize your own output (so you control processing efficiency).
Extracts:
- redirects
- response headers
- performance metrics
- meta tags
- hreflang
- favicons
- citations, per the Google Scholar spec
- Open Graph Protocol (og:) Tags
- Twitter Card Tags
- JSON-LD
- h1-h6 tags
- img tags
- automatic charset detection & decoding (optional)
- the full response body as a string of html (optional)
- x402 errors return payment requirements
Security - v5.1.0+ Protects against:
- Infinite redirect loops:
maxRedirectsoption defaults to 10. - SSRF attacks via request-filtering-agent in Node.js v18+ (custom options also available)
- Memory-exhaustion attacks (gzip bombs, oversized responses): set
sizeoption. Pair withtimeoutto prevent slow/connection-holding responses.
More details below. To report a bug or request a feature please open an issue or pull request in GitHub. Please read the Troubleshooting section below before filing a bug.
Install
Works with Node.js versions >=6.0.0 or in the browser when bundled. Example build configs available in the GitHub repo /example-* dirs: Next.js, Vite and Webpack (see /example-typescript).
npm install url-metadata --save
Usage
In your project file:
// Use 'import' in .mjs/ .ts files or if your package.json
// has "type": "module", otherwise use 'require'.
// This package supports both:
import urlMetadata from 'url-metadata';
// const urlMetadata = require('url-metadata');
async function getMetadata(url) {
try {
return await urlMetadata(url); // pass options as 2nd argument
} catch (err) {
console.error(err);
}
}
const metadata = await getMetadata('https://en.wikipedia.org/wiki/WHATWG');Options & Defaults
To override the default options, pass in a second options argument. The default options are the values below.
const options = {
// Customize the default request headers:
requestHeaders: {
'User-Agent': 'url-metadata (+https://www.npmjs.com/package/url-metadata)',
From: 'example@example.com'
},
// Alternate use-case: pass `Response` object to be parsed
// See example usage below
parseResponseObject: undefined
// (Node.js v18+ only)
// To prevent SSRF attacks, the default option below blocks
// requests to private network & reserved IP addresses via:
// https://www.npmjs.com/package/request-filtering-agent
// Browser security policies prevent SSRF automatically.
requestFilteringAgentOptions: undefined,
// (Node.js v6+ only)
// Pass in your own custom `agent` to override the
// built-in request filtering agent above
// https://www.npmjs.com/package/node-fetch/v/2.7.0#custom-agent
agent: undefined,
// Maximum redirects in request chain, defaults to 10
maxRedirects: 10,
// `fetch` timeout in milliseconds, default is 10 seconds.
// Time-bounds slow and connection-holding (Slowloris-class) responses.
timeout: 10000,
// (Node.js v6+ only) max size of response in bytes (decompressed).
// Aborts before limit is exceeded so oversized upstreams can't
// exhaust process memory. Pair with `timeout` option above.
// Default set to 0 disables max size:
size: 0,
// (Node.js v6+ only) compression defaults to true
// Support gzip/deflate content encoding, set `false` to disable
compress: true,
// Force-rewrites favicon and img url strings returned to use https://,
// valid for images with absolute & protocol-relative urls.
// Relative urls pass thru untouched:
ensureSecureImageRequest: true,
// Charset to decode response with (ex: 'auto', 'utf-8', 'EUC-JP')
// defaults to auto-detect in `Content-Type` header or meta tag
// if none found, default `auto` option falls back to `utf-8`
// override by passing in charset here (ex: 'windows-1251'):
decode: 'auto',
// (Browser only) `fetch` API cache setting
cache: 'no-cache',
// (Browser only) `fetch` API mode (ex: 'cors', 'same-origin', etc)
mode: 'cors',
// Number of characters to truncate description to
descriptionLength: 750,
// Include raw response body as string
includeResponseBody: false
};
// Options usage:
const metadata = await urlMetadata(url, options);
// Alternately, parse a Response object instead:
try {
// fetch the url in your own code
const response = await fetch('https://en.wikipedia.org/wiki/WHATWG');
// ...do your own thing
// then pass the `response` object to be parsed for metadata:
const metadata = await urlMetadata(null, {
parseResponseObject: response
});
console.log(metadata);
} catch (err) {
console.error(err);
}
// Similarly, if you have a string of html you can create
// a Response object and pass the html string into it:
const html = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Metadata page</title>
<meta name="author" content="foobar">
<meta name="keywords" content="HTML, CSS, JavaScript">
</head>
<body>
<h1>Metadata page</h1>
</body>
</html>
`;
const response = new Response(html, {
headers: {
'Content-Type': 'text/html'
}
});
const metadata = await urlMetadata(null, {
parseResponseObject: response
});
console.log(metadata);Returns
Returns a promise resolved with a JSON object. Note that the returned url field will be the last hop in the request chain if there are redirects.
A basic template for the object can be found in lib/metadata-fields.js. Any additional meta tags found on the page are appended as new fields to the object. Extractor details live in /lib/extract-*.
Content is returned raw, by design, as found on the page without sanitizing it. Only you know its output context, so sanitization is your call (e.g. DOMPurify before rendering to a DOM; parameterized queries when storing to SQL). Treat all extracted values as untrusted.
The object consists of key/value pairs as strings, with exceptions:
redirectsis an object withcount(number) andchain(array of{ order, url, statusCode })hreflang,favicons, andresponseHeadersis an array of objects containing key/value pairs of stringsjsonldis an array of objects- all meta tags that begin with
citation_(ex:citation_author) return with keys as strings and values that are an array of strings conforming to the Google Scholar spec which allows for multiple citation meta tags with different content values. So if the html contains:
<meta name="citation_author" content="Arlitsch, Kenning">
<meta name="citation_author" content="OBrien, Patrick">
... it will return as:
'citation_author': ["Arlitsch, Kenning", "OBrien, Patrick"],
Troubleshooting
Issue: Request returns 404, 403 errors or a CAPTCHA form. Your request may have been blocked by the server because it suspects you are a bot or scraper. Check this list to ensure you're not triggering a block.
Issue: No fetch implementation found. You're in either an older browser that doesn't have the native fetch API or a Node.js environment that doesn't support node-fetch (Node.js < v6). File a GitHub issue or try dowgrading to url-metadata version 2.5.0 which uses the now-deprecated request module.
Issue: DNS Lookup errors. The SSRF filtering agent defaults on this package prevent calls to private ip addresses, link-local addresses and reserved ip addresses. To change or disable this feature you need to pass custom requestFilteringAgentOptions. More info here.
Issue: Response status code 0 or CORS errors. The fetch request failed at either the network or protocol level. Possible causes:
CORS errors. Try changing the mode option (ex:
cors,same-origin, etc) or setting theAccess-Control-Allow-Originheader on the server response from the url you are requesting if you have access to it.Trying to access an
httpsresource that has invalid certificate, or trying to access anhttpresource from a page with anhttpsorigin.A browser plugin such as an ad-blocker or privacy protector.
You may also want to try the hosted version of this package: Minifetch.