0.7.0 • Published 22h agoCLI

@lexiprep/core

Licence

MIT

Version

0.7.0

Deps

Size

80 kB

Vulns

Weekly

1.1K

Summary Dependency Versions

@lexiprep/core

Reusable, framework-agnostic pipeline that turns an EPUB or PDF into a frequency-sorted word list — the language-processing engine behind lexiprep, usable on its own in any Node/TypeScript project. Published on npm as @lexiprep/core.

Pure TypeScript, no server or database dependencies. Runs locally, in a worker, in tests, or (via jszip) potentially in the browser.

Install

pnpm add @lexiprep/core

Usage

import { readFile } from "node:fs/promises";
import { readEpub, analyzeBook } from "@lexiprep/core";

const book = await readEpub(await readFile("book.epub"));

const analysis = analyzeBook(book, {
  excludeStopwords: true,   // drop the/a/on/this...
  from: 0,                  // optional section range (EPUB chapters / PDF pages)
  to: 5,
});

console.log(analysis.metadata.title, analysis.totalTokens);
for (const { word, count } of analysis.frequencies.slice(0, 20)) {
  console.log(count, word);
}

PDFs are a drop-in — readPdf returns the same Book, with each page as a section:

import { readPdf } from "@lexiprep/core";

const pdfBook = await readPdf(await readFile("book.pdf"));

API

A Book is { metadata, sections[] }. A section is one EPUB content document or one PDF page, in reading order — so the section range is a chapter range for EPUB and a page range for PDF.

Export	Purpose
`readEpub(input)`	Parse an EPUB (2/3) into a `Book` — sections are content documents in reading order, with clean text.
`readPdf(input)`	Parse a PDF into a `Book` — one section per page, in page order, with clean text. Runs in Node and the browser.
`analyzeBook(book, options?)`	Frequency report over a `Book` (EPUB or PDF) with section-range (`from`/`to`), stopword and min-length filters.
`countWords(text, options?)`	Frequency count for a raw string.
`tokenize(text)` / `normalizeWord(raw)`	Locale-agnostic word tokenization (EN now, accents/ES ready).
`htmlToText(html)`	Strip XHTML to clean text.
`ENGLISH_STOPWORDS`	Default English function-word set.

Full behavior and design notes: see the lexiprep specs.

CLI

pnpm analyze book.epub --top 30      # or: pnpm analyze book.pdf
# options: --from N --to N --min-length N --keep-stopwords --json

Develop

No Docker, no services — just Node + pnpm:

pnpm install
pnpm test          # vitest
pnpm typecheck
pnpm build         # emits dist/ (tsc)

License

MIT — reusable in any project, open or closed.

Keywords

epub pdf ebook tokenize word-frequency vocabulary language-learning nlp