web-content-extractor

A small and fast library for extracting content from HTML.

It is an one of implementation of the paper DOM Based Content Extraction via Text Density.

Installation

To install via NPM:

npm i @wrtnlabs/web-content-extractor

Usage

import { extractContent } from "@wrtnlabs/web-content-extractor";

const { title, description, content, contentHtmls, links } =
  extractContent(html);

console.log("title", title);
console.log("description", description);

console.log("content", content); // The content of the page; string

for (const fragment of contentHtmls) {
  console.log("fragment", fragment); // The fragment of the content; string
}

for (const link of links) {
  console.log("url", link.url); // The URL of the link
  console.log("content", link.content); // The content of the link
}

Note

It strips some tags that can be considered as non-content tags, including:

script
noscript
style
nav
header
footer
img
svg
video
audio
form
label
input
select
option
button
object
embed
iframe
canvas
map
area

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-content-extractor

Installation

Usage

Note

About

Releases

Packages

Contributors 3

Languages

wrtnlabs/web-content-extractor

Folders and files

Latest commit

History

Repository files navigation

web-content-extractor

Installation

Usage

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages