Skip to content

A LLM-free library for extracting main content from HTML strings via Text Density analysis

Notifications You must be signed in to change notification settings

wrtnlabs/web-content-extractor

Repository files navigation

web-content-extractor

A small and fast library for extracting content from HTML.

It is an one of implementation of the paper DOM Based Content Extraction via Text Density.

Installation

To install via NPM:

npm i @wrtnlabs/web-content-extractor

Usage

import { extractContent } from "@wrtnlabs/web-content-extractor";

const { title, description, content, contentHtmls, links } =
  extractContent(html);

console.log("title", title);
console.log("description", description);

console.log("content", content); // The content of the page; string

for (const fragment of contentHtmls) {
  console.log("fragment", fragment); // The fragment of the content; string
}

for (const link of links) {
  console.log("url", link.url); // The URL of the link
  console.log("content", link.content); // The content of the link
}

Note

It strips some tags that can be considered as non-content tags, including:

  • script
  • noscript
  • style
  • nav
  • header
  • footer
  • img
  • svg
  • video
  • audio
  • form
  • label
  • input
  • select
  • option
  • button
  • object
  • embed
  • iframe
  • canvas
  • map
  • area

About

A LLM-free library for extracting main content from HTML strings via Text Density analysis

Resources

Stars

Watchers

Forks

Packages

No packages published