Skip to content

Web content extraction using machine learning

License

Notifications You must be signed in to change notification settings

i-timur/learnhtml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Getting started

Prerequisites:

  • Python > 3.6; Python < 3.8
  • pip 23.0+

Installation

First you will need to install the dependencies. For the binary dependencies:

  • Linux
    sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

LIBXML2_PATH=<PATH_TO_LIBXML2> pip install -e .

About

Web content extraction using machine learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 55.3%
  • Python 35.8%
  • Cython 8.1%
  • Shell 0.8%