Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96
on the Dragnet dataset.
- Python > 3.6; Python < 3.8
- pip 23.0+
First you will need to install the dependencies. For the binary dependencies:
- Linux
sudo apt-get install recode libxml2-dev libxslt1-dev unzip
Python dependencies:
pip install -r requirements.txt
Build the project and install it locally
LIBXML2_PATH=<PATH_TO_LIBXML2> pip install -e .