📚 rust-doc-scraper

A fast, idiomatic Rust tool for scraping and converting official Rust documentation sites into Markdown — with automatic attribution headers and offline-friendly output.

Built for maintainers of AI agents, documentation tools, or GPTs that use content from rust-lang.org, docs.rs, or other community-authored Rust books and sites.

🚀 Features

🔍 Scrapes HTML pages from Rust ecosystem documentation sites
📄 Converts to Markdown using customizable rules
🖋️ Injects attribution headers automatically
📂 Outputs Markdown to structured folders
🦀 100% Rust-native, fast and parallelizable

📘 Supported Docs & Books

The following sources are currently scraped:

Doc Name	Source URL
The Rust Book	https://doc.rust-lang.org/book/
Rust by Example	https://doc.rust-lang.org/rust-by-example/
The Cargo Book	https://doc.rust-lang.org/cargo/
The Rustonomicon	https://doc.rust-lang.org/nomicon/
The Async Book	https://rust-lang.github.io/async-book/
The Clippy Book	https://rust-lang.github.io/rust-clippy/current/
Error Index	https://doc.rust-lang.org/error_codes/
Rust API Guidelines	https://rust-lang.github.io/api-guidelines/
The Rust and WebAssembly Book	https://rustwasm.github.io/book/
Tokio Documentation	https://docs.rs/tokio/latest/tokio/
Axum Documentation	https://docs.rs/axum/latest/axum/
Leptos Book	https://book.leptos.dev/
Embedded Rust Book	https://docs.rust-embedded.org/book/
The Little Book of Rust Macros	https://danielkeep.github.io/tlborm/book/
Too Many Linked Lists	https://rust-unofficial.github.io/too-many-lists/

⚙️ Customizing the Docs to Scrape

You can customize which documentation sites the scraper pulls from by editing the source list in:

src/targets.rs

Inside, you'll find a function like:

pub fn get_scrape_targets() -> HashMap<String, String> {
    HashMap::from([
        ("The Rust Programming Language Book".into(), "https://doc.rust-lang.org/book/".into()),
        ("Tokio Documentation".into(), "https://docs.rs/tokio/latest/tokio/".into()),
        // ...
    ])
}

You can:

✅ Add new entries to scrape new Rust documentation sites
❌ Remove entries if you don’t need certain sources
✏️ Rename entries (keys are just used for folder names)

Changes take effect next time you run the scraper.

🛠️ Usage

1. 📦 Build the tool

cargo build --release

2. 🧪 Run the scraper

To scrape all configured sources and output Markdown into output/:

cargo run --release

If you only want to run specific modules, you can comment out others in main.rs.

3. 📁 Output

Markdown files will be saved in folder: ./scraped_docs/
Attribution headers are prepended like:

<!--
Source: The Rust Book - https://doc.rust-lang.org/book/
License: MIT OR Apache-2.0
-->

📜 Attribution & License Info

All scraped content includes source URL and license attribution in each .md or .rs file.
All sources currently use dual MIT or Apache-2.0 licenses.
You can find complete references in ATTRIBUTION.md.

⚙️ Project Structure

src/
├── main.rs             # Entry point
├── scrape.rs           # Web scraping and HTML-to-Markdown logic
├── attribute_md.rs     # Attribution for .md files
├── attribute_rs.rs     # Attribution for .rs files
├── utils.rs            # Helper functions
output/                 # Final Markdown output

🧰 Requirements

Rust 1.72+ (tested)
OpenSSL (for crates using reqwest on some systems)

Linux (Ubuntu/Debian)

sudo apt install pkg-config libssl-dev

🧪 Testing & Linting

cargo fmt     # Format
cargo clippy  # Lint
cargo test    # (Future: Add test suite)

🪪 License

This project is dual-licensed under either:

MIT License (LICENSE-MIT)
Apache License, Version 2.0 (LICENSE-APACHE)

You may choose either license.

Scraped documentation content retains the license of its original source (typically MIT OR Apache-2.0).
See ATTRIBUTION.md for source-specific license references.

🙋 Contributing

PRs welcome — especially for:

New doc sources
Better markdown cleaning
Language-specific scraping (i18n)

Last updated: 2025-05-25

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

📚 rust-doc-scraper

🚀 Features

📘 Supported Docs & Books

⚙️ Customizing the Docs to Scrape

🛠️ Usage

1. 📦 Build the tool

2. 🧪 Run the scraper

3. 📁 Output

📜 Attribution & License Info

⚙️ Project Structure

🧰 Requirements

Linux (Ubuntu/Debian)

🧪 Testing & Linting

🪪 License

🙋 Contributing

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

fewrux/rust-doc-scraper

Folders and files

Latest commit

History

Repository files navigation

📚 rust-doc-scraper

🚀 Features

📘 Supported Docs & Books

⚙️ Customizing the Docs to Scrape

🛠️ Usage

1. 📦 Build the tool

2. 🧪 Run the scraper

3. 📁 Output

📜 Attribution & License Info

⚙️ Project Structure

🧰 Requirements

Linux (Ubuntu/Debian)

🧪 Testing & Linting

🪪 License

🙋 Contributing

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages