diff --git a/README.md b/README.md index ace1b90..698bb39 100644 --- a/README.md +++ b/README.md @@ -1,136 +1,139 @@ # Elastic Open Web Crawler This repository contains code for the Elastic Open Web Crawler. -This is a tool to allow users to easily ingest content into Elasticsearch from the web. +Open Crawler enables users to easily ingest web content into Elasticsearch. -## How it works +> [!IMPORTANT] +> _The Open Crawler is currently in **tech-preview**_. +Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. +Elastic plans to promote this feature to GA in a future release. -Crawler runs crawl jobs on command based on config files in the `config` directory. -1 URL endpoint on a site will correlate with 1 result output. +_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ -The crawl results can be output in 3 different modes: +### User workflow -- As docs to an Elasticsearch index -- As files to a specified directory -- Directly to the terminal +Indexing web content with the Open Crawler requires: -### Setup - -#### Running from Docker - -Crawler has a Dockerfile that can be built and run locally. - -1. Build the image `docker build -t crawler-image .` -2. Run the container `docker run -i -d --name crawler crawler-image` - - `-i` allows the container to stay alive so CLI commands can be executed inside it - - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it -3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` -4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. - - See [Crawling content](#crawling-content) for examples. +1. Running an instance of Elasticsearch (on-prem, cloud, or serverless) +2. Cloning of the Open Crawler repository (see [Setup](#setup)) +3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers)) +4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands)) -#### Running from source +### Execution logic -Crawler uses both JRuby and Java. -We recommend using version managers for both. -When developing Crawler we use `rbenv` and `jenv`. -There are instructions for setting up these env managers here: +Open Crawler runs crawl jobs on command based on config files in the `config` directory. +Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch. -- [Official documentation for installing jenv](https://www.jenv.be/) -- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) +Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. +The crawl results from these are added to a pool of results. +These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold. -Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: +### Setup -```bash -# should output the same version as `.ruby-version` -$ ruby --version +#### Prerequisites -# should output the same version as `.java-version` -$ java --version -``` +A running instance of Elasticsearch is required to index documents into. +If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). -If the versions seem correct, you can install dependencies: +#### Connecting to Elasticsearch -```bash -$ make install -``` +Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch. +To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. +If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. -You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: -Doing this requires that you use both `rbenv` and `jenv` in your local setup. +- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details +- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler -```bash -$ CRAWLER_MANAGE_ENV=true make install -``` +
+ Creating an API key + Here is an example of creating an API key with minimal permissions for Open Crawler. + This will return a JSON with an `encoded` key. + The value of `encoded` is what Open Crawler can use in its configuration. + + ```bash + POST /_security/api_key + { + "name": "my-api-key", + "role_descriptors": { + "my-crawler-role": { + "cluster": ["all"], + "indices": [ + { + "names": ["my-crawler-index-name"], + "privileges": ["all"] + } + ] + } + }, + "metadata": { + "application": "my-crawler" + } + } + ``` +
-Crawler should now be functional. -See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content. -### Configuring Crawlers -See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. +#### Running Open Crawler from Docker -Once you have a Crawler configured, you can validate the domain(s) using the CLI. +Open Crawler has a Dockerfile that can be built and run locally. -```bash -$ bin/crawler validate config/my-crawler.yml -``` +1. Clone the repository: `git clone https://github.com/elastic/crawler.git` +2. Build the image `docker build -t crawler-image .` +3. Run the container `docker run -i -d --name crawler crawler-image` + - `-i` allows the container to stay alive so CLI commands can be executed inside it + - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it +4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version` + - Execute other CLI commands from outside of the container by prepending `docker exec -it crawler ` +5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps. -If you are running from docker, you will first need to copy the config file into the docker container. +#### Running Open Crawler from source -```bash -# copy file (if you haven't already done so) -$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml +> [!TIP] +> We recommend running from source only if you are actively developing Open Crawler. -# run -$ docker exec -it crawler bin/crawler validate config/my-crawler.yml -``` +
+ Instructions for running from source + ℹ️ Open Crawler uses both JRuby and Java. + We recommend using version managers for both. + When developing Open Crawler we use rbenv and jenv. + There are instructions for setting up these env managers here: -See [Crawling content](#crawling-content). + - [Official documentation for installing jenv](https://www.jenv.be/) + - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) -### Crawling content + 1. Clone the repository: `git clone https://github.com/elastic/crawler.git` + 2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used: + ```bash + # should output the same version as `.ruby-version` + $ ruby --version -Use the following command to run a crawl based on the configuration provided. + # should output the same version as `.java-version` + $ java --version + ``` -```bash -$ bin/crawler crawl config/my-crawler.yml -``` + 3. If the versions seem correct, you can install dependencies: + ```bash + $ make install + ``` -And from Docker. + You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: + Doing this requires that you use both `rbenv` and `jenv` in your local setup. -```bash -$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml -``` + ```bash + $ CRAWLER_MANAGE_ENV=true make install + ``` +
-### Connecting to Elasticsearch +### Configuring Crawlers -If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch. -To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. -If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. +See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files. -- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details -- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler +### CLI Commands -Here is an example of creating an API key with minimal permissions for Crawler. -This will return a JSON with an `encoded` key. -The value of `encoded` is what Crawler can use in its configuration. - -```bash -POST /_security/api_key -{ - "name": "my-api-key", - "role_descriptors": { - "my-crawler-role": { - "cluster": ["all"], - "indices": [ - { - "names": ["my-crawler-index-name"], - "privileges": ["all"] - } - ] - } - }, - "metadata": { - "application": "my-crawler" - } -} +Open Crawler does not have a graphical user interface. +All interactions with Open Crawler take place through the CLI. +When given a command, Open Crawler will run until the process is finished. +OpenCrawler is not kept alive in any way between commands. -``` +See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler. diff --git a/docs/CLI.md b/docs/CLI.md new file mode 100644 index 0000000..b1f84c9 --- /dev/null +++ b/docs/CLI.md @@ -0,0 +1,99 @@ +# CLI + +Crawler CLI is a command-line interface for use in the terminal or scripts. +This is the only user interface for interacting with Crawler. + +## Installation and Configuration + +Ensure you complete the [setup](../README.md#setup) before using the CLI. + +For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md). + +### CLI in Docker + +If you are running a dockerized version of Crawler, you can run CLI commands in two ways; + +1. Exec into the docker container and execute commands directly using `docker exec -it bash` + - This requires no changes to CLI commands + ```bash + # exec into container + $ docker exec -it crawler bash + + # move to crawler directory + $ cd crawler + + # execute commands + $ bin/crawler version + ``` +2. Execute commands externally using `docker exec -it ` + ```bash + # execute command directly without entering docker container + $ docker exec -it crawler bin/crawler version + ``` + +## Available commands +### Getting help +Use the `--help or -h` option with any command to get more information. + +For example: +```bash +$ bin/crawler --help + +> Commands: +> crawler crawl CRAWL_CONFIG # Run a crawl of the site +> crawler validate CRAWL_CONFIG # Validate crawler configuration +> crawler version # Print version +``` + +### Commands + + +- [`crawler crawl`](#crawler-crawl) +- [`crawler validate`](#crawler-validate) +- [`crawler version`](#crawler-version) + +#### `crawler crawl` + +Crawls the configured domain in the provided config file. +Can optionally take a second configuration file for Elasticsearch settings. +See [CONFIG.md](./CONFIG.md) for details on the configuration files. + +```bash +# crawl using only crawler config +$ bin/crawler crawl config/examples/parks-australia.yml +``` + +```bash +# crawl using crawler config and optional --es-config +$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml +``` + +#### `crawler validate` + +Checks the configured domains in `domain_allowlist` to see if they can be crawled. + +```bash +# when valid +$ bin/crawler validate path/to/crawler.yml + +> Domain https://www.elastic.co is valid +``` + +```bash +# when invalid (e.g. has a redirect) +$ bin/crawler validate path/to/invalid-crawler.yml + +> Domain https://elastic.co is invalid: +> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/). +> If you want to crawl this site, please configure https://www.elastic.co as one of the domains. +``` + +#### `crawler version` + +Checks the product version of Crawler + +```bash +$ bin/crawler version + +> v0.2.0 +```