elastic · navarone-feekery · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024
@@ -1,136 +1,139 @@
 # Elastic Open Web Crawler
 
 This repository contains code for the Elastic Open Web Crawler.
-This is a tool to allow users to easily ingest content into Elasticsearch from the web.
+Open Crawler enables users to easily ingest web content into Elasticsearch.
 
-## How it works
+> [!IMPORTANT]
+> _The Open Crawler is currently in **tech-preview**_.
+Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
+Elastic plans to promote this feature to GA in a future release.
 
-Crawler runs crawl jobs on command based on config files in the `config` directory.
-1 URL endpoint on a site will correlate with 1 result output.
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
 
-The crawl results can be output in 3 different modes:
+### User workflow
 
-- As docs to an Elasticsearch index
-- As files to a specified directory
-- Directly to the terminal
+Indexing web content with the Open Crawler requires:
 
-### Setup
-
-#### Running from Docker
-
-Crawler has a Dockerfile that can be built and run locally.
-
-1. Build the image `docker build -t crawler-image .`
-2. Run the container `docker run -i -d --name crawler crawler-image`
-   - `-i` allows the container to stay alive so CLI commands can be executed inside it
-   - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
-3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
-   - See [Crawling content](#crawling-content) for examples.
+1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
+2. Cloning of the Open Crawler repository (see [Setup](#setup))
+3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
+4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))
 
-#### Running from source
+### Execution logic
 
-Crawler uses both JRuby and Java.
-We recommend using version managers for both.
-When developing Crawler we use `rbenv` and `jenv`.
-There are instructions for setting up these env managers here:
+Open Crawler runs crawl jobs on command based on config files in the `config` directory.
+Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
 
-- [Official documentation for installing jenv](https://www.jenv.be/)
-- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
+Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
+The crawl results from these are added to a pool of results.
+These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
 
-Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+### Setup
 
-```bash
-# should output the same version as `.ruby-version`
-$ ruby --version
+#### Prerequisites
 
-# should output the same version as `.java-version`
-$ java --version
-```
+A running instance of Elasticsearch is required to index documents into.
+If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
 
-If the versions seem correct, you can install dependencies:
+#### Connecting to Elasticsearch
 
-```bash
-$ make install
-```
+Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch.
+To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
+If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.
 
-You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
-Doing this requires that you use both `rbenv` and `jenv` in your local setup.
+- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
+- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler
 
-```bash
-$ CRAWLER_MANAGE_ENV=true make install
-```
+<details>
+  <summary>Creating an API key</summary>
+  Here is an example of creating an API key with minimal permissions for Open Crawler.
+  This will return a JSON with an `encoded` key.
+  The value of `encoded` is what Open Crawler can use in its configuration.
+
+  ```bash
+  POST /_security/api_key
+  {
+    "name": "my-api-key",
+    "role_descriptors": { 
+      "my-crawler-role": {
+        "cluster": ["all"],
+        "indices": [
+          {
+            "names": ["my-crawler-index-name"],
+            "privileges": ["all"]
+          }
+        ]
+      }
+    },
+    "metadata": {
+      "application": "my-crawler"
+    }
+  }
+  ```
+</details>
 
-Crawler should now be functional.
-See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
 
-### Configuring Crawlers
 
-See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
+#### Running Open Crawler from Docker
 
-Once you have a Crawler configured, you can validate the domain(s) using the CLI.
+Open Crawler has a Dockerfile that can be built and run locally.
 
-```bash
-$ bin/crawler validate config/my-crawler.yml
-```
+1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
+2. Build the image `docker build -t crawler-image .`
+3. Run the container `docker run -i -d --name crawler crawler-image`
+   - `-i` allows the container to stay alive so CLI commands can be executed inside it
+   - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
+4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version`
+   - Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
+5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps.
 
-If you are running from docker, you will first need to copy the config file into the docker container.
+#### Running Open Crawler from source
 
-```bash
-# copy file (if you haven't already done so)
-$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml
+> [!TIP]
+> We recommend running from source only if you are actively developing Open Crawler.
 
-# run 
-$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
-```
+<details>
+  <summary>Instructions for running from source</summary>
+  ℹ️ Open Crawler uses both JRuby and Java.
+  We recommend using version managers for both.
+  When developing Open Crawler we use <b>rbenv</b> and <b>jenv</b>.
+  There are instructions for setting up these env managers here:
 
-See [Crawling content](#crawling-content).
+  - [Official documentation for installing jenv](https://www.jenv.be/)
+  - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-### Crawling content
+  1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
+  2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:
+      ```bash
+      # should output the same version as `.ruby-version`
+      $ ruby --version
 
-Use the following command to run a crawl based on the configuration provided.
+      # should output the same version as `.java-version`
+      $ java --version
+      ```
 
-```bash
-$ bin/crawler crawl config/my-crawler.yml
-```
+  3. If the versions seem correct, you can install dependencies:
+      ```bash
+      $ make install
+      ```
 
-And from Docker.
+     You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
+     Doing this requires that you use both `rbenv` and `jenv` in your local setup.
 
-```bash
-$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
-```
+      ```bash
+      $ CRAWLER_MANAGE_ENV=true make install
+      ```
+</details>
 
-### Connecting to Elasticsearch
+### Configuring Crawlers
 
-If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch.
-To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
-If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.
+See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.
 
-- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
-- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler
+### CLI Commands
 
-Here is an example of creating an API key with minimal permissions for Crawler.
-This will return a JSON with an `encoded` key.
-The value of `encoded` is what Crawler can use in its configuration. 
-
-```bash
-POST /_security/api_key
-{
-  "name": "my-api-key",
-  "role_descriptors": { 
-    "my-crawler-role": {
-      "cluster": ["all"],
-      "indices": [
-        {
-          "names": ["my-crawler-index-name"],
-          "privileges": ["all"]
-        }
-      ]
-    }
-  },
-  "metadata": {
-    "application": "my-crawler"
-  }
-}
+Open Crawler does not have a graphical user interface.
+All interactions with Open Crawler take place through the CLI.
+When given a command, Open Crawler will run until the process is finished.
+OpenCrawler is not kept alive in any way between commands.
 
-```
+See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
@@ -0,0 +1,99 @@
+# CLI
+
+Crawler CLI is a command-line interface for use in the terminal or scripts.
+This is the only user interface for interacting with Crawler.
+
+## Installation and Configuration
+
+Ensure you complete the [setup](../README.md#setup) before using the CLI.
+
+For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).
+
+### CLI in Docker
+
+If you are running a dockerized version of Crawler, you can run CLI commands in two ways;
+
+1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
+    - This requires no changes to CLI commands
+    ```bash
+    # exec into container
+    $ docker exec -it crawler bash
+
+    # move to crawler directory
+    $ cd crawler
+
+    # execute commands
+    $ bin/crawler version
+    ```
+2. Execute commands externally using `docker exec -it <container name> <command>`
+    ```bash
+    # execute command directly without entering docker container
+    $ docker exec -it crawler bin/crawler version
+    ```
+
+## Available commands
+### Getting help
+Use the `--help or -h` option with any command to get more information.
+
+For example:
+```bash
+$ bin/crawler --help
+
+> Commands:
+>   crawler crawl CRAWL_CONFIG                   # Run a crawl of the site
+>   crawler validate CRAWL_CONFIG                # Validate crawler configuration
+>   crawler version                              # Print version
+```
+
+### Commands
+
+
+- [`crawler crawl`](#crawler-crawl)
+- [`crawler validate`](#crawler-validate)
+- [`crawler version`](#crawler-version)
+
+#### `crawler crawl`
+
+Crawls the configured domain in the provided config file.
+Can optionally take a second configuration file for Elasticsearch settings.
+See [CONFIG.md](./CONFIG.md) for details on the configuration files.
+
+```bash
+# crawl using only crawler config
+$ bin/crawler crawl config/examples/parks-australia.yml
+```
+
+```bash
+# crawl using crawler config and optional --es-config
+$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
+```
+
+#### `crawler validate`
+
+Checks the configured domains in `domain_allowlist` to see if they can be crawled.
+
+```bash
+# when valid
+$ bin/crawler validate path/to/crawler.yml
+
+> Domain https://www.elastic.co is valid
+```
+
+```bash
+# when invalid (e.g. has a redirect)
+$ bin/crawler validate path/to/invalid-crawler.yml
+
+> Domain https://elastic.co is invalid:
+> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
+> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
+```
+
+#### `crawler version`
+
+Checks the product version of Crawler
+
+```bash
+$ bin/crawler version
+
+> v0.2.0
+```