elastic · navarone-feekery · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024
@@ -1,104 +1,99 @@
 # Elastic Open Web Crawler
 
 This repository contains code for the Elastic Open Web Crawler.
-This is a tool to allow users to easily ingest content into Elasticsearch from the web.
+The crawler enables users to easily ingest web content into Elasticsearch.
 
-## How it works
+⚠️ _The Open Crawler is currently in **tech-preview**_.
+Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
+Elastic plans to promote this feature to GA in a future release.
+
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
+
+### How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
-1 URL endpoint on a site will correlate with 1 result output.
+Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
 
-The crawl results can be output in 3 different modes:
+Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
+The crawl results from these are added to a pool of results.
+These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
 
-- As docs to an Elasticsearch index
-- As files to a specified directory
-- Directly to the terminal
+The full process required from setup to indexing requires;
+
+1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
+2. Cloning of the Open Crawler repository (see [Setup](#setup))
+3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
+4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))
 
 ### Setup
 
+#### Prerequisites
+
+A running instance of Elasticsearch is required to index documents into.
+If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
+
 #### Running from Docker
 
 Crawler has a Dockerfile that can be built and run locally.
 
-1. Build the image `docker build -t crawler-image .`
-2. Run the container `docker run -i -d --name crawler crawler-image`
+1. Clone the repository
+2. Build the image `docker build -t crawler-image .`
+3. Run the container `docker run -i -d --name crawler crawler-image`
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
-3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
-   - See [Crawling content](#crawling-content) for examples.
+4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
+5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
+6. See [Configuring crawlers](#configuring-crawlers) for next steps.
 
 #### Running from source
 
-Crawler uses both JRuby and Java.
-We recommend using version managers for both.
-When developing Crawler we use `rbenv` and `jenv`.
-There are instructions for setting up these env managers here:
+To avoid complications caused by different operating systems and managing ruby and java versions, we recommend running from source only if you are actively developing Open Crawler.
 
-- [Official documentation for installing jenv](https://www.jenv.be/)
-- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
+<details>
+  <summary>Instructions for running from source</summary>
+  ℹ️ Crawler uses both JRuby and Java.
+  We recommend using version managers for both.
+  When developing Crawler we use <b>rbenv</b> and <b>jenv</b>.
+  There are instructions for setting up these env managers here:
 
-Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+  - [Official documentation for installing jenv](https://www.jenv.be/)
+  - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-```bash
-# should output the same version as `.ruby-version`
-$ ruby --version
+  1. Clone the repository
+  2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+      ```bash
+      # should output the same version as `.ruby-version`
+      $ ruby --version
 
-# should output the same version as `.java-version`
-$ java --version
-```
+      # should output the same version as `.java-version`
+      $ java --version
+      ```
 
-If the versions seem correct, you can install dependencies:
+  3. If the versions seem correct, you can install dependencies:
+      ```bash
+      $ make install
+      ```
 
-```bash
-$ make install
-```
-
-You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
-Doing this requires that you use both `rbenv` and `jenv` in your local setup.
+     You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
+     Doing this requires that you use both `rbenv` and `jenv` in your local setup.
 
-```bash
-$ CRAWLER_MANAGE_ENV=true make install
-```
-
-Crawler should now be functional.
-See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
+      ```bash
+      $ CRAWLER_MANAGE_ENV=true make install
+      ```
+</details>
 
 ### Configuring Crawlers
 
 See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
 
-Once you have a Crawler configured, you can validate the domain(s) using the CLI.
-
-```bash
-$ bin/crawler validate config/my-crawler.yml
-```
-
-If you are running from docker, you will first need to copy the config file into the docker container.
-
-```bash
-# copy file (if you haven't already done so)
-$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml
-
-# run 
-$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
-```
-
-See [Crawling content](#crawling-content).
+### CLI Commands
 
-### Crawling content
+Open Crawler has no UI.
+All interactions with Crawler take place through the CLI.
+When given a command, Crawler will run until the process is finished.
+Crawler is not kept alive in any way between commands.
 
-Use the following command to run a crawl based on the configuration provided.
-
-```bash
-$ bin/crawler crawl config/my-crawler.yml
-```
-
-And from Docker.
-
-```bash
-$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
-```
+See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
 
 ### Connecting to Elasticsearch
 
@@ -132,5 +127,4 @@ POST /_security/api_key
     "application": "my-crawler"
   }
 }
-
 ```
@@ -0,0 +1,99 @@
+# CLI
+
+Crawler CLI is a command-line interface for use in the terminal or scripts.
+This is the only user interface for interacting with Crawler.
+
+## Installation and Configuration
+
+Ensure you complete the [setup](../README.md#setup) before using the CLI.
+
+For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).
+
+### CLI in Docker
+
+If you are running a dockerized version of Crawler, you can run CLI commands in two ways;
+
+1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
+    - This requires no changes to CLI commands
+    ```bash
+    # exec into container
+    $ docker exec -it crawler bash
+
+    # move to crawler directory
+    $ cd crawler
+
+    # execute commands
+    $ bin/crawler version
+    ```
+2. Execute commands externally using `docker exec -it <container name> <command>`
+    ```bash
+    # execute command directly without entering docker container
+    $ docker exec -it crawler bin/crawler version
+    ```
+
+## Available commands
+### Getting help
+Use the `--help or -h` option with any command to get more information.
+
+For example:
+```bash
+$ bin/crawler --help
+
+> Commands:
+>   crawler crawl CRAWL_CONFIG                   # Run a crawl of the site
+>   crawler validate CRAWL_CONFIG                # Validate crawler configuration
+>   crawler version                              # Print version
+```
+
+### Commands
+
+
+- [`crawler crawl`](#crawler-crawl)
+- [`crawler validate`](#crawler-validate)
+- [`crawler version`](#crawler-version)
+
+#### `crawler crawl`
+
+Crawls the configured domain in the provided config file.
+Can optionally take a second configuration file for Elasticsearch settings.
+See [CONFIG.md](./CONFIG.md) for details on the configuration files.
+
+```bash
+# crawl using only crawler config
+$ bin/crawler crawl config/examples/parks-australia.yml
+```
+
+```bash
+# crawl using crawler config and optional --es-config
+$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
+```
+
+#### `crawler validate`
+
+Checks the configured domains in `domain_allowlist` to see if they can be crawled.
+
+```bash
+# when valid
+$ bin/crawler validate path/to/crawler.yml
+
+> Domain https://www.elastic.co is valid
+```
+
+```bash
+# when invalid (e.g. has a redirect)
+$ bin/crawler validate path/to/invalid-crawler.yml
+
+> Domain https://elastic.co is invalid:
+> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
+> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
+```
+
+#### `crawler version`
+
+Checks the product version of Crawler
+
+```bash
+$ bin/crawler version
+
+> v0.2.0
+```