elastic · navarone-feekery · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024
@@ -3,6 +3,14 @@
 This repository contains code for the Elastic Open Web Crawler.
 This is a tool to allow users to easily ingest content into Elasticsearch from the web.
 
+⚠️ _The Open Crawler is currently in **tech-preview**_.
+Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
+Elastic plans to promote this feature to GA in a future release.
+
+ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into.
+If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started.
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
+
 ## How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
@@ -16,50 +24,52 @@ The crawl results can be output in 3 different modes:
 
 ### Setup
 
+In order to index crawl results into an Elasticsearch instance, you must first have one up and running.
+
 #### Running from Docker
 
 Crawler has a Dockerfile that can be built and run locally.
 
-1. Build the image `docker build -t crawler-image .`
-2. Run the container `docker run -i -d --name crawler crawler-image`
+1. Clone the repository
+2. Build the image `docker build -t crawler-image .`
+3. Run the container `docker run -i -d --name crawler crawler-image`
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
-3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
+4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
+5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
    - See [Crawling content](#crawling-content) for examples.
 
 #### Running from source
 
-Crawler uses both JRuby and Java.
+_Note: Crawler uses both JRuby and Java.
 We recommend using version managers for both.
 When developing Crawler we use `rbenv` and `jenv`.
-There are instructions for setting up these env managers here:
+There are instructions for setting up these env managers here:_
 
 - [Official documentation for installing jenv](https://www.jenv.be/)
 - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
-
-```bash
-# should output the same version as `.ruby-version`
-$ ruby --version
+1. Clone the repository
+2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+    ```bash
+    # should output the same version as `.ruby-version`
+    $ ruby --version
 
-# should output the same version as `.java-version`
-$ java --version
-```
+    # should output the same version as `.java-version`
+    $ java --version
+    ```
 
-If the versions seem correct, you can install dependencies:
+3. If the versions seem correct, you can install dependencies:
+    ```bash
+    $ make install
+    ```
 
-```bash
-$ make install
-```
+    You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
+    Doing this requires that you use both `rbenv` and `jenv` in your local setup.
 
-You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
-Doing this requires that you use both `rbenv` and `jenv` in your local setup.
-
-```bash
-$ CRAWLER_MANAGE_ENV=true make install
-```
+    ```bash
+    $ CRAWLER_MANAGE_ENV=true make install
+    ```
 
 Crawler should now be functional.
 See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
@@ -68,37 +78,9 @@ See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
 
 See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
 
-Once you have a Crawler configured, you can validate the domain(s) using the CLI.
+### CLI Commands
 
-```bash
-$ bin/crawler validate config/my-crawler.yml
-```
-
-If you are running from docker, you will first need to copy the config file into the docker container.
-
-```bash
-# copy file (if you haven't already done so)
-$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml
-
-# run 
-$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
-```
-
-See [Crawling content](#crawling-content).
-
-### Crawling content
-
-Use the following command to run a crawl based on the configuration provided.
-
-```bash
-$ bin/crawler crawl config/my-crawler.yml
-```
-
-And from Docker.
-
-```bash
-$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
-```
+See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
 
 ### Connecting to Elasticsearch
 
@@ -132,5 +114,4 @@ POST /_security/api_key
     "application": "my-crawler"
   }
 }
-
 ```
@@ -0,0 +1,98 @@
+# CLI
+
+Crawler CLI is a command-line interface for use in the terminal or scripts.
+
+## Installation and Configuration
+
+Ensure you complete the [setup](../README.md#setup) before using the CLI.
+
+For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).
+
+### CLI in Docker
+
+If you are running a dockerized version of Crawler, you can run CLI commands in two ways;
+
+1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
+    - This requires no changes to CLI commands
+    ```bash
+    # exec into container
+    $ docker exec -it crawler bash
+
+    # move to crawler directory
+    $ cd crawler
+
+    # execute commands
+    $ bin/crawler version
+    ```
+2. Execute commands externally using `docker exec -it <container name> <command>`
+    ```bash
+    # execute command directly without entering docker container
+    $ docker exec -it crawler bin/crawler version
+    ```
+
+## Available commands
+### Getting help
+Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information.
+
+For example:
+```bash
+$ bin/crawler --help
+
+> Commands:
+>   crawler crawl CRAWL_CONFIG                   # Run a crawl of the site
+>   crawler validate CRAWL_CONFIG                # Validate crawler configuration
+>   crawler version                              # Print version
+```
+
+### Commands
+
+
+- [`crawler crawl`](#crawler-crawl)
+- [`crawler validate`](#crawler-validate)
+- [`crawler version`](#crawler-version)
+
+#### `crawler crawl`
+
+Crawls the configured domain in the provided config file.
+Can optionally take a second configuration file for Elasticsearch settings.
+See [CONFIG.md](./CONFIG.md) for details on the configuration files.
+
+```bash
+# crawl using only crawler config
+$ bin/crawler crawl config/examples/parks-australia.yml
+```
+
+```bash
+# crawl using crawler config and optional --es-config
+$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
+```
+
+#### `crawler validate`
+
+Checks the configured domains in `domain_allowlist` to see if they can be crawled.
+
+```bash
+# when valid
+$ bin/crawler validate path/to/crawler.yml
+
+> Domain https://www.elastic.co is valid
+```
+
+```bash
+# when invalid (e.g. has a redirect)
+$ bin/crawler validate path/to/invalid-crawler.yml
+
+> Domain https://elastic.co is invalid:
+> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
+> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
+```
+
+#### `crawler version`
+
+Checks the product version of Crawler
+
+```bash
+$ bin/crawler version
+
+> v0.2.0
+```