From 1fc37ff94591fa044efd9dbccfea29beec6625d6 Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 12:13:00 +0200
Subject: [PATCH 1/8] Improve setup docs and add CLI docs

---
 README.md   | 91 ++++++++++++++++++++-----------------------------
 docs/CLI.md | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 55 deletions(-)
 create mode 100644 docs/CLI.md
diff --git a/README.md b/README.md
index ace1b90..28f1e1b 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,14 @@
 This repository contains code for the Elastic Open Web Crawler.
 This is a tool to allow users to easily ingest content into Elasticsearch from the web.
 
+⚠️ _The Open Crawler is currently in **tech-preview**_.
+Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
+Elastic plans to promote this feature to GA in a future release.
+
+ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into.
+If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started.
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
+
 ## How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
@@ -16,50 +24,52 @@ The crawl results can be output in 3 different modes:
 
 ### Setup
 
+In order to index crawl results into an Elasticsearch instance, you must first have one up and running.
+
 #### Running from Docker
 
 Crawler has a Dockerfile that can be built and run locally.
 
-1. Build the image `docker build -t crawler-image .`
-2. Run the container `docker run -i -d --name crawler crawler-image`
+1. Clone the repository
+2. Build the image `docker build -t crawler-image .`
+3. Run the container `docker run -i -d --name crawler crawler-image`
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
-3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
+4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
+5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
    - See [Crawling content](#crawling-content) for examples.
 
 #### Running from source
 
-Crawler uses both JRuby and Java.
+_Note: Crawler uses both JRuby and Java.
 We recommend using version managers for both.
 When developing Crawler we use `rbenv` and `jenv`.
-There are instructions for setting up these env managers here:
+There are instructions for setting up these env managers here:_
 
 - [Official documentation for installing jenv](https://www.jenv.be/)
 - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
-
-```bash
-# should output the same version as `.ruby-version`
-$ ruby --version
+1. Clone the repository
+2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+    ```bash
+    # should output the same version as `.ruby-version`
+    $ ruby --version
 
-# should output the same version as `.java-version`
-$ java --version
-```
+    # should output the same version as `.java-version`
+    $ java --version
+    ```
 
-If the versions seem correct, you can install dependencies:
+3. If the versions seem correct, you can install dependencies:
+    ```bash
+    $ make install
+    ```
 
-```bash
-$ make install
-```
+    You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
+    Doing this requires that you use both `rbenv` and `jenv` in your local setup.
 
-You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
-Doing this requires that you use both `rbenv` and `jenv` in your local setup.
-
-```bash
-$ CRAWLER_MANAGE_ENV=true make install
-```
+    ```bash
+    $ CRAWLER_MANAGE_ENV=true make install
+    ```
 
 Crawler should now be functional.
 See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
@@ -68,37 +78,9 @@ See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
 
 See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
 
-Once you have a Crawler configured, you can validate the domain(s) using the CLI.
+### CLI Commands
 
-```bash
-$ bin/crawler validate config/my-crawler.yml
-```
-
-If you are running from docker, you will first need to copy the config file into the docker container.
-
-```bash
-# copy file (if you haven't already done so)
-$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml
-
-# run 
-$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
-```
-
-See [Crawling content](#crawling-content).
-
-### Crawling content
-
-Use the following command to run a crawl based on the configuration provided.
-
-```bash
-$ bin/crawler crawl config/my-crawler.yml
-```
-
-And from Docker.
-
-```bash
-$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
-```
+See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
 
 ### Connecting to Elasticsearch
 
@@ -132,5 +114,4 @@ POST /_security/api_key
     "application": "my-crawler"
   }
 }
-
 ```
diff --git a/docs/CLI.md b/docs/CLI.md
new file mode 100644
index 0000000..2e6af0f
--- /dev/null
+++ b/docs/CLI.md
@@ -0,0 +1,98 @@
+# CLI
+
+Crawler CLI is a command-line interface for use in the terminal or scripts.
+
+## Installation and Configuration
+
+Ensure you complete the [setup](../README.md#setup) before using the CLI.
+
+For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).
+
+### CLI in Docker
+
+If you are running a dockerized version of Crawler, you can run CLI commands in two ways;
+
+1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
+    - This requires no changes to CLI commands
+    ```bash
+    # exec into container
+    $ docker exec -it crawler bash
+   
+    # move to crawler directory
+    $ cd crawler
+    
+    # execute commands
+    $ bin/crawler version
+    ```
+2. Execute commands externally using `docker exec -it <container name> <command>`
+    ```bash
+    # execute command directly without entering docker container
+    $ docker exec -it crawler bin/crawler version
+    ```
+
+## Available commands
+### Getting help
+Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information.
+
+For example:
+```bash
+$ bin/crawler --help
+
+> Commands:
+>   crawler crawl CRAWL_CONFIG                   # Run a crawl of the site
+>   crawler validate CRAWL_CONFIG                # Validate crawler configuration
+>   crawler version                              # Print version
+```
+
+### Commands
+
+
+- [`crawler crawl`](#crawler-crawl)
+- [`crawler validate`](#crawler-validate)
+- [`crawler version`](#crawler-version)
+
+#### `crawler crawl`
+
+Crawls the configured domain in the provided config file.
+Can optionally take a second configuration file for Elasticsearch settings.
+See [CONFIG.md](./CONFIG.md) for details on the configuration files.
+
+```bash
+# crawl using only crawler config
+$ bin/crawler crawl config/examples/parks-australia.yml
+```
+
+```bash
+# crawl using crawler config and optional --es-config
+$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
+```
+
+#### `crawler validate`
+
+Checks the configured domains in `domain_allowlist` to see if they can be crawled.
+
+```bash
+# when valid
+$ bin/crawler validate path/to/crawler.yml
+
+> Domain https://www.elastic.co is valid
+```
+
+```bash
+# when invalid (e.g. has a redirect)
+$ bin/crawler validate path/to/invalid-crawler.yml
+
+> Domain https://elastic.co is invalid:
+> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
+> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
+```
+
+#### `crawler version`
+
+Checks the product version of Crawler
+
+```bash
+$ bin/crawler version
+
+> v0.2.0
+```

From 4fee2dfdb2234132a05d04c62686f2156cbb160f Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 13:27:47 +0200
Subject: [PATCH 2/8] Apply suggestions from code review

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
---
 README.md   | 4 ++--
 docs/CLI.md | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 28f1e1b..3f5ceb1 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # Elastic Open Web Crawler
 
 This repository contains code for the Elastic Open Web Crawler.
-This is a tool to allow users to easily ingest content into Elasticsearch from the web.
+The crawler enables users to easily ingest web content into Elasticsearch.
 
 ⚠️ _The Open Crawler is currently in **tech-preview**_.
 Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
@@ -36,7 +36,7 @@ Crawler has a Dockerfile that can be built and run locally.
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
 4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
+5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
    - See [Crawling content](#crawling-content) for examples.
 
 #### Running from source
diff --git a/docs/CLI.md b/docs/CLI.md
index 2e6af0f..c4c6651 100644
--- a/docs/CLI.md
+++ b/docs/CLI.md
@@ -32,7 +32,7 @@ If you are running a dockerized version of Crawler, you can run CLI commands in
 
 ## Available commands
 ### Getting help
-Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information.
+Use the `--help or -h` option with any command to get more information.
 
 For example:
 ```bash

From b825ff208b507bb1ee76bbb4e3f16aa9fb76d395 Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 13:55:09 +0200
Subject: [PATCH 3/8] Small fixes

---
 README.md   | 74 +++++++++++++++++++++++++++++------------------------
 docs/CLI.md |  1 +
 2 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/README.md b/README.md
index 3f5ceb1..e1e1a89 100644
--- a/README.md
+++ b/README.md
@@ -7,10 +7,6 @@ The crawler enables users to easily ingest web content into Elasticsearch.
 Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
 Elastic plans to promote this feature to GA in a future release.
 
-ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into.
-If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started.
-_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
-
 ## How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
@@ -22,9 +18,14 @@ The crawl results can be output in 3 different modes:
 - As files to a specified directory
 - Directly to the terminal
 
-### Setup
+## Prerequisites
 
-In order to index crawl results into an Elasticsearch instance, you must first have one up and running.
+If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into.
+If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
+
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
+
+### Setup
 
 #### Running from Docker
 
@@ -36,43 +37,45 @@ Crawler has a Dockerfile that can be built and run locally.
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
 4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
-   - See [Crawling content](#crawling-content) for examples.
+5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
+6. See [Configuring crawlers](#configuring-crawlers) for next steps.
 
 #### Running from source
 
-_Note: Crawler uses both JRuby and Java.
-We recommend using version managers for both.
-When developing Crawler we use `rbenv` and `jenv`.
-There are instructions for setting up these env managers here:_
+To avoid complications caused by different operating systems and managing ruby and java versions, we recommend running from source only if you are actively developing Open Crawler.
 
-- [Official documentation for installing jenv](https://www.jenv.be/)
-- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
+<details>
+  <summary>Instructions for running from source</summary>
+  ℹ️ Crawler uses both JRuby and Java.
+  We recommend using version managers for both.
+  When developing Crawler we use <b>rbenv</b> and <b>jenv</b>.
+  There are instructions for setting up these env managers here:
 
-1. Clone the repository
-2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
-    ```bash
-    # should output the same version as `.ruby-version`
-    $ ruby --version
+  - [Official documentation for installing jenv](https://www.jenv.be/)
+  - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-    # should output the same version as `.java-version`
-    $ java --version
-    ```
+  1. Clone the repository
+  2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+      ```bash
+      # should output the same version as `.ruby-version`
+      $ ruby --version
 
-3. If the versions seem correct, you can install dependencies:
-    ```bash
-    $ make install
-    ```
+      # should output the same version as `.java-version`
+      $ java --version
+      ```
 
-    You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
-    Doing this requires that you use both `rbenv` and `jenv` in your local setup.
+  3. If the versions seem correct, you can install dependencies:
+      ```bash
+      $ make install
+      ```
 
-    ```bash
-    $ CRAWLER_MANAGE_ENV=true make install
-    ```
+     You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
+     Doing this requires that you use both `rbenv` and `jenv` in your local setup.
 
-Crawler should now be functional.
-See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
+      ```bash
+      $ CRAWLER_MANAGE_ENV=true make install
+      ```
+</details>
 
 ### Configuring Crawlers
 
@@ -80,6 +83,11 @@ See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration fi
 
 ### CLI Commands
 
+Open Crawler has no UI.
+All interactions with Crawler take place through the CLI.
+When given a command, Crawler will run until the process is finished.
+Crawler is not kept alive in any way between commands.
+
 See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
 
 ### Connecting to Elasticsearch
diff --git a/docs/CLI.md b/docs/CLI.md
index c4c6651..b1f84c9 100644
--- a/docs/CLI.md
+++ b/docs/CLI.md
@@ -1,6 +1,7 @@
 # CLI
 
 Crawler CLI is a command-line interface for use in the terminal or scripts.
+This is the only user interface for interacting with Crawler.
 
 ## Installation and Configuration
 

From 36da94cf1ae6cbcc57fbb161cfdd64cbb051ce89 Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 14:12:44 +0200
Subject: [PATCH 4/8] Clean up heading sizes

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index e1e1a89..afcfdc1 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ The crawler enables users to easily ingest web content into Elasticsearch.
 Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
 Elastic plans to promote this feature to GA in a future release.
 
-## How it works
+### How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
 1 URL endpoint on a site will correlate with 1 result output.
@@ -18,15 +18,15 @@ The crawl results can be output in 3 different modes:
 - As files to a specified directory
 - Directly to the terminal
 
-## Prerequisites
+### Setup
+
+#### Prerequisites
 
 If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into.
 If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
 
 _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
 
-### Setup
-
 #### Running from Docker
 
 Crawler has a Dockerfile that can be built and run locally.

From b1ee54a47c743e8ea9c2cc341bdc5886cfa48d0c Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 15:02:20 +0200
Subject: [PATCH 5/8] Expand how it works section

---
 README.md | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/README.md b/README.md
index afcfdc1..197b707 100644
--- a/README.md
+++ b/README.md
@@ -7,26 +7,31 @@ The crawler enables users to easily ingest web content into Elasticsearch.
 Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
 Elastic plans to promote this feature to GA in a future release.
 
+_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
+
 ### How it works
 
 Crawler runs crawl jobs on command based on config files in the `config` directory.
-1 URL endpoint on a site will correlate with 1 result output.
+Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
+
+Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
+The crawl results from these are added to a pool of results.
+These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
 
-The crawl results can be output in 3 different modes:
+The full process required from setup to indexing requires;
 
-- As docs to an Elasticsearch index
-- As files to a specified directory
-- Directly to the terminal
+1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
+2. Cloning of the Open Crawler repository (see [Setup](#setup))
+3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
+4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))
 
 ### Setup
 
 #### Prerequisites
 
-If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into.
+A running instance of Elasticsearch is required to index documents into.
 If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
 
-_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
-
 #### Running from Docker
 
 Crawler has a Dockerfile that can be built and run locally.
@@ -80,7 +85,7 @@ To avoid complications caused by different operating systems and managing ruby a
 ### Configuring Crawlers
 
 See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
-
+                               
 ### CLI Commands
 
 Open Crawler has no UI.

From 79bde1d5f461f577e0e317b88fb1453d1d0c7876 Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 15:04:31 +0200
Subject: [PATCH 6/8] Remove whitespace

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 197b707..a676915 100644
--- a/README.md
+++ b/README.md
@@ -85,7 +85,7 @@ To avoid complications caused by different operating systems and managing ruby a
 ### Configuring Crawlers
 
 See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
-                               
+
 ### CLI Commands
 
 Open Crawler has no UI.

From a28f54af518a412a529342d5d881d81d6176310c Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 16:18:19 +0200
Subject: [PATCH 7/8] More fixes

---
 README.md | 133 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 71 insertions(+), 62 deletions(-)

diff --git a/README.md b/README.md
index a676915..e4d7ae0 100644
--- a/README.md
+++ b/README.md
@@ -1,30 +1,33 @@
 # Elastic Open Web Crawler
 
 This repository contains code for the Elastic Open Web Crawler.
-The crawler enables users to easily ingest web content into Elasticsearch.
+Open Crawler enables users to easily ingest web content into Elasticsearch.
 
-⚠️ _The Open Crawler is currently in **tech-preview**_.
+> [!IMPORTANT]
+> _The Open Crawler is currently in **tech-preview**_.
 Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
 Elastic plans to promote this feature to GA in a future release.
 
 _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
 
-### How it works
+### User workflow
 
-Crawler runs crawl jobs on command based on config files in the `config` directory.
-Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
-
-Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
-The crawl results from these are added to a pool of results.
-These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
-
-The full process required from setup to indexing requires;
+The full process from setup to indexing requires:
 
 1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
 2. Cloning of the Open Crawler repository (see [Setup](#setup))
 3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
 4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))
 
+### Execution logic
+
+Open Crawler runs crawl jobs on command based on config files in the `config` directory.
+Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
+
+Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
+The crawl results from these are added to a pool of results.
+These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
+
 ### Setup
 
 #### Prerequisites
@@ -32,35 +35,75 @@ The full process required from setup to indexing requires;
 A running instance of Elasticsearch is required to index documents into.
 If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).
 
-#### Running from Docker
+#### Connecting to Elasticsearch
+
+Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch.
+To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
+If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.
 
-Crawler has a Dockerfile that can be built and run locally.
+- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
+- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler
 
-1. Clone the repository
+<details>
+  <summary>Creating an API key</summary>
+  Here is an example of creating an API key with minimal permissions for Open Crawler.
+  This will return a JSON with an `encoded` key.
+  The value of `encoded` is what Open Crawler can use in its configuration.
+
+  ```bash
+  POST /_security/api_key
+  {
+    "name": "my-api-key",
+    "role_descriptors": { 
+      "my-crawler-role": {
+        "cluster": ["all"],
+        "indices": [
+          {
+            "names": ["my-crawler-index-name"],
+            "privileges": ["all"]
+          }
+        ]
+      }
+    },
+    "metadata": {
+      "application": "my-crawler"
+    }
+  }
+  ```
+</details>
+
+
+
+#### Running Open Crawler from Docker
+
+Open Crawler has a Dockerfile that can be built and run locally.
+
+1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
 2. Build the image `docker build -t crawler-image .`
 3. Run the container `docker run -i -d --name crawler crawler-image`
    - `-i` allows the container to stay alive so CLI commands can be executed inside it
    - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
-4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
-5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
-6. See [Configuring crawlers](#configuring-crawlers) for next steps.
+4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version`
+   - Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
+5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps.
 
-#### Running from source
+#### Running Open Crawler from source
 
-To avoid complications caused by different operating systems and managing ruby and java versions, we recommend running from source only if you are actively developing Open Crawler.
+> [!TIP]
+> We recommend running from source only if you are actively developing Open Crawler.
 
 <details>
   <summary>Instructions for running from source</summary>
-  ℹ️ Crawler uses both JRuby and Java.
+  ℹ️ Open Crawler uses both JRuby and Java.
   We recommend using version managers for both.
-  When developing Crawler we use <b>rbenv</b> and <b>jenv</b>.
+  When developing Open Crawler we use <b>rbenv</b> and <b>jenv</b>.
   There are instructions for setting up these env managers here:
 
   - [Official documentation for installing jenv](https://www.jenv.be/)
   - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
 
-  1. Clone the repository
-  2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
+  1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
+  2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:
       ```bash
       # should output the same version as `.ruby-version`
       $ ruby --version
@@ -84,47 +127,13 @@ To avoid complications caused by different operating systems and managing ruby a
 
 ### Configuring Crawlers
 
-See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
+See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.
 
 ### CLI Commands
 
-Open Crawler has no UI.
-All interactions with Crawler take place through the CLI.
-When given a command, Crawler will run until the process is finished.
-Crawler is not kept alive in any way between commands.
+Open Crawler does not have a graphical user interface.
+All interactions with Open Crawler take place through the CLI.
+When given a command, Open Crawler will run until the process is finished.
+OpenCrawler is not kept alive in any way between commands.
 
 See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
-
-### Connecting to Elasticsearch
-
-If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch.
-To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
-If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.
-
-- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
-- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler
-
-Here is an example of creating an API key with minimal permissions for Crawler.
-This will return a JSON with an `encoded` key.
-The value of `encoded` is what Crawler can use in its configuration. 
-
-```bash
-POST /_security/api_key
-{
-  "name": "my-api-key",
-  "role_descriptors": { 
-    "my-crawler-role": {
-      "cluster": ["all"],
-      "indices": [
-        {
-          "names": ["my-crawler-index-name"],
-          "privileges": ["all"]
-        }
-      ]
-    }
-  },
-  "metadata": {
-    "application": "my-crawler"
-  }
-}
-```

From 633ead803b043cfdd7e02264c2b69b5c60b1de4f Mon Sep 17 00:00:00 2001
From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Date: Fri, 7 Jun 2024 16:40:57 +0200
Subject: [PATCH 8/8] Update README.md

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e4d7ae0..698bb39 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0`
 
 ### User workflow
 
-The full process from setup to indexing requires:
+Indexing web content with the Open Crawler requires:
 
 1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
 2. Cloning of the Open Crawler repository (see [Setup](#setup))