From 1fc37ff94591fa044efd9dbccfea29beec6625d6 Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 12:13:00 +0200 Subject: [PATCH 1/8] Improve setup docs and add CLI docs --- README.md | 91 ++++++++++++++++++++----------------------------- docs/CLI.md | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 134 insertions(+), 55 deletions(-) create mode 100644 docs/CLI.md diff --git a/README.md b/README.md index ace1b90..28f1e1b 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,14 @@ This repository contains code for the Elastic Open Web Crawler. This is a tool to allow users to easily ingest content into Elasticsearch from the web. +⚠️ _The Open Crawler is currently in **tech-preview**_. +Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. +Elastic plans to promote this feature to GA in a future release. + +ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into. +If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started. +_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ + ## How it works Crawler runs crawl jobs on command based on config files in the `config` directory. @@ -16,50 +24,52 @@ The crawl results can be output in 3 different modes: ### Setup +In order to index crawl results into an Elasticsearch instance, you must first have one up and running. + #### Running from Docker Crawler has a Dockerfile that can be built and run locally. -1. Build the image `docker build -t crawler-image .` -2. Run the container `docker run -i -d --name crawler crawler-image` +1. Clone the repository +2. Build the image `docker build -t crawler-image .` +3. Run the container `docker run -i -d --name crawler crawler-image` - `-i` allows the container to stay alive so CLI commands can be executed inside it - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it -3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` -4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. +4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` +5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. - See [Crawling content](#crawling-content) for examples. #### Running from source -Crawler uses both JRuby and Java. +_Note: Crawler uses both JRuby and Java. We recommend using version managers for both. When developing Crawler we use `rbenv` and `jenv`. -There are instructions for setting up these env managers here: +There are instructions for setting up these env managers here:_ - [Official documentation for installing jenv](https://www.jenv.be/) - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) -Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: - -```bash -# should output the same version as `.ruby-version` -$ ruby --version +1. Clone the repository +2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: + ```bash + # should output the same version as `.ruby-version` + $ ruby --version -# should output the same version as `.java-version` -$ java --version -``` + # should output the same version as `.java-version` + $ java --version + ``` -If the versions seem correct, you can install dependencies: +3. If the versions seem correct, you can install dependencies: + ```bash + $ make install + ``` -```bash -$ make install -``` + You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: + Doing this requires that you use both `rbenv` and `jenv` in your local setup. -You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: -Doing this requires that you use both `rbenv` and `jenv` in your local setup. - -```bash -$ CRAWLER_MANAGE_ENV=true make install -``` + ```bash + $ CRAWLER_MANAGE_ENV=true make install + ``` Crawler should now be functional. See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content. @@ -68,37 +78,9 @@ See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content. See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. -Once you have a Crawler configured, you can validate the domain(s) using the CLI. +### CLI Commands -```bash -$ bin/crawler validate config/my-crawler.yml -``` - -If you are running from docker, you will first need to copy the config file into the docker container. - -```bash -# copy file (if you haven't already done so) -$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml - -# run -$ docker exec -it crawler bin/crawler validate config/my-crawler.yml -``` - -See [Crawling content](#crawling-content). - -### Crawling content - -Use the following command to run a crawl based on the configuration provided. - -```bash -$ bin/crawler crawl config/my-crawler.yml -``` - -And from Docker. - -```bash -$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml -``` +See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler. ### Connecting to Elasticsearch @@ -132,5 +114,4 @@ POST /_security/api_key "application": "my-crawler" } } - ``` diff --git a/docs/CLI.md b/docs/CLI.md new file mode 100644 index 0000000..2e6af0f --- /dev/null +++ b/docs/CLI.md @@ -0,0 +1,98 @@ +# CLI + +Crawler CLI is a command-line interface for use in the terminal or scripts. + +## Installation and Configuration + +Ensure you complete the [setup](../README.md#setup) before using the CLI. + +For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md). + +### CLI in Docker + +If you are running a dockerized version of Crawler, you can run CLI commands in two ways; + +1. Exec into the docker container and execute commands directly using `docker exec -it bash` + - This requires no changes to CLI commands + ```bash + # exec into container + $ docker exec -it crawler bash + + # move to crawler directory + $ cd crawler + + # execute commands + $ bin/crawler version + ``` +2. Execute commands externally using `docker exec -it ` + ```bash + # execute command directly without entering docker container + $ docker exec -it crawler bin/crawler version + ``` + +## Available commands +### Getting help +Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information. + +For example: +```bash +$ bin/crawler --help + +> Commands: +> crawler crawl CRAWL_CONFIG # Run a crawl of the site +> crawler validate CRAWL_CONFIG # Validate crawler configuration +> crawler version # Print version +``` + +### Commands + + +- [`crawler crawl`](#crawler-crawl) +- [`crawler validate`](#crawler-validate) +- [`crawler version`](#crawler-version) + +#### `crawler crawl` + +Crawls the configured domain in the provided config file. +Can optionally take a second configuration file for Elasticsearch settings. +See [CONFIG.md](./CONFIG.md) for details on the configuration files. + +```bash +# crawl using only crawler config +$ bin/crawler crawl config/examples/parks-australia.yml +``` + +```bash +# crawl using crawler config and optional --es-config +$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml +``` + +#### `crawler validate` + +Checks the configured domains in `domain_allowlist` to see if they can be crawled. + +```bash +# when valid +$ bin/crawler validate path/to/crawler.yml + +> Domain https://www.elastic.co is valid +``` + +```bash +# when invalid (e.g. has a redirect) +$ bin/crawler validate path/to/invalid-crawler.yml + +> Domain https://elastic.co is invalid: +> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/). +> If you want to crawl this site, please configure https://www.elastic.co as one of the domains. +``` + +#### `crawler version` + +Checks the product version of Crawler + +```bash +$ bin/crawler version + +> v0.2.0 +``` From 4fee2dfdb2234132a05d04c62686f2156cbb160f Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 13:27:47 +0200 Subject: [PATCH 2/8] Apply suggestions from code review Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- README.md | 4 ++-- docs/CLI.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 28f1e1b..3f5ceb1 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # Elastic Open Web Crawler This repository contains code for the Elastic Open Web Crawler. -This is a tool to allow users to easily ingest content into Elasticsearch from the web. +The crawler enables users to easily ingest web content into Elasticsearch. ⚠️ _The Open Crawler is currently in **tech-preview**_. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. @@ -36,7 +36,7 @@ Crawler has a Dockerfile that can be built and run locally. - `-i` allows the container to stay alive so CLI commands can be executed inside it - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it 4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` -5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. +5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler ` - See [Crawling content](#crawling-content) for examples. #### Running from source diff --git a/docs/CLI.md b/docs/CLI.md index 2e6af0f..c4c6651 100644 --- a/docs/CLI.md +++ b/docs/CLI.md @@ -32,7 +32,7 @@ If you are running a dockerized version of Crawler, you can run CLI commands in ## Available commands ### Getting help -Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information. +Use the `--help or -h` option with any command to get more information. For example: ```bash From b825ff208b507bb1ee76bbb4e3f16aa9fb76d395 Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 13:55:09 +0200 Subject: [PATCH 3/8] Small fixes --- README.md | 74 +++++++++++++++++++++++++++++------------------------ docs/CLI.md | 1 + 2 files changed, 42 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 3f5ceb1..e1e1a89 100644 --- a/README.md +++ b/README.md @@ -7,10 +7,6 @@ The crawler enables users to easily ingest web content into Elasticsearch. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release. -ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into. -If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started. -_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ - ## How it works Crawler runs crawl jobs on command based on config files in the `config` directory. @@ -22,9 +18,14 @@ The crawl results can be output in 3 different modes: - As files to a specified directory - Directly to the terminal -### Setup +## Prerequisites -In order to index crawl results into an Elasticsearch instance, you must first have one up and running. +If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into. +If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). + +_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ + +### Setup #### Running from Docker @@ -36,43 +37,45 @@ Crawler has a Dockerfile that can be built and run locally. - `-i` allows the container to stay alive so CLI commands can be executed inside it - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it 4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` -5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler ` - - See [Crawling content](#crawling-content) for examples. +5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. +6. See [Configuring crawlers](#configuring-crawlers) for next steps. #### Running from source -_Note: Crawler uses both JRuby and Java. -We recommend using version managers for both. -When developing Crawler we use `rbenv` and `jenv`. -There are instructions for setting up these env managers here:_ +To avoid complications caused by different operating systems and managing ruby and java versions, we recommend running from source only if you are actively developing Open Crawler. -- [Official documentation for installing jenv](https://www.jenv.be/) -- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) +
+ Instructions for running from source + ℹ️ Crawler uses both JRuby and Java. + We recommend using version managers for both. + When developing Crawler we use rbenv and jenv. + There are instructions for setting up these env managers here: -1. Clone the repository -2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: - ```bash - # should output the same version as `.ruby-version` - $ ruby --version + - [Official documentation for installing jenv](https://www.jenv.be/) + - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) - # should output the same version as `.java-version` - $ java --version - ``` + 1. Clone the repository + 2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: + ```bash + # should output the same version as `.ruby-version` + $ ruby --version -3. If the versions seem correct, you can install dependencies: - ```bash - $ make install - ``` + # should output the same version as `.java-version` + $ java --version + ``` - You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: - Doing this requires that you use both `rbenv` and `jenv` in your local setup. + 3. If the versions seem correct, you can install dependencies: + ```bash + $ make install + ``` - ```bash - $ CRAWLER_MANAGE_ENV=true make install - ``` + You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: + Doing this requires that you use both `rbenv` and `jenv` in your local setup. -Crawler should now be functional. -See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content. + ```bash + $ CRAWLER_MANAGE_ENV=true make install + ``` +
### Configuring Crawlers @@ -80,6 +83,11 @@ See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration fi ### CLI Commands +Open Crawler has no UI. +All interactions with Crawler take place through the CLI. +When given a command, Crawler will run until the process is finished. +Crawler is not kept alive in any way between commands. + See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler. ### Connecting to Elasticsearch diff --git a/docs/CLI.md b/docs/CLI.md index c4c6651..b1f84c9 100644 --- a/docs/CLI.md +++ b/docs/CLI.md @@ -1,6 +1,7 @@ # CLI Crawler CLI is a command-line interface for use in the terminal or scripts. +This is the only user interface for interacting with Crawler. ## Installation and Configuration From 36da94cf1ae6cbcc57fbb161cfdd64cbb051ce89 Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 14:12:44 +0200 Subject: [PATCH 4/8] Clean up heading sizes --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index e1e1a89..afcfdc1 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ The crawler enables users to easily ingest web content into Elasticsearch. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release. -## How it works +### How it works Crawler runs crawl jobs on command based on config files in the `config` directory. 1 URL endpoint on a site will correlate with 1 result output. @@ -18,15 +18,15 @@ The crawl results can be output in 3 different modes: - As files to a specified directory - Directly to the terminal -## Prerequisites +### Setup + +#### Prerequisites If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into. If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ -### Setup - #### Running from Docker Crawler has a Dockerfile that can be built and run locally. From b1ee54a47c743e8ea9c2cc341bdc5886cfa48d0c Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 15:02:20 +0200 Subject: [PATCH 5/8] Expand how it works section --- README.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index afcfdc1..197b707 100644 --- a/README.md +++ b/README.md @@ -7,26 +7,31 @@ The crawler enables users to easily ingest web content into Elasticsearch. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release. +_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ + ### How it works Crawler runs crawl jobs on command based on config files in the `config` directory. -1 URL endpoint on a site will correlate with 1 result output. +Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch. + +Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. +The crawl results from these are added to a pool of results. +These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold. -The crawl results can be output in 3 different modes: +The full process required from setup to indexing requires; -- As docs to an Elasticsearch index -- As files to a specified directory -- Directly to the terminal +1. Running an instance of Elasticsearch (on-prem, cloud, or serverless) +2. Cloning of the Open Crawler repository (see [Setup](#setup)) +3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers)) +4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands)) ### Setup #### Prerequisites -If you are using the Crawler to index documents into Elasticsearch, you will require a running instance of Elasticsearch to index documents into. +A running instance of Elasticsearch is required to index documents into. If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). -_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ - #### Running from Docker Crawler has a Dockerfile that can be built and run locally. @@ -80,7 +85,7 @@ To avoid complications caused by different operating systems and managing ruby a ### Configuring Crawlers See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. - + ### CLI Commands Open Crawler has no UI. From 79bde1d5f461f577e0e317b88fb1453d1d0c7876 Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 15:04:31 +0200 Subject: [PATCH 6/8] Remove whitespace --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 197b707..a676915 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ To avoid complications caused by different operating systems and managing ruby a ### Configuring Crawlers See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. - + ### CLI Commands Open Crawler has no UI. From a28f54af518a412a529342d5d881d81d6176310c Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 16:18:19 +0200 Subject: [PATCH 7/8] More fixes --- README.md | 133 +++++++++++++++++++++++++++++------------------------- 1 file changed, 71 insertions(+), 62 deletions(-) diff --git a/README.md b/README.md index a676915..e4d7ae0 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,33 @@ # Elastic Open Web Crawler This repository contains code for the Elastic Open Web Crawler. -The crawler enables users to easily ingest web content into Elasticsearch. +Open Crawler enables users to easily ingest web content into Elasticsearch. -⚠️ _The Open Crawler is currently in **tech-preview**_. +> [!IMPORTANT] +> _The Open Crawler is currently in **tech-preview**_. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release. _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ -### How it works +### User workflow -Crawler runs crawl jobs on command based on config files in the `config` directory. -Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch. - -Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. -The crawl results from these are added to a pool of results. -These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold. - -The full process required from setup to indexing requires; +The full process from setup to indexing requires: 1. Running an instance of Elasticsearch (on-prem, cloud, or serverless) 2. Cloning of the Open Crawler repository (see [Setup](#setup)) 3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers)) 4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands)) +### Execution logic + +Open Crawler runs crawl jobs on command based on config files in the `config` directory. +Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch. + +Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. +The crawl results from these are added to a pool of results. +These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold. + ### Setup #### Prerequisites @@ -32,35 +35,75 @@ The full process required from setup to indexing requires; A running instance of Elasticsearch is required to index documents into. If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). -#### Running from Docker +#### Connecting to Elasticsearch + +Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch. +To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. +If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. -Crawler has a Dockerfile that can be built and run locally. +- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details +- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler -1. Clone the repository +
+ Creating an API key + Here is an example of creating an API key with minimal permissions for Open Crawler. + This will return a JSON with an `encoded` key. + The value of `encoded` is what Open Crawler can use in its configuration. + + ```bash + POST /_security/api_key + { + "name": "my-api-key", + "role_descriptors": { + "my-crawler-role": { + "cluster": ["all"], + "indices": [ + { + "names": ["my-crawler-index-name"], + "privileges": ["all"] + } + ] + } + }, + "metadata": { + "application": "my-crawler" + } + } + ``` +
+ + + +#### Running Open Crawler from Docker + +Open Crawler has a Dockerfile that can be built and run locally. + +1. Clone the repository: `git clone https://github.com/elastic/crawler.git` 2. Build the image `docker build -t crawler-image .` 3. Run the container `docker run -i -d --name crawler crawler-image` - `-i` allows the container to stay alive so CLI commands can be executed inside it - `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it -4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` -5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler `. -6. See [Configuring crawlers](#configuring-crawlers) for next steps. +4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version` + - Execute other CLI commands from outside of the container by prepending `docker exec -it crawler ` +5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps. -#### Running from source +#### Running Open Crawler from source -To avoid complications caused by different operating systems and managing ruby and java versions, we recommend running from source only if you are actively developing Open Crawler. +> [!TIP] +> We recommend running from source only if you are actively developing Open Crawler.
Instructions for running from source - ℹ️ Crawler uses both JRuby and Java. + ℹ️ Open Crawler uses both JRuby and Java. We recommend using version managers for both. - When developing Crawler we use rbenv and jenv. + When developing Open Crawler we use rbenv and jenv. There are instructions for setting up these env managers here: - [Official documentation for installing jenv](https://www.jenv.be/) - [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) - 1. Clone the repository - 2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: + 1. Clone the repository: `git clone https://github.com/elastic/crawler.git` + 2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used: ```bash # should output the same version as `.ruby-version` $ ruby --version @@ -84,47 +127,13 @@ To avoid complications caused by different operating systems and managing ruby a ### Configuring Crawlers -See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. +See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files. ### CLI Commands -Open Crawler has no UI. -All interactions with Crawler take place through the CLI. -When given a command, Crawler will run until the process is finished. -Crawler is not kept alive in any way between commands. +Open Crawler does not have a graphical user interface. +All interactions with Open Crawler take place through the CLI. +When given a command, Open Crawler will run until the process is finished. +OpenCrawler is not kept alive in any way between commands. See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler. - -### Connecting to Elasticsearch - -If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch. -To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. -If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. - -- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details -- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler - -Here is an example of creating an API key with minimal permissions for Crawler. -This will return a JSON with an `encoded` key. -The value of `encoded` is what Crawler can use in its configuration. - -```bash -POST /_security/api_key -{ - "name": "my-api-key", - "role_descriptors": { - "my-crawler-role": { - "cluster": ["all"], - "indices": [ - { - "names": ["my-crawler-index-name"], - "privileges": ["all"] - } - ] - } - }, - "metadata": { - "application": "my-crawler" - } -} -``` From 633ead803b043cfdd7e02264c2b69b5c60b1de4f Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Fri, 7 Jun 2024 16:40:57 +0200 Subject: [PATCH 8/8] Update README.md Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e4d7ae0..698bb39 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ _Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` ### User workflow -The full process from setup to indexing requires: +Indexing web content with the Open Crawler requires: 1. Running an instance of Elasticsearch (on-prem, cloud, or serverless) 2. Cloning of the Open Crawler repository (see [Setup](#setup))