-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve setup docs and add CLI docs (#44)
- Add version compatibility - Add tech-preview status - Clean up Setup section - Add CLI.md - Move crawl instructions to CLI.md --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
- Loading branch information
1 parent
ae66978
commit 4c6520f
Showing
2 changed files
with
201 additions
and
99 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,136 +1,139 @@ | ||
# Elastic Open Web Crawler | ||
|
||
This repository contains code for the Elastic Open Web Crawler. | ||
This is a tool to allow users to easily ingest content into Elasticsearch from the web. | ||
Open Crawler enables users to easily ingest web content into Elasticsearch. | ||
|
||
## How it works | ||
> [!IMPORTANT] | ||
> _The Open Crawler is currently in **tech-preview**_. | ||
Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. | ||
Elastic plans to promote this feature to GA in a future release. | ||
|
||
Crawler runs crawl jobs on command based on config files in the `config` directory. | ||
1 URL endpoint on a site will correlate with 1 result output. | ||
_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ | ||
|
||
The crawl results can be output in 3 different modes: | ||
### User workflow | ||
|
||
- As docs to an Elasticsearch index | ||
- As files to a specified directory | ||
- Directly to the terminal | ||
Indexing web content with the Open Crawler requires: | ||
|
||
### Setup | ||
|
||
#### Running from Docker | ||
|
||
Crawler has a Dockerfile that can be built and run locally. | ||
|
||
1. Build the image `docker build -t crawler-image .` | ||
2. Run the container `docker run -i -d --name crawler crawler-image` | ||
- `-i` allows the container to stay alive so CLI commands can be executed inside it | ||
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it | ||
3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version` | ||
4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`. | ||
- See [Crawling content](#crawling-content) for examples. | ||
1. Running an instance of Elasticsearch (on-prem, cloud, or serverless) | ||
2. Cloning of the Open Crawler repository (see [Setup](#setup)) | ||
3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers)) | ||
4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands)) | ||
|
||
#### Running from source | ||
### Execution logic | ||
|
||
Crawler uses both JRuby and Java. | ||
We recommend using version managers for both. | ||
When developing Crawler we use `rbenv` and `jenv`. | ||
There are instructions for setting up these env managers here: | ||
Open Crawler runs crawl jobs on command based on config files in the `config` directory. | ||
Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch. | ||
|
||
- [Official documentation for installing jenv](https://www.jenv.be/) | ||
- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) | ||
Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. | ||
The crawl results from these are added to a pool of results. | ||
These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold. | ||
|
||
Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used: | ||
### Setup | ||
|
||
```bash | ||
# should output the same version as `.ruby-version` | ||
$ ruby --version | ||
#### Prerequisites | ||
|
||
# should output the same version as `.java-version` | ||
$ java --version | ||
``` | ||
A running instance of Elasticsearch is required to index documents into. | ||
If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html). | ||
|
||
If the versions seem correct, you can install dependencies: | ||
#### Connecting to Elasticsearch | ||
|
||
```bash | ||
$ make install | ||
``` | ||
Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch. | ||
To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. | ||
If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. | ||
|
||
You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: | ||
Doing this requires that you use both `rbenv` and `jenv` in your local setup. | ||
- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details | ||
- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler | ||
|
||
```bash | ||
$ CRAWLER_MANAGE_ENV=true make install | ||
``` | ||
<details> | ||
<summary>Creating an API key</summary> | ||
Here is an example of creating an API key with minimal permissions for Open Crawler. | ||
This will return a JSON with an `encoded` key. | ||
The value of `encoded` is what Open Crawler can use in its configuration. | ||
|
||
```bash | ||
POST /_security/api_key | ||
{ | ||
"name": "my-api-key", | ||
"role_descriptors": { | ||
"my-crawler-role": { | ||
"cluster": ["all"], | ||
"indices": [ | ||
{ | ||
"names": ["my-crawler-index-name"], | ||
"privileges": ["all"] | ||
} | ||
] | ||
} | ||
}, | ||
"metadata": { | ||
"application": "my-crawler" | ||
} | ||
} | ||
``` | ||
</details> | ||
|
||
Crawler should now be functional. | ||
See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content. | ||
|
||
### Configuring Crawlers | ||
|
||
See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files. | ||
#### Running Open Crawler from Docker | ||
|
||
Once you have a Crawler configured, you can validate the domain(s) using the CLI. | ||
Open Crawler has a Dockerfile that can be built and run locally. | ||
|
||
```bash | ||
$ bin/crawler validate config/my-crawler.yml | ||
``` | ||
1. Clone the repository: `git clone https://github.com/elastic/crawler.git` | ||
2. Build the image `docker build -t crawler-image .` | ||
3. Run the container `docker run -i -d --name crawler crawler-image` | ||
- `-i` allows the container to stay alive so CLI commands can be executed inside it | ||
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it | ||
4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version` | ||
- Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>` | ||
5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps. | ||
|
||
If you are running from docker, you will first need to copy the config file into the docker container. | ||
#### Running Open Crawler from source | ||
|
||
```bash | ||
# copy file (if you haven't already done so) | ||
$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml | ||
> [!TIP] | ||
> We recommend running from source only if you are actively developing Open Crawler. | ||
# run | ||
$ docker exec -it crawler bin/crawler validate config/my-crawler.yml | ||
``` | ||
<details> | ||
<summary>Instructions for running from source</summary> | ||
ℹ️ Open Crawler uses both JRuby and Java. | ||
We recommend using version managers for both. | ||
When developing Open Crawler we use <b>rbenv</b> and <b>jenv</b>. | ||
There are instructions for setting up these env managers here: | ||
|
||
See [Crawling content](#crawling-content). | ||
- [Official documentation for installing jenv](https://www.jenv.be/) | ||
- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation) | ||
|
||
### Crawling content | ||
1. Clone the repository: `git clone https://github.com/elastic/crawler.git` | ||
2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used: | ||
```bash | ||
# should output the same version as `.ruby-version` | ||
$ ruby --version | ||
|
||
Use the following command to run a crawl based on the configuration provided. | ||
# should output the same version as `.java-version` | ||
$ java --version | ||
``` | ||
|
||
```bash | ||
$ bin/crawler crawl config/my-crawler.yml | ||
``` | ||
3. If the versions seem correct, you can install dependencies: | ||
```bash | ||
$ make install | ||
``` | ||
|
||
And from Docker. | ||
You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both: | ||
Doing this requires that you use both `rbenv` and `jenv` in your local setup. | ||
|
||
```bash | ||
$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml | ||
``` | ||
```bash | ||
$ CRAWLER_MANAGE_ENV=true make install | ||
``` | ||
</details> | ||
|
||
### Connecting to Elasticsearch | ||
### Configuring Crawlers | ||
|
||
If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch. | ||
To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. | ||
If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`. | ||
See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files. | ||
|
||
- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details | ||
- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler | ||
### CLI Commands | ||
|
||
Here is an example of creating an API key with minimal permissions for Crawler. | ||
This will return a JSON with an `encoded` key. | ||
The value of `encoded` is what Crawler can use in its configuration. | ||
|
||
```bash | ||
POST /_security/api_key | ||
{ | ||
"name": "my-api-key", | ||
"role_descriptors": { | ||
"my-crawler-role": { | ||
"cluster": ["all"], | ||
"indices": [ | ||
{ | ||
"names": ["my-crawler-index-name"], | ||
"privileges": ["all"] | ||
} | ||
] | ||
} | ||
}, | ||
"metadata": { | ||
"application": "my-crawler" | ||
} | ||
} | ||
Open Crawler does not have a graphical user interface. | ||
All interactions with Open Crawler take place through the CLI. | ||
When given a command, Open Crawler will run until the process is finished. | ||
OpenCrawler is not kept alive in any way between commands. | ||
|
||
``` | ||
See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# CLI | ||
|
||
Crawler CLI is a command-line interface for use in the terminal or scripts. | ||
This is the only user interface for interacting with Crawler. | ||
|
||
## Installation and Configuration | ||
|
||
Ensure you complete the [setup](../README.md#setup) before using the CLI. | ||
|
||
For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md). | ||
|
||
### CLI in Docker | ||
|
||
If you are running a dockerized version of Crawler, you can run CLI commands in two ways; | ||
|
||
1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash` | ||
- This requires no changes to CLI commands | ||
```bash | ||
# exec into container | ||
$ docker exec -it crawler bash | ||
|
||
# move to crawler directory | ||
$ cd crawler | ||
|
||
# execute commands | ||
$ bin/crawler version | ||
``` | ||
2. Execute commands externally using `docker exec -it <container name> <command>` | ||
```bash | ||
# execute command directly without entering docker container | ||
$ docker exec -it crawler bin/crawler version | ||
``` | ||
|
||
## Available commands | ||
### Getting help | ||
Use the `--help or -h` option with any command to get more information. | ||
|
||
For example: | ||
```bash | ||
$ bin/crawler --help | ||
> Commands: | ||
> crawler crawl CRAWL_CONFIG # Run a crawl of the site | ||
> crawler validate CRAWL_CONFIG # Validate crawler configuration | ||
> crawler version # Print version | ||
``` | ||
|
||
### Commands | ||
|
||
|
||
- [`crawler crawl`](#crawler-crawl) | ||
- [`crawler validate`](#crawler-validate) | ||
- [`crawler version`](#crawler-version) | ||
|
||
#### `crawler crawl` | ||
|
||
Crawls the configured domain in the provided config file. | ||
Can optionally take a second configuration file for Elasticsearch settings. | ||
See [CONFIG.md](./CONFIG.md) for details on the configuration files. | ||
|
||
```bash | ||
# crawl using only crawler config | ||
$ bin/crawler crawl config/examples/parks-australia.yml | ||
``` | ||
|
||
```bash | ||
# crawl using crawler config and optional --es-config | ||
$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml | ||
``` | ||
|
||
#### `crawler validate` | ||
|
||
Checks the configured domains in `domain_allowlist` to see if they can be crawled. | ||
|
||
```bash | ||
# when valid | ||
$ bin/crawler validate path/to/crawler.yml | ||
> Domain https://www.elastic.co is valid | ||
``` | ||
|
||
```bash | ||
# when invalid (e.g. has a redirect) | ||
$ bin/crawler validate path/to/invalid-crawler.yml | ||
> Domain https://elastic.co is invalid: | ||
> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/). | ||
> If you want to crawl this site, please configure https://www.elastic.co as one of the domains. | ||
``` | ||
|
||
#### `crawler version` | ||
|
||
Checks the product version of Crawler | ||
|
||
```bash | ||
$ bin/crawler version | ||
> v0.2.0 | ||
``` |