Skip to content

Commit

Permalink
Improve setup docs and add CLI docs (#44)
Browse files Browse the repository at this point in the history
- Add version compatibility
- Add tech-preview status
- Clean up Setup section
- Add CLI.md
- Move crawl instructions to CLI.md

---------

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
  • Loading branch information
2 people authored and elastic committed Jun 17, 2024
1 parent ae66978 commit 4c6520f
Show file tree
Hide file tree
Showing 2 changed files with 201 additions and 99 deletions.
201 changes: 102 additions & 99 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,136 +1,139 @@
# Elastic Open Web Crawler

This repository contains code for the Elastic Open Web Crawler.
This is a tool to allow users to easily ingest content into Elasticsearch from the web.
Open Crawler enables users to easily ingest web content into Elasticsearch.

## How it works
> [!IMPORTANT]
> _The Open Crawler is currently in **tech-preview**_.
Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
Elastic plans to promote this feature to GA in a future release.

Crawler runs crawl jobs on command based on config files in the `config` directory.
1 URL endpoint on a site will correlate with 1 result output.
_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._

The crawl results can be output in 3 different modes:
### User workflow

- As docs to an Elasticsearch index
- As files to a specified directory
- Directly to the terminal
Indexing web content with the Open Crawler requires:

### Setup

#### Running from Docker

Crawler has a Dockerfile that can be built and run locally.

1. Build the image `docker build -t crawler-image .`
2. Run the container `docker run -i -d --name crawler crawler-image`
- `-i` allows the container to stay alive so CLI commands can be executed inside it
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
- See [Crawling content](#crawling-content) for examples.
1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
2. Cloning of the Open Crawler repository (see [Setup](#setup))
3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))

#### Running from source
### Execution logic

Crawler uses both JRuby and Java.
We recommend using version managers for both.
When developing Crawler we use `rbenv` and `jenv`.
There are instructions for setting up these env managers here:
Open Crawler runs crawl jobs on command based on config files in the `config` directory.
Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.

- [Official documentation for installing jenv](https://www.jenv.be/)
- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)
Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
The crawl results from these are added to a pool of results.
These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.

Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
### Setup

```bash
# should output the same version as `.ruby-version`
$ ruby --version
#### Prerequisites

# should output the same version as `.java-version`
$ java --version
```
A running instance of Elasticsearch is required to index documents into.
If you don't have this set up yet, you can sign up for an [Elastic Cloud free trial](https://www.elastic.co/cloud/cloud-trial-overview) or check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html).

If the versions seem correct, you can install dependencies:
#### Connecting to Elasticsearch

```bash
$ make install
```
Open Crawler will attempt to use the `_bulk` API to index crawl results into Elasticsearch.
To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.

You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
Doing this requires that you use both `rbenv` and `jenv` in your local setup.
- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler

```bash
$ CRAWLER_MANAGE_ENV=true make install
```
<details>
<summary>Creating an API key</summary>
Here is an example of creating an API key with minimal permissions for Open Crawler.
This will return a JSON with an `encoded` key.
The value of `encoded` is what Open Crawler can use in its configuration.

```bash
POST /_security/api_key
{
"name": "my-api-key",
"role_descriptors": {
"my-crawler-role": {
"cluster": ["all"],
"indices": [
{
"names": ["my-crawler-index-name"],
"privileges": ["all"]
}
]
}
},
"metadata": {
"application": "my-crawler"
}
}
```
</details>

Crawler should now be functional.
See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.

### Configuring Crawlers

See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.
#### Running Open Crawler from Docker

Once you have a Crawler configured, you can validate the domain(s) using the CLI.
Open Crawler has a Dockerfile that can be built and run locally.

```bash
$ bin/crawler validate config/my-crawler.yml
```
1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
2. Build the image `docker build -t crawler-image .`
3. Run the container `docker run -i -d --name crawler crawler-image`
- `-i` allows the container to stay alive so CLI commands can be executed inside it
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
4. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version`
- Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
5. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps. See [Configuring crawlers](#configuring-crawlers) for next steps.

If you are running from docker, you will first need to copy the config file into the docker container.
#### Running Open Crawler from source

```bash
# copy file (if you haven't already done so)
$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml
> [!TIP]
> We recommend running from source only if you are actively developing Open Crawler.
# run
$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
```
<details>
<summary>Instructions for running from source</summary>
ℹ️ Open Crawler uses both JRuby and Java.
We recommend using version managers for both.
When developing Open Crawler we use <b>rbenv</b> and <b>jenv</b>.
There are instructions for setting up these env managers here:

See [Crawling content](#crawling-content).
- [Official documentation for installing jenv](https://www.jenv.be/)
- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)

### Crawling content
1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
2. Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:
```bash
# should output the same version as `.ruby-version`
$ ruby --version

Use the following command to run a crawl based on the configuration provided.
# should output the same version as `.java-version`
$ java --version
```

```bash
$ bin/crawler crawl config/my-crawler.yml
```
3. If the versions seem correct, you can install dependencies:
```bash
$ make install
```

And from Docker.
You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
Doing this requires that you use both `rbenv` and `jenv` in your local setup.

```bash
$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
```
```bash
$ CRAWLER_MANAGE_ENV=true make install
```
</details>

### Connecting to Elasticsearch
### Configuring Crawlers

If you set the `output_sink` value to `elasticsearch`, Crawler will attempt to bulk index crawl results into Elasticsearch.
To facilitate this connection, Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance.
If using an API key, ensure that the API key has read and write permissions to access the index configured in `output_index`.
See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.

- [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html) for managing API keys for more details
- [elasticsearch.yml.example](config/elasticsearch.yml.example) file for all of the available Elasticsearch configurations for Crawler
### CLI Commands

Here is an example of creating an API key with minimal permissions for Crawler.
This will return a JSON with an `encoded` key.
The value of `encoded` is what Crawler can use in its configuration.

```bash
POST /_security/api_key
{
"name": "my-api-key",
"role_descriptors": {
"my-crawler-role": {
"cluster": ["all"],
"indices": [
{
"names": ["my-crawler-index-name"],
"privileges": ["all"]
}
]
}
},
"metadata": {
"application": "my-crawler"
}
}
Open Crawler does not have a graphical user interface.
All interactions with Open Crawler take place through the CLI.
When given a command, Open Crawler will run until the process is finished.
OpenCrawler is not kept alive in any way between commands.

```
See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.
99 changes: 99 additions & 0 deletions docs/CLI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# CLI

Crawler CLI is a command-line interface for use in the terminal or scripts.
This is the only user interface for interacting with Crawler.

## Installation and Configuration

Ensure you complete the [setup](../README.md#setup) before using the CLI.

For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).

### CLI in Docker

If you are running a dockerized version of Crawler, you can run CLI commands in two ways;

1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
- This requires no changes to CLI commands
```bash
# exec into container
$ docker exec -it crawler bash

# move to crawler directory
$ cd crawler

# execute commands
$ bin/crawler version
```
2. Execute commands externally using `docker exec -it <container name> <command>`
```bash
# execute command directly without entering docker container
$ docker exec -it crawler bin/crawler version
```

## Available commands
### Getting help
Use the `--help or -h` option with any command to get more information.

For example:
```bash
$ bin/crawler --help
> Commands:
> crawler crawl CRAWL_CONFIG # Run a crawl of the site
> crawler validate CRAWL_CONFIG # Validate crawler configuration
> crawler version # Print version
```

### Commands


- [`crawler crawl`](#crawler-crawl)
- [`crawler validate`](#crawler-validate)
- [`crawler version`](#crawler-version)

#### `crawler crawl`

Crawls the configured domain in the provided config file.
Can optionally take a second configuration file for Elasticsearch settings.
See [CONFIG.md](./CONFIG.md) for details on the configuration files.

```bash
# crawl using only crawler config
$ bin/crawler crawl config/examples/parks-australia.yml
```

```bash
# crawl using crawler config and optional --es-config
$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
```

#### `crawler validate`

Checks the configured domains in `domain_allowlist` to see if they can be crawled.

```bash
# when valid
$ bin/crawler validate path/to/crawler.yml
> Domain https://www.elastic.co is valid
```

```bash
# when invalid (e.g. has a redirect)
$ bin/crawler validate path/to/invalid-crawler.yml
> Domain https://elastic.co is invalid:
> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
```

#### `crawler version`

Checks the product version of Crawler

```bash
$ bin/crawler version
> v0.2.0
```

0 comments on commit 4c6520f

Please sign in to comment.