Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve setup docs and add CLI docs #44

Merged
merged 9 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 36 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,14 @@
This repository contains code for the Elastic Open Web Crawler.
This is a tool to allow users to easily ingest content into Elasticsearch from the web.

⚠️ _The Open Crawler is currently in **tech-preview**_.
Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
Elastic plans to promote this feature to GA in a future release.

ℹ️ The Open Crawler requires a running instance of Elasticsearch to index documents into.
If you don't have this set up yet, check out the [quickstart guide for Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/master/quickstart.html) to get started.
_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._

## How it works

Crawler runs crawl jobs on command based on config files in the `config` directory.
Expand All @@ -16,50 +24,52 @@ The crawl results can be output in 3 different modes:

### Setup

In order to index crawl results into an Elasticsearch instance, you must first have one up and running.

#### Running from Docker

Crawler has a Dockerfile that can be built and run locally.

1. Build the image `docker build -t crawler-image .`
2. Run the container `docker run -i -d --name crawler crawler-image`
1. Clone the repository
2. Build the image `docker build -t crawler-image .`
3. Run the container `docker run -i -d --name crawler crawler-image`
- `-i` allows the container to stay alive so CLI commands can be executed inside it
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
3. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
4. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
4. Confirm that Crawler commands are working `docker exec -it crawler bin/crawler version`
5. Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`.
- See [Crawling content](#crawling-content) for examples.

#### Running from source

Crawler uses both JRuby and Java.
_Note: Crawler uses both JRuby and Java.
We recommend using version managers for both.
When developing Crawler we use `rbenv` and `jenv`.
There are instructions for setting up these env managers here:
There are instructions for setting up these env managers here:_

- [Official documentation for installing jenv](https://www.jenv.be/)
- [Official documentation for installing rbenv](https://github.com/rbenv/rbenv?tab=readme-ov-file#installation)

Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:

```bash
# should output the same version as `.ruby-version`
$ ruby --version
1. Clone the repository
2. Go to the root of the Crawler directory and check the expected Java and Ruby versions are being used:
```bash
# should output the same version as `.ruby-version`
$ ruby --version

# should output the same version as `.java-version`
$ java --version
```
# should output the same version as `.java-version`
$ java --version
```

If the versions seem correct, you can install dependencies:
3. If the versions seem correct, you can install dependencies:
```bash
$ make install
```

```bash
$ make install
```
You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
Doing this requires that you use both `rbenv` and `jenv` in your local setup.

You can also use the env variable `CRAWLER_MANAGE_ENV` to have the install script automatically check whether `rbenv` and `jenv` are installed, and that the correct versions are running on both:
Doing this requires that you use both `rbenv` and `jenv` in your local setup.

```bash
$ CRAWLER_MANAGE_ENV=true make install
```
```bash
$ CRAWLER_MANAGE_ENV=true make install
```

Crawler should now be functional.
See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.
Expand All @@ -68,37 +78,9 @@ See [Configuring Crawlers](#configuring-crawlers) to begin crawling web content.

See [CONFIG.md](docs/CONFIG.md) for in-depth details on Crawler configuration files.

Once you have a Crawler configured, you can validate the domain(s) using the CLI.
### CLI Commands

```bash
$ bin/crawler validate config/my-crawler.yml
```

If you are running from docker, you will first need to copy the config file into the docker container.

```bash
# copy file (if you haven't already done so)
$ docker cp /path/to/my-crawler.yml crawler:config/my-crawler.yml

# run
$ docker exec -it crawler bin/crawler validate config/my-crawler.yml
```

See [Crawling content](#crawling-content).

### Crawling content

Use the following command to run a crawl based on the configuration provided.

```bash
$ bin/crawler crawl config/my-crawler.yml
```

And from Docker.

```bash
$ docker exec -it crawler bin/crawler crawl config/my-crawler.yml
```
See [CLI.md](docs/CLI.md) for a full list of CLI commands available for Crawler.

### Connecting to Elasticsearch

Expand Down Expand Up @@ -132,5 +114,4 @@ POST /_security/api_key
"application": "my-crawler"
}
}

```
98 changes: 98 additions & 0 deletions docs/CLI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# CLI

Crawler CLI is a command-line interface for use in the terminal or scripts.

## Installation and Configuration

Ensure you complete the [setup](../README.md#setup) before using the CLI.

For instructions on configuring a Crawler, see [CONFIG.md](./CONFIG.md).

### CLI in Docker

If you are running a dockerized version of Crawler, you can run CLI commands in two ways;

1. Exec into the docker container and execute commands directly using `docker exec -it <container name> bash`
- This requires no changes to CLI commands
```bash
# exec into container
$ docker exec -it crawler bash

# move to crawler directory
$ cd crawler

# execute commands
$ bin/crawler version
```
2. Execute commands externally using `docker exec -it <container name> <command>`
```bash
# execute command directly without entering docker container
$ docker exec -it crawler bin/crawler version
```

## Available commands
### Getting help
Crawler CLI provides a `--help`/`-h` argument that can be used with any command to get more information.

For example:
```bash
$ bin/crawler --help

> Commands:
> crawler crawl CRAWL_CONFIG # Run a crawl of the site
> crawler validate CRAWL_CONFIG # Validate crawler configuration
> crawler version # Print version
```

### Commands


- [`crawler crawl`](#crawler-crawl)
- [`crawler validate`](#crawler-validate)
- [`crawler version`](#crawler-version)

#### `crawler crawl`

Crawls the configured domain in the provided config file.
Can optionally take a second configuration file for Elasticsearch settings.
See [CONFIG.md](./CONFIG.md) for details on the configuration files.

```bash
# crawl using only crawler config
$ bin/crawler crawl config/examples/parks-australia.yml
```

```bash
# crawl using crawler config and optional --es-config
$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
```

#### `crawler validate`

Checks the configured domains in `domain_allowlist` to see if they can be crawled.

```bash
# when valid
$ bin/crawler validate path/to/crawler.yml

> Domain https://www.elastic.co is valid
```

```bash
# when invalid (e.g. has a redirect)
$ bin/crawler validate path/to/invalid-crawler.yml

> Domain https://elastic.co is invalid:
> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.
```

#### `crawler version`

Checks the product version of Crawler

```bash
$ bin/crawler version

> v0.2.0
```