Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for running official docker image #132

Merged
merged 1 commit into from
Sep 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 26 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ _Open Crawler `v0.2` is confirmed to be compatible with Elasticsearch `v8.13.0`
Indexing web content with the Open Crawler requires:

1. Running an instance of Elasticsearch (on-prem, cloud, or serverless)
2. Cloning of the Open Crawler repository (see [Setup](#setup))
2. Running the official Docker image (see [Setup](#setup))
3. Configuring a crawler config file (see [Configuring crawlers](#configuring-crawlers))
4. Using the CLI to begin a crawl job (see [CLI commands](#cli-commands))

Expand Down Expand Up @@ -95,33 +95,27 @@ If using an API key, ensure that the API key has read and write permissions to a
```
</details>

#### Running Open Crawler from Docker
#### Running Open Crawler with Docker

> [!IMPORTANT]
> **Do not trigger multiple crawl jobs that reference the same index simultaneously.**
A single crawl execution can be thought of as a single crawler.
Even if two crawl executions share a configuration file, the two crawl processes will not communicate with each other.
Two crawlers simultaneously interacting with a single index can lead to data loss.

Open Crawler has a Dockerfile that can be built and run locally.

1. Clone the repository: `git clone https://github.com/elastic/crawler.git`
2. Create a docker network `docker network create elastic`
3. Build the image `docker build -t crawler-image .`
4. Run the container
1. Run the official Docker image
```bash
docker run \
-i -d \
--network elastic \
--name crawler \
crawler-image
docker run -i -d \
--network elastic \
--name crawler \
docker.elastic.co/integrations/crawler:0.2.0
```
- `-i` allows the container to stay alive so CLI commands can be executed inside it
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
- `--network` if you're running Elasticsearch in another docker container on the same machine, they will both need to run on the same network
5. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version`
- Execute other CLI commands from outside of the container by prepending `docker exec -it crawler <command>`
6. Create a config file for your crawler. See [Configuring crawlers](#configuring-crawlers) for next steps.
- `-i` allows the container to stay alive so CLI commands can be executed inside it
- `-d` allows the container to run "detached" so you don't have to dedicate a terminal window to it
- `--network` if you're running Elasticsearch in another docker container on the same machine, they will both need to run on the same network
2. Confirm that CLI commands are working `docker exec -it crawler bin/crawler version`
3. Create a config file for your crawler
4. See [Configuring crawlers](#configuring-crawlers) for next steps.

#### Running Open Crawler from source

Expand Down Expand Up @@ -168,19 +162,28 @@ Crawler has template configuration files that contain every configuration availa
- [config/crawler.yml.example](config/crawler.yml.example)
- [config/elasticsearch.yml.example](config/elasticsearch.yml.example)

To use these files, make a copy in the same directory without the `.example` suffix:
To use these files, make a copy locally without the `.example` suffix.
Then remove the `#` comment-out characters from the configurations that you need.

You can then copy the file into your running Docker image.

```bash
$ cp config/crawler.yml.example config/crawler.yml
$ docker cp config/my-crawler.yml crawler:app/config/my-crawler.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder - what is our position on copying configs vs mounting volumes?

Connectors require mounting, while crawler copying

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered this. Let's discuss as a team and if mounting is better we can create an issue to update docs.

```

Then remove the `#` comment-out characters from the configurations that you need.

Crawler can be configured using two config files, a Crawler configuration and an Elasticsearch configuration.
The Elasticsearch configuration file is optional.
It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.
See [CONFIG.md](docs/CONFIG.md) for more details on these files.

### Running a Crawl Job

Once everything is configured, you can run a crawl job using the CLI:

```bash
$ docker exec -it crawler bin/crawler schedule path/to/my-crawler.yml
```

### Scheduling Recurring Crawl Jobs

Crawl jobs can also be scheduled to recur.
Expand Down
Loading