Skip to content

Commit

Permalink
Clean up config docs
Browse files Browse the repository at this point in the history
  • Loading branch information
navarone-feekery committed Aug 29, 2024
1 parent 0c1dbbf commit 1f6fd28
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 9 deletions.
20 changes: 17 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,6 @@ If using an API key, ensure that the API key has read and write permissions to a
```
</details>



#### Running Open Crawler from Docker

> [!IMPORTANT]
Expand Down Expand Up @@ -142,7 +140,23 @@ Open Crawler has a Dockerfile that can be built and run locally.

### Configuring Crawlers

See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.
Crawler has template configuration files that contain every configuration available.

- [config/crawler.yml.example](config/crawler.yml.example)
- [config/elasticsearch.yml.example](config/elasticsearch.yml.example)

To use these files, make a copy in the same directory without the `.example` suffix:

```bash
$ cp config/crawler.yml.example config/crawler.yml
```

Then remove the `#` comment-out characters from the configurations that you need.

Crawler can be configured using two config files, a Crawler configuration and an Elasticsearch configuration.
The Elasticsearch configuration file is optional.
It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.
See [CONFIG.md](docs/CONFIG.md) for more details on these files.

### Scheduling Recurring Crawl Jobs

Expand Down
5 changes: 3 additions & 2 deletions config/crawler.yml.example
Original file line number Diff line number Diff line change
Expand Up @@ -79,13 +79,14 @@
#
## Authentication configurations.
## Only required if a site has some form of authentication.
#auth.domain: https://parksaustralia.gov.au
#auth.domain: https://my-auth-domain.com
#auth.type: basic
#auth.username: user
#auth.password: pass
#
## Whether document metadata from certain content types will be indexed or not.
## This does not allow binary content to be indexed from these files, only metadata.
## See docs/features/BINARY_CONTENT_EXTRACTION.md for more details.
#binary_content_extraction_enabled: true
#binary_content_extraction_mime_types:
# - application/pdf
Expand All @@ -96,7 +97,7 @@
#
## ------------------------------- Logging -------------------------------------
#
# The log level for system logs. Defaults to `info`
## The log level for system logs. Defaults to `info`
#log_level: info
#
# Whether or not event logging is enabled for output to the shell running Crawler.
Expand Down
5 changes: 5 additions & 0 deletions config/examples/parks-australia.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# This is a sample config file for crawling the parksaustralia.gov.au website writing output to an ES index
#
# The configuration options in this example are not exhaustive. To see all possible configuration options,
# reference the config templates:
# - config/crawler.yml.example
# - config/elasticsearch.yml.example

# Domains allowed for the crawl
domains:
Expand Down
10 changes: 6 additions & 4 deletions docs/CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
Configuration files live in the [config](../config) directory.
There are two kinds of configuration files:

1. Crawler configurations (provided as a positional argument)
2. Elasticsearch configurations (provided as an optional argument with `--es-config`)
- Crawler configuration - [config/crawler.yml.example](../config/crawler.yml.example)
- Elasticsearch configuration - [config/elasticsearch.yml.example](../config/elasticsearch.yml.example)

The Elasticsearch configuration file is optional.
It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.

There two configuration file arguments allow crawl jobs to share Elasticsearch instance configuration.
There are no enforced pathing or naming for these files.
They are differentiated only by how they are provided to the CLI when running a crawl.

Expand Down Expand Up @@ -49,7 +51,7 @@ When performing a crawl with only a crawl config:
$ bin/crawler crawl config/my-crawler.yml
```

When performing a crawl with only both a crawl config and an Elasticsearch config:
When performing a crawl with both a crawl config and an Elasticsearch config:

```shell
$ bin/crawler crawl config/my-crawler.yml --es-config config/elasticsearch.yml
Expand Down

0 comments on commit 1f6fd28

Please sign in to comment.