diff --git a/README.md b/README.md index 94ae1b15..711e123b 100644 --- a/README.md +++ b/README.md @@ -72,8 +72,6 @@ If using an API key, ensure that the API key has read and write permissions to a ``` - - #### Running Open Crawler from Docker > [!IMPORTANT] @@ -142,7 +140,23 @@ Open Crawler has a Dockerfile that can be built and run locally. ### Configuring Crawlers -See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files. +Crawler has template configuration files that contain every configuration available. + +- [config/crawler.yml.example](config/crawler.yml.example) +- [config/elasticsearch.yml.example](config/elasticsearch.yml.example) + +To use these files, make a copy in the same directory without the `.example` suffix: + +```bash +$ cp config/crawler.yml.example config/crawler.yml +``` + +Then remove the `#` comment-out characters from the configurations that you need. + +Crawler can be configured using two config files, a Crawler configuration and an Elasticsearch configuration. +The Elasticsearch configuration file is optional. +It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration. +See [CONFIG.md](docs/CONFIG.md) for more details on these files. ### Scheduling Recurring Crawl Jobs diff --git a/config/crawler.yml.example b/config/crawler.yml.example index e30aef99..cf2c2737 100644 --- a/config/crawler.yml.example +++ b/config/crawler.yml.example @@ -79,13 +79,14 @@ # ## Authentication configurations. ## Only required if a site has some form of authentication. -#auth.domain: https://parksaustralia.gov.au +#auth.domain: https://my-auth-domain.com #auth.type: basic #auth.username: user #auth.password: pass # ## Whether document metadata from certain content types will be indexed or not. ## This does not allow binary content to be indexed from these files, only metadata. +## See docs/features/BINARY_CONTENT_EXTRACTION.md for more details. #binary_content_extraction_enabled: true #binary_content_extraction_mime_types: # - application/pdf @@ -96,7 +97,7 @@ # ## ------------------------------- Logging ------------------------------------- # -# The log level for system logs. Defaults to `info` +## The log level for system logs. Defaults to `info` #log_level: info # # Whether or not event logging is enabled for output to the shell running Crawler. diff --git a/config/examples/parks-australia.yml b/config/examples/parks-australia.yml index e47346b3..54de3c86 100644 --- a/config/examples/parks-australia.yml +++ b/config/examples/parks-australia.yml @@ -1,4 +1,9 @@ # This is a sample config file for crawling the parksaustralia.gov.au website writing output to an ES index +# +# The configuration options in this example are not exhaustive. To see all possible configuration options, +# reference the config templates: +# - config/crawler.yml.example +# - config/elasticsearch.yml.example # Domains allowed for the crawl domains: diff --git a/docs/CONFIG.md b/docs/CONFIG.md index 8d474fcd..0a70b37e 100644 --- a/docs/CONFIG.md +++ b/docs/CONFIG.md @@ -3,10 +3,12 @@ Configuration files live in the [config](../config) directory. There are two kinds of configuration files: -1. Crawler configurations (provided as a positional argument) -2. Elasticsearch configurations (provided as an optional argument with `--es-config`) +- Crawler configuration - [config/crawler.yml.example](../config/crawler.yml.example) +- Elasticsearch configuration - [config/elasticsearch.yml.example](../config/elasticsearch.yml.example) + +The Elasticsearch configuration file is optional. +It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration. -There two configuration file arguments allow crawl jobs to share Elasticsearch instance configuration. There are no enforced pathing or naming for these files. They are differentiated only by how they are provided to the CLI when running a crawl. @@ -49,7 +51,7 @@ When performing a crawl with only a crawl config: $ bin/crawler crawl config/my-crawler.yml ``` -When performing a crawl with only both a crawl config and an Elasticsearch config: +When performing a crawl with both a crawl config and an Elasticsearch config: ```shell $ bin/crawler crawl config/my-crawler.yml --es-config config/elasticsearch.yml