From c72e5d5f4484e72733adaa6cc85d50617fdf0e08 Mon Sep 17 00:00:00 2001 From: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com> Date: Tue, 3 Sep 2024 16:27:19 +0200 Subject: [PATCH] [0.2] Add CHANGELOG.md and upgrade to beta (#121) (#125) # Backport This will backport the following commits from `main` to `0.2`: - [Add CHANGELOG.md and upgrade to beta (#121)](https://github.com/elastic/crawler/pull/121) ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) --- README.md | 6 +++--- docs/CHANGELOG.md | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+), 3 deletions(-) create mode 100644 docs/CHANGELOG.md diff --git a/README.md b/README.md index eb2294f..db5760e 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,11 @@ This repository contains code for the Elastic Open Web Crawler. Open Crawler enables users to easily ingest web content into Elasticsearch. > [!IMPORTANT] -> _The Open Crawler is currently in **tech-preview**_. -Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. +> _The Open Crawler is currently in **beta**_. +Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release. -_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ +_Open Crawler `v0.2` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._ ### User workflow diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md new file mode 100644 index 0000000..9b3e085 --- /dev/null +++ b/docs/CHANGELOG.md @@ -0,0 +1,19 @@ +# Open Crawler Changelog + +## Legend + +- 🚀 Feature +- 🐛 Bugfix +- 🔨 Refactor + +## `v0.2.0` + +- 🚀 Crawl jobs can now be scheduled using the CLI command `bin/crawler schedule`. See [CLI.md](./CLI.md#crawler-schedule). +- 🚀 Crawler can now extract binary content from files. See [BINARY_CONTENT_EXTRACTION.md](./features/BINARY_CONTENT_EXTRACTION.md). +- 🚀 Crawler will now purge outdated documents from the index at the end of the crawl. This is enabled by default. You can disable this by adding `purge_docs_enabled: false` to the crawler's yaml config file. +- 🚀 Crawl rules can now be configured, allowing specified URLs to be allowed/denied. See [CRAWL_RULES.md](./features/CRAWL_RULES.md). +- 🚀 Extraction rules using CSS, XPath, and URL selectors can now be applied to crawls. See [EXTRACTION_RULES.md](./features/EXTRACTION_RULES.md). +- 🔨 The configuration field `content_extraction_enabled` is now `binary_content_extraction_enabled`. +- 🔨 The configuration field `content_extraction_mime_types` is now `binary_content_extraction_mime_types`. +- 🔨 The Elasticsearch document field `body_content` is now `body`. +- 🔨 The format for config files has changed, so existing crawler configurations will not work. The new format can be referenced in the [crawler.yml.example](../config/crawler.yml.example) file.