Skip to content

Commit

Permalink
[0.2] Add CHANGELOG.md and upgrade to beta (#121) (#125)
Browse files Browse the repository at this point in the history
# Backport

This will backport the following commits from `main` to `0.2`:
- [Add CHANGELOG.md and upgrade to beta
(#121)](#121)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)
  • Loading branch information
navarone-feekery authored Sep 3, 2024
1 parent 3c0132d commit c72e5d5
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 3 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ This repository contains code for the Elastic Open Web Crawler.
Open Crawler enables users to easily ingest web content into Elasticsearch.

> [!IMPORTANT]
> _The Open Crawler is currently in **tech-preview**_.
Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features.
> _The Open Crawler is currently in **beta**_.
Beta features are subject to change and are not covered by the support SLA of generally available (GA) features.
Elastic plans to promote this feature to GA in a future release.

_Open Crawler `v0.1` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._
_Open Crawler `v0.2` is confirmed to be compatible with Elasticsearch `v8.13.0` and above._

### User workflow

Expand Down
19 changes: 19 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Open Crawler Changelog

## Legend

- πŸš€ Feature
- πŸ› Bugfix
- πŸ”¨ Refactor

## `v0.2.0`

- πŸš€ Crawl jobs can now be scheduled using the CLI command `bin/crawler schedule`. See [CLI.md](./CLI.md#crawler-schedule).
- πŸš€ Crawler can now extract binary content from files. See [BINARY_CONTENT_EXTRACTION.md](./features/BINARY_CONTENT_EXTRACTION.md).
- πŸš€ Crawler will now purge outdated documents from the index at the end of the crawl. This is enabled by default. You can disable this by adding `purge_docs_enabled: false` to the crawler's yaml config file.
- πŸš€ Crawl rules can now be configured, allowing specified URLs to be allowed/denied. See [CRAWL_RULES.md](./features/CRAWL_RULES.md).
- πŸš€ Extraction rules using CSS, XPath, and URL selectors can now be applied to crawls. See [EXTRACTION_RULES.md](./features/EXTRACTION_RULES.md).
- πŸ”¨ The configuration field `content_extraction_enabled` is now `binary_content_extraction_enabled`.
- πŸ”¨ The configuration field `content_extraction_mime_types` is now `binary_content_extraction_mime_types`.
- πŸ”¨ The Elasticsearch document field `body_content` is now `body`.
- πŸ”¨ The format for config files has changed, so existing crawler configurations will not work. The new format can be referenced in the [crawler.yml.example](../config/crawler.yml.example) file.

0 comments on commit c72e5d5

Please sign in to comment.