Merge branch 'main' into acrewdson/wolfi-fixes

elastic · Aug 29, 2024 · 83253dc · 83253dc
2 parents b7382d0 + 84f6f00
commit 83253dc
Show file tree

Hide file tree

Showing 13 changed files with 233 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -72,8 +72,6 @@ If using an API key, ensure that the API key has read and write permissions to a
   ```
 </details>
 
-
-
 #### Running Open Crawler from Docker
 
 > [!IMPORTANT]
@@ -142,7 +140,23 @@ Open Crawler has a Dockerfile that can be built and run locally.
 
 ### Configuring Crawlers
 
-See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.
+Crawler has template configuration files that contain every configuration available.
+
+- [config/crawler.yml.example](config/crawler.yml.example)
+- [config/elasticsearch.yml.example](config/elasticsearch.yml.example)
+
+To use these files, make a copy in the same directory without the `.example` suffix:
+
+```bash
+$ cp config/crawler.yml.example config/crawler.yml
+```
+
+Then remove the `#` comment-out characters from the configurations that you need.
+
+Crawler can be configured using two config files, a Crawler configuration and an Elasticsearch configuration.
+The Elasticsearch configuration file is optional.
+It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.
+See [CONFIG.md](docs/CONFIG.md) for more details on these files.
 
 ### Scheduling Recurring Crawl Jobs
 

diff --git a/config/crawler.yml.example b/config/crawler.yml.example
@@ -29,7 +29,7 @@
 #        pattern: /blog     # the pattern string for the rule
 #
 #    # An array of content extraction rules
-#    # See docs/features/CONTENT_EXTRACTION.md for more details on this feature
+#    # See docs/features/EXTRACTION_RULES.md for more details on this feature
 #    extraction_rulesets:
 #      - url_filters:
 #          - type: begins           # Filter type, can be: begins | ends | contains | regex
@@ -79,15 +79,16 @@
 #
 ## Authentication configurations.
 ##     Only required if a site has some form of authentication.
-#auth.domain: https://parksaustralia.gov.au
+#auth.domain: https://my-auth-domain.com
 #auth.type: basic
 #auth.username: user
 #auth.password: pass
 #
 ## Whether document metadata from certain content types will be indexed or not.
 ##     This does not allow binary content to be indexed from these files, only metadata.
-#content_extraction_enabled: true
-#content_extraction_mime_types:
+##     See docs/features/BINARY_CONTENT_EXTRACTION.md for more details.
+#binary_content_extraction_enabled: true
+#binary_content_extraction_mime_types:
 #  - application/pdf
 #  - application/msword
 #  - application/vnd.openxmlformats-officedocument.wordprocessingml.document
@@ -96,7 +97,7 @@
 #
 ## ------------------------------- Logging -------------------------------------
 #
-# The log level for system logs. Defaults to `info`
+## The log level for system logs. Defaults to `info`
 #log_level: info
 #
 # Whether or not event logging is enabled for output to the shell running Crawler.

diff --git a/config/examples/parks-australia.yml b/config/examples/parks-australia.yml
@@ -1,4 +1,9 @@
 # This is a sample config file for crawling the parksaustralia.gov.au website writing output to an ES index
+#
+# The configuration options in this example are not exhaustive. To see all possible configuration options,
+# reference the config templates:
+# - config/crawler.yml.example
+# - config/elasticsearch.yml.example
 
 # Domains allowed for the crawl
 domains:

diff --git a/docs/CLI.md b/docs/CLI.md
@@ -41,14 +41,15 @@ $ bin/crawler --help
 
 > Commands:
 >   crawler crawl CRAWL_CONFIG                   # Run a crawl of the site
+>   crawler schedule CRAWL_CONFIG                # Schedule a recurrent crawl of the site
 >   crawler validate CRAWL_CONFIG                # Validate crawler configuration
 >   crawler version                              # Print version
 ```
 
 ### Commands
 
-
 - [`crawler crawl`](#crawler-crawl)
+- [`crawler schedule`](#crawler-schedule)
 - [`crawler validate`](#crawler-validate)
 - [`crawler version`](#crawler-version)
 
@@ -68,6 +69,25 @@ $ bin/crawler crawl config/examples/parks-australia.yml
 $ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
 ```
 
+#### `crawler schedule`
+
+Creates a schedule to recurrently crawl the configured domain in the provided config file.
+The scheduler uses a cron expression that is configured in the Crawler configuration file using the field `schedule.pattern`.
+See [scheduling recurring crawl jobs](../README.md#scheduling-recurring-crawl-jobs) for details on scheduling.
+
+Can optionally take a second configuration file for Elasticsearch settings.
+See [CONFIG.md](./CONFIG.md) for details on the configuration files.
+
+```bash
+# schedule crawls using only crawler config
+$ bin/crawler schedule config/examples/parks-australia.yml
+```
+
+```bash
+# schedule crawls using crawler config and optional --es-config
+$ bin/crawler schedule config/examples/parks-australia.yml --es-config=config/es.yml
+```
+
 #### `crawler validate`
 
 Checks the configured domains in `domain_allowlist` to see if they can be crawled.

diff --git a/docs/CONFIG.md b/docs/CONFIG.md
@@ -3,10 +3,12 @@
 Configuration files live in the [config](../config) directory.
 There are two kinds of configuration files:
 
-1. Crawler configurations (provided as a positional argument)
-2. Elasticsearch configurations (provided as an optional argument with `--es-config`)
+- Crawler configuration - [config/crawler.yml.example](../config/crawler.yml.example)
+- Elasticsearch configuration - [config/elasticsearch.yml.example](../config/elasticsearch.yml.example)
+
+The Elasticsearch configuration file is optional.
+It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.
 
-There two configuration file arguments allow crawl jobs to share Elasticsearch instance configuration.
 There are no enforced pathing or naming for these files.
 They are differentiated only by how they are provided to the CLI when running a crawl.
 
@@ -49,7 +51,7 @@ When performing a crawl with only a crawl config:
 $ bin/crawler crawl config/my-crawler.yml
 ```
 
-When performing a crawl with only both a crawl config and an Elasticsearch config:
+When performing a crawl with both a crawl config and an Elasticsearch config:
 
 ```shell
 $ bin/crawler crawl config/my-crawler.yml --es-config config/elasticsearch.yml

diff --git a/docs/DOCUMENT_SCHEMA.md b/docs/DOCUMENT_SCHEMA.md
@@ -6,7 +6,7 @@ These documents have a predefined list of fields that are always included.
 Crawler does not impose any mappings onto indices that it ingests docs into.
 This means you are free to create whatever mappings you like for an index, so long as you create the mappings _before_ indexing any documents.
 
-If any [content extraction](./features/CONTENT_EXTRACTION.md) rules have been configured, you can add more fields to the Elasticsearch documents.
+If any [content extraction rules](./features/EXTRACTION_RULES) have been configured, you can add more fields to the Elasticsearch documents.
 However, the predefined fields can never be changed or overwritten by content extraction rules.
 If you are ingesting onto an index that has custom mappings, be sure that the mappings don't conflict with these predefined fields.
 

diff --git a/docs/features/BINARY_CONTENT_EXTRACTION.md b/docs/features/BINARY_CONTENT_EXTRACTION.md
@@ -0,0 +1,30 @@
+# Binary Content Extraction
+
+The web crawler can extract content from downloadable binary files, such as PDF and DOCX files.
+Binary content is extracted by converting file contents to base64 and including the output in a document to index.
+This value is picked up by an [Elasticsearch ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html) that will convert the base64 content into plain text, to store in the `body` field of the same document.
+
+## Using this feature
+
+1. Enable ingest pipelines in the Elasticsearch configuration
+2. Enable binary content extraction in the Crawler configuration
+3. Select which MIME types should have their contents extracted
+   - The MIME type is determined by the HTTP response’s `Content-Type` header when downloading a given file
+   - While intended primarily for PDF and Microsoft Office formats, you can use any of the formats supported by [Apache Tika](https://tika.apache.org/)
+   - No default MIME types are defined, so at least at least one MIME type must be configured in order to extract non-HTML content
+   - The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs
+
+For example, the following configuration allows for the binary content extraction of PDF and DOCX files, through the default pipeline `ent-search-ingestion-pipeline`:
+
+```yaml
+binary_content_extraction_enabled: true
+binary_content_extraction_mime_types:
+  - application/pdf
+  - application/msword
+
+elasticsearch:
+   pipeline: ent-search-generic-ingestion
+   pipeline_enabled: true
+```
+
+Read more on ingest pipelines in Open Crawler [here](./INGEST_PIPELINES.md).
diff --git a/docs/features/CONTENT_EXTRACTION.md → docs/features/EXTRACTION_RULES.md b/docs/features/CONTENT_EXTRACTION.md → docs/features/EXTRACTION_RULES.md
@@ -1,4 +1,9 @@
-# Content Extraction
+# Extraction Rules
+
+This page explains the individual fields in the extraction ruleset configuration.
+The last section provides [usage examples](#examples).
+
+## Summary
 
 Extraction rules enable you to customize how the crawler extracts content from webpages.
 Extraction rules are configured in the Crawler config file.
@@ -113,3 +118,121 @@ Value can be anything except `null`.
 
 The source that Crawler will try to extract content from.
 Currently only `html` or `url` is supported.
+
+## Examples
+
+### Extracting from HTML
+
+I have a simple website for an RPG.
+A page describing cities in the RPG is hosted at `https://totally-real-rpg.com/cities`.
+The HTML for this page looks like this:
+
+```HTML
+<!DOCTYPE html>
+<html>
+  <body>
+    <div>Cities:</div>
+    <div class="city">Summerstay</div>
+    <div class="city">Drenchwell</div>
+    <div class="city">Mezzoterran</div>
+  </body>
+</html>
+```
+
+I want to extract all of the cities as an array, but only from the webpage that ends with `/cities`.
+First I must set the `url_filters` for this extraction rule to apply to only this URL.
+Then I can define what the Crawler should do when it encounters this webpage.
+
+```yaml
+domains:
+  - url: https://totally-real-rpg.com
+    extraction_rulesets:
+      - url_filters:
+          - type: "ends"
+            pattern: "/cities"
+        rules:
+          - action: "extract"
+            field_name: "cities"
+            selector: ".city"
+            join_as: "array"
+            source: "html"
+```
+
+In this example, the output document will include the following field on top of the standard crawl result fields:
+
+```json
+{
+  "cities": ["Summerstay", "Drenchwell", "Mezzoterran"]
+}
+```
+
+### Extracting from URLs
+
+Now, I also have a blog on this website.
+There are three posts on this blog, which fall under the following URLs:
+
+- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
+- https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
+- https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
+
+When these sites are crawled, I want to get only the year that the blog was published.
+First I should define the `url_filters` so that this extraction only applies to blogs.
+Then I can use a `regex` selector in the rule to fetch the year from the URL.
+
+```yaml
+domains:
+  - url: https://totally-real-rpg.com
+    extraction_rulesets:
+      - url_filters:
+          - type: "begins"
+            pattern: "/blog"
+        rules:
+          - action: "extract"
+            field_name: "publish_year"
+            selector: "blog\/([0-9]{4})"
+            join_as: "string"
+            source: "url"
+```
+In this example, the ingested documents will include the following fields on top of the standard crawl result fields:
+
+- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
+    ```json
+    { "publish_year": "2023" }
+    ```
+- https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
+    ```json
+    { "publish_year": "2024" }
+    ```
+- https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
+    ```json
+    { "publish_year": "2024" }
+    ```
+
+### Multiple rulesets
+
+There's no limit to the number of extraction rulesets that can be defined for a single Crawler.
+Taking the above two examples, we can combine them into a single configuration.
+
+```yaml
+domains:
+  - url: https://totally-real-rpg.com
+    extraction_rulesets:
+      - url_filters:
+            - type: "ends"
+              pattern: "/cities"
+        rules:
+          - action: "extract"
+            field_name: "cities"
+            selector: ".city"
+            join_as: "array"
+            source: "html"
+      - url_filters:
+          - type: "begins"
+            pattern: "/blog"
+        rules:
+          - action: "extract"
+            field_name: "publish_year"
+            selector: "blog\/([0-9]{4})"
+            join_as: "string"
+            source: "url"
+```
diff --git a/docs/features/INGEST_PIPELINES.md b/docs/features/INGEST_PIPELINES.md
@@ -0,0 +1,6 @@
+# Ingest Pipelines
+
+Open Crawler uses an [Elasticsearch ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html) to power several content extraction features.
+The default pipeline, `ent-search-generic-ingestion`, is automatically created when Elasticsearch first starts.
+This pipeline does some pre-processing on documents before they are ingested by Open Crawler.
+See [Ingest pipelines for Search indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest-pipeline-search.html) for more details on this pipeline.
diff --git a/lib/crawler/api/config.rb b/lib/crawler/api/config.rb
@@ -110,9 +110,9 @@ class Config # rubocop:disable Metrics/ClassLength
         :max_indexed_links_count,   # Number of links to extract for indexing
         :max_headings_count,        # HTML heading tags count limit
 
-        # Content extraction (from files)
-        :content_extraction_enabled,    # Enable content extraction of non-HTML files found during a crawl
-        :content_extraction_mime_types, # Extract files with the following MIME types
+        # Binary content extraction (from files)
+        :binary_content_extraction_enabled,    # Enable content extraction of non-HTML files found during a crawl
+        :binary_content_extraction_mime_types, # Extract files with the following MIME types
 
         # Other crawler tuning settings
         :default_encoding,            # Default encoding used for responses that do not specify a charset
@@ -166,8 +166,8 @@ class Config # rubocop:disable Metrics/ClassLength
         max_indexed_links_count: 25,
         max_headings_count: 25,
 
-        content_extraction_enabled: false,
-        content_extraction_mime_types: [],
+        binary_content_extraction_enabled: false,
+        binary_content_extraction_mime_types: [],
 
         output_sink: :console,
         url_queue: :memory_only,

diff --git a/lib/crawler/http_executor.rb b/lib/crawler/http_executor.rb
@@ -330,7 +330,7 @@ def generate_content_extractable_file_crawl_result(crawl_task:, response:, respo
 
     #-------------------------------------------------------------------------------------------------
     def content_extractable_file_mime_types
-      config.content_extraction_enabled ? config.content_extraction_mime_types.map(&:downcase) : []
+      config.binary_content_extraction_enabled ? config.binary_content_extraction_mime_types.map(&:downcase) : []
     end
 
     #-------------------------------------------------------------------------------------------------