Skip to content

Commit

Permalink
Merge branch 'main' into acrewdson/wolfi-fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
acrewdson authored Aug 29, 2024
2 parents b7382d0 + 84f6f00 commit 83253dc
Show file tree
Hide file tree
Showing 13 changed files with 233 additions and 32 deletions.
20 changes: 17 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,6 @@ If using an API key, ensure that the API key has read and write permissions to a
```
</details>



#### Running Open Crawler from Docker

> [!IMPORTANT]
Expand Down Expand Up @@ -142,7 +140,23 @@ Open Crawler has a Dockerfile that can be built and run locally.

### Configuring Crawlers

See [CONFIG.md](docs/CONFIG.md) for in-depth details on Open Crawler configuration files.
Crawler has template configuration files that contain every configuration available.

- [config/crawler.yml.example](config/crawler.yml.example)
- [config/elasticsearch.yml.example](config/elasticsearch.yml.example)

To use these files, make a copy in the same directory without the `.example` suffix:

```bash
$ cp config/crawler.yml.example config/crawler.yml
```

Then remove the `#` comment-out characters from the configurations that you need.

Crawler can be configured using two config files, a Crawler configuration and an Elasticsearch configuration.
The Elasticsearch configuration file is optional.
It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.
See [CONFIG.md](docs/CONFIG.md) for more details on these files.

### Scheduling Recurring Crawl Jobs

Expand Down
11 changes: 6 additions & 5 deletions config/crawler.yml.example
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
# pattern: /blog # the pattern string for the rule
#
# # An array of content extraction rules
# # See docs/features/CONTENT_EXTRACTION.md for more details on this feature
# # See docs/features/EXTRACTION_RULES.md for more details on this feature
# extraction_rulesets:
# - url_filters:
# - type: begins # Filter type, can be: begins | ends | contains | regex
Expand Down Expand Up @@ -79,15 +79,16 @@
#
## Authentication configurations.
## Only required if a site has some form of authentication.
#auth.domain: https://parksaustralia.gov.au
#auth.domain: https://my-auth-domain.com
#auth.type: basic
#auth.username: user
#auth.password: pass
#
## Whether document metadata from certain content types will be indexed or not.
## This does not allow binary content to be indexed from these files, only metadata.
#content_extraction_enabled: true
#content_extraction_mime_types:
## See docs/features/BINARY_CONTENT_EXTRACTION.md for more details.
#binary_content_extraction_enabled: true
#binary_content_extraction_mime_types:
# - application/pdf
# - application/msword
# - application/vnd.openxmlformats-officedocument.wordprocessingml.document
Expand All @@ -96,7 +97,7 @@
#
## ------------------------------- Logging -------------------------------------
#
# The log level for system logs. Defaults to `info`
## The log level for system logs. Defaults to `info`
#log_level: info
#
# Whether or not event logging is enabled for output to the shell running Crawler.
Expand Down
5 changes: 5 additions & 0 deletions config/examples/parks-australia.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# This is a sample config file for crawling the parksaustralia.gov.au website writing output to an ES index
#
# The configuration options in this example are not exhaustive. To see all possible configuration options,
# reference the config templates:
# - config/crawler.yml.example
# - config/elasticsearch.yml.example

# Domains allowed for the crawl
domains:
Expand Down
22 changes: 21 additions & 1 deletion docs/CLI.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,15 @@ $ bin/crawler --help
> Commands:
> crawler crawl CRAWL_CONFIG # Run a crawl of the site
> crawler schedule CRAWL_CONFIG # Schedule a recurrent crawl of the site
> crawler validate CRAWL_CONFIG # Validate crawler configuration
> crawler version # Print version
```

### Commands


- [`crawler crawl`](#crawler-crawl)
- [`crawler schedule`](#crawler-schedule)
- [`crawler validate`](#crawler-validate)
- [`crawler version`](#crawler-version)

Expand All @@ -68,6 +69,25 @@ $ bin/crawler crawl config/examples/parks-australia.yml
$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.yml
```

#### `crawler schedule`

Creates a schedule to recurrently crawl the configured domain in the provided config file.
The scheduler uses a cron expression that is configured in the Crawler configuration file using the field `schedule.pattern`.
See [scheduling recurring crawl jobs](../README.md#scheduling-recurring-crawl-jobs) for details on scheduling.

Can optionally take a second configuration file for Elasticsearch settings.
See [CONFIG.md](./CONFIG.md) for details on the configuration files.

```bash
# schedule crawls using only crawler config
$ bin/crawler schedule config/examples/parks-australia.yml
```

```bash
# schedule crawls using crawler config and optional --es-config
$ bin/crawler schedule config/examples/parks-australia.yml --es-config=config/es.yml
```

#### `crawler validate`

Checks the configured domains in `domain_allowlist` to see if they can be crawled.
Expand Down
10 changes: 6 additions & 4 deletions docs/CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
Configuration files live in the [config](../config) directory.
There are two kinds of configuration files:

1. Crawler configurations (provided as a positional argument)
2. Elasticsearch configurations (provided as an optional argument with `--es-config`)
- Crawler configuration - [config/crawler.yml.example](../config/crawler.yml.example)
- Elasticsearch configuration - [config/elasticsearch.yml.example](../config/elasticsearch.yml.example)

The Elasticsearch configuration file is optional.
It exists to allow users with multiple crawlers to only need a single Elasticsearch configuration.

There two configuration file arguments allow crawl jobs to share Elasticsearch instance configuration.
There are no enforced pathing or naming for these files.
They are differentiated only by how they are provided to the CLI when running a crawl.

Expand Down Expand Up @@ -49,7 +51,7 @@ When performing a crawl with only a crawl config:
$ bin/crawler crawl config/my-crawler.yml
```

When performing a crawl with only both a crawl config and an Elasticsearch config:
When performing a crawl with both a crawl config and an Elasticsearch config:

```shell
$ bin/crawler crawl config/my-crawler.yml --es-config config/elasticsearch.yml
Expand Down
2 changes: 1 addition & 1 deletion docs/DOCUMENT_SCHEMA.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ These documents have a predefined list of fields that are always included.
Crawler does not impose any mappings onto indices that it ingests docs into.
This means you are free to create whatever mappings you like for an index, so long as you create the mappings _before_ indexing any documents.

If any [content extraction](./features/CONTENT_EXTRACTION.md) rules have been configured, you can add more fields to the Elasticsearch documents.
If any [content extraction rules](./features/EXTRACTION_RULES) have been configured, you can add more fields to the Elasticsearch documents.
However, the predefined fields can never be changed or overwritten by content extraction rules.
If you are ingesting onto an index that has custom mappings, be sure that the mappings don't conflict with these predefined fields.

Expand Down
30 changes: 30 additions & 0 deletions docs/features/BINARY_CONTENT_EXTRACTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Binary Content Extraction

The web crawler can extract content from downloadable binary files, such as PDF and DOCX files.
Binary content is extracted by converting file contents to base64 and including the output in a document to index.
This value is picked up by an [Elasticsearch ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html) that will convert the base64 content into plain text, to store in the `body` field of the same document.

## Using this feature

1. Enable ingest pipelines in the Elasticsearch configuration
2. Enable binary content extraction in the Crawler configuration
3. Select which MIME types should have their contents extracted
- The MIME type is determined by the HTTP response’s `Content-Type` header when downloading a given file
- While intended primarily for PDF and Microsoft Office formats, you can use any of the formats supported by [Apache Tika](https://tika.apache.org/)
- No default MIME types are defined, so at least at least one MIME type must be configured in order to extract non-HTML content
- The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs

For example, the following configuration allows for the binary content extraction of PDF and DOCX files, through the default pipeline `ent-search-ingestion-pipeline`:

```yaml
binary_content_extraction_enabled: true
binary_content_extraction_mime_types:
- application/pdf
- application/msword

elasticsearch:
pipeline: ent-search-generic-ingestion
pipeline_enabled: true
```
Read more on ingest pipelines in Open Crawler [here](./INGEST_PIPELINES.md).
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Content Extraction
# Extraction Rules

This page explains the individual fields in the extraction ruleset configuration.
The last section provides [usage examples](#examples).

## Summary

Extraction rules enable you to customize how the crawler extracts content from webpages.
Extraction rules are configured in the Crawler config file.
Expand Down Expand Up @@ -113,3 +118,121 @@ Value can be anything except `null`.

The source that Crawler will try to extract content from.
Currently only `html` or `url` is supported.

## Examples

### Extracting from HTML

I have a simple website for an RPG.
A page describing cities in the RPG is hosted at `https://totally-real-rpg.com/cities`.
The HTML for this page looks like this:

```HTML
<!DOCTYPE html>
<html>
<body>
<div>Cities:</div>
<div class="city">Summerstay</div>
<div class="city">Drenchwell</div>
<div class="city">Mezzoterran</div>
</body>
</html>
```

I want to extract all of the cities as an array, but only from the webpage that ends with `/cities`.
First I must set the `url_filters` for this extraction rule to apply to only this URL.
Then I can define what the Crawler should do when it encounters this webpage.

```yaml
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "ends"
pattern: "/cities"
rules:
- action: "extract"
field_name: "cities"
selector: ".city"
join_as: "array"
source: "html"
```
In this example, the output document will include the following field on top of the standard crawl result fields:
```json
{
"cities": ["Summerstay", "Drenchwell", "Mezzoterran"]
}
```

### Extracting from URLs

Now, I also have a blog on this website.
There are three posts on this blog, which fall under the following URLs:

- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
- https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
- https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance

When these sites are crawled, I want to get only the year that the blog was published.
First I should define the `url_filters` so that this extraction only applies to blogs.
Then I can use a `regex` selector in the rule to fetch the year from the URL.

```yaml
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "begins"
pattern: "/blog"
rules:
- action: "extract"
field_name: "publish_year"
selector: "blog\/([0-9]{4})"
join_as: "string"
source: "url"
```
In this example, the ingested documents will include the following fields on top of the standard crawl result fields:
- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
```json
{ "publish_year": "2023" }
```
- https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
```json
{ "publish_year": "2024" }
```
- https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
```json
{ "publish_year": "2024" }
```

### Multiple rulesets

There's no limit to the number of extraction rulesets that can be defined for a single Crawler.
Taking the above two examples, we can combine them into a single configuration.

```yaml
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "ends"
pattern: "/cities"
rules:
- action: "extract"
field_name: "cities"
selector: ".city"
join_as: "array"
source: "html"
- url_filters:
- type: "begins"
pattern: "/blog"
rules:
- action: "extract"
field_name: "publish_year"
selector: "blog\/([0-9]{4})"
join_as: "string"
source: "url"
```
6 changes: 6 additions & 0 deletions docs/features/INGEST_PIPELINES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Ingest Pipelines

Open Crawler uses an [Elasticsearch ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html) to power several content extraction features.
The default pipeline, `ent-search-generic-ingestion`, is automatically created when Elasticsearch first starts.
This pipeline does some pre-processing on documents before they are ingested by Open Crawler.
See [Ingest pipelines for Search indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest-pipeline-search.html) for more details on this pipeline.
10 changes: 5 additions & 5 deletions lib/crawler/api/config.rb
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@ class Config # rubocop:disable Metrics/ClassLength
:max_indexed_links_count, # Number of links to extract for indexing
:max_headings_count, # HTML heading tags count limit

# Content extraction (from files)
:content_extraction_enabled, # Enable content extraction of non-HTML files found during a crawl
:content_extraction_mime_types, # Extract files with the following MIME types
# Binary content extraction (from files)
:binary_content_extraction_enabled, # Enable content extraction of non-HTML files found during a crawl
:binary_content_extraction_mime_types, # Extract files with the following MIME types

# Other crawler tuning settings
:default_encoding, # Default encoding used for responses that do not specify a charset
Expand Down Expand Up @@ -166,8 +166,8 @@ class Config # rubocop:disable Metrics/ClassLength
max_indexed_links_count: 25,
max_headings_count: 25,

content_extraction_enabled: false,
content_extraction_mime_types: [],
binary_content_extraction_enabled: false,
binary_content_extraction_mime_types: [],

output_sink: :console,
url_queue: :memory_only,
Expand Down
2 changes: 1 addition & 1 deletion lib/crawler/http_executor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@ def generate_content_extractable_file_crawl_result(crawl_task:, response:, respo

#-------------------------------------------------------------------------------------------------
def content_extractable_file_mime_types
config.content_extraction_enabled ? config.content_extraction_mime_types.map(&:downcase) : []
config.binary_content_extraction_enabled ? config.binary_content_extraction_mime_types.map(&:downcase) : []
end

#-------------------------------------------------------------------------------------------------
Expand Down
Loading

0 comments on commit 83253dc

Please sign in to comment.