Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

mattnowzari · 2025-01-30T20:35:10Z

Closes #53 and #172

This is a continuation of issue #53 but also closes #172.
This PR will add the following steps during the initialization of the ES Sink:

A verification step that checks if crawler can reach the Elasticsearch instance provided in configs
An explicit attempt to create the output_index should the index ping fail (index ping step was added in Adding check to ES sink to check if index is present before crawling #186 )

Thus, the flow during init will be like:
verify ES connection--> if all good, verify the output_index --> if index does not exist, attempt to create the index --> if index creation fails, system exit

Additional background:
While working on this, I discovered that technically speaking, the _bulk command that Crawler uses to upsert documents is capable of auto-creating the index if it doesn't exist. However, this is dependent on the user having auto_configure, create_index, or manage index privileges.

Therefore, while we may not need an explicit index creation attempt, it is good to have because we can then explicitly log that it happened, and also provide a safe point to fail out at should something go wrong vs. waiting for _bulk to be called, at which point a crawl would have already begun.

Checklists

Pre-Review Checklist

This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
This PR has a meaningful title
This PR links to all relevant GitHub issues that it fixes or partially addresses
- If there is no GitHub issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v0.1.0)
Considered corresponding documentation changes

Related Pull Requests

#186

…icit creation of missing output index

lib/crawler/output_sink/elasticsearch.rb

artem-shelkovnikov · 2025-02-04T12:37:15Z

lib/crawler/output_sink/elasticsearch.rb

+      def create_index
+        raise Errors::UnableToCreateIndex, system_logger.info("Failed to create #{config.output_index}") unless
+          client.indices.create(index: config.output_index)


I would honestly inline this method cause it's single-line + name of method does not reflect that it will system exit.

I agree, this would be cleaner and have less mental overhead to read and understand

Actually - I just realized (again) why I had done this. Rubocop complains about Assignment Branch Condition size when I inline the raise __ unless__ line. I'm still not familiar with this particular Ruby-ism so I'll research and see how I can inline this while still being Ruby-friendly

This seems like a pretty silly thing to get hung up on, but I can't seem to get the verify_output_index() method to do everything I want it to do without exceeding the ABC threshold 😭

As a compromise, I've renamed the helper method to be more indicative of what it actually does. I might give this another go tomorrow with a fresh mind, am 110% open to other ideas here!

artem-shelkovnikov · 2025-02-04T12:38:19Z

lib/errors.rb

+  class ESConnectionError < SystemExit; end
+
  # Raised when the desired output index does not exist. This is specific for Elasticsearch
  # sink. During initialization of the Elasticsearch sink, it will call indices.exists()
  # against the output_index value, and will continue if the index is found.
  # If it is not found, this error will be raised, which causes a system exit to occur.
-  class IndexDoesNotExistError < SystemExit; end
+  class UnableToCreateIndex < SystemExit; end


Why do we have exception classes that are deriving from SystemExit, do we catch them? Why not raise SystemExit directly instead with a message?

A distinct error class that inherits from SystemExit was in response to a suggestion @navarone-feekery had made in my first iteration on this issue here. In practice, this UnableToCreateIndex is not caught, it falls through and terminates Crawler.

On one had, I see the value in having more precise error naming (my naming is not precise here TBH, as it still does not indicate a System Exit, so this needs a fix no matter what)

On the other, I can see the value in callingSystemExit directly - it's one less layer of abstraction and it indicates exactly what is going to happen when index creation fails.

Compromise - rename this to something like SystemExitOnIndexCreateFailure or something similar? It's a bit wordy but is more descriptive.

I missed this in the last PR review, sorry. I think this should just be derived from StandardError. I don't see a reason to useSystemExit specifically when raising.

After doing some experimenting, I would like to make an argument for deriving the error from SystemExit - it results in a cleaner-looking fail state. When raising a StandardError, the pre-flight check will fail with quite a long stack trace, to the point where I have to scroll up to actually get to our custom log messaging.

A SystemExit-derived error will cleanly terminate at the failure point after outputting our log messaging about what the issue was (in this case, ES not being reachable).

If we are OK with a large stack trace accompanying this failure point, then I am OK with deriving from StandardError - it's just a matter of how we want the output to look.

Those are good arguments. I think a quick and clear fail out for the preflight check makes sense. Thanks for testing both cases!

…ort Error rescue

…e what they do

artem-shelkovnikov

🚀 🌓

…ing ES Sink initialization (#192) ### Closes #53 and #172 This is a continuation of issue #53 but also closes #172. This PR will add the following steps during the initialization of the ES Sink: - A verification step that checks if crawler can reach the Elasticsearch instance provided in configs - An explicit attempt to create the output_index should the index ping fail (index ping step was added in #186 ) Thus, the flow during init will be like: `verify ES connection`--> `if all good, verify the output_index `--> `if index does not exist, attempt to create the index` --> `if index creation fails, system exit` Additional background: While working on this, I discovered that _technically speaking_, the _bulk command that Crawler uses to upsert documents is capable of auto-creating the index if it doesn't exist. However, this is dependent on the user having `auto_configure`, `create_index`, or `manage` index privileges. Therefore, while we may not _need_ an explicit index creation attempt, it is good to have because we can then explicitly log that it happened, and also provide a safe point to fail out at should something go wrong vs. waiting for _bulk to be called, at which point a crawl would have already begun. ### Checklists #### Pre-Review Checklist - [x] This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `crawler.yml.example` and `elasticsearch.yml.example`) - [x] This PR has a meaningful title - [x] This PR links to all relevant GitHub issues that it fixes or partially addresses - If there is no GitHub issue, please create it. Each PR should have a link to an issue - [x] this PR has a thorough description - [x] Covered the changes with automated tests - [x] Tested the changes locally - [x] Added a label for each target release version (example: `v0.1.0`) - [x] Considered corresponding documentation changes ### Related Pull Requests #186

github-actions · 2025-02-06T10:47:09Z

💚 Backport PR(s) successfully created

Status	Branch	Result
✅	0.2	#207

This backport PR will be merged automatically after passing CI.

…on during ES Sink initialization (#192) (#207) Backports the following commits to 0.2: - Adding ES verification step + explicit best-effort index creation during ES Sink initialization (#192) Co-authored-by: Matt Nowzari <matt.nowzari@elastic.co>

mattnowzari added 2 commits January 30, 2025 15:17

Revisiting ES Sink index check work, adding more checks for ES + expl…

6ad75cd

…icit creation of missing output index

Amended tests + new test cases for ES verification

c8bc895

mattnowzari added enhancement New feature or request v0.2.1 v0.2.2 and removed v0.2.1 labels Jan 30, 2025

Merge branch 'main' into es_sink_initcheck_part_deux

8bd4c88

mattnowzari mentioned this pull request Jan 31, 2025

Confirm ES connection before starting crawls #53

Closed

Merge branch 'main' into es_sink_initcheck_part_deux

c4ee882

mattnowzari marked this pull request as ready for review February 3, 2025 14:07

mattnowzari requested a review from a team as a code owner February 3, 2025 14:07

mattnowzari requested a review from navarone-feekery February 3, 2025 14:07

artem-shelkovnikov reviewed Feb 4, 2025

View reviewed changes

mattnowzari added 3 commits February 4, 2025 13:54

Renamed create_index + changed StandardError rescue to Elastic Transp…

da42588

…ort Error rescue

spec file fixes + renamed SystemExit-derived errors to better indicat…

f9919f8

…e what they do

Merge branch 'main' into es_sink_initcheck_part_deux

1e8168c

mattnowzari requested a review from artem-shelkovnikov February 4, 2025 21:55

artem-shelkovnikov approved these changes Feb 5, 2025

View reviewed changes

Merge branch 'main' into es_sink_initcheck_part_deux

3fff23a

mattnowzari merged commit 458f575 into main Feb 5, 2025
2 checks passed

mattnowzari deleted the es_sink_initcheck_part_deux branch February 5, 2025 14:53

navarone-feekery added v0.2.1 auto-backport and removed v0.2.2 labels Feb 6, 2025

github-actions bot mentioned this pull request Feb 6, 2025

[0.2] Adding ES verification step + explicit best-effort index creation during ES Sink initialization (#192) #207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

mattnowzari commented Jan 30, 2025 •

edited

Loading

artem-shelkovnikov Feb 4, 2025

mattnowzari Feb 4, 2025

mattnowzari Feb 4, 2025 •

edited

Loading

mattnowzari Feb 4, 2025

artem-shelkovnikov Feb 4, 2025

mattnowzari Feb 4, 2025

navarone-feekery Feb 4, 2025

mattnowzari Feb 4, 2025

navarone-feekery Feb 4, 2025

artem-shelkovnikov left a comment

github-actions bot commented Feb 6, 2025

Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

Conversation

mattnowzari commented Jan 30, 2025 • edited Loading

Closes #53 and #172

Checklists

Pre-Review Checklist

Related Pull Requests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattnowzari Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artem-shelkovnikov left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 6, 2025

💚 Backport PR(s) successfully created

mattnowzari commented Jan 30, 2025 •

edited

Loading

mattnowzari Feb 4, 2025 •

edited

Loading