Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192

Merged
merged 8 commits into from
Feb 5, 2025

Conversation

mattnowzari
Copy link
Contributor

@mattnowzari mattnowzari commented Jan 30, 2025

Closes #53 and #172

This is a continuation of issue #53 but also closes #172.
This PR will add the following steps during the initialization of the ES Sink:

Thus, the flow during init will be like:
verify ES connection--> if all good, verify the output_index --> if index does not exist, attempt to create the index --> if index creation fails, system exit

Additional background:
While working on this, I discovered that technically speaking, the _bulk command that Crawler uses to upsert documents is capable of auto-creating the index if it doesn't exist. However, this is dependent on the user having auto_configure, create_index, or manage index privileges.

Therefore, while we may not need an explicit index creation attempt, it is good to have because we can then explicitly log that it happened, and also provide a safe point to fail out at should something go wrong vs. waiting for _bulk to be called, at which point a crawl would have already begun.

Checklists

Pre-Review Checklist

  • This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
  • This PR has a meaningful title
  • This PR links to all relevant GitHub issues that it fixes or partially addresses
    • If there is no GitHub issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v0.1.0)
  • Considered corresponding documentation changes

Related Pull Requests

#186

@mattnowzari mattnowzari added enhancement New feature or request v0.2.1 v0.2.2 and removed v0.2.1 labels Jan 30, 2025
@mattnowzari mattnowzari marked this pull request as ready for review February 3, 2025 14:07
@mattnowzari mattnowzari requested a review from a team as a code owner February 3, 2025 14:07
Comment on lines 64 to 66
def create_index
raise Errors::UnableToCreateIndex, system_logger.info("Failed to create #{config.output_index}") unless
client.indices.create(index: config.output_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would honestly inline this method cause it's single-line + name of method does not reflect that it will system exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this would be cleaner and have less mental overhead to read and understand

Copy link
Contributor Author

@mattnowzari mattnowzari Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - I just realized (again) why I had done this. Rubocop complains about Assignment Branch Condition size when I inline the raise __ unless__ line. I'm still not familiar with this particular Ruby-ism so I'll research and see how I can inline this while still being Ruby-friendly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty silly thing to get hung up on, but I can't seem to get the verify_output_index() method to do everything I want it to do without exceeding the ABC threshold 😭

As a compromise, I've renamed the helper method to be more indicative of what it actually does. I might give this another go tomorrow with a fresh mind, am 110% open to other ideas here!

lib/errors.rb Outdated
Comment on lines 25 to 31
class ESConnectionError < SystemExit; end

# Raised when the desired output index does not exist. This is specific for Elasticsearch
# sink. During initialization of the Elasticsearch sink, it will call indices.exists()
# against the output_index value, and will continue if the index is found.
# If it is not found, this error will be raised, which causes a system exit to occur.
class IndexDoesNotExistError < SystemExit; end
class UnableToCreateIndex < SystemExit; end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have exception classes that are deriving from SystemExit, do we catch them? Why not raise SystemExit directly instead with a message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A distinct error class that inherits from SystemExit was in response to a suggestion @navarone-feekery had made in my first iteration on this issue here. In practice, this UnableToCreateIndex is not caught, it falls through and terminates Crawler.

On one had, I see the value in having more precise error naming (my naming is not precise here TBH, as it still does not indicate a System Exit, so this needs a fix no matter what)

On the other, I can see the value in callingSystemExit directly - it's one less layer of abstraction and it indicates exactly what is going to happen when index creation fails.

Compromise - rename this to something like SystemExitOnIndexCreateFailure or something similar? It's a bit wordy but is more descriptive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in the last PR review, sorry. I think this should just be derived from StandardError. I don't see a reason to useSystemExit specifically when raising.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some experimenting, I would like to make an argument for deriving the error from SystemExit - it results in a cleaner-looking fail state. When raising a StandardError, the pre-flight check will fail with quite a long stack trace, to the point where I have to scroll up to actually get to our custom log messaging.

A SystemExit-derived error will cleanly terminate at the failure point after outputting our log messaging about what the issue was (in this case, ES not being reachable).

If we are OK with a large stack trace accompanying this failure point, then I am OK with deriving from StandardError - it's just a matter of how we want the output to look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good arguments. I think a quick and clear fail out for the preflight check makes sense. Thanks for testing both cases!

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🌓

@mattnowzari mattnowzari merged commit 458f575 into main Feb 5, 2025
2 checks passed
@mattnowzari mattnowzari deleted the es_sink_initcheck_part_deux branch February 5, 2025 14:53
github-actions bot pushed a commit that referenced this pull request Feb 6, 2025
…ing ES Sink initialization (#192)

### Closes #53 and
#172

This is a continuation of issue #53 but also closes #172.
This PR will add the following steps during the initialization of the ES
Sink:
- A verification step that checks if crawler can reach the Elasticsearch
instance provided in configs
- An explicit attempt to create the output_index should the index ping
fail (index ping step was added in #186 )

Thus, the flow during init will be like:
`verify ES connection`--> `if all good, verify the output_index `--> `if
index does not exist, attempt to create the index` --> `if index
creation fails, system exit`

Additional background:
While working on this, I discovered that _technically speaking_, the
_bulk command that Crawler uses to upsert documents is capable of
auto-creating the index if it doesn't exist. However, this is dependent
on the user having `auto_configure`, `create_index`, or `manage` index
privileges.

Therefore, while we may not _need_ an explicit index creation attempt,
it is good to have because we can then explicitly log that it happened,
and also provide a safe point to fail out at should something go wrong
vs. waiting for _bulk to be called, at which point a crawl would have
already begun.

### Checklists

#### Pre-Review Checklist
- [x] This PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `crawler.yml.example` and
`elasticsearch.yml.example`)
- [x] This PR has a meaningful title
- [x] This PR links to all relevant GitHub issues that it fixes or
partially addresses
- If there is no GitHub issue, please create it. Each PR should have a
link to an issue
- [x] this PR has a thorough description
- [x] Covered the changes with automated tests
- [x] Tested the changes locally
- [x] Added a label for each target release version (example: `v0.1.0`)
- [x] Considered corresponding documentation changes

### Related Pull Requests
#186
Copy link

github-actions bot commented Feb 6, 2025

💚 Backport PR(s) successfully created

Status Branch Result
0.2 #207

This backport PR will be merged automatically after passing CI.

navarone-feekery pushed a commit that referenced this pull request Feb 6, 2025
…on during ES Sink initialization (#192) (#207)

Backports the following commits to 0.2:
- Adding ES verification step + explicit best-effort index creation
during ES Sink initialization (#192)

Co-authored-by: Matt Nowzari <matt.nowzari@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawler will not create a new index Confirm ES connection before starting crawls
3 participants