-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ES verification step + explicit best-effort index creation during ES Sink initialization #192
Conversation
…icit creation of missing output index
def create_index | ||
raise Errors::UnableToCreateIndex, system_logger.info("Failed to create #{config.output_index}") unless | ||
client.indices.create(index: config.output_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would honestly inline this method cause it's single-line + name of method does not reflect that it will system exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this would be cleaner and have less mental overhead to read and understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - I just realized (again) why I had done this. Rubocop complains about Assignment Branch Condition size
when I inline the raise __ unless__
line. I'm still not familiar with this particular Ruby-ism so I'll research and see how I can inline this while still being Ruby-friendly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pretty silly thing to get hung up on, but I can't seem to get the verify_output_index()
method to do everything I want it to do without exceeding the ABC threshold 😭
As a compromise, I've renamed the helper method to be more indicative of what it actually does. I might give this another go tomorrow with a fresh mind, am 110% open to other ideas here!
lib/errors.rb
Outdated
class ESConnectionError < SystemExit; end | ||
|
||
# Raised when the desired output index does not exist. This is specific for Elasticsearch | ||
# sink. During initialization of the Elasticsearch sink, it will call indices.exists() | ||
# against the output_index value, and will continue if the index is found. | ||
# If it is not found, this error will be raised, which causes a system exit to occur. | ||
class IndexDoesNotExistError < SystemExit; end | ||
class UnableToCreateIndex < SystemExit; end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have exception classes that are deriving from SystemExit, do we catch them? Why not raise SystemExit directly instead with a message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A distinct error class that inherits from SystemExit
was in response to a suggestion @navarone-feekery had made in my first iteration on this issue here. In practice, this UnableToCreateIndex
is not caught, it falls through and terminates Crawler.
On one had, I see the value in having more precise error naming (my naming is not precise here TBH, as it still does not indicate a System Exit, so this needs a fix no matter what)
On the other, I can see the value in callingSystemExit
directly - it's one less layer of abstraction and it indicates exactly what is going to happen when index creation fails.
Compromise - rename this to something like SystemExitOnIndexCreateFailure
or something similar? It's a bit wordy but is more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in the last PR review, sorry. I think this should just be derived from StandardError
. I don't see a reason to useSystemExit
specifically when raising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some experimenting, I would like to make an argument for deriving the error from SystemExit
- it results in a cleaner-looking fail state. When raising a StandardError
, the pre-flight check will fail with quite a long stack trace, to the point where I have to scroll up to actually get to our custom log messaging.
A SystemExit
-derived error will cleanly terminate at the failure point after outputting our log messaging about what the issue was (in this case, ES not being reachable).
If we are OK with a large stack trace accompanying this failure point, then I am OK with deriving from StandardError
- it's just a matter of how we want the output to look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good arguments. I think a quick and clear fail out for the preflight check makes sense. Thanks for testing both cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🌓
…ing ES Sink initialization (#192) ### Closes #53 and #172 This is a continuation of issue #53 but also closes #172. This PR will add the following steps during the initialization of the ES Sink: - A verification step that checks if crawler can reach the Elasticsearch instance provided in configs - An explicit attempt to create the output_index should the index ping fail (index ping step was added in #186 ) Thus, the flow during init will be like: `verify ES connection`--> `if all good, verify the output_index `--> `if index does not exist, attempt to create the index` --> `if index creation fails, system exit` Additional background: While working on this, I discovered that _technically speaking_, the _bulk command that Crawler uses to upsert documents is capable of auto-creating the index if it doesn't exist. However, this is dependent on the user having `auto_configure`, `create_index`, or `manage` index privileges. Therefore, while we may not _need_ an explicit index creation attempt, it is good to have because we can then explicitly log that it happened, and also provide a safe point to fail out at should something go wrong vs. waiting for _bulk to be called, at which point a crawl would have already begun. ### Checklists #### Pre-Review Checklist - [x] This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `crawler.yml.example` and `elasticsearch.yml.example`) - [x] This PR has a meaningful title - [x] This PR links to all relevant GitHub issues that it fixes or partially addresses - If there is no GitHub issue, please create it. Each PR should have a link to an issue - [x] this PR has a thorough description - [x] Covered the changes with automated tests - [x] Tested the changes locally - [x] Added a label for each target release version (example: `v0.1.0`) - [x] Considered corresponding documentation changes ### Related Pull Requests #186
Closes #53 and #172
This is a continuation of issue #53 but also closes #172.
This PR will add the following steps during the initialization of the ES Sink:
Thus, the flow during init will be like:
verify ES connection
-->if all good, verify the output_index
-->if index does not exist, attempt to create the index
-->if index creation fails, system exit
Additional background:
While working on this, I discovered that technically speaking, the _bulk command that Crawler uses to upsert documents is capable of auto-creating the index if it doesn't exist. However, this is dependent on the user having
auto_configure
,create_index
, ormanage
index privileges.Therefore, while we may not need an explicit index creation attempt, it is good to have because we can then explicitly log that it happened, and also provide a safe point to fail out at should something go wrong vs. waiting for _bulk to be called, at which point a crawl would have already begun.
Checklists
Pre-Review Checklist
crawler.yml.example
andelasticsearch.yml.example
)v0.1.0
)Related Pull Requests
#186