Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor how spark options are passed to Options/JetStreamOptions builders #30

Open
stoddabr opened this issue Oct 4, 2024 · 5 comments
Labels
proposal Enhancement idea or proposal

Comments

@stoddabr
Copy link
Contributor

stoddabr commented Oct 4, 2024

What motivated this proposal?

I really like this library and want it to succeed. However, while trying to contribute I'm feeling significant technical debt on how options are passed in the code.

Currently the pattern for passing options to Options/JetStreamOptions is somewhat ad-hoc. I can't seem to find a common pattern for . This makes it difficult to develop on top of the v2 library and confusing to users. Ideally the option names would be self-descriptive and the nested string pattern would be useful.

For example, I'm working on a large feature for adding SSL/TLS support and there's no obvious option name for the many options I need to add. I could reuse some of them from v1 but that doesn't follow a pattern either.

The code itself is difficult to figure out and follow which makes changes a struggle. For example, "nats.host" spark option is passed to JetStreamConfig which is passed to NatsConnectionConfig which is used by the options builder. This makes adding new option a task that spans multiple files (this feels to me like a classic case of technical debt). Ideally this structure would be flattened such that adding a new option would be possible in one class with an intentional name.

I propose a new structure here which aims to standardize how options are passed to the library and used in code and reduce technical debt. By leveraging abstractions already defined in nats.java to 1) standardize spark option names, and 2) reduce the number of internal classes used by the library.

What is the proposed change?

I would love to establish a new pattern that makes exposing builder parameters to spark follow a common option name and coding pattern based on the builders that nats.java uses.

Here is what I propose for the spark option names: "nats.[alias for builder class].[builder method]"

  • All options should follow a nested path starting with "nats."
  • Options names will continue with an alias for the builder class the option will be passed to.
    • ConsumerConfiguration = "consumer"
    • io.nats.client.Options = "connection"
    • io.nats.client.JetStreamOptions = "jetstream"
    • io.nats.client.PullSubscribeOptions (and args passed to fetch)= "pull"

The internal config classes should reflect this hierarchy directly instead of feeling arbitrary structured as is today.

For example, instead of the host option being passed from JetStreamConfig->NatsConnectionConfig->builder (which is tricky to track down and requires updating code in multiple places), the host would be passed from spark option "nats.connection. server" -> NatsConnectionConfig -> builder

This gives a clear way to use and contribute to this library that matches the nats.java documentation.

This could include a few QOL exceptions that differs (for example using "nats.connection.port" & "...host" instead of ..."server"). Following a standard by default will help new developers and users as well as keep this library in sync with its dependency on nats.java.

Who benefits from this change?

Users since the options will follow a pattern. They can leverage the nats.java docs to use this library.

Developers, since building on this library to add new features using the options will be easier.

What alternatives have you evaluated?

Maybe I'm missing something? Updating the documentation to clarify the current pattern for option names and describe how to contribute new ones would also work. However, I can't for the life of me figure out what it is. I feel like it is arbitrary as many of the nats.java library options are not passed and data is passed to builders across layers of arbitrary config classes.

@jnmoyne
Copy link
Contributor

jnmoyne commented Oct 8, 2024

Yes I agree the V1 is not great for the options and definitely there’s added complexity stemming from the original coding that got worse as I had to add some features on top of it.
I am all in favor of this:

follow a common option name and coding pattern based on the builders that nats.java uses.

What would be good to do is have the options passed to the connector indeed be compatible with what nats.java uses such that the connector code can simply pass the options directly to the nats.java builder.

If we can do this, then I'm ok with forgoing any attempt at keeping the V1 option names (especially since I haven't actually properly released V2 yet, so I'm ok with doing that break and announcing it clearly in the V2 release notes).

@stoddabr
Copy link
Contributor Author

I'm close to getting a working version of this.

  • Options -> client -- used when connecting to nats
  • PullSubscribeOptions -> pullsubscribe -- used when calling getConsumerContext or subscribe
  • JetStreamOptions -> jetstream -- used when calling connection.jetstream
  • other -> streaming -- vars used by the spark streaming code, generally not passed to nats (except for fetch size)

Here's an example of what it looks like:

val initDF = spark.readStream
    .format("org.apache.spark.sql.nats") // will look for org.apache.spark.sql.nats.DefaultSource
    .option("io.nats.client.url", s"nats://${nats_host}:${nats_port}")
    .option("io.nats.client.credential.bytes", credentialBytesString)
    .option("io.nats.client.keyStore", credentialBytesString)
    .option("io.nats.client.keyStorePassword", nats_trust_store)
    .option("io.nats.client.trustStore", credentialBytesString)
    .option("io.nats.client.trustStorePassword", nats_trust_store)
    .option("io.nats.client.ssl.context.factory.class", factoryClassName)
    .option("io.nats.client.tls.algorithm", tls_alg)
    .option("io.nats.pullsubscribe.stream", nats_subscriber_stream)
    .option("io.nats.pullsubscribe.name", nats_durable) // ...durable will also work
    .option("io.nats.jetstream.prefix", nats_api_prefix)
    .option("io.nats.streaming.payloadCompression", "none") // none or zlib
    .option("io.nats.streaming.maxBatch", "12000") // maximum messages in a minibatch
    .option("io.nats.streaming.maxFetchCount", "1000") // none or zlib
    .option("io.nats.streaming.maxFetchWaitMs", "5000") // none or zlib
    .load()

any preliminary feedback before I finalize and open a PR?

@stoddabr
Copy link
Contributor Author

I still need to add support for nats sink and consumer create options however I do not have a good way to test in our cloud. Can nats be self-hosted in docker? If so setting up a compose file with nats & spark could be used to run integration tests for this library

@Marcus-Rosti
Copy link
Contributor

I like this pitch for parameter names!

@Marcus-Rosti
Copy link
Contributor

and you should be able to very easily run nats in docker, at least I do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

No branches or pull requests

3 participants