Refactor how spark options are passed to Options/JetStreamOptions builders #30

stoddabr · 2024-10-04T18:32:10Z

What motivated this proposal?

I really like this library and want it to succeed. However, while trying to contribute I'm feeling significant technical debt on how options are passed in the code.

Currently the pattern for passing options to Options/JetStreamOptions is somewhat ad-hoc. I can't seem to find a common pattern for . This makes it difficult to develop on top of the v2 library and confusing to users. Ideally the option names would be self-descriptive and the nested string pattern would be useful.

For example, I'm working on a large feature for adding SSL/TLS support and there's no obvious option name for the many options I need to add. I could reuse some of them from v1 but that doesn't follow a pattern either.

The code itself is difficult to figure out and follow which makes changes a struggle. For example, "nats.host" spark option is passed to JetStreamConfig which is passed to NatsConnectionConfig which is used by the options builder. This makes adding new option a task that spans multiple files (this feels to me like a classic case of technical debt). Ideally this structure would be flattened such that adding a new option would be possible in one class with an intentional name.

I propose a new structure here which aims to standardize how options are passed to the library and used in code and reduce technical debt. By leveraging abstractions already defined in nats.java to 1) standardize spark option names, and 2) reduce the number of internal classes used by the library.

What is the proposed change?

I would love to establish a new pattern that makes exposing builder parameters to spark follow a common option name and coding pattern based on the builders that nats.java uses.

Here is what I propose for the spark option names: "nats.[alias for builder class].[builder method]"

All options should follow a nested path starting with "nats."
Options names will continue with an alias for the builder class the option will be passed to.
- ConsumerConfiguration = "consumer"
- io.nats.client.Options = "connection"
- io.nats.client.JetStreamOptions = "jetstream"
- io.nats.client.PullSubscribeOptions (and args passed to fetch)= "pull"

The internal config classes should reflect this hierarchy directly instead of feeling arbitrary structured as is today.

For example, instead of the host option being passed from JetStreamConfig->NatsConnectionConfig->builder (which is tricky to track down and requires updating code in multiple places), the host would be passed from spark option "nats.connection. server" -> NatsConnectionConfig -> builder

This gives a clear way to use and contribute to this library that matches the nats.java documentation.

This could include a few QOL exceptions that differs (for example using "nats.connection.port" & "...host" instead of ..."server"). Following a standard by default will help new developers and users as well as keep this library in sync with its dependency on nats.java.

Who benefits from this change?

Users since the options will follow a pattern. They can leverage the nats.java docs to use this library.

Developers, since building on this library to add new features using the options will be easier.

What alternatives have you evaluated?

Maybe I'm missing something? Updating the documentation to clarify the current pattern for option names and describe how to contribute new ones would also work. However, I can't for the life of me figure out what it is. I feel like it is arbitrary as many of the nats.java library options are not passed and data is passed to builders across layers of arbitrary config classes.

The text was updated successfully, but these errors were encountered:

jnmoyne · 2024-10-08T18:48:44Z

Yes I agree the V1 is not great for the options and definitely there’s added complexity stemming from the original coding that got worse as I had to add some features on top of it.
I am all in favor of this:

follow a common option name and coding pattern based on the builders that nats.java uses.

What would be good to do is have the options passed to the connector indeed be compatible with what nats.java uses such that the connector code can simply pass the options directly to the nats.java builder.

If we can do this, then I'm ok with forgoing any attempt at keeping the V1 option names (especially since I haven't actually properly released V2 yet, so I'm ok with doing that break and announcing it clearly in the V2 release notes).

stoddabr · 2024-10-17T18:31:20Z

I'm close to getting a working version of this.

Options -> client -- used when connecting to nats
PullSubscribeOptions -> pullsubscribe -- used when calling getConsumerContext or subscribe
JetStreamOptions -> jetstream -- used when calling connection.jetstream
other -> streaming -- vars used by the spark streaming code, generally not passed to nats (except for fetch size)

Here's an example of what it looks like:

val initDF = spark.readStream
    .format("org.apache.spark.sql.nats") // will look for org.apache.spark.sql.nats.DefaultSource
    .option("io.nats.client.url", s"nats://${nats_host}:${nats_port}")
    .option("io.nats.client.credential.bytes", credentialBytesString)
    .option("io.nats.client.keyStore", credentialBytesString)
    .option("io.nats.client.keyStorePassword", nats_trust_store)
    .option("io.nats.client.trustStore", credentialBytesString)
    .option("io.nats.client.trustStorePassword", nats_trust_store)
    .option("io.nats.client.ssl.context.factory.class", factoryClassName)
    .option("io.nats.client.tls.algorithm", tls_alg)
    .option("io.nats.pullsubscribe.stream", nats_subscriber_stream)
    .option("io.nats.pullsubscribe.name", nats_durable) // ...durable will also work
    .option("io.nats.jetstream.prefix", nats_api_prefix)
    .option("io.nats.streaming.payloadCompression", "none") // none or zlib
    .option("io.nats.streaming.maxBatch", "12000") // maximum messages in a minibatch
    .option("io.nats.streaming.maxFetchCount", "1000") // none or zlib
    .option("io.nats.streaming.maxFetchWaitMs", "5000") // none or zlib
    .load()

any preliminary feedback before I finalize and open a PR?

stoddabr · 2024-10-17T18:33:32Z

I still need to add support for nats sink and consumer create options however I do not have a good way to test in our cloud. Can nats be self-hosted in docker? If so setting up a compose file with nats & spark could be used to run integration tests for this library

Marcus-Rosti · 2024-12-02T17:37:38Z

I like this pitch for parameter names!

Marcus-Rosti · 2024-12-02T17:37:57Z

and you should be able to very easily run nats in docker, at least I do

stoddabr added the proposal Enhancement idea or proposal label Oct 4, 2024

stoddabr mentioned this issue Oct 4, 2024

Adds compressed payload support though an optional config 'nats.pull.subscription.payload.compression', supports zlib or none #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor how spark options are passed to Options/JetStreamOptions builders #30

Refactor how spark options are passed to Options/JetStreamOptions builders #30

stoddabr commented Oct 4, 2024 •

edited

Loading

jnmoyne commented Oct 8, 2024

stoddabr commented Oct 17, 2024

stoddabr commented Oct 17, 2024

Marcus-Rosti commented Dec 2, 2024

Marcus-Rosti commented Dec 2, 2024

Refactor how spark options are passed to Options/JetStreamOptions builders #30

Refactor how spark options are passed to Options/JetStreamOptions builders #30

Comments

stoddabr commented Oct 4, 2024 • edited Loading

What motivated this proposal?

What is the proposed change?

Who benefits from this change?

What alternatives have you evaluated?

jnmoyne commented Oct 8, 2024

stoddabr commented Oct 17, 2024

stoddabr commented Oct 17, 2024

Marcus-Rosti commented Dec 2, 2024

Marcus-Rosti commented Dec 2, 2024

stoddabr commented Oct 4, 2024 •

edited

Loading