Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPENNLP-421 - Replace String.intern() in StringList #568

Merged
merged 10 commits into from
Dec 22, 2023
Merged

OPENNLP-421 - Replace String.intern() in StringList #568

merged 10 commits into from
Dec 22, 2023

Conversation

rzo1
Copy link
Contributor

@rzo1 rzo1 commented Dec 14, 2023

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

I found OPENNLP-421 in our issue tracker and added a simple JMH benchmark to look into the impact of interning strings and our multiple copy-operations in StringList.

As stated by the original issue opener, String interning was most likely used because of:

Presumably this is an attempt to reduce memory usage for duplicate tokens. Interned Strings are stored in the JVM's permanent generation, which has a small fixed size (seems to be about 83 MB on modern 64-bit JVMs: https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html)

So if we have huge StringLists or Dictionaries, users need to increase -XX:MaxPermSize= option to avoid an OutOfMemoryException.

There might also be room for improvement in the Dictionary implemention as we mess around with StringLists, StringListWrappers, etc. - but that would be something to look into next.

Just want to have some thoughts on the interning removal here ;-)

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the issue and read the benchmark gist results correctly, with the new code JMH benchmark starts faster, then the old code catchs up and plateaus around 1.2K ops/s. But in overall the non interned version still process more ops/s, with the advantage of not having the JVM max memory issue.

Do you think we could look at memory usage running these tests too? Maybe through GC metrics in JMH or VisualVM, YourKit, etc.?

Thanks!

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 14, 2023

If I understood the issue and read the benchmark gist results correctly, with the new code JMH benchmark starts faster, then the old code catchs up and plateaus around 1.2K ops/s. But in overall the non interned version still process more ops/s, with the advantage of not having the JVM max memory issue.

if we look at the mean values:

  • 1,211 ± 0,149 ops/s (old code)

vs.

  • 468,404 ± 140,482 ops/s (new code)

Systemarray.copy(...) is the reason for the speed improvement from merely 1k ops/s to 328k op/s (worst case)

So what we see is a huge improvement in terms of throughut because we avoid copying and creating objects all the time ;-) - I agree, that we could investigate memory consumption, too.

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 14, 2023

Added GC Profiling output to JMH. Here it is

Old Code

Benchmark                                                        Mode  Cnt         Score       Error   Units
StringListBenchmark.newWithArrayConstructor                     thrpt   10         1,104 ±     0,311   ops/s
StringListBenchmark.newWithArrayConstructor:gc.alloc.rate       thrpt   10        73,739 ±    14,797  MB/sec
StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm  thrpt   10  92007982,187 ± 37715,232    B/op
StringListBenchmark.newWithArrayConstructor:gc.count            thrpt   10        68,000              counts
StringListBenchmark.newWithArrayConstructor:gc.time             thrpt   10      4396,000                  ms

New Code

Benchmark                                                        Mode  Cnt         Score     Error   Units
StringListBenchmark.newWithArrayConstructor                     thrpt   10       471,103 ± 170,077   ops/s
StringListBenchmark.newWithArrayConstructor:gc.alloc.rate       thrpt   10       443,527 ±  37,567  MB/sec
StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm  thrpt   10  92000042,655 ±   0,953    B/op
StringListBenchmark.newWithArrayConstructor:gc.count            thrpt   10       277,000            counts
StringListBenchmark.newWithArrayConstructor:gc.time             thrpt   10     14432,000                ms
``

@kinow
Copy link
Member

kinow commented Dec 14, 2023

Thanks for the GC numbers. The new code seems a bit heavier on the GC and mem allocations, but I think that was expected.

A bit late here, but I decided to give it a try. Excuse if I say anything silly, I can check these results again tomorrow after having some ☕ . I opened your branch on IntelliJ, then made the jmh profile active by default (how do you tell IntelliJ to include the deps of a profile? anywho), invalidated cache, etc..

Executed the tests just to make sure they were running, as you already provided the results. Then executed with the profiler, to generate the .jfr file. Saved the jmh source folder, checked out main, then patched the pom.xml files, and executed the same test again, to get another .jfr for the main branch (so earlier file is for this branch, older is for main).

Opened both in IntelliJ and selected the .jfr file from this branch to compare with the one from main. Not sure if they show a fair comparison as the difference in throughput might exacerbate some number, but in case you find it useful, @rzo1 .

CPU samples

flamegraph

I think it just confirms the change of calls to intern functions.

image

Also the increase in GC calls.

image

And I think the Random calls are from JMH for having more/less samples due to higher throughput.

image

Memory allocations

flamegraph

With a lot more Strings, as expected, as well as bytes due to the array copy calls.

image

image

Unfortunately I don't have any better or more practical test to compare memory (maybe a single run of a large example would be better for comparing memory, instead of multiple jmh runs…). But I don't think this should really be a problem. I wonder if that was some premature optimization or if at some point someone had memory issues. But a change that's easy to revert if needed, and that way users won't have to modify heap settings. So I am inclined to approve it! 🎉

Will wait until tomorrow so you, Martin, Jeff, others had some time to review 🛌

Thanks!

Bruno

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 15, 2023

@kinow thanks for the details and the flight recording :-)

I think, that the current use of String interning was an attempt to optimize string comparisons by using the string constant pool (until it explodes). I guess, that the idea was also to save memory.

There is a really nice blog article which summarizes the pros/cons: https://www.codecentric.de/wissens-hub/blog/save-memory-by-using-string-intern-in-java

The state, that

Only use String.intern() on Strings you know are occurring multiple times, and only do it to save memory

I think, that we can remove a lot of the current object creation related to the use of StringList, StringListWrapper or the creation of StringList objects just for the sake of running a contains(...), which should make up for the overhead due to the removal of String interning.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, performance tradeoff also seems OK IMO. Thanks @rzo1 !

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 15, 2023

@mawiesne and myself just started an eval run from the office. Let's see if we see anything ;-)

@mawiesne
Copy link
Contributor

mawiesne commented Dec 16, 2023

I think, we c/should consider using a ConcurrentHashMap-based deduplicator approach to replace "intern()" calls, as investigated and explained by Shipilev in his blog post here:

https://shipilev.net/jvm/anatomy-quarks/10-string-intern/

The removal of the String deduplication, as found in the code segments of this PR, could otherwise lead to more required RAM at runtime - as @kinow already expressed.

Wdyt? @jzonthemtn @rzo1

(A) Removal (PR as is now), or
(B) Replacement with concurrent-capable custom deduplicator?

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 16, 2023

I am OK with A or B. Feel free to add B on-top of this PR (don't have time in the next days)

@rzo1 rzo1 changed the title OPENNLP-421 - Remove String interning from StringList Draft: OPENNLP-421 - Remove String interning from StringList Dec 16, 2023
@rzo1 rzo1 changed the title Draft: OPENNLP-421 - Remove String interning from StringList OPENNLP-421 - Remove String interning from StringList Dec 16, 2023
@rzo1
Copy link
Contributor Author

rzo1 commented Dec 16, 2023

Let's make it a draft and add B. Saving memory is always a good thing.

@mawiesne mawiesne marked this pull request as draft December 16, 2023 08:00
@kinow
Copy link
Member

kinow commented Dec 16, 2023

I think, we c/should consider using a ConcurrentHashMap-based deduplicator approach to replace "intern()" calls, as investigated and explained by Shipilev in his blog post here:

https://shipilev.net/jvm/anatomy-quarks/10-string-intern/

The removal of the String deduplication, as found in the code segments of this PR, could otherwise lead to more required RAM at runtime - as @kinow already expressed.

Wdyt? @jzonthemtn @rzo1

(A) Removal (PR as is now), or (B) Replacement with concurrent-capable custom deduplicator?

(B) looks interesting, +1 for trying that out, and for @rzo1 suggestion to add that on top of this one.

CHMInterner

First time I've seen an intern made in the code (intentionally like that). Sounds like a good idea! You do the intern without using the heap. Not sure if it will perform as fast as String.intern, but I think it might work! Thanks @mawiesne

@mawiesne
Copy link
Contributor

mawiesne commented Dec 16, 2023

I reproduced the basic String deduplication benchmark by Shipilev (Java 8) under MacOS with Java 17; the results are consistent with his findings:

Benchmark                             (size)   Mode  Cnt         Score         Error  Units
StringDeduplicationBenchmark.chm           1  thrpt   25  71093243,405 ± 1437503,007  ops/s
StringDeduplicationBenchmark.chm         100  thrpt   25    698295,053 ±   10420,119  ops/s
StringDeduplicationBenchmark.chm       10000  thrpt   25      3389,917 ±      56,170  ops/s
StringDeduplicationBenchmark.chm     1000000  thrpt   25        22,715 ±       0,120  ops/s
StringDeduplicationBenchmark.hm            1  thrpt   25  61916588,259 ± 1271732,213  ops/s
StringDeduplicationBenchmark.hm          100  thrpt   25    714613,401 ±   10472,821  ops/s
StringDeduplicationBenchmark.hm        10000  thrpt   25      3201,984 ±      18,741  ops/s
StringDeduplicationBenchmark.hm      1000000  thrpt   25        23,081 ±       0,122  ops/s
StringDeduplicationBenchmark.intern        1  thrpt   25  11033749,712 ±   35746,520  ops/s
StringDeduplicationBenchmark.intern      100  thrpt   25    109538,438 ±     193,911  ops/s
StringDeduplicationBenchmark.intern    10000  thrpt   25       997,533 ±       3,149  ops/s
StringDeduplicationBenchmark.intern  1000000  thrpt   25         6,024 ±       0,026  ops/s

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 16, 2023

Looks great, @mawiesne that it also applies to Java 17. Think we should give it a try and add our own CHMInterner. Do you want to take over here (in which case I would re-assign the Jira) or shall I add it in the next days? Just want to make sure no duplicate work is done here ;-)

@mawiesne
Copy link
Contributor

Do you want to take over here (in which case I would re-assign the Jira) or shall I add it in the next days?

Feel free to proceed after I've pushed the basic string dedup benchmark, leave Jira as is.

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 16, 2023

Ack.

- adds StringDeduplicationBenchmark (jmh), adapted from A. Shipilëv for benchmarking on different operating systems / Java environments
- fixes inconsistent path in build-helper-maven-plugin config for jmh test addition
@rzo1 rzo1 changed the title OPENNLP-421 - Remove String interning from StringList OPENNLP-421 - Replace String.intern() in StringList Dec 18, 2023
OPENNLP-412 - Add some interner implementations based on the examples of
Aleksey Shipilëv. Introduces mechanismn similar to Hadoop for project wide usage.
OPENNLP-412 - Use our user provided StringInterner in StringList
OPENNLP-412 - Update Interner Benchmarker with new OpenNLP classes.
@rzo1
Copy link
Contributor Author

rzo1 commented Dec 18, 2023

Hi all,

I added related interner / dedup implementations inspired / based on the code provided by @mawiesne from Aleksey.

I added the following interner impls:

  • CHMStringInterner (as provided by Aleksey), thread-safe
  • CHMStringDeduplicator (as provided by Aleksey in his talk), thread-safe -> relaxes the canonical requirements on interning. It is more or less a probabilistic deduplication
  • HMStringInterner (as provided by Aleksey), not thread-safe
  • JvmStringInterner -> relies on String.intern() - can be used to get the previous OpenNLP behaviour ;)
  • NoOpStringInterner -> doesn't actually intern/dedup Strings.

The implementation of the static StringInterners is based on how Hadoop is doing it [1]. The default interner used in OpenNLP is now: CHMStringInterner
I added a system property opennlp.interner.class, which can be used to specify the interner implementation which will be used at runtime:

  • If people want the old behaviour back: they can.
  • If people do not want interning at all: they can.
  • If people want probabilistic dedup: they can.

Currently, an updated benchmark of the different impls is running as well as a full eval build with the default.

Will update this PR with the JMH results in a few hours.

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 18, 2023

Here is the JMH result for the interner impls only:

Benchmark                             (size)   Mode  Cnt          Score         Error  Units
StringDeduplicationBenchmark.chm           1  thrpt   25   36070236,872 ± 1204122,063  ops/s
StringDeduplicationBenchmark.chm         100  thrpt   25     354445,896 ±    2893,432  ops/s
StringDeduplicationBenchmark.chm       10000  thrpt   25       2323,076 ±      25,596  ops/s
StringDeduplicationBenchmark.chm     1000000  thrpt   25         13,137 ±       0,199  ops/s
StringDeduplicationBenchmark.chmd05        1  thrpt   25   45289035,126 ±  143794,468  ops/s
StringDeduplicationBenchmark.chmd05      100  thrpt   25     433279,497 ±    3165,815  ops/s
StringDeduplicationBenchmark.chmd05    10000  thrpt   25       2692,779 ±       8,173  ops/s
StringDeduplicationBenchmark.chmd05  1000000  thrpt   25         13,413 ±       0,155  ops/s
StringDeduplicationBenchmark.hm            1  thrpt   25   35123958,776 ±  472485,649  ops/s
StringDeduplicationBenchmark.hm          100  thrpt   25     371997,311 ±    6780,622  ops/s
StringDeduplicationBenchmark.hm        10000  thrpt   25       2311,588 ±     115,117  ops/s
StringDeduplicationBenchmark.hm      1000000  thrpt   25         14,073 ±       0,068  ops/s
StringDeduplicationBenchmark.intern        1  thrpt   25   10040026,472 ±   77470,327  ops/s
StringDeduplicationBenchmark.intern      100  thrpt   25      87644,048 ±     844,053  ops/s
StringDeduplicationBenchmark.intern    10000  thrpt   25        764,752 ±      34,300  ops/s
StringDeduplicationBenchmark.intern  1000000  thrpt   25          2,956 ±       0,024  ops/s
StringDeduplicationBenchmark.noop          1  thrpt   25  148719353,102 ±  743493,703  ops/s
StringDeduplicationBenchmark.noop        100  thrpt   25    1491614,947 ±    1587,406  ops/s
StringDeduplicationBenchmark.noop      10000  thrpt   25       9848,732 ±      11,076  ops/s
StringDeduplicationBenchmark.noop    1000000  thrpt   25         78,726 ±       0,064  ops/s

https://gist.github.com/rzo1/754b15381041b28718cf44073f4986c5

@rzo1 rzo1 marked this pull request as ready for review December 18, 2023 11:17
@jzonthemtn
Copy link
Contributor

Love it. Curious to see the impact on things like training.

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 18, 2023

Here is the other benchmark:

Benchmark                                                                 (internerClazz)   (size)   Mode  Cnt          Score          Error  Units
StringListBenchmark.newWithArrayConstructor  opennlp.tools.util.jvm.CHMStringDeduplicator        1  thrpt   25   45717169,968 ±  4894532,156  ops/s
StringListBenchmark.newWithArrayConstructor  opennlp.tools.util.jvm.CHMStringDeduplicator      100  thrpt   25     544627,570 ±     9900,671  ops/s
StringListBenchmark.newWithArrayConstructor  opennlp.tools.util.jvm.CHMStringDeduplicator    10000  thrpt   25       3876,808 ±      211,760  ops/s
StringListBenchmark.newWithArrayConstructor  opennlp.tools.util.jvm.CHMStringDeduplicator  1000000  thrpt   25         11,190 ±        1,009  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.CHMStringInterner        1  thrpt   25   43774046,611 ± 11369890,960  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.CHMStringInterner      100  thrpt   25     557563,692 ±    12776,109  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.CHMStringInterner    10000  thrpt   25       2951,642 ±      189,985  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.CHMStringInterner  1000000  thrpt   25          8,232 ±        0,847  ops/s
StringListBenchmark.newWithArrayConstructor       opennlp.tools.util.jvm.HMStringInterner        1  thrpt   25   63103971,334 ±  1958932,641  ops/s
StringListBenchmark.newWithArrayConstructor       opennlp.tools.util.jvm.HMStringInterner      100  thrpt   25     854268,631 ±    22261,811  ops/s
StringListBenchmark.newWithArrayConstructor       opennlp.tools.util.jvm.HMStringInterner    10000  thrpt   25       3191,138 ±      241,943  ops/s
StringListBenchmark.newWithArrayConstructor       opennlp.tools.util.jvm.HMStringInterner  1000000  thrpt   25          8,891 ±        1,040  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.JvmStringInterner        1  thrpt   25    8881121,516 ±   108767,585  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.JvmStringInterner      100  thrpt   25      81073,526 ±      342,867  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.JvmStringInterner    10000  thrpt   25        729,416 ±        6,782  ops/s
StringListBenchmark.newWithArrayConstructor      opennlp.tools.util.jvm.JvmStringInterner  1000000  thrpt   25          2,346 ±        0,076  ops/s
StringListBenchmark.newWithArrayConstructor     opennlp.tools.util.jvm.NoOpStringInterner        1  thrpt   25  132530907,915 ±   368546,226  ops/s
StringListBenchmark.newWithArrayConstructor     opennlp.tools.util.jvm.NoOpStringInterner      100  thrpt   25    6446247,778 ±    14485,272  ops/s
StringListBenchmark.newWithArrayConstructor     opennlp.tools.util.jvm.NoOpStringInterner    10000  thrpt   25      53441,292 ±      107,629  ops/s
StringListBenchmark.newWithArrayConstructor     opennlp.tools.util.jvm.NoOpStringInterner  1000000  thrpt   25        124,158 ±        1,921  ops/s

https://gist.github.com/rzo1/1e85f65ff5d152e168c58e72b2de5375

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 18, 2023

Basically, it is now ready for review. Latest eval run also green: https://ci-builds.apache.org/job/OpenNLP/job/eval-tests-configurable/14/

@rzo1 rzo1 requested review from mawiesne and kinow December 18, 2023 14:29
Copy link
Contributor

@mawiesne mawiesne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @rzo1 for this substantial improvement, tackling a three digit Issue number!

I've left minor comments, which can easily be addressed.

@rzo1 rzo1 requested a review from mawiesne December 18, 2023 19:46
@rzo1
Copy link
Contributor Author

rzo1 commented Dec 22, 2023

If this code is integrated, there is another branch based on this PR ( 7d55599 ) which will remove the unnecessary creation of StringListWrapper objects.

@kinow
Copy link
Member

kinow commented Dec 22, 2023

I'm working until noon today, then I should have time to test it again 🎉 (sorry the delay)

Trying the branch again after rebuilding it, but I keep getting

# Run progress: 0.74% complete, ETA 00:01:51
# Fork: 1 of 1
<failure>

java.lang.NoClassDefFoundError: opennlp/tools/jmh/ExecutionPlan
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:375)
	at org.openjdk.jmh.util.ClassUtils.loadClass(ClassUtils.java:73)
...

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 22, 2023

Did you activate the JMH profile, so it adds the related dependencies? Also note, that Execution plan was moved to opennlp-tools/src/jmh/java/opennlp/tools/util/jvm/jmh (so might be a flawed left-over / old run config)

@kinow
Copy link
Member

kinow commented Dec 22, 2023

I did, like the previous time. I invalidated the cache and restart IntelliJ, deleted the Run configuration… that's really odd — although it's Friday afternoon, so my 🧠 is probably giving up on me already. Will try again later.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a mvn clean install -DskipTests and that solved it.

Then I left it running, and after 40 minutes on my old Thinkpad:

# Run progress: 7.50% complete, ETA 05:13:04

😅 I think I will skip running the memory tests on my machine. Code looks OK, so no objection to merge it from me (we can revert if needed). Thanks

@rzo1
Copy link
Contributor Author

rzo1 commented Dec 22, 2023

@kinow thx for trying. Both JMH benchmark took around 8h together on my old tower. I will address your comments, squash it up and merge it. Stay tuned for the wrapper removal after 😁

@rzo1 rzo1 merged commit 497a7f0 into main Dec 22, 2023
8 checks passed
@rzo1 rzo1 deleted the OPENNLP-421 branch December 22, 2023 18:35
@rzo1
Copy link
Contributor Author

rzo1 commented Dec 22, 2023

Thx to all reviewers / comments with feedback & ideas! Think we solved a good thing here for 2.3.2 ! Other PR for OPENNLP-421 incoming in a few minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants