Skip to content

use rapidhash #57509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
May 10, 2025
Merged

use rapidhash #57509

merged 48 commits into from
May 10, 2025

Conversation

adienes
Copy link
Member

@adienes adienes commented Feb 23, 2025

closes #57235

todos:

  • are the test changes acceptable?
  • default seed going from 0 --> 0xbdd89aa982704029
  • lots and lots of benchmarking
  • I don't really understand the effects stuff (for the String hash)
  • how configurable should the secret be? to address randomize hash values per process #37166 should the seed/secret be an env var?
  • proper attribution to creators of rapidhash and also to reference implementations
  • I changed some h + seed to h ⊻ seed only because I don't really get why + was used in the first place
  • should instances of hash(x) - 3h instead be hash(x, h) ?

@adienes adienes added performance Must go faster hashing labels Feb 23, 2025
@adienes adienes requested a review from oscardssmith February 23, 2025 22:44
@oscardssmith
Copy link
Member

oscardssmith commented Feb 24, 2025

default seed going from 0 --> 0xbdd89aa982704029

yes

should instances of hash(x) - 3h instead be hash(x, h)

yes

I'm curious to see how the benchmarks look. It would be very cool if this was faster.

@nsajko
Copy link
Contributor

nsajko commented Feb 24, 2025

xref #52440

@nsajko
Copy link
Contributor

nsajko commented Feb 25, 2025

  • Why does this change the dispatch on the second argument to hash from UInt to UInt64 in some places?
  • I don't think this PR should add new arguments to hash, or change the meaning of existing argument positions. It's probably better to separate performance improvements and new features into separate PRs. You could use an internal function (_hash?) which accepts four, or however many necessary, arguments, and keep the dispatch for hash methods as-is. In this PR, at least.

@adienes
Copy link
Member Author

adienes commented Feb 27, 2025

should be in a more reviewable state now. I applied review comments so I (think) the hash methods are identical to master. benchmarks look pretty good locally --- enormously faster on String and about the same / slightly faster on everything else

note that I took some liberty with the hash(::UInt64) method to do a 2-round feistel design as suggested by Oscar on zulip. this has the downside that it's no longer "true" rapidhash as in the hash will not match a String with the same bitstring, but it has the upsides that it's quite a bit faster and also invertible. at least in terms of randomness used, true rapidhash on 8 bytes won't use the third secret term either

@nanosoldier runbenchmarks(ALL, vs=":master")

@adienes
Copy link
Member Author

adienes commented Feb 27, 2025

why didn't that trigger a benchmark run?

@jakobnissen
Copy link
Member

I believe you need special permissions to run Nanosoldier. Maybe ask someone on Slack to do it

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nsajko
Copy link
Contributor

nsajko commented Feb 27, 2025

Question: objectid is implemented in C like this:

function objectid(@nospecialize(x))
@_total_meta
return ccall(:jl_object_id, UInt, (Any,), x)
end

julia/src/builtins.c

Lines 292 to 488 in 4a6ada6

// object_id ------------------------------------------------------------------
static uintptr_t bits_hash(const void *b, size_t sz) JL_NOTSAFEPOINT
{
switch (sz) {
case 1: return int32hash(*(const int8_t*)b);
case 2: return int32hash(jl_load_unaligned_i16(b));
case 4: return int32hash(jl_load_unaligned_i32(b));
#ifdef _P64
case 8: return int64hash(jl_load_unaligned_i64(b));
#else
case 8: return int64to32hash(jl_load_unaligned_i64(b));
#endif
default:
#ifdef _P64
return memhash((const char*)b, sz);
#else
return memhash32((const char*)b, sz);
#endif
}
}
static uintptr_t NOINLINE hash_svec(jl_svec_t *v) JL_NOTSAFEPOINT
{
uintptr_t h = 0;
size_t i, l = jl_svec_len(v);
for (i = 0; i < l; i++) {
jl_value_t *x = jl_svecref(v, i);
uintptr_t u = (x == NULL) ? 0 : jl_object_id(x);
h = bitmix(h, u);
}
return h;
}
static uintptr_t immut_id_(jl_datatype_t *dt, jl_value_t *v, uintptr_t h) JL_NOTSAFEPOINT;
typedef struct _varidx {
jl_tvar_t *var;
struct _varidx *prev;
} jl_varidx_t;
static uintptr_t type_object_id_(jl_value_t *v, jl_varidx_t *env) JL_NOTSAFEPOINT
{
if (v == NULL)
return 0;
jl_datatype_t *tv = (jl_datatype_t*)jl_typeof(v);
if (tv == jl_tvar_type) {
jl_varidx_t *pe = env;
int i = 0;
while (pe != NULL) {
if (pe->var == (jl_tvar_t*)v)
return (i<<8) + 42;
i++;
pe = pe->prev;
}
uintptr_t bits = jl_astaggedvalue(v)->header;
if (bits & GC_IN_IMAGE)
return ((uintptr_t*)v)[-2];
return inthash((uintptr_t)v);
}
if (tv == jl_uniontype_type) {
return bitmix(bitmix(jl_object_id((jl_value_t*)tv),
type_object_id_(((jl_uniontype_t*)v)->a, env)),
type_object_id_(((jl_uniontype_t*)v)->b, env));
}
if (tv == jl_unionall_type) {
jl_unionall_t *u = (jl_unionall_t*)v;
uintptr_t h = u->var->name->hash;
h = bitmix(h, type_object_id_(u->var->lb, env));
h = bitmix(h, type_object_id_(u->var->ub, env));
jl_varidx_t e = { u->var, env };
return bitmix(h, type_object_id_(u->body, &e));
}
if (tv == jl_datatype_type) {
jl_datatype_t *dtv = (jl_datatype_t*)v;
if (dtv->isconcretetype)
return dtv->hash;
uintptr_t h = ~dtv->name->hash;
size_t i, l = jl_nparams(v);
for (i = 0; i < l; i++) {
h = bitmix(h, type_object_id_(jl_tparam(v, i), env));
}
return h;
}
if (tv == jl_vararg_type) {
jl_vararg_t *vm = (jl_vararg_t*)v;
jl_value_t *t = vm->T ? vm->T : (jl_value_t*)jl_any_type;
jl_value_t *n = vm->N ? vm->N : jl_nothing;
return bitmix(type_object_id_(t, env),
type_object_id_(n, env));
}
if (tv == jl_symbol_type)
return ((jl_sym_t*)v)->hash;
if (tv == jl_module_type)
return ((jl_module_t*)v)->hash;
assert(!tv->name->mutabl);
return immut_id_(tv, v, tv->hash);
}
static uintptr_t immut_id_(jl_datatype_t *dt, jl_value_t *v, uintptr_t h) JL_NOTSAFEPOINT
{
size_t sz = jl_datatype_size(dt);
if (sz == 0)
return ~h;
size_t f, nf = jl_datatype_nfields(dt);
if (nf == 0 || (!dt->layout->flags.haspadding && dt->layout->flags.isbitsegal && dt->layout->npointers == 0)) {
// operate element-wise if there are unused bits inside,
// otherwise just take the whole data block at once
// a few select pointers (notably symbol) also have special hash values
// which may affect the stability of the objectid hash, even though
// they don't affect egal comparison
return bits_hash(v, sz) ^ h;
}
if (dt == jl_unionall_type)
return type_object_id_(v, NULL);
for (f = 0; f < nf; f++) {
size_t offs = jl_field_offset(dt, f);
char *vo = (char*)v + offs;
uintptr_t u;
if (jl_field_isptr(dt, f)) {
jl_value_t *f = *(jl_value_t**)vo;
u = (f == NULL) ? 0 : jl_object_id(f);
}
else {
jl_datatype_t *fieldtype = (jl_datatype_t*)jl_field_type_concrete(dt, f);
if (jl_is_uniontype(fieldtype)) {
uint8_t sel = ((uint8_t*)vo)[jl_field_size(dt, f) - 1];
fieldtype = (jl_datatype_t*)jl_nth_union_component((jl_value_t*)fieldtype, sel);
}
assert(jl_is_datatype(fieldtype) && !fieldtype->name->abstract && !fieldtype->name->mutabl);
int32_t first_ptr = fieldtype->layout->first_ptr;
if (first_ptr >= 0 && ((jl_value_t**)vo)[first_ptr] == NULL) {
// If the field is a inline immutable that can be can be undef
// we need to check to check for undef first since undef struct
// may have fields that are different but should still be treated as equal.
u = 0;
}
else {
u = immut_id_(fieldtype, (jl_value_t*)vo, 0);
}
}
h = bitmix(h, u);
}
return h;
}
static uintptr_t NOINLINE jl_object_id__cold(uintptr_t tv, jl_value_t *v) JL_NOTSAFEPOINT
{
jl_datatype_t *dt = (jl_datatype_t*)jl_to_typeof(tv);
if (dt->name->mutabl) {
if (dt == jl_string_type) {
#ifdef _P64
return memhash_seed(jl_string_data(v), jl_string_len(v), 0xedc3b677);
#else
return memhash32_seed(jl_string_data(v), jl_string_len(v), 0xedc3b677);
#endif
}
if (dt == jl_simplevector_type)
return hash_svec((jl_svec_t*)v);
if (dt == jl_datatype_type) {
jl_datatype_t *dtv = (jl_datatype_t*)v;
uintptr_t h = ~dtv->name->hash;
return bitmix(h, hash_svec(dtv->parameters));
}
if (dt == jl_module_type) {
jl_module_t *m = (jl_module_t*)v;
return m->hash;
}
uintptr_t bits = jl_astaggedvalue(v)->header;
if (bits & GC_IN_IMAGE)
return ((uintptr_t*)v)[-2];
return inthash((uintptr_t)v);
}
return immut_id_(dt, v, dt->hash);
}
JL_DLLEXPORT inline uintptr_t jl_object_id_(uintptr_t tv, jl_value_t *v) JL_NOTSAFEPOINT
{
if (tv == jl_symbol_tag << 4) {
return ((jl_sym_t*)v)->hash;
}
else if (tv == jl_datatype_tag << 4) {
jl_datatype_t *dtv = (jl_datatype_t*)v;
if (dtv->isconcretetype)
return dtv->hash;
}
else if (tv == (uintptr_t)jl_typename_type) {
return ((jl_typename_t*)v)->hash;
}
return jl_object_id__cold(tv, v);
}
JL_DLLEXPORT uintptr_t jl_object_id(jl_value_t *v) JL_NOTSAFEPOINT
{
return jl_object_id_(jl_typetagof(v), v);
}

Does objectid have to be kept in sync with hash?

@nanosoldier
Copy link
Collaborator

Your job failed.

@KristofferC
Copy link
Member

Ah, this branch doesn't build:


LoadError(at "sysimg.jl" line 5: LoadError(at "Base.jl" line 42: LoadError(at "float.jl" line 723: MethodError(f=Base.hash, args=(0x7ff8000000000000, 0xbdd89aa982704029), world=0x0000331f))))
--
  | jl_method_error_bare at [buildroot]/src/gf.c:2508
  | jl_method_error at [buildroot]/src/gf.c:2526
  | jl_f_throw_methoderror at [buildroot]/src/builtins.c:587 [inlined]
  | jl_f_throw_methoderror at [buildroot]/src/builtins.c:583
  | hash at ./hashing.jl:33

@adienes
Copy link
Member Author

adienes commented Feb 27, 2025

weird, it builds fine "on my machine" even after a make cleanall

is it possible the build success is architecture dependent?

Does objectid have to be kept in sync with hash?

I don't think so, besides the fact that some types use objectid as part of their hash. but happy to be corrected on that

@oscardssmith
Copy link
Member

@adienes the reason your nanosoldier invocation didn't trigger is that your membership in the JuliaLang org is currently set to private.

@nsajko
Copy link
Contributor

nsajko commented Feb 27, 2025

I didn't know JuliaLang is a military organization 😆

@KristofferC
Copy link
Member

I tried to update the branch with a rebase to see if that does something

@adienes
Copy link
Member Author

adienes commented Feb 27, 2025

thank you!

ERROR: LoadError: unable to verify download from https://pkg.julialang.org/

is this related to the PR? I had assumed it was not, but it's happened a few pushes consecutively now

@IanButterworth IanButterworth added merge me PR is reviewed. Merge when all tests are passing and removed merge me PR is reviewed. Merge when all tests are passing labels May 10, 2025
@IanButterworth
Copy link
Member

Can someone with admin rights merge this. The whitespace check passed the commit before merge with master but didn't run on the last commit for some reason. @DilumAluthge perhaps?

@DilumAluthge
Copy link
Member

Should be good to go now.

@oscardssmith oscardssmith merged commit 4a3d736 into JuliaLang:master May 10, 2025
9 checks passed
@IanButterworth
Copy link
Member

Oh ok. I expected that to rerun all CI

@oscardssmith
Copy link
Member

wait, was the ci checkmark lying to me?

@IanButterworth
Copy link
Member

No I meant I didn't want to rerun all CI. Seems like it just reran the whitespace and labels checks which is good.

@adienes adienes deleted the rapid_hash branch May 10, 2025 14:12
@giordano giordano removed the merge me PR is reviewed. Merge when all tests are passing label May 11, 2025
@DilumAluthge
Copy link
Member

If you close and reopen a PR, it will only rerun the GitHub actions jobs, such as the Whitespace job. It won't rerun the buildkite jobs.

charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
closes JuliaLang#57235

todos:
* are the test changes acceptable?
* default seed going from `0` --> `0xbdd89aa982704029`
* lots and lots of benchmarking
* I don't really understand the effects stuff (for the `String` hash)
* how configurable should the secret be? to address
JuliaLang#37166 should the seed/secret
be an env var?
* proper attribution to creators of `rapidhash` and also to reference
implementations
* I changed some `h + seed` to `h ⊻ seed` only because I don't really
get why `+` was used in the first place
* should instances of `hash(x) - 3h` instead be `hash(x, h)` ?

---------

Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com>
Co-authored-by: Oscar Smith <oscardssmith@gmail.com>
Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com>
Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>
@oscardssmith
Copy link
Member

Might be time to update again: Nicoshev/rapidhash#16 The rapidhash author apparently has a new better algorithm

@adienes
Copy link
Member Author

adienes commented May 17, 2025

lol, I saw that 😭 lucky me

I did try it. it is indeed faster on long strings, but it seems quite a lot slower than v1 on short strings. what's interesting to me about that, is the short path logic itself is basically identical --- the only thing changed is the code size in the other branch.

that is:

if string_short
    short_string_logic
else
   # ... small block of code
end

vs

if string_short
    short_string_logic
else
    # ... large block of code
end

the first is faster even inside only the string_short path! I didn't know that was possible

but anyway given that Julia uses this as a general purpose hash and not just for big files, I think so far it looks like V1 strikes a better balance. anyway it should be eas(ier) to update in the future than this PR was, as it basically requires changing only hash_bytes in isolation

@oscardssmith
Copy link
Member

I wonder if outlining the # ... large block of code would fix the small string performance?

@ScottPJones
Copy link
Contributor

I noticed that there may be a licensing problem here, the rapidhash code is not MIT licensed, and I didn't see any changes to the Julia license files (I think it just needs the correct attribution added to THIRDPARTY.md)

@StefanKarpinski
Copy link
Member

We can't copy the code. We can use the algorithm though, but we do have to be careful not to copy the code.

@adienes
Copy link
Member Author

adienes commented May 20, 2025

I am no lawyer, nor open source license expert, but FWIW I would describe this as firmly "using the algorithm" but not "copying the code"

I actually made more use of the Rust port (https://github.com/hoxxep/rapidhash/blob/master/src/rapid_file.rs) than I did the original C++ source, as I found it easier to read. And the Rust port is licensed MIT and Apache-2.0 (and ofc references the C++ source as BSD 2-Clause)

I am happy to submit the appropriate PR to THIRDPARTY.md shortly

@StefanKarpinski
Copy link
Member

I don't think there's any third party license to add if you didn't copy the code. Using an MIT-licensed implementation as a reference is fine and doesn't require adding a license.

@ScottPJones
Copy link
Contributor

Ok, if it is actually taken from the Rust port (and that port was licensed correctly), it might be ok, but from what I know, translating a BSD 2-clause still needs the license attribution:

You are free to translate the code into another language and distribute it under the BSD 2-clause license.
Ensure that you retain the original copyright notice and disclaimer in all translations and distributions.
You can modify and distribute the translated code as well, adhering to the license's requirements.

@nsajko
Copy link
Contributor

nsajko commented May 21, 2025

translating a BSD 2-clause still needs the license attribution

This is all speculation of course, IANAL and IANYL. In any case, a translation is a derivative work in copyright terms (not just in the context of software). However I guess what happened here is not translation in that sense, rather presumably the situation is that the PR author:

  • consulted the copyrighted source code with the goal of understanding the algorithm
  • then proceeded to reimplement the algorithm in Julia

When it's two different people doing that it's known as https://en.wikipedia.org/wiki/Clean-room_design

@KristofferC
Copy link
Member

This is not a place to play lawyer. Feel free to get together, hire an IP lawyer and get an official statement if you feel strongly that something is wrong here. I'll lock this to stop further speculation.

@JuliaLang JuliaLang locked as off-topic and limited conversation to collaborators May 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

switch MurmurHash3 to rapidhash ?