hash_structure does not produce the same before & after ASE's write & read to XYZ #118

stenczelt · 2024-12-08T10:12:39Z

One of the derived parameters is hash_structure, which collects structural information and hashes it for matching of structures. I've noticed that this does not product the same for a structure before and after passing through ASE's extxyz write & read.

This seems to make this hash function somewhat useless, since we can reasonably expect use cases where structures are being read/written/calculated a lot of times.

As demonstration, see a failing test:

import io

import ase
import ase.io
import numpy as np
import pytest
from ase.calculators.lj import LennardJones

from abcd.model import AbstractModel


@pytest.fixture
def rng():
    return np.random.default_rng(seed=42)


def test_hash_structure(rng):
    # create atoms & add a calculator
    atoms = ase.Atoms(
        "H3",
        positions=rng.random(size=(3, 3)),
        pbc=True,
        cell=[2, 2, 2],
    )
    atoms.calc = LennardJones()
    atoms.calc.calculate(atoms)

    # dump to XYZ
    buffer = io.StringIO()
    ase.io.write(buffer, atoms, format="extxyz")

    # read back
    buffer.seek(0)
    atoms_read = ase.io.read(buffer, format="extxyz")

    # read in both of them
    abcd_data = AbstractModel.from_atoms(atoms)
    abcd_data_after_read = AbstractModel.from_atoms(atoms_read)

    assert abcd_data["hash_structure"] == abcd_data_after_read["hash_structure"]

which fails with

>       assert abcd_data["hash_structure"] == abcd_data_after_read["hash_structure"]
E       AssertionError: assert '8bb52aa8ab4be61a8ad0f76b32a0c810' == '4cf0d1cbde645832830e6f6d60f86c9a'
E         
E         - 4cf0d1cbde645832830e6f6d60f86c9a
E         + 8bb52aa8ab4be61a8ad0f76b32a0c810

The text was updated successfully, but these errors were encountered:

ElliottKasoar · 2025-01-03T14:26:08Z

I think this is related to the error that appears in #120, which has changed the behaviour of Hasher slightly.

The change I made for ruff was related to https://docs.astral.sh/ruff/rules/function-call-in-default-argument/, so currently we have the issue that something along the lines of

from abcd.model import Hasher

hasher_1 = Hasher()

init_hash = hasher_1()
hasher_1.update(b"Test value")
assert hasher_1() != init_hash

hasher_2 = Hasher()
assert hasher_2() == init_hash

fails, as hasher_2() is instead equal to hasher_1(), since the function isn't reinitialised.

This is fixed by the changes in #120, due to the change

-class Hasher(object):
-    def __init__(self, method=md5()):
-        self.method = method
+class Hasher:
+    def __init__(self, method=md5):
+        self.method = method()

The main other question is whether we actually want to separate hashers when we're saving data, or whether actually the current behaviour (the structure hash is updated with additional keys) is preferable, since otherwise the hash could be the same for two different structures.

ElliottKasoar · 2025-01-22T12:11:32Z

A few points to note, from talking to @alinelena, @oerc0122 and others:

Hashing is never going to be a perfect solution for comparing arrays of floats, similar to why we don't usually compare equality, so perhaps we can come up with better alternatives
This issue only arises if the position floats are higher precision than %16.8f, from the format map used when writing out
- If the extxyz file was written out by ASE, I think this should be quite stable. I think this errors only occur when the file starts with a higher precision than ASE would write
It's also unclear more generally how in future ase.io.write and extxyz.write may interact, which could change things

stenczelt mentioned this issue Dec 8, 2024

Fix data model #117

Merged

ElliottKasoar mentioned this issue Jan 10, 2025

Fix hashing #121

Merged

ElliottKasoar mentioned this issue Jan 22, 2025

Precision specification when writing to extxyz libAtoms/extxyz#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash_structure does not produce the same before & after ASE's write & read to XYZ #118

hash_structure does not produce the same before & after ASE's write & read to XYZ #118

stenczelt commented Dec 8, 2024

ElliottKasoar commented Jan 3, 2025

ElliottKasoar commented Jan 22, 2025

hash_structure does not produce the same before & after ASE's write & read to XYZ #118

hash_structure does not produce the same before & after ASE's write & read to XYZ #118

Comments

stenczelt commented Dec 8, 2024

ElliottKasoar commented Jan 3, 2025

ElliottKasoar commented Jan 22, 2025