Alphabetical ID (redo #1178) + Trajectory tweaks #1230

esoteric-ephemera · 2025-05-14T19:17:53Z

Alphabetical ID

Mistakenly closed #1178 by rebasing git history. This PR expands upon it.

Defines a new AlphaID class that could eventually replace / contain the current MPID class.

From internal discussions, the benefit of the MPID system was brevity when the system was relatively new. "mp-149" is easy to remember whereas the current batch of MPIDs are > 3,000,000.

To replace / augment the current MPID system, we need an identifier that:

Can be sorted
Can mint an $N+1$ ID given that $N$ IDs have currently been assigned
Is easy to remember

From this, it was suggested that an alphabetical string (to avoid clashes with the current MPIDs, no numbers can be used) could be used instead. The integer value of this string would essentially be taken as base-26 representation, i.e.:

"a" = $0 \times 26^0 = 0$
"bc" = $1 \times 26^1 + 2 \times 26^0 = 28$
"aaft" = $0 \times 26^3 + 0 \times 26^2 + 5 \times 26^1 + 19 \times 26^0 = 149$

The current implementation supports these features, as well as addition, subtraction, ==, <, > with other AlphaID, int, and the existing MPID. Includes convenience constructor from MPID to AlphaID.

Tests included for these features.

To make these easy to remember, we may want to set the pad length (the number of leading zeroes or "a" characters) to be at least 6, which would give us $26^6 = 308,915,776$ total task IDs (minus the ~3,100,000 that have currently been assigned).

Suggestions / discussion are welcome

Trajectory changes

Added run_type and task_type as optional top-levels
Added ionic_step_idx to arrow de-/serialization to allow for sorting ionic steps even if preserve_index was not set while writing trajectory files
Trajectory.from_task_doc now parses sequential calculations of different CalcType as separate calculations. Ex: task mp-1120260 has three calcs in the calcs_reversed: a GGA static, followed by two SCAN relaxations. Trajectory now parses a single trajectory each for the GGA static and SCAN relaxations

Misc

Closes #1234 by ensuring OSZICAR has the right leading directory name

codecov-commenter · 2025-05-14T19:22:40Z

Codecov Report

Attention: Patch coverage is 28.12500% with 115 lines in your changes missing coverage. Please review.

Project coverage is 70.03%. Comparing base (978c5fb) to head (66cdbcc).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
emmet-core/emmet/core/mpid.py	28.48%	113 Missing ⚠️
emmet-core/emmet/core/tasks.py	0.00%	1 Missing ⚠️
emmet-core/emmet/core/vasp/calculation.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (978c5fb) and HEAD (66cdbcc). Click for more details.

HEAD has 4 uploads less than BASE

Flag BASE (978c5fb) HEAD (66cdbcc)

6 2

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1230       +/-   ##
===========================================
- Coverage   89.89%   70.03%   -19.87%     
===========================================
  Files         149       78       -71     
  Lines       14891     5430     -9461     
===========================================
- Hits        13387     3803     -9584     
- Misses       1504     1627      +123

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tsmathis · 2025-05-19T19:39:32Z

one more comment on here @esoteric-ephemera, adding AlphaID as a model field breaks pydantic branch ci run:

pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'emmet.core.mpid.AlphaID'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.

If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.

MPID has this dealt with this in the three methods here:

emmet/emmet-core/emmet/core/mpid.py

Line 100 in f1f38ad

@classmethod

Something similar should be in AlphaID

esoteric-ephemera · 2025-05-20T00:08:16Z

@tsmathis fixed the pydantic serialization and added tests for this. Also added a _calculation_to_props_dict method to Trajectory to convert a emmet.core.vasp.calculation.Calculation to the format Trajectory._from_dict expects

…tely

…one row of parquet-like dataset

…erty / to_arrow field

esoteric-ephemera · 2025-05-29T23:49:55Z

@tschaume this should be good to go - it's a bigger change but mostly independent of any existing structures in emmet-core. Do you want to look it over before merging?

tschaume

Excellent! It's basically ready to merge I think. Just have nit-picky comments for you to consider.

tschaume · 2025-05-30T00:21:36Z

emmet-core/emmet/core/mpid.py

+    from collections.abc import Callable
+    from typing import Any
+    from typing_extensions import Self
+

 # matches "mp-1234" or "1234" followed by and optional "-(Alphanumeric)"
 mpid_regex = re.compile(r"^([A-Za-z]*-)?(\d+)(-[A-Za-z0-9]+)*$")


Should the first group be [A-Za-z]+- (+ instead of *)?

Is the alphanumeric string before the first - in a molecule ID always the same length?

Yeah that should be + in the legacy MPIDs, looks like the current arrangement allows for MPID(-100) which is not good

For the molecule IDs: the first prefix of every ID looks to be a BLAKE2 hash based on emmet.core.utils.get_molecule_id. Should be the same length for all of them

tschaume · 2025-05-30T00:29:55Z

emmet-core/emmet/core/mpid.py

@@ -74,7 +83,7 @@ def __str__(self):
    def __repr__(self):
        return f"MPID({self})"

-    def __lt__(self, other: Union["MPID", int, str]):
+    def __lt__(self, other: MPID | int | str):
        other_parts = MPID(other).parts

        if self.parts[0] != "" and other_parts[0] != "":


Would this be equivalent to if self.parts[0] and other_parts[0]:?

That should be fine, but there's another weird thing about this comparison: in contrast to AlphaID, MPIDs with different prefixes are comparable, so you get this situation where:

MPID('mp-100') < MPID('mvc-100')

because mp is shorter than mvc. Not sure we want that behavior?

We probably do to ensure that mvc prefixed IDs are always sorted above mpprefixed IDs, but just want to check

There's also a bug in MPID.__gt__ that I'll try to fix:

MPID('100') < MPID(100) >>> False MPID('100') > MPID(100) >>> True MPID(100) > MPID('100') >>> True MPID('100') == MPID(100) >>> True

tschaume · 2025-05-30T00:33:39Z

emmet-core/emmet/core/mpid.py

@@ -91,7 +100,7 @@ def __lt__(self, other: Union["MPID", int, str]):
            # both are pure ints; normal comparison
            return self.parts[1] < other_parts[1]

-    def __gt__(self, other: Union["MPID", int, str]):
+    def __gt__(self, other: MPID | int | str):
        return not self.__lt__(other)

    def __hash__(self):


use a global variable for the regex pattern of MPID in __get_pydantic_json_schema__ and for mpid_regex (avoid having to change the regex in two places if needed)? Same for Molecule ID

tschaume · 2025-05-30T00:39:45Z

emmet-core/emmet/core/mpid.py

+            ):
+                separator = list(non_alpha_char)[0]
+                prefix, identifier = identifier.split(separator)
+            elif len(non_alpha_char) > 1:


What if identifier is a str but doesn't contain any non-alphanumeric characters?

Good catch, will add an exception for this. Note that AlphaID('') is valid input and gives a zero-valued identifier

tschaume · 2025-05-30T00:45:50Z

emmet-core/emmet/core/mpid.py

+            prefix, identifier = identifier.parts
+            separator = "-"
+
+        if isinstance(identifier, str) and set(identifier).intersection(digits) > set():


set(identifier).intersection(digits) > set() is the same as just set(identifier).intersection(digits), right?

tschaume · 2025-05-30T17:44:13Z

emmet-core/emmet/core/mpid.py

+        will not add the two. Only checks the separator if prefixes are both non-null.
+
+        Args:
+            other (str or int) : the value to add to the current identifier.


other could also be an AlphaID, right?

yeah just an outdated docstr

tschaume · 2025-05-30T17:46:42Z

emmet-core/emmet/core/mpid.py

+
+        Will not subtract two AlphaIDs if `prefix` and `separator` do not match.
+        """
+        if isinstance(other, MPID):


there's some code here that's identical to the one used in __add__. Would it make sense to factor it out? Maybe the checks are not necessary in __sub__ since __add__ will check -diff?

For __add__ / __sub__ and __gt__ / __lt__ it's not possible to just negate the output of whichever method is defined. Gave an example of how this breaks for MPID above, which uses __gt__ := not __lt__

But I can organize this into a staticmethod

tschaume · 2025-05-30T17:49:12Z

emmet-core/emmet/core/mpid.py

+        else:
+            test = other
+
+        if isinstance(test, AlphaID) and (


Does this test the same as in __lt__?

merged this into the staticmethod _coerce_value used across add, sub, lt, gt

tschaume · 2025-05-30T17:53:55Z

emmet-core/emmet/core/trajectory.py

+            rt = run_type(padded_params)
+            tt = task_type(vis)
+
+        props: dict[str, list[Any]] = {


Use defaultdict here?

tschaume · 2025-05-30T17:56:17Z

emmet-core/emmet/core/trajectory.py

+                if icr > 0:
+                    trajs.append(cls._from_dict(props, **old_meta, **kwargs))  # type: ignore[arg-type]
+
+                props = new_props.copy()


Does this need deepcopy?

To be safe, probably yes, but I don't think there have been parsing issues so far using shallow copy

…ctures

esoteric-ephemera changed the title ~~Alphabetical ID (redo #1178)~~ Alphabetical ID (redo #1178) + Trajectory tweaks May 15, 2025

esoteric-ephemera added 15 commits May 29, 2025 16:48

redraft alpha id

7c384d6

add gt / lt

50487b5

mypy/pcmt

544b5e7

add padding tests

ef5a783

hashing + correct add/sub/roundtrip

e33e422

add hashing, sorting tests

fd81760

tweak trajectory arrow serialization

a8c6721

pcmt

f23b27f

ensure traj parses non-sequential calcs of different calc_type separa…

5c594a3

…tely

allow for printing AlphaID as legacy MPID for all currently minted MPIDs

446ed65

spruce up tests

af8e3c9

lint

f69fd8b

add convergence data to trajectory, make to_arrow put entire traj in …

a4bb958

…one row of parquet-like dataset

make coords optional at each ionic step, add has_complete_output prop…

e8d66c7

…erty / to_arrow field

ensure oszicar is parsed relative to specified dir name

12f4a94

esoteric-ephemera force-pushed the alpha_id branch from f189d54 to 12f4a94 Compare May 29, 2025 23:48

tschaume self-requested a review May 30, 2025 00:18

tschaume approved these changes May 30, 2025

View reviewed changes

esoteric-ephemera added 2 commits June 2, 2025 12:25

review revisions

359bf3f

Make base trajectory class more flexible in handling molecules / stru…

66cdbcc

…ctures

Alphabetical ID (redo #1178) + Trajectory tweaks #1230

Are you sure you want to change the base?

Alphabetical ID (redo #1178) + Trajectory tweaks #1230

Conversation

esoteric-ephemera commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Alphabetical ID

Trajectory changes

Misc

Uh oh!

codecov-commenter commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tsmathis commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esoteric-ephemera commented May 20, 2025

Uh oh!

esoteric-ephemera commented May 29, 2025

Uh oh!

tschaume left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esoteric-ephemera Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esoteric-ephemera Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

esoteric-ephemera commented May 14, 2025 •

edited

Loading

codecov-commenter commented May 14, 2025 •

edited

Loading

tsmathis commented May 19, 2025 •

edited

Loading

esoteric-ephemera Jun 2, 2025 •

edited

Loading

esoteric-ephemera Jun 2, 2025 •

edited

Loading