Skip to content

Alphabetical ID (redo #1178) + Trajectory tweaks #1230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

esoteric-ephemera
Copy link
Collaborator

@esoteric-ephemera esoteric-ephemera commented May 14, 2025

Alphabetical ID

Mistakenly closed #1178 by rebasing git history. This PR expands upon it.

Defines a new AlphaID class that could eventually replace / contain the current MPID class.

From internal discussions, the benefit of the MPID system was brevity when the system was relatively new. "mp-149" is easy to remember whereas the current batch of MPIDs are > 3,000,000.

To replace / augment the current MPID system, we need an identifier that:

  • Can be sorted
  • Can mint an $N+1$ ID given that $N$ IDs have currently been assigned
  • Is easy to remember

From this, it was suggested that an alphabetical string (to avoid clashes with the current MPIDs, no numbers can be used) could be used instead. The integer value of this string would essentially be taken as base-26 representation, i.e.:

  • "a" = $0 \times 26^0 = 0$
  • "bc" = $1 \times 26^1 + 2 \times 26^0 = 28$
  • "aaft" = $0 \times 26^3 + 0 \times 26^2 + 5 \times 26^1 + 19 \times 26^0 = 149$

The current implementation supports these features, as well as addition, subtraction, ==, <, > with other AlphaID, int, and the existing MPID. Includes convenience constructor from MPID to AlphaID.

Tests included for these features.

To make these easy to remember, we may want to set the pad length (the number of leading zeroes or "a" characters) to be at least 6, which would give us $26^6 = 308,915,776$ total task IDs (minus the ~3,100,000 that have currently been assigned).

Suggestions / discussion are welcome

Trajectory changes

  • Added run_type and task_type as optional top-levels
  • Added ionic_step_idx to arrow de-/serialization to allow for sorting ionic steps even if preserve_index was not set while writing trajectory files
  • Trajectory.from_task_doc now parses sequential calculations of different CalcType as separate calculations. Ex: task mp-1120260 has three calcs in the calcs_reversed: a GGA static, followed by two SCAN relaxations. Trajectory now parses a single trajectory each for the GGA static and SCAN relaxations

Misc

  • Closes #1234 by ensuring OSZICAR has the right leading directory name

@codecov-commenter
Copy link

codecov-commenter commented May 14, 2025

Codecov Report

Attention: Patch coverage is 28.12500% with 115 lines in your changes missing coverage. Please review.

Project coverage is 70.03%. Comparing base (978c5fb) to head (66cdbcc).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
emmet-core/emmet/core/mpid.py 28.48% 113 Missing ⚠️
emmet-core/emmet/core/tasks.py 0.00% 1 Missing ⚠️
emmet-core/emmet/core/vasp/calculation.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (978c5fb) and HEAD (66cdbcc). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (978c5fb) HEAD (66cdbcc)
6 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1230       +/-   ##
===========================================
- Coverage   89.89%   70.03%   -19.87%     
===========================================
  Files         149       78       -71     
  Lines       14891     5430     -9461     
===========================================
- Hits        13387     3803     -9584     
- Misses       1504     1627      +123     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@esoteric-ephemera esoteric-ephemera changed the title Alphabetical ID (redo #1178) Alphabetical ID (redo #1178) + Trajectory tweaks May 15, 2025
@tsmathis
Copy link
Collaborator

tsmathis commented May 19, 2025

one more comment on here @esoteric-ephemera, adding AlphaID as a model field breaks pydantic branch ci run:

pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'emmet.core.mpid.AlphaID'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.

If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.

MPID has this dealt with this in the three methods here:

@classmethod

Something similar should be in AlphaID

@esoteric-ephemera
Copy link
Collaborator Author

@tsmathis fixed the pydantic serialization and added tests for this. Also added a _calculation_to_props_dict method to Trajectory to convert a emmet.core.vasp.calculation.Calculation to the format Trajectory._from_dict expects

@esoteric-ephemera
Copy link
Collaborator Author

@tschaume this should be good to go - it's a bigger change but mostly independent of any existing structures in emmet-core. Do you want to look it over before merging?

@tschaume tschaume self-requested a review May 30, 2025 00:18
Copy link
Member

@tschaume tschaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! It's basically ready to merge I think. Just have nit-picky comments for you to consider.

from collections.abc import Callable
from typing import Any
from typing_extensions import Self


# matches "mp-1234" or "1234" followed by and optional "-(Alphanumeric)"
mpid_regex = re.compile(r"^([A-Za-z]*-)?(\d+)(-[A-Za-z0-9]+)*$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the first group be [A-Za-z]+- (+ instead of *)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the alphanumeric string before the first - in a molecule ID always the same length?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that should be + in the legacy MPIDs, looks like the current arrangement allows for MPID(-100) which is not good

For the molecule IDs: the first prefix of every ID looks to be a BLAKE2 hash based on emmet.core.utils.get_molecule_id. Should be the same length for all of them

@@ -74,7 +83,7 @@ def __str__(self):
def __repr__(self):
return f"MPID({self})"

def __lt__(self, other: Union["MPID", int, str]):
def __lt__(self, other: MPID | int | str):
other_parts = MPID(other).parts

if self.parts[0] != "" and other_parts[0] != "":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be equivalent to if self.parts[0] and other_parts[0]:?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine, but there's another weird thing about this comparison: in contrast to AlphaID, MPIDs with different prefixes are comparable, so you get this situation where:

MPID('mp-100') < MPID('mvc-100')

because mp is shorter than mvc. Not sure we want that behavior?

We probably do to ensure that mvc prefixed IDs are always sorted above mpprefixed IDs, but just want to check

Copy link
Collaborator Author

@esoteric-ephemera esoteric-ephemera Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a bug in MPID.__gt__ that I'll try to fix:

MPID('100') < MPID(100)
>>> False

MPID('100') > MPID(100)
>>> True

MPID(100) > MPID('100')
>>> True

MPID('100') == MPID(100)
>>> True

@@ -91,7 +100,7 @@ def __lt__(self, other: Union["MPID", int, str]):
# both are pure ints; normal comparison
return self.parts[1] < other_parts[1]

def __gt__(self, other: Union["MPID", int, str]):
def __gt__(self, other: MPID | int | str):
return not self.__lt__(other)

def __hash__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a global variable for the regex pattern of MPID in __get_pydantic_json_schema__ and for mpid_regex (avoid having to change the regex in two places if needed)? Same for Molecule ID

):
separator = list(non_alpha_char)[0]
prefix, identifier = identifier.split(separator)
elif len(non_alpha_char) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if identifier is a str but doesn't contain any non-alphanumeric characters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will add an exception for this. Note that AlphaID('') is valid input and gives a zero-valued identifier

prefix, identifier = identifier.parts
separator = "-"

if isinstance(identifier, str) and set(identifier).intersection(digits) > set():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set(identifier).intersection(digits) > set() is the same as just set(identifier).intersection(digits), right?

will not add the two. Only checks the separator if prefixes are both non-null.

Args:
other (str or int) : the value to add to the current identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other could also be an AlphaID, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just an outdated docstr


Will not subtract two AlphaIDs if `prefix` and `separator` do not match.
"""
if isinstance(other, MPID):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's some code here that's identical to the one used in __add__. Would it make sense to factor it out? Maybe the checks are not necessary in __sub__ since __add__ will check -diff?

Copy link
Collaborator Author

@esoteric-ephemera esoteric-ephemera Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For __add__ / __sub__ and __gt__ / __lt__ it's not possible to just negate the output of whichever method is defined. Gave an example of how this breaks for MPID above, which uses __gt__ := not __lt__

But I can organize this into a staticmethod

else:
test = other

if isinstance(test, AlphaID) and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test the same as in __lt__?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged this into the staticmethod _coerce_value used across add, sub, lt, gt

rt = run_type(padded_params)
tt = task_type(vis)

props: dict[str, list[Any]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use defaultdict here?

if icr > 0:
trajs.append(cls._from_dict(props, **old_meta, **kwargs)) # type: ignore[arg-type]

props = new_props.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need deepcopy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, probably yes, but I don't think there have been parsing issues so far using shallow copy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: The path error of oszicar_file in the calculation.py
4 participants