-
Notifications
You must be signed in to change notification settings - Fork 72
Alphabetical ID (redo #1178) + Trajectory tweaks #1230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1230 +/- ##
===========================================
- Coverage 89.89% 70.03% -19.87%
===========================================
Files 149 78 -71
Lines 14891 5430 -9461
===========================================
- Hits 13387 3803 -9584
- Misses 1504 1627 +123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
one more comment on here @esoteric-ephemera, adding pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'emmet.core.mpid.AlphaID'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.
If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.
emmet/emmet-core/emmet/core/mpid.py Line 100 in f1f38ad
Something similar should be in |
@tsmathis fixed the pydantic serialization and added tests for this. Also added a |
…one row of parquet-like dataset
…erty / to_arrow field
f189d54
to
12f4a94
Compare
@tschaume this should be good to go - it's a bigger change but mostly independent of any existing structures in emmet-core. Do you want to look it over before merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! It's basically ready to merge I think. Just have nit-picky comments for you to consider.
emmet-core/emmet/core/mpid.py
Outdated
from collections.abc import Callable | ||
from typing import Any | ||
from typing_extensions import Self | ||
|
||
|
||
# matches "mp-1234" or "1234" followed by and optional "-(Alphanumeric)" | ||
mpid_regex = re.compile(r"^([A-Za-z]*-)?(\d+)(-[A-Za-z0-9]+)*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the first group be [A-Za-z]+-
(+
instead of *
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the alphanumeric string before the first -
in a molecule ID always the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that should be +
in the legacy MPIDs, looks like the current arrangement allows for MPID(-100)
which is not good
For the molecule IDs: the first prefix of every ID looks to be a BLAKE2 hash based on emmet.core.utils.get_molecule_id
. Should be the same length for all of them
emmet-core/emmet/core/mpid.py
Outdated
@@ -74,7 +83,7 @@ def __str__(self): | |||
def __repr__(self): | |||
return f"MPID({self})" | |||
|
|||
def __lt__(self, other: Union["MPID", int, str]): | |||
def __lt__(self, other: MPID | int | str): | |||
other_parts = MPID(other).parts | |||
|
|||
if self.parts[0] != "" and other_parts[0] != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be equivalent to if self.parts[0] and other_parts[0]:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fine, but there's another weird thing about this comparison: in contrast to AlphaID
, MPID
s with different prefixes are comparable, so you get this situation where:
MPID('mp-100') < MPID('mvc-100')
because mp
is shorter than mvc
. Not sure we want that behavior?
We probably do to ensure that mvc
prefixed IDs are always sorted above mp
prefixed IDs, but just want to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a bug in MPID.__gt__
that I'll try to fix:
MPID('100') < MPID(100)
>>> False
MPID('100') > MPID(100)
>>> True
MPID(100) > MPID('100')
>>> True
MPID('100') == MPID(100)
>>> True
@@ -91,7 +100,7 @@ def __lt__(self, other: Union["MPID", int, str]): | |||
# both are pure ints; normal comparison | |||
return self.parts[1] < other_parts[1] | |||
|
|||
def __gt__(self, other: Union["MPID", int, str]): | |||
def __gt__(self, other: MPID | int | str): | |||
return not self.__lt__(other) | |||
|
|||
def __hash__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a global variable for the regex pattern of MPID in __get_pydantic_json_schema__
and for mpid_regex
(avoid having to change the regex in two places if needed)? Same for Molecule ID
): | ||
separator = list(non_alpha_char)[0] | ||
prefix, identifier = identifier.split(separator) | ||
elif len(non_alpha_char) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if identifier is a str
but doesn't contain any non-alphanumeric characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will add an exception for this. Note that AlphaID('')
is valid input and gives a zero-valued identifier
emmet-core/emmet/core/mpid.py
Outdated
prefix, identifier = identifier.parts | ||
separator = "-" | ||
|
||
if isinstance(identifier, str) and set(identifier).intersection(digits) > set(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(identifier).intersection(digits) > set()
is the same as just set(identifier).intersection(digits)
, right?
emmet-core/emmet/core/mpid.py
Outdated
will not add the two. Only checks the separator if prefixes are both non-null. | ||
|
||
Args: | ||
other (str or int) : the value to add to the current identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other
could also be an AlphaID, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just an outdated docstr
emmet-core/emmet/core/mpid.py
Outdated
|
||
Will not subtract two AlphaIDs if `prefix` and `separator` do not match. | ||
""" | ||
if isinstance(other, MPID): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's some code here that's identical to the one used in __add__
. Would it make sense to factor it out? Maybe the checks are not necessary in __sub__
since __add__
will check -diff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For __add__
/ __sub__
and __gt__
/ __lt__
it's not possible to just negate the output of whichever method is defined. Gave an example of how this breaks for MPID
above, which uses __gt__ := not __lt__
But I can organize this into a staticmethod
emmet-core/emmet/core/mpid.py
Outdated
else: | ||
test = other | ||
|
||
if isinstance(test, AlphaID) and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test the same as in __lt__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged this into the staticmethod _coerce_value
used across add, sub, lt, gt
emmet-core/emmet/core/trajectory.py
Outdated
rt = run_type(padded_params) | ||
tt = task_type(vis) | ||
|
||
props: dict[str, list[Any]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use defaultdict
here?
emmet-core/emmet/core/trajectory.py
Outdated
if icr > 0: | ||
trajs.append(cls._from_dict(props, **old_meta, **kwargs)) # type: ignore[arg-type] | ||
|
||
props = new_props.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need deepcopy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, probably yes, but I don't think there have been parsing issues so far using shallow copy
Alphabetical ID
Mistakenly closed #1178 by rebasing git history. This PR expands upon it.
Defines a new
AlphaID
class that could eventually replace / contain the currentMPID
class.From internal discussions, the benefit of the
MPID
system was brevity when the system was relatively new. "mp-149" is easy to remember whereas the current batch of MPIDs are > 3,000,000.To replace / augment the current
MPID
system, we need an identifier that:From this, it was suggested that an alphabetical string (to avoid clashes with the current MPIDs, no numbers can be used) could be used instead. The integer value of this string would essentially be taken as base-26 representation, i.e.:
The current implementation supports these features, as well as addition, subtraction, ==, <, > with other AlphaID, int, and the existing MPID. Includes convenience constructor from MPID to AlphaID.
Tests included for these features.
To make these easy to remember, we may want to set the pad length (the number of leading zeroes or "a" characters) to be at least 6, which would give us$26^6 = 308,915,776$ total task IDs (minus the ~3,100,000 that have currently been assigned).
Suggestions / discussion are welcome
Trajectory changes
run_type
andtask_type
as optional top-levelsionic_step_idx
to arrow de-/serialization to allow for sorting ionic steps even ifpreserve_index
was not set while writing trajectory filesTrajectory.from_task_doc
now parses sequential calculations of differentCalcType
as separate calculations. Ex: taskmp-1120260
has three calcs in thecalcs_reversed
: a GGA static, followed by two SCAN relaxations.Trajectory
now parses a single trajectory each for the GGA static and SCAN relaxationsMisc