-
Notifications
You must be signed in to change notification settings - Fork 9
Fixes to schema v2 #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to schema v2 #255
Conversation
Codecov Report
@@ Coverage Diff @@
## release-v0.13.0 #255 +/- ##
===================================================
+ Coverage 86.99% 93.32% +6.33%
===================================================
Files 17 17
Lines 1138 1199 +61
===================================================
+ Hits 990 1119 +129
+ Misses 148 80 -68 |
Note the tests are failing because of the tutorials - that will be addressed in #257 |
0259e01
to
1fba6b7
Compare
log("schema is valid", color='green') | ||
|
||
except SchemaError as e: | ||
log(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prints out a pretty detailed message about what is invalid.
$ python -m clkhash validate-schema tests\testdata\bad-schema-v2.json
The schema is not valid.
{'identifier': 'DOB YYYY/MM/DD', 'format': {'type': 'date', 'description': 'Numbers separated by slashes, in the year, month, day order', 'format': '%Y/%m/%d'}, 'hashing': {'ngram': 1,
'positional': True, 'k': 30, 'hash': {'type': 'doubleHash'}}} is not valid under any of the given schemas
Failed validating 'oneOf' in schema['properties']['features']['items']:
{'oneOf': [{'$ref': '#/definitions/featureConfig'},
{'$ref': '#/definitions/ignoreFeature'}],
'type': 'object'}
On instance['features'][2]:
{'format': {'description': 'Numbers separated by slashes, in the year, '
'month, day order',
'format': '%Y/%m/%d',
'type': 'date'},
'hashing': {'hash': {'type': 'doubleHash'},
'k': 30,
'ngram': 1,
'positional': True},
'identifier': 'DOB YYYY/MM/DD'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it show all the errors or just the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jsonschema
just outputs the first error - although with a small tweak it could iterate over them. https://python-jsonschema.readthedocs.io/en/stable/validate/#jsonschema.IValidator.iter_errors
try: | ||
schema_object = clkhash.schema.from_json_file(schema_file=schema) | ||
except SchemaError as e: | ||
log(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the new validate-schema
command this now prints out quite a detailed error message based on the schema violation.
if not num_bits and not k: | ||
num_bits = 200 # default for v2 schema | ||
|
||
num_bits = hashing_strategy.get('numBits') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note if the user didn't provide a hashing strategy the default includes numBits=200
(from the master schema)
clkhash/field_formats.py
Outdated
type_str = json_dict['format']['type'] | ||
except KeyError as e: | ||
error = InvalidSchemaError("Missing attribute " + str(e)) | ||
error.json_field_spec = json_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By storing the raw data that was invalid we can provide much better error messages to the user.
@@ -113,7 +129,8 @@ def convert_feature(f): | |||
if 'weight' in hashing: | |||
del hashing['weight'] | |||
|
|||
hashing['k'] = int(round(weight * k)) | |||
hashing['strategy'] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strategy
key was missing from the converted schemas. The k
was ignored because extra attributes were allowed.
feature_errors = [] | ||
feature_configs = [] | ||
|
||
for i, feature_config in enumerate(dct['features']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can report which i
feature/s have problems.
msg = ('The master schema is not a valid JSON file. The schema cannot ' | ||
'be validated. Please file a bug report.') | ||
raise_from(MasterSchemaError(msg), e) | ||
master_schema = _get_master_schema(version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this bytes -> json handling into the helper function.
@@ -136,6 +141,7 @@ | |||
}, | |||
"hashingConfig": { | |||
"type": "object", | |||
"additionalProperties": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have caught the invalid schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we could modify line 145 to "required": ["ngram", "strategy"],
as it doesn't really make sense to have a default here anyway.
Note the current published docs refers to v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more work to be done
log("schema is valid", color='green') | ||
|
||
except SchemaError as e: | ||
log(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it show all the errors or just the first one?
clkhash/field_formats.py
Outdated
:returns: An initialised instance of the appropriate FieldSpec | ||
subclass. | ||
""" | ||
if 'ignored' in json_dict: | ||
if 'ignored' in json_dict and json_dict['ignored']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that cannot happen if you validate against schema. We enforce True
with the const
thing.
@@ -136,6 +141,7 @@ | |||
}, | |||
"hashingConfig": { | |||
"type": "object", | |||
"additionalProperties": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we could modify line 145 to "required": ["ngram", "strategy"],
as it doesn't really make sense to have a default here anyway.
Thanks for the review Wilko. I've created a bunch of issues and tried not to adjust the schema itself in this PR. I wanted to mostly tighten the spec up and ensure all our examples/tests were compliant. Shame I didn't pick up on the strategy issue back here Only one thing you said I had an issue with:
I think that |
I see. Good point. |
as you very elegantly diverted all my concerns into separate issues, you could sent the PR again for me to sign off. |
requirements.txt
Outdated
@@ -5,7 +5,7 @@ cryptography==2.6.1 | |||
enum34==1.1.6; python_version < '3.4' | |||
future==0.17.1 | |||
futures==3.2.0; python_version < '3.2' | |||
jsonschema==2.6.0 | |||
jsonschema==3.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something, quite possibly this, has broken the pyinstaller - https://ci.appveyor.com/project/hardbyte/clkhash/history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting back to jsonschema 2.6 seems to have fixed the windows exe build - https://ci.appveyor.com/project/hardbyte/clkhash/builds/25527663
f49f0e1
to
29d8da7
Compare
Fixes #254 Another tweak to schema v2 Add a short note about the v2 schema in the docs.
Add cli command to validate a schema Avoid a divide by zero when ngrams is empty cli can throw nice errors when asked to hash with invalid schema schema types
Previously duplicated default information in schema and code.
Update to latest jsonschema version
* Save run of CLI tutorial against newest anonlink server * Include a notebook sanitizer so travis can test cells match. * Use the new version of anonlink in api tutorial. * fine-tuned schemas for jaw dropping performance demonstration * Schema id now includes version * Update author attribute
* Make the v2 schema stricter with additional properties * Fix the v2 testdata schemas to be compliant with the specification * Fix the hardcoded feature configs in the unit tests to be compliant with the spec * Bugfix in converting schemas from v1 to v2 * Update documentation regarding linkage schema v2 * Add a v2 schema with errors * Invalid schema exceptions keep more context * Add cli command to validate a schema and cli schema validation tests * Avoid a divide by zero when ngrams is empty * cli can throw nice errors when asked to hash with invalid schema * Travis: notebook execution runs in the integration test stage * Expose Schema at top level * Don't include a default hashing strategy in the schema * Update tutorial notebooks to use v2 schema (#257) * Include a notebook sanitizer so travis can test cells match. * Use the new version of anonlink in api tutorial. * fine-tuned schemas for jaw dropping performance demonstration * Schema id now includes version * Update author attribute * Note we don't update to latest jsonschema version as Windows exe built with PyInstaller failed. Closes #254
Schema v2 could have been better