Skip to content

Fixes to schema v2 #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 26, 2019
Merged

Fixes to schema v2 #255

merged 25 commits into from
Jun 26, 2019

Conversation

hardbyte
Copy link
Collaborator

@hardbyte hardbyte commented Jun 7, 2019

Schema v2 could have been better

@hardbyte hardbyte requested a review from wilko77 June 7, 2019 07:50
@codecov
Copy link

codecov bot commented Jun 9, 2019

Codecov Report

Merging #255 into release-v0.13.0 will increase coverage by 6.33%.
The diff coverage is 73.97%.

@@                 Coverage Diff                 @@
##           release-v0.13.0     #255      +/-   ##
===================================================
+ Coverage            86.99%   93.32%   +6.33%     
===================================================
  Files                   17       17              
  Lines                 1138     1199      +61     
===================================================
+ Hits                   990     1119     +129     
+ Misses                 148       80      -68

@hardbyte
Copy link
Collaborator Author

hardbyte commented Jun 9, 2019

Note the tests are failing because of the tutorials - that will be addressed in #257

@hardbyte hardbyte force-pushed the fix-schema-v2 branch 2 times, most recently from 0259e01 to 1fba6b7 Compare June 11, 2019 01:07
log("schema is valid", color='green')

except SchemaError as e:
log(str(e))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints out a pretty detailed message about what is invalid.

$ python -m clkhash validate-schema tests\testdata\bad-schema-v2.json
The schema is not valid.

{'identifier': 'DOB YYYY/MM/DD', 'format': {'type': 'date', 'description': 'Numbers separated by slashes, in the year, month, day order', 'format': '%Y/%m/%d'}, 'hashing': {'ngram': 1,
 'positional': True, 'k': 30, 'hash': {'type': 'doubleHash'}}} is not valid under any of the given schemas

Failed validating 'oneOf' in schema['properties']['features']['items']:
    {'oneOf': [{'$ref': '#/definitions/featureConfig'},
               {'$ref': '#/definitions/ignoreFeature'}],
     'type': 'object'}

On instance['features'][2]:
    {'format': {'description': 'Numbers separated by slashes, in the year, '
                               'month, day order',
                'format': '%Y/%m/%d',
                'type': 'date'},
     'hashing': {'hash': {'type': 'doubleHash'},
                 'k': 30,
                 'ngram': 1,
                 'positional': True},
     'identifier': 'DOB YYYY/MM/DD'}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it show all the errors or just the first one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonschema just outputs the first error - although with a small tweak it could iterate over them. https://python-jsonschema.readthedocs.io/en/stable/validate/#jsonschema.IValidator.iter_errors

try:
schema_object = clkhash.schema.from_json_file(schema_file=schema)
except SchemaError as e:
log(str(e))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the new validate-schema command this now prints out quite a detailed error message based on the schema violation.

if not num_bits and not k:
num_bits = 200 # default for v2 schema

num_bits = hashing_strategy.get('numBits')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note if the user didn't provide a hashing strategy the default includes numBits=200 (from the master schema)

type_str = json_dict['format']['type']
except KeyError as e:
error = InvalidSchemaError("Missing attribute " + str(e))
error.json_field_spec = json_dict
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By storing the raw data that was invalid we can provide much better error messages to the user.

@@ -113,7 +129,8 @@ def convert_feature(f):
if 'weight' in hashing:
del hashing['weight']

hashing['k'] = int(round(weight * k))
hashing['strategy'] = {}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy key was missing from the converted schemas. The k was ignored because extra attributes were allowed.

feature_errors = []
feature_configs = []

for i, feature_config in enumerate(dct['features']):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can report which i feature/s have problems.

msg = ('The master schema is not a valid JSON file. The schema cannot '
'be validated. Please file a bug report.')
raise_from(MasterSchemaError(msg), e)
master_schema = _get_master_schema(version)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this bytes -> json handling into the helper function.

@@ -136,6 +141,7 @@
},
"hashingConfig": {
"type": "object",
"additionalProperties": false,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have caught the invalid schemas.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could modify line 145 to "required": ["ngram", "strategy"],
as it doesn't really make sense to have a default here anyway.

@hardbyte
Copy link
Collaborator Author

hardbyte commented Jun 11, 2019

I have a local branch which updates the docs further rebased and pushed that too.

Note the current published docs refers to v1

Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more work to be done

log("schema is valid", color='green')

except SchemaError as e:
log(str(e))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it show all the errors or just the first one?

:returns: An initialised instance of the appropriate FieldSpec
subclass.
"""
if 'ignored' in json_dict:
if 'ignored' in json_dict and json_dict['ignored']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that cannot happen if you validate against schema. We enforce True with the const thing.

@@ -136,6 +141,7 @@
},
"hashingConfig": {
"type": "object",
"additionalProperties": false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could modify line 145 to "required": ["ngram", "strategy"],
as it doesn't really make sense to have a default here anyway.

@hardbyte
Copy link
Collaborator Author

hardbyte commented Jun 13, 2019

Thanks for the review Wilko. I've created a bunch of issues and tried not to adjust the schema itself in this PR. I wanted to mostly tighten the spec up and ensure all our examples/tests were compliant.

Shame I didn't pick up on the strategy issue back here

Only one thing you said I had an issue with:

that cannot happen if you validate against schema. We enforce True with the const thing.

I think that ignored=False should be valid for any feature so someone can easily toggle a feature on and off.

@wilko77
Copy link
Collaborator

wilko77 commented Jun 17, 2019

I see. Good point.
So an ignoreFeature section needs to have ignored: true, but any other featureConfig can have an optional ignored: true/false.
We should probably make this explicit in the featureConfig definition and also test for the correct behavior.

@wilko77
Copy link
Collaborator

wilko77 commented Jun 17, 2019

as you very elegantly diverted all my concerns into separate issues, you could sent the PR again for me to sign off.

@hardbyte hardbyte requested a review from wilko77 June 22, 2019 22:26
requirements.txt Outdated
@@ -5,7 +5,7 @@ cryptography==2.6.1
enum34==1.1.6; python_version < '3.4'
future==0.17.1
futures==3.2.0; python_version < '3.2'
jsonschema==2.6.0
jsonschema==3.0.1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something, quite possibly this, has broken the pyinstaller - https://ci.appveyor.com/project/hardbyte/clkhash/history

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to jsonschema 2.6 seems to have fixed the windows exe build - https://ci.appveyor.com/project/hardbyte/clkhash/builds/25527663

@hardbyte hardbyte self-assigned this Jun 26, 2019
@hardbyte hardbyte merged commit 2f9f9f1 into release-v0.13.0 Jun 26, 2019
@hardbyte hardbyte deleted the fix-schema-v2 branch June 26, 2019 04:08
hardbyte added a commit that referenced this pull request Jun 26, 2019
* Make the v2 schema stricter with additional properties
* Fix the v2 testdata schemas to be compliant with the specification
* Fix the hardcoded feature configs in the unit tests to be compliant with the spec
* Bugfix in converting schemas from v1 to v2
* Update documentation regarding linkage schema v2 
* Add a v2 schema with errors
* Invalid schema exceptions keep more context
* Add cli command to validate a schema and cli schema validation tests
* Avoid a divide by zero when ngrams is empty
* cli can throw nice errors when asked to hash with invalid schema
* Travis: notebook execution runs in the integration test stage
* Expose Schema at top level
* Don't include a default hashing strategy in the schema
* Update tutorial notebooks to use v2 schema (#257)
* Include a notebook sanitizer so travis can test cells match.
* Use the new version of anonlink in api tutorial.
* fine-tuned schemas for jaw dropping performance demonstration
* Schema id now includes version
* Update author attribute
* Note we don't update to latest jsonschema version as Windows exe built with PyInstaller failed.

Closes #254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants