Skip to content

Fixes to schema v2 #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
a65c4f4
Tweak schema v2 to catch hash config errors additional properties
hardbyte Jun 7, 2019
2b75199
Fix the v2 testdata schemas to be compliant with the specification
hardbyte Jun 7, 2019
0340e80
Fix the hardcoded feature configs in the unit tests to be compliant w…
hardbyte Jun 7, 2019
69a8a6c
Only ignore if the ignored boolean is true...
hardbyte Jun 7, 2019
555a21b
Bugfix in converting schemas from v1 to v2
hardbyte Jun 7, 2019
a298d18
Add a v2 schema with errors
hardbyte Jun 7, 2019
806a3cd
Invalid schema exceptions keep more context
hardbyte Jun 7, 2019
a421c82
Schema updates
hardbyte Jun 7, 2019
b61362d
Expose Schema at top level
hardbyte Jun 8, 2019
5338104
Travis: notebook execution runs in the integration test stage
hardbyte Jun 7, 2019
d524a37
Add cli schema validation tests
hardbyte Jun 11, 2019
58047b0
_get_master_schema now returns a json objects instead of bytes
hardbyte Jun 11, 2019
2b2748f
Load default hashing strategy from master schema
hardbyte Jun 11, 2019
8fbc6b8
Minor docstring update
hardbyte Jun 11, 2019
9b90959
Update another not quite compliant example schema
hardbyte Jun 11, 2019
bcf8fc3
Make the v2 schema stricter with additional properties
hardbyte Jun 11, 2019
0f897e0
Linkage schema documentation update
hardbyte Jun 11, 2019
54f921b
Don't include a default hashing strategy in the schema
hardbyte Jun 13, 2019
69bf6a1
Update to latest jsonschema version
hardbyte Jun 13, 2019
349770d
Update docstring and require non optional strategy key
hardbyte Jun 13, 2019
cd5ce90
Minor changes to schema v2
hardbyte Jun 25, 2019
57fc7fe
Adjusted test and v1 to v2 conversion to comply with stricter schema
hardbyte Jun 25, 2019
b4a4a1a
Update tutorial notebook to use v2 schema (#257)
hardbyte Jun 26, 2019
b0575bc
Fix merge/rebase error
hardbyte Jun 26, 2019
61bf256
Store field spec on exception
hardbyte Jun 26, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ install:
- travis_retry pip install -e .

script:
- if [ "${INCLUDE_NB_TEST}" == "1" ]; then pytest --cov=clkhash --nbval-lax; else pytest --cov=clkhash; fi
- pytest --cov=clkhash
- codecov


Expand All @@ -48,28 +48,33 @@ jobs:
- python: '3.6'
env:
- INCLUDE_CLI=1
- INCLUDE_NB_TEST=1
- python: '2.7'
env:
- INCLUDE_CLI=1

# OSX + Python is officially supported by Travis CI as of April 2011
# https://docs.travis-ci.com/user/reference/osx/
- os: osx
osx_image: xcode8.3
python: "3.6-dev"

- stage: Integration
name: Test Notebooks
python: 3.7
before_install:
- travis_retry pip install -U -r docs/doc-requirements.txt
script:
- pytest --nbval docs -x --sanitize-with docs/tutorial_sanitize.cfg

- stage: Integration
python: '3.8-dev'
env:
- TEST_ENTITY_SERVICE=https://testing.es.data61.xyz
- INCLUDE_CLI=1
- stage: Integration
python: '3.6'
python: '3.7'
env:
- TEST_ENTITY_SERVICE=https://testing.es.data61.xyz
- INCLUDE_CLI=1
- INCLUDE_NB_TEST=1
- stage: Integration
python: '2.7'
env:
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## 0.13.0

- Fix example and test linkage schemas using v2.
- Fix mismatch between double hash and blake hash key requirement.
- Update to use newer anonlink-entity-service api.
- Updates to dependencies.
Expand Down
3 changes: 2 additions & 1 deletion clkhash/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import pkg_resources

from . import bloomfilter, field_formats, key_derivation, schema, randomnames, describe
from .schema import Schema

try:
__version__ = pkg_resources.get_distribution('clkhash').version
except pkg_resources.DistributionNotFound:
__version__ = "development"

__author__ = 'N1 Analytics'
__author__ = "Data61"
7 changes: 4 additions & 3 deletions clkhash/bloomfilter.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,10 @@ def crypto_bloom_filter(record, # type: Sequence[Text]
if fhp:
ngrams = list(tokenize(field.format_value(entry)))
hash_function = hashing_function_from_properties(fhp)
bloomfilter |= hash_function(ngrams, key,
fhp.ks(len(ngrams)),
hash_l, fhp.encoding)
if ngrams:
bloomfilter |= hash_function(ngrams, key,
fhp.ks(len(ngrams)),
hash_l, fhp.encoding)

c1 = bloomfilter.count()
bloomfilter = fold_xor(bloomfilter, schema.xor_folds)
Expand Down
36 changes: 31 additions & 5 deletions clkhash/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
run_get_status, project_create, run_create,
server_get_status, ServiceError,
format_run_status, watch_run_status)
from clkhash.schema import SchemaError

DEFAULT_SERVICE_URL = 'https://es.data61.xyz'

Expand Down Expand Up @@ -68,8 +69,11 @@ def hash(pii_csv, keys, schema, clk_json, quiet, no_header, check_header, valida
Use "-" for CLK_JSON to write JSON to stdout.
"""

schema_object = clkhash.schema.from_json_file(schema_file=schema)
try:
schema_object = clkhash.schema.from_json_file(schema_file=schema)
except SchemaError as e:
log(str(e))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the new validate-schema command this now prints out quite a detailed error message based on the schema violation.

raise SystemExit(-1)
header = True
if not check_header:
header = 'ignore'
Expand All @@ -92,7 +96,7 @@ def hash(pii_csv, keys, schema, clk_json, quiet, no_header, check_header, valida
log("CLK data written to {}".format(clk_json.name))


@cli.command('status', short_help='Get status of entity service')
@cli.command('status', short_help='get status of entity service')
@click.option('--server', type=str, default=DEFAULT_SERVICE_URL, help="Server address including protocol")
@click.option('-o', '--output', type=click.File('w'), default='-')
@click.option('-v', '--verbose', default=False, is_flag=True, help="Script is more talkative")
Expand Down Expand Up @@ -141,7 +145,7 @@ def status(server, output, verbose):
@click.option('--name', type=str, help="Name to give this project")
@click.option('--parties', default=2, type=int,
help="Number of parties in the project")
@click.option('-o','--output', type=click.File('w'), default='-')
@click.option('-o', '--output', type=click.File('w'), default='-')
@click.option('-v', '--verbose', is_flag=True, help="Script is more talkative")
def create_project(type, schema, server, name, parties, output, verbose):
"""Create a new project on an entity matching server.
Expand Down Expand Up @@ -171,7 +175,7 @@ def create_project(type, schema, server, name, parties, output, verbose):
except ServiceError as e:
log("Unexpected response - {}".format(e.status_code))
log(e.text)
raise SystemExit
raise SystemExit(-1)
else:
log("Project created")

Expand Down Expand Up @@ -318,6 +322,28 @@ def generate_default_schema(output):
shutil.copyfile(original_path, output)


@cli.command('validate-schema', short_help="validate linkage schema")
@click.argument('schema', type=click.File('r', lazy=True))
def validate_schema(schema):
"""Validate a linkage schema
Given a file containing a linkage schema, verify the schema is valid otherwise
print detailed errors.
"""

try:
clkhash.schema.from_json_file(
schema_file=schema,
validate=True
)

log("schema is valid", color='green')

except SchemaError as e:
log(str(e))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints out a pretty detailed message about what is invalid.

$ python -m clkhash validate-schema tests\testdata\bad-schema-v2.json
The schema is not valid.

{'identifier': 'DOB YYYY/MM/DD', 'format': {'type': 'date', 'description': 'Numbers separated by slashes, in the year, month, day order', 'format': '%Y/%m/%d'}, 'hashing': {'ngram': 1,
 'positional': True, 'k': 30, 'hash': {'type': 'doubleHash'}}} is not valid under any of the given schemas

Failed validating 'oneOf' in schema['properties']['features']['items']:
    {'oneOf': [{'$ref': '#/definitions/featureConfig'},
               {'$ref': '#/definitions/ignoreFeature'}],
     'type': 'object'}

On instance['features'][2]:
    {'format': {'description': 'Numbers separated by slashes, in the year, '
                               'month, day order',
                'format': '%Y/%m/%d',
                'type': 'date'},
     'hashing': {'hash': {'type': 'doubleHash'},
                 'k': 30,
                 'ngram': 1,
                 'positional': True},
     'identifier': 'DOB YYYY/MM/DD'}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it show all the errors or just the first one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonschema just outputs the first error - although with a small tweak it could iterate over them. https://python-jsonschema.readthedocs.io/en/stable/validate/#jsonschema.IValidator.iter_errors

raise SystemExit(-1)


if __name__ == "__main__":
freeze_support()
cli()
1 change: 0 additions & 1 deletion clkhash/data/randomnames-schema.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

{
"version": 1,
"clkConfig": {
Expand Down
33 changes: 17 additions & 16 deletions clkhash/field_formats.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@ class InvalidEntryError(ValueError):


class InvalidSchemaError(ValueError):
""" The schema is not valid.
"""Raised if the schema of a field specification is invalid.
This exception is raised if, for example, a regular expression
included in the schema is not syntactically correct.
For example, a regular expression included in the schema is not
syntactically correct.
"""
json_field_spec = None # type: Optional[dict]
field_spec_index = None # type: Optional[int]


class MissingValueSpec(object):
Expand Down Expand Up @@ -161,19 +163,17 @@ def fhp_from_json_dict(
"""
Make a :class:`FieldHashingProperties` object from a dictionary.
:param dict json_dict:
The dictionary must have have an 'ngram' key
and one of k or num_bits. It may have
'positional' key; if missing a default is used.
The encoding is
always set to the default value.
:return: A :class:`FieldHashingProperties` instance.
:param dict json_dict:
Conforming to the `hashingConfig` definition
in the `v2` linkage schema.
:return: A :class:`FieldHashingProperties` instance.
"""
hashing_strategy = json_dict['strategy']
h = json_dict.get('hash', {'type': 'blakeHash'})
num_bits = json_dict.get('numBits')
k = json_dict.get('k')
if not num_bits and not k:
num_bits = 200 # default for v2 schema

num_bits = hashing_strategy.get('numBits')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note if the user didn't provide a hashing strategy the default includes numBits=200 (from the master schema)

k = hashing_strategy.get('k')

return FieldHashingProperties(
ngram=json_dict['ngram'],
positional=json_dict.get(
Expand Down Expand Up @@ -263,7 +263,6 @@ def validate(self, str_in):
e_new.field_spec = self
raise_from(e_new, err)


def is_missing_value(self, str_in):
# type: (Text) -> bool
""" tests if 'str_in' is the sentinel value for this field
Expand Down Expand Up @@ -441,6 +440,7 @@ def from_json_dict(cls,
except (SyntaxError, re.error) as e:
msg = "Invalid regular expression '{}.'".format(pattern)
e_new = InvalidSchemaError(msg)
e_new.json_field_spec = json_dict
raise_from(e_new, e)
result.regex_based = True

Expand Down Expand Up @@ -843,9 +843,10 @@ def spec_from_json_dict(
json_dict # type: Dict[str, Any]
):
# type: (...) -> FieldSpec
""" Turns a dictionary into the appropriate object.
""" Turns a dictionary into the appropriate FieldSpec object.
:param dict json_dict: A dictionary with properties.
:raises InvalidSchemaError:
:returns: An initialised instance of the appropriate FieldSpec
subclass.
"""
Expand Down
Loading