AI and machine learning challenges for improving planning data quality

This repository contains scenarios, test cases and code for challenges to scale how we improve the quality of data on https://planning.data.gov.uk

Testing and assuring our data quality is intended to complement our with work with i.AI to increase the availability of data by extracting data from documents.

We expect to incorporate issues identified by these tests into the feedback we give to data providers, to help them improve the availability and quality of their planning data.

Challenges

Finding and maintaining our links to information sources

The planning data platform collects data from local sources and them available as national datasets.

We link to an information page for each local data source, a webpage containing human readable information about the data. This page MUST be on the authorititive website for the organisation.

As an example, most LPAs have a planning policy page listing its conservation areas:

We record a link to these source web pages as the documentation-url in our source configuration.

Finding links to data sources

LPAs are also expected provide their conservation areas as data, following our guidance.

Our endpoint configuration contains an endpoint-url which we check and collect the data. Each night the platform makes a request from each endpoint, the log contains a record of the request and saves the data downloaded as a resource.

We expect the endpoint-url to be documented and linked to from a webpage which is recorded as documentation-url for the endpoint. This may be the same webpage as the information source page, but may also be a page on another domain. For example, Barnet open data.

In this case we expect there to be a hyperlink from the information source page to the endpoint documentation page, and from the endpoint documentation page to the endpoint.

We also expect there to be a statement about the copyright and licensing of the data, which we currently record against the source.

provision contains the organisations expected to have information about each dataset.
organisation dataset includes a link to each organisation's website.
source contains existing source pages
endpoint contains existing endpoint pages, the URL where we download data from

Our source and endpoint data is currently very messy. The source dataset contains placeholder entries with blank urls, and many of the documentation-urls are broken links, or point to endpoint webpages. We need to migrate to the source documentation-url linking to the information page, and the endpoint documentation-urls have been recorded against the source.

We are adding datasets as we work through our backlog of planning considerations many of which are devolved to LPAs (currently 311 organisations). Finding sources for a new dataset on these LPA websites is timeconsuming.

Once we have source and endpoint data, we need to monitor the LPA sites for changes, in particular publication of new endpoints, and changes to licensing is a time-consuming and error-prone activity.

Comparing entities against documents and documentation

The planning data platform is an index of material information provided by LPAs and other organisations.

Each entity includes a link to the documentation-url, the webpage with human readable content describing the entity, and a document-url usually a PDF containint the material information, including the name of the entity, a start-date (when the entity came into force) and where the entity applies. For example:

The entity 6100046 represents an Article 4 Direction (PDF). The direction removes permitted development rights from a single area represented by the entity [https://www.planning.data.gov.uk/entity/7010002601], found using a datasette query.

Can we highlight where the name, date and other information in our data differs or is missing from those in these webpages and documents?

For example, we manually reviewed Conservation areas in Barnet. Can we scale this approach to provide similar reporting for other LPAs and datasets?

Finding possibly duplicate entities

Comparing boundaries to geographical features

Finding and reconciling information from alternative sources

Can we find notices and links to other official information about entities to help users with reconciliation and improve the trustworthy of data on the platform?

Knotty Ash is a conservation area

Resources

planning-extract — our work with i.AI on extracting data from documents
Data quality needs — notes on our data quality framework
specification — the planning data model
config — source for our pipeline configuration
datasette — the output of our data pipeline
quality — categories for scoring the quality of an entity, dataset and provision

Licence

The software in this project is open source and covered by the LICENSE file.

Individual datasets copied into this repository may have specific copyright and licensing, otherwise all content and data in this repository is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI and machine learning challenges for improving planning data quality

Challenges

Finding and maintaining our links to information sources

Finding links to data sources

Comparing entities against documents and documentation

Finding possibly duplicate entities

Comparing boundaries to geographical features

Finding and reconciling information from alternative sources

Resources

Licence

About

Releases

Packages

License

digital-land/quality-challenge

Folders and files

Latest commit

History

Repository files navigation

AI and machine learning challenges for improving planning data quality

Challenges

Finding and maintaining our links to information sources

Finding links to data sources

Comparing entities against documents and documentation

Finding possibly duplicate entities

Comparing boundaries to geographical features

Finding and reconciling information from alternative sources

Resources

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages