This repository contains scenarios, test cases and code for challenges to scale how we improve the quality of data on https://planning.data.gov.uk
Testing and assuring our data quality is intended to complement our with work with i.AI to increase the availability of data by extracting data from documents.
We expect to incorporate issues identified by these tests into the feedback we give to data providers, to help them improve the availability and quality of their planning data.
The planning data platform collects data from local sources and them available as national datasets.
We link to an information page for each local data source, a webpage containing human readable information about the data. This page MUST be on the authorititive website for the organisation.
As an example, most LPAs have a planning policy page listing its conservation areas:
- Barnet conservation areas
- Liverpool conservation areas
- Erewash conservation areas
- Dacorum conservation areas
- Lambeth conservation areas
We record a link to these source web pages as the documentation-url
in our source configuration.
LPAs are also expected provide their conservation areas as data, following our guidance.
Our endpoint configuration contains an endpoint-url
which we check and collect the data. Each night the platform makes a request from each endpoint,
the log contains a record of the request
and saves the data downloaded as a resource.
We expect the endpoint-url
to be documented and linked to from a webpage which is recorded as documentation-url
for the endpoint.
This may be the same webpage as the information source page, but may also be a page on another domain.
For example, Barnet open data.
In this case we expect there to be a hyperlink from the information source page to the endpoint documentation page, and from the endpoint documentation page to the endpoint.
We also expect there to be a statement about the copyright and licensing of the data, which we currently record against the source.
- provision contains the organisations expected to have information about each dataset.
- organisation dataset includes a link to each organisation's website.
- source contains existing source pages
- endpoint contains existing endpoint pages, the URL where we download data from
Our source and endpoint data is currently very messy. The source dataset contains placeholder entries with blank urls, and many of the documentation-urls are broken links, or point to endpoint webpages. We need to migrate to the source documentation-url linking to the information page, and the endpoint documentation-urls have been recorded against the source.
We are adding datasets as we work through our backlog of planning considerations many of which are devolved to LPAs (currently 311 organisations). Finding sources for a new dataset on these LPA websites is timeconsuming.
Once we have source and endpoint data, we need to monitor the LPA sites for changes, in particular publication of new endpoints, and changes to licensing is a time-consuming and error-prone activity.
The planning data platform is an index of material information provided by LPAs and other organisations.
Each entity includes a link to the documentation-url
, the webpage with human readable content describing the entity, and a document-url
usually a PDF containint the material information, including the name
of the entity, a start-date
(when the entity came into force)
and where the entity applies. For example:
- The entity 6100046 represents an Article 4 Direction (PDF). The direction removes permitted development rights from a single area represented by the entity [https://www.planning.data.gov.uk/entity/7010002601], found using a datasette query.
Can we highlight where the name, date and other information in our data differs or is missing from those in these webpages and documents?
For example, we manually reviewed Conservation areas in Barnet. Can we scale this approach to provide similar reporting for other LPAs and datasets?
Can we find notices and links to other official information about entities to help users with reconciliation and improve the trustworthy of data on the platform?
- Knotty Ash is a conservation area
- planning-extract — our work with i.AI on extracting data from documents
- Data quality needs — notes on our data quality framework
- specification — the planning data model
- config — source for our pipeline configuration
- datasette — the output of our data pipeline
- quality — categories for scoring the quality of an entity, dataset and provision
The software in this project is open source and covered by the LICENSE file.
Individual datasets copied into this repository may have specific copyright and licensing, otherwise all content and data in this repository is © Crown copyright and available under the terms of the Open Government 3.0 licence.