Skip to content

Latest commit

 

History

History
130 lines (110 loc) · 5.98 KB

File metadata and controls

130 lines (110 loc) · 5.98 KB

Data Analysis Tooling

Contents

This repository contains Python scripts and modules for various tooling purposes.

Requirements

Before running the scripts in this repository, make sure you have the following:

  • Python 3.11 or above
  • Virtual environment - included in Installation instructions below

Installation

Method 1 - to a local, project specific python virtual environment

I recommend using venv and the instructions for this are as follows:

  • Set up venv, preferably running this command in your data_analysis directory:
python3 -m venv .venv
  • Activate your environment:
source .venv/bin/activate
  • Install the requirements via requirements.txt to this project specific virtual environment:
python3 -m pip install -r requirements.txt

Method 2 - directly to your base python distribution

To set up the virtual environment, you can either:

  • Use the provided requirements.txt file and install directly to your base python environment:
pip3 install -r requirements.txt

Scripts

Retrieve Schemas and their fields from Prison API

retrieve_schema_fields.py To run this script from within the tooling folder, based on your python distribution:

python3 retrieve_schema_fields.py

Outputs:

  • A csv file in the outputs directory containing schemas and fields within them

Generate a Schema space diagram, output child-parent relations

generate_schema_diagram.py There are several options for the running of this script:

  • Search for one schema
  • Search for multiple schema
  • No Search option - generate full diagram

To run this script from within the tooling folder, based on your Python distribution:

  • Search for one schema, where the argument is a schema name in a string format
python3 generate_schema_diagram.py "AddressDto"
  • Search for mulitiple schemas, where the arguments are all strings of schema names:
python3 generate_schema_diagram.py "AddressDto" "SentenceCalcDates"
  • No search option, generating a full diagram:
python3 generate_schema_diagram.py

Outputs

  • A csv file with the parent child relations of schemas.
  • A diagram in .dot format, renderable in plantuml or local graphviz renderer, of the schema relations

Retrieve Endpoints for provided list of schemas

discover_schema_endpoints.py This script:

  • Takes as an input a csv file of parent-child schema relations (defaults to the schema_parent_child.csv file generated by another script)
    • Note that the input file MUST contain as a subset the following columns in any order: Parent_Schema, Field, Child_Schema, Searched_bool
  • Returns as a csv file a table of relevant endpoints to all parent and children schemas in the provided file. The endpoints are for successful response types only (i.e. 2XX)
    • The output file is tabulated with the following columns: Path, HTTP_method, HTTP_response, Schema To run this script from within the data_analysiss folder:
python3 discover_schema_endpoints.py

OR you can manually specify a file to load in with:

python3 discover_schema_endpoints.py "outputs/some_other_file.csv"

Noting that the file must contain expected columns

Search all published APIs for search phrase

search_apis_for_phrase.py This script:

  • Takes a search term or even a phrase and searches every API listed in the published APIs of Structurizr
  • The search is not case or space delimiter sensitive, and works by scanning only the paths and schemas of the api-docs (which is where the relevant information will be).
  • It generates an in-memory data frame of search results for both the Schema and the Path search, and returns these data frames as csv tables.
python3 search_apis_for_phrase.py "search phrase"

Limitations:

  • The script will handle timeouts and other common API errors.
  • If there is a limitation on accessing a URL from a non MoJ device, this script will also be limited in that way
    • In this regard all of the links only work when running the search on an MoJ device
  • The URLs are currently hardcoded due to non-obvious ways of retrieving the api-docs dynamically
    • A potential upgrade could be with a webscraper but I don't know if the computational complexity is worth the effort

Modules

This repository contains the following Python modules:

Constants

There is a constants directory, initialised to be a module directory, containing constants and common objects used by the scripts. This allows you to edit constants in one location, without having to amend other scripts in the base data_analysis directory. Feel free to explore them for more functionality.

Contributing

Contributions to this tooling section are welcome, as long as they can be executed in Python. Autodocumentation is a potential and the desire is to keep this option open as this tooling section expands.