Skip to content

Commit

Permalink
Updated storage management (#829)
Browse files Browse the repository at this point in the history
* remove archive, separate api and data retention

* update docs

* commit cron update

* commit test.env updte

* commit example update

* Update data-retention.rst

* Update env.py

---------

Co-authored-by: Tom Kralidis <tomkralidis@gmail.com>
  • Loading branch information
maaikelimper and tomkralidis authored Jan 9, 2025
1 parent 5af06bf commit df57aa3
Show file tree
Hide file tree
Showing 10 changed files with 81 additions and 83 deletions.
2 changes: 1 addition & 1 deletion docs/source/reference/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,8 @@ The following environment variables can be used to configure `WIS2BOX_STORAGE`.
WIS2BOX_STORAGE_PASSWORD=minio123 # password for the storage-layer
WIS2BOX_STORAGE_INCOMING=wis2box-incoming # name of the storage-bucket/folder for incoming files
WIS2BOX_STORAGE_PUBLIC=wis2box-public # name of the storage-bucket/folder for public files
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive # name of the storage-bucket/folder for archived data
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7 # number of days to keep files in incoming and public
WIS2BOX_STORAGE_API_RETENTION_DAYS=7 # number of days to keep files in API backend
MinIO
Expand Down
25 changes: 15 additions & 10 deletions docs/source/reference/running/data-retention.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ Data retention
==============

wis2box is configured to set data retention according to your requirements. Data retention is managed
via the ``WIS2BOX_STORAGE_DATA_RETENTION_DAYS`` environment variable as part of configuring wis2box.
via the ``WIS2BOX_STORAGE_DATA_RETENTION_DAYS`` and ``WIS2BOX_STORAGE_API_RETENTION_DAYS`` environment variables as part of configuring wis2box.

Cleaning
--------
Once a day, at UTC midnight, wis2box will run the commands ``wis2box data clean`` and ``wis2box api clean`` to remove data older than the specified retention period
(cronjob defined in ``wis2box-management/docker/wis2box.cron``).

Cleaning applies to storage defined by ``WIS2BOX_STORAGE_PUBLIC`` and involves the deletion of files after set amount of time.
Cleaning (storage)
------------------

Cleaning applies to storage defined by ``WIS2BOX_STORAGE_PUBLIC`` and ``WIS2BOX_STORAGE_INCOMING`` and involves the deletion of files after set amount of time.

Cleaning is performed by default daily at 0Z by the system, and can also be run interactively with:

Expand All @@ -24,15 +27,17 @@ Cleaning is performed by default daily at 0Z by the system, and can also be run
wis2box data clean --days=30
Archiving
---------
Cleaning (API)
--------------

Archiving applies to storage defined by ``WIS2BOX_STORAGE_INCOMING`` and involves moving files to the storage defined by ``WIS2BOX_STORAGE_ARCHIVE``.
Cleaning applies to data in the API backend and involves the deletion of records after a set amount of time.

Archive is performed on incoming data by default daily at 1Z by the system, and can also be run interactively with:
Cleaning is performed by default daily at 0Z by the system, and can also be run interactively with:

.. code-block:: bash
wis2box data archive
# delete data older than WIS2BOX_STORAGE_API_RETENTION_DAYS by default
wis2box api clean
Only files with a timestamp older than one hour are considered for archiving.
# delete data older than --days (force override)
wis2box api clean --days=30
2 changes: 1 addition & 1 deletion tests/test.env
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ WIS2BOX_BASEMAP_ATTRIBUTION=<a href="https://osm.org/copyright">OpenStreetMap</a
WIS2BOX_STORAGE_TYPE=S3
WIS2BOX_STORAGE_SOURCE=http://minio:9000
WIS2BOX_STORAGE_INCOMING=wis2box-incoming
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive
WIS2BOX_STORAGE_PUBLIC=wis2box-public
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7
WIS2BOX_STORAGE_API_RETENTION_DAYS=7
WIS2BOX_STORAGE_USERNAME=wis2box
WIS2BOX_STORAGE_PASSWORD=minio123

Expand Down
2 changes: 1 addition & 1 deletion wis2box-create-config.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,9 +367,9 @@ def create_wis2box_env(host_datadir: str) -> None:
fh.write('WIS2BOX_STORAGE_TYPE=S3\n')
fh.write('WIS2BOX_STORAGE_SOURCE=http://minio:9000\n')
fh.write('WIS2BOX_STORAGE_INCOMING=wis2box-incoming\n')
fh.write('WIS2BOX_STORAGE_ARCHIVE=wis2box-archive\n')
fh.write('WIS2BOX_STORAGE_PUBLIC=wis2box-public\n')
fh.write('WIS2BOX_STORAGE_DATA_RETENTION_DAYS=30\n')
fh.write('WIS2BOX_STORAGE_API_RETENTION_DAYS=100\n')
# use the default username wis2box for WIS2BOX_STORAGE_USERNAME
fh.write('WIS2BOX_STORAGE_USERNAME=wis2box\n')
# get password for WIS2BOX_STORAGE_PASSWORD and write it to wis2box.env
Expand Down
4 changes: 2 additions & 2 deletions wis2box-management/docker/wis2box.cron
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
0 0 * * * su wis2box -c "wis2box data clean --days=$WIS2BOX_STORAGE_DATA_RETENTION_DAYS" > /proc/1/fd/1 2>/proc/1/fd/2
0 1 * * * su wis2box -c "wis2box data archive" > /proc/1/fd/1 2>/proc/1/fd/2
0 0 * * * su wis2box -c "wis2box data clean > /proc/1/fd/1 2>/proc/1/fd/2
0 1 * * * su wis2box -c "wis2box api clean > /proc/1/fd/1 2>/proc/1/fd/2
0 15 * * * su wis2box -c "wis2box metadata discovery republish" > /proc/1/fd/1 2>/proc/1/fd/2
*/10 * * * * su wis2box -c "echo 'wis2box.cron is alive'" > /proc/1/fd/1 2>/proc/1/fd/2
28 changes: 27 additions & 1 deletion wis2box-management/wis2box/api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@
from wis2box.api.config import load_config
from wis2box.data_mappings import get_plugins

from wis2box.env import (DOCKER_API_URL, API_URL)
from wis2box.env import (DOCKER_API_URL, API_URL, STORAGE_API_RETENTION_DAYS,
STORAGE_DATA_RETENTION_DAYS)

LOGGER = logging.getLogger(__name__)

Expand Down Expand Up @@ -302,6 +303,31 @@ def delete_collection(ctx, collection, verbosity):
click.echo('Collection deleted')


@click.command()
@click.pass_context
@click.option('--days', '-d', help='Number of days of data to keep in API-backend', type=int) # noqa
@cli_helpers.OPTION_VERBOSITY
def clean(ctx, days, verbosity):
"""Clean data from API backend older than X days"""

if days is not None:
click.echo(f'Using days={days}')
days_ = days
elif STORAGE_API_RETENTION_DAYS is not None:
click.echo(f'Using STORAGE_API_RETENTION_DAYS={STORAGE_API_RETENTION_DAYS}') # noqa
days_ = STORAGE_API_RETENTION_DAYS
else:
click.echo(f'Using STORAGE_DATA_RETENTION_DAYS={STORAGE_DATA_RETENTION_DAYS}') # noqa
days_ = STORAGE_DATA_RETENTION_DAYS

if days_ is None or days_ < 0:
click.echo('No api data retention set. Skipping')
else:
LOGGER.debug('Cleaning API data backend')
delete_collections_by_retention(days_)


api.add_command(setup)
api.add_command(add_collection)
api.add_command(delete_collection)
api.add_command(clean)
86 changes: 27 additions & 59 deletions wis2box-management/wis2box/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,53 +19,26 @@
#
###############################################################################

from datetime import datetime, timedelta, timezone
import logging
from datetime import datetime, timedelta, timezone
from typing import Union

import click

from wis2box import cli_helpers
from wis2box.api import (setup_collection, remove_collection,
delete_collections_by_retention,
reindex_collection)
from wis2box.data_mappings import get_data_mappings
from wis2box.env import (STORAGE_SOURCE, STORAGE_ARCHIVE, STORAGE_PUBLIC,
STORAGE_DATA_RETENTION_DAYS, STORAGE_INCOMING)
from wis2box.env import (STORAGE_SOURCE, STORAGE_PUBLIC, STORAGE_INCOMING,
STORAGE_DATA_RETENTION_DAYS)
from wis2box.handler import Handler
from wis2box.metadata.discovery import DiscoveryMetadata
from wis2box.storage import put_data, move_data, list_content, delete_data
from wis2box.util import older_than, walk_path
from wis2box.storage import put_data, list_content, delete_data
from wis2box.util import walk_path

LOGGER = logging.getLogger(__name__)


def archive_data(source_path: str, archive_path: str) -> None:
"""
Archive data based on today's date (YYYY-MM-DD)
:param source_path: `str` of base storage-path for source
:param arcive_path: `str` of base storage-path for archive
:returns: `None`
"""

today_dir = f"{archive_path}/{datetime.now().date().strftime('%Y-%m-%d')}"
LOGGER.debug(f'Archive directory={today_dir}')
datetime_now = datetime.now(timezone.utc)
LOGGER.debug(f'datetime_now={datetime_now}')
for obj in list_content(source_path):
storage_path = obj['fullpath']
archive_path = f"{today_dir}/{obj['filename']}"
LOGGER.debug(f"filename={obj['filename']}")
LOGGER.debug(f"last_modified={obj['last_modified']}")
if obj['last_modified'] < datetime_now - timedelta(hours=1):
LOGGER.debug(f'Moving {storage_path} to {archive_path}')
move_data(storage_path, archive_path)
else:
LOGGER.debug(f"{storage_path} created less than 1 h ago, skip")


def clean_data(source_path: str, days: int) -> None:
"""
Remove data older than n days from source_path and API indexes')
Expand All @@ -76,21 +49,24 @@ def clean_data(source_path: str, days: int) -> None:
:returns: `None`
"""

LOGGER.debug(f'Clean files in {source_path} older than {days} day(s)')
before = datetime.now(timezone.utc) - timedelta(days=days)
LOGGER.info(f'Deleting data older than {before} from {source_path}')
nfiles_deleted = 0
for obj in list_content(source_path):
if obj['basedir'] == 'metadata':
LOGGER.debug('Skipping metadata')
continue
# don't delete files in the base-directory
if obj['basedir'] == '' or obj['basedir'] == obj['filename']:
continue
storage_path = obj['fullpath']
if older_than(obj['basedir'], days):
LOGGER.debug(f"{obj['basedir']} is older than {days} days")
LOGGER.debug(f'Deleting {storage_path}')
LOGGER.debug(f"filename={obj['filename']}")
LOGGER.debug(f"last_modified={obj['last_modified']}")
if obj['last_modified'] < before:
LOGGER.debug(f"Deleting {storage_path}")
delete_data(storage_path)
else:
LOGGER.debug(f"{obj['basedir']} less than {days} days old")

LOGGER.debug('Cleaning API indexes')
delete_collections_by_retention(days)
nfiles_deleted += 1
LOGGER.info(f'Deleted {nfiles_deleted} files from {source_path}')


def gcm(mcf: Union[dict, str]) -> dict:
Expand Down Expand Up @@ -157,37 +133,30 @@ def data():
pass


@click.command()
@click.pass_context
@cli_helpers.OPTION_VERBOSITY
def archive(ctx, verbosity):
"""Move data from incoming storage to archive storage"""

source_path = f'{STORAGE_SOURCE}/{STORAGE_INCOMING}'
archive_path = f'{STORAGE_SOURCE}/{STORAGE_ARCHIVE}'

click.echo(f'Archiving data from {source_path} to {archive_path}')
archive_data(source_path, archive_path)


@click.command()
@click.pass_context
@click.option('--days', '-d', help='Number of days of data to keep', type=int)
@cli_helpers.OPTION_VERBOSITY
def clean(ctx, days, verbosity):
"""Clean data directories and API indexes"""
"""Clean data from storage older than X days"""

if days is not None:
click.echo(f'Using data retention days: {days}')
days_ = days
else:
click.echo(f'Using default data retention days: {STORAGE_DATA_RETENTION_DAYS}') # noqa
days_ = STORAGE_DATA_RETENTION_DAYS

if days_ is None or days_ < 0:
click.echo('No data retention set. Skipping')
else:
storage_path = f'{STORAGE_SOURCE}/{STORAGE_PUBLIC}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path}')
clean_data(storage_path, days_)
storage_path_public = f'{STORAGE_SOURCE}/{STORAGE_PUBLIC}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path_public}') # noqa
clean_data(storage_path_public, days_)
storage_path_incoming = f'{STORAGE_SOURCE}/{STORAGE_INCOMING}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path_incoming}') # noqa
clean_data(storage_path_incoming, days_)
click.echo('Done')


@click.command()
Expand Down Expand Up @@ -286,7 +255,6 @@ def add_collection_items(ctx, topic_hierarchy, path, recursive, verbosity):
click.echo('Done')


data.add_command(archive)
data.add_command(clean)
data.add_command(ingest)
data.add_command(add_collection)
Expand Down
7 changes: 5 additions & 2 deletions wis2box-management/wis2box/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,18 @@
STORAGE_USERNAME = os.environ.get('WIS2BOX_STORAGE_USERNAME', 'wis2box')
STORAGE_PASSWORD = os.environ.get('WIS2BOX_STORAGE_PASSWORD', 'minio123')
STORAGE_INCOMING = os.environ.get('WIS2BOX_STORAGE_INCOMING', 'wis2box-incoming') # noqa
STORAGE_ARCHIVE = os.environ.get('WIS2BOX_STORAGE_ARCHIVE', 'wis2box-archive')
STORAGE_PUBLIC = os.environ.get('WIS2BOX_STORAGE_PUBLIC', 'wis2box-public')

try:
STORAGE_DATA_RETENTION_DAYS = int(os.environ.get('WIS2BOX_STORAGE_DATA_RETENTION_DAYS')) # noqa
except TypeError:
STORAGE_DATA_RETENTION_DAYS = None

try:
STORAGE_API_RETENTION_DAYS = int(os.environ.get('WIS2BOX_STORAGE_API_RETENTION_DAYS', 100)) # noqa
except TypeError:
STORAGE_API_RETENTION_DAYS = None

LOGLEVEL = os.environ.get('WIS2BOX_LOGGING_LOGLEVEL', 'ERROR')
LOGFILE = os.environ.get('WIS2BOX_LOGGING_LOGFILE', 'stdout')

Expand Down Expand Up @@ -159,7 +163,6 @@ def create(ctx, verbosity):

storages = {
STORAGE_INCOMING: 'private',
STORAGE_ARCHIVE: 'private',
STORAGE_PUBLIC: 'readonly'
}
for key, value in storages.items():
Expand Down
6 changes: 1 addition & 5 deletions wis2box-management/wis2box/pubsub/subscribe.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,7 @@
from wis2box.data.message import MessageData

from wis2box.env import (DATADIR, DOCKER_BROKER,
STORAGE_SOURCE, STORAGE_ARCHIVE,
STORAGE_INCOMING)
STORAGE_SOURCE, STORAGE_INCOMING)
from wis2box.handler import Handler, NotHandledError
import wis2box.metadata.discovery as discovery_metadata
from wis2box.plugin import load_plugin, PLUGINS
Expand Down Expand Up @@ -156,9 +155,6 @@ def on_message_handler(self, client, userdata, msg):
LOGGER.info(f'Do not process directories: {key}')
return
filepath = f'{STORAGE_SOURCE}/{key}'
if key.startswith(STORAGE_ARCHIVE):
LOGGER.info(f'Do not process archived-data: {key}')
return
# start a new process to handle the received data
while len(mp.active_children()) == mp.cpu_count():
sleep(0.05)
Expand Down
2 changes: 1 addition & 1 deletion wis2box.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ WIS2BOX_STORAGE_USERNAME=minio
WIS2BOX_STORAGE_PASSWORD=minio123
WIS2BOX_STORAGE_INCOMING=wis2box-incoming
WIS2BOX_STORAGE_PUBLIC=wis2box-public
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7
WIS2BOX_STORAGE_API_RETENTION_DAYS=7

# you should be okay from here

Expand Down

0 comments on commit df57aa3

Please sign in to comment.