Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated storage management #829

Merged
merged 7 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/reference/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,8 @@ The following environment variables can be used to configure `WIS2BOX_STORAGE`.
WIS2BOX_STORAGE_PASSWORD=minio123 # password for the storage-layer
WIS2BOX_STORAGE_INCOMING=wis2box-incoming # name of the storage-bucket/folder for incoming files
WIS2BOX_STORAGE_PUBLIC=wis2box-public # name of the storage-bucket/folder for public files
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive # name of the storage-bucket/folder for archived data
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7 # number of days to keep files in incoming and public
WIS2BOX_STORAGE_API_RETENTION_DAYS=7 # number of days to keep files in API backend


MinIO
Expand Down
25 changes: 15 additions & 10 deletions docs/source/reference/running/data-retention.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ Data retention
==============

wis2box is configured to set data retention according to your requirements. Data retention is managed
via the ``WIS2BOX_STORAGE_DATA_RETENTION_DAYS`` environment variable as part of configuring wis2box.
via the ``WIS2BOX_STORAGE_DATA_RETENTION_DAYS`` and ``WIS2BOX_STORAGE_API_RETENTION_DAYS`` environment variables as part of configuring wis2box.

Cleaning
--------
Once a day, at UTC midnight, wis2box will run the commands ``wis2box data clean`` and ``wis2box api clean`` to remove data older than the specified retention period
(cronjob defined in ``wis2box-management/docker/wis2box.cron``).

Cleaning applies to storage defined by ``WIS2BOX_STORAGE_PUBLIC`` and involves the deletion of files after set amount of time.
Cleaning (storage)
------------------

Cleaning applies to storage defined by ``WIS2BOX_STORAGE_PUBLIC`` and ``WIS2BOX_STORAGE_INCOMING`` and involves the deletion of files after set amount of time.

Cleaning is performed by default daily at 0Z by the system, and can also be run interactively with:

Expand All @@ -24,15 +27,17 @@ Cleaning is performed by default daily at 0Z by the system, and can also be run
wis2box data clean --days=30


Archiving
---------
Cleaning (API)
--------------

Archiving applies to storage defined by ``WIS2BOX_STORAGE_INCOMING`` and involves moving files to the storage defined by ``WIS2BOX_STORAGE_ARCHIVE``.
Cleaning applies to data in the API backend and involves the deletion of records after a set amount of time.

Archive is performed on incoming data by default daily at 1Z by the system, and can also be run interactively with:
Cleaning is performed by default daily at 0Z by the system, and can also be run interactively with:

.. code-block:: bash

wis2box data archive
# delete data older than WIS2BOX_STORAGE_API_RETENTION_DAYS by default
wis2box api clean

Only files with a timestamp older than one hour are considered for archiving.
# delete data older than --days (force override)
wis2box api clean --days=30
2 changes: 1 addition & 1 deletion tests/test.env
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ WIS2BOX_BASEMAP_ATTRIBUTION=<a href="https://osm.org/copyright">OpenStreetMap</a
WIS2BOX_STORAGE_TYPE=S3
WIS2BOX_STORAGE_SOURCE=http://minio:9000
WIS2BOX_STORAGE_INCOMING=wis2box-incoming
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive
WIS2BOX_STORAGE_PUBLIC=wis2box-public
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7
WIS2BOX_STORAGE_API_RETENTION_DAYS=7
WIS2BOX_STORAGE_USERNAME=wis2box
WIS2BOX_STORAGE_PASSWORD=minio123

Expand Down
2 changes: 1 addition & 1 deletion wis2box-create-config.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,9 +367,9 @@ def create_wis2box_env(host_datadir: str) -> None:
fh.write('WIS2BOX_STORAGE_TYPE=S3\n')
fh.write('WIS2BOX_STORAGE_SOURCE=http://minio:9000\n')
fh.write('WIS2BOX_STORAGE_INCOMING=wis2box-incoming\n')
fh.write('WIS2BOX_STORAGE_ARCHIVE=wis2box-archive\n')
fh.write('WIS2BOX_STORAGE_PUBLIC=wis2box-public\n')
fh.write('WIS2BOX_STORAGE_DATA_RETENTION_DAYS=30\n')
fh.write('WIS2BOX_STORAGE_API_RETENTION_DAYS=100\n')
# use the default username wis2box for WIS2BOX_STORAGE_USERNAME
fh.write('WIS2BOX_STORAGE_USERNAME=wis2box\n')
# get password for WIS2BOX_STORAGE_PASSWORD and write it to wis2box.env
Expand Down
4 changes: 2 additions & 2 deletions wis2box-management/docker/wis2box.cron
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
0 0 * * * su wis2box -c "wis2box data clean --days=$WIS2BOX_STORAGE_DATA_RETENTION_DAYS" > /proc/1/fd/1 2>/proc/1/fd/2
0 1 * * * su wis2box -c "wis2box data archive" > /proc/1/fd/1 2>/proc/1/fd/2
0 0 * * * su wis2box -c "wis2box data clean > /proc/1/fd/1 2>/proc/1/fd/2
0 1 * * * su wis2box -c "wis2box api clean > /proc/1/fd/1 2>/proc/1/fd/2
0 15 * * * su wis2box -c "wis2box metadata discovery republish" > /proc/1/fd/1 2>/proc/1/fd/2
*/10 * * * * su wis2box -c "echo 'wis2box.cron is alive'" > /proc/1/fd/1 2>/proc/1/fd/2
28 changes: 27 additions & 1 deletion wis2box-management/wis2box/api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@
from wis2box.api.config import load_config
from wis2box.data_mappings import get_plugins

from wis2box.env import (DOCKER_API_URL, API_URL)
from wis2box.env import (DOCKER_API_URL, API_URL, STORAGE_API_RETENTION_DAYS,
STORAGE_DATA_RETENTION_DAYS)

LOGGER = logging.getLogger(__name__)

Expand Down Expand Up @@ -302,6 +303,31 @@ def delete_collection(ctx, collection, verbosity):
click.echo('Collection deleted')


@click.command()
@click.pass_context
@click.option('--days', '-d', help='Number of days of data to keep in API-backend', type=int) # noqa
@cli_helpers.OPTION_VERBOSITY
def clean(ctx, days, verbosity):
"""Clean data from API backend older than X days"""

if days is not None:
click.echo(f'Using days={days}')
days_ = days
elif STORAGE_API_RETENTION_DAYS is not None:
click.echo(f'Using STORAGE_API_RETENTION_DAYS={STORAGE_API_RETENTION_DAYS}') # noqa
days_ = STORAGE_API_RETENTION_DAYS
else:
click.echo(f'Using STORAGE_DATA_RETENTION_DAYS={STORAGE_DATA_RETENTION_DAYS}') # noqa
days_ = STORAGE_DATA_RETENTION_DAYS

if days_ is None or days_ < 0:
click.echo('No api data retention set. Skipping')
else:
LOGGER.debug('Cleaning API data backend')
delete_collections_by_retention(days_)


api.add_command(setup)
api.add_command(add_collection)
api.add_command(delete_collection)
api.add_command(clean)
86 changes: 27 additions & 59 deletions wis2box-management/wis2box/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,53 +19,26 @@
#
###############################################################################

from datetime import datetime, timedelta, timezone
import logging
from datetime import datetime, timedelta, timezone
from typing import Union

import click

from wis2box import cli_helpers
from wis2box.api import (setup_collection, remove_collection,
delete_collections_by_retention,
reindex_collection)
from wis2box.data_mappings import get_data_mappings
from wis2box.env import (STORAGE_SOURCE, STORAGE_ARCHIVE, STORAGE_PUBLIC,
STORAGE_DATA_RETENTION_DAYS, STORAGE_INCOMING)
from wis2box.env import (STORAGE_SOURCE, STORAGE_PUBLIC, STORAGE_INCOMING,
STORAGE_DATA_RETENTION_DAYS)
from wis2box.handler import Handler
from wis2box.metadata.discovery import DiscoveryMetadata
from wis2box.storage import put_data, move_data, list_content, delete_data
from wis2box.util import older_than, walk_path
from wis2box.storage import put_data, list_content, delete_data
from wis2box.util import walk_path

LOGGER = logging.getLogger(__name__)


def archive_data(source_path: str, archive_path: str) -> None:
"""
Archive data based on today's date (YYYY-MM-DD)

:param source_path: `str` of base storage-path for source
:param arcive_path: `str` of base storage-path for archive

:returns: `None`
"""

today_dir = f"{archive_path}/{datetime.now().date().strftime('%Y-%m-%d')}"
LOGGER.debug(f'Archive directory={today_dir}')
datetime_now = datetime.now(timezone.utc)
LOGGER.debug(f'datetime_now={datetime_now}')
for obj in list_content(source_path):
storage_path = obj['fullpath']
archive_path = f"{today_dir}/{obj['filename']}"
LOGGER.debug(f"filename={obj['filename']}")
LOGGER.debug(f"last_modified={obj['last_modified']}")
if obj['last_modified'] < datetime_now - timedelta(hours=1):
LOGGER.debug(f'Moving {storage_path} to {archive_path}')
move_data(storage_path, archive_path)
else:
LOGGER.debug(f"{storage_path} created less than 1 h ago, skip")


def clean_data(source_path: str, days: int) -> None:
"""
Remove data older than n days from source_path and API indexes')
Expand All @@ -76,21 +49,24 @@ def clean_data(source_path: str, days: int) -> None:
:returns: `None`
"""

LOGGER.debug(f'Clean files in {source_path} older than {days} day(s)')
before = datetime.now(timezone.utc) - timedelta(days=days)
LOGGER.info(f'Deleting data older than {before} from {source_path}')
nfiles_deleted = 0
for obj in list_content(source_path):
if obj['basedir'] == 'metadata':
LOGGER.debug('Skipping metadata')
continue
# don't delete files in the base-directory
if obj['basedir'] == '' or obj['basedir'] == obj['filename']:
continue
storage_path = obj['fullpath']
if older_than(obj['basedir'], days):
LOGGER.debug(f"{obj['basedir']} is older than {days} days")
LOGGER.debug(f'Deleting {storage_path}')
LOGGER.debug(f"filename={obj['filename']}")
LOGGER.debug(f"last_modified={obj['last_modified']}")
if obj['last_modified'] < before:
LOGGER.debug(f"Deleting {storage_path}")
delete_data(storage_path)
else:
LOGGER.debug(f"{obj['basedir']} less than {days} days old")

LOGGER.debug('Cleaning API indexes')
delete_collections_by_retention(days)
nfiles_deleted += 1
LOGGER.info(f'Deleted {nfiles_deleted} files from {source_path}')


def gcm(mcf: Union[dict, str]) -> dict:
Expand Down Expand Up @@ -157,37 +133,30 @@ def data():
pass


@click.command()
@click.pass_context
@cli_helpers.OPTION_VERBOSITY
def archive(ctx, verbosity):
"""Move data from incoming storage to archive storage"""

source_path = f'{STORAGE_SOURCE}/{STORAGE_INCOMING}'
archive_path = f'{STORAGE_SOURCE}/{STORAGE_ARCHIVE}'

click.echo(f'Archiving data from {source_path} to {archive_path}')
archive_data(source_path, archive_path)


@click.command()
@click.pass_context
@click.option('--days', '-d', help='Number of days of data to keep', type=int)
@cli_helpers.OPTION_VERBOSITY
def clean(ctx, days, verbosity):
"""Clean data directories and API indexes"""
"""Clean data from storage older than X days"""

if days is not None:
click.echo(f'Using data retention days: {days}')
days_ = days
else:
click.echo(f'Using default data retention days: {STORAGE_DATA_RETENTION_DAYS}') # noqa
days_ = STORAGE_DATA_RETENTION_DAYS

if days_ is None or days_ < 0:
click.echo('No data retention set. Skipping')
else:
storage_path = f'{STORAGE_SOURCE}/{STORAGE_PUBLIC}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path}')
clean_data(storage_path, days_)
storage_path_public = f'{STORAGE_SOURCE}/{STORAGE_PUBLIC}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path_public}') # noqa
clean_data(storage_path_public, days_)
storage_path_incoming = f'{STORAGE_SOURCE}/{STORAGE_INCOMING}'
click.echo(f'Deleting data > {days_} day(s) old from {storage_path_incoming}') # noqa
clean_data(storage_path_incoming, days_)
click.echo('Done')


@click.command()
Expand Down Expand Up @@ -286,7 +255,6 @@ def add_collection_items(ctx, topic_hierarchy, path, recursive, verbosity):
click.echo('Done')


data.add_command(archive)
data.add_command(clean)
data.add_command(ingest)
data.add_command(add_collection)
Expand Down
7 changes: 5 additions & 2 deletions wis2box-management/wis2box/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,18 @@
STORAGE_USERNAME = os.environ.get('WIS2BOX_STORAGE_USERNAME', 'wis2box')
STORAGE_PASSWORD = os.environ.get('WIS2BOX_STORAGE_PASSWORD', 'minio123')
STORAGE_INCOMING = os.environ.get('WIS2BOX_STORAGE_INCOMING', 'wis2box-incoming') # noqa
STORAGE_ARCHIVE = os.environ.get('WIS2BOX_STORAGE_ARCHIVE', 'wis2box-archive')
STORAGE_PUBLIC = os.environ.get('WIS2BOX_STORAGE_PUBLIC', 'wis2box-public')

try:
STORAGE_DATA_RETENTION_DAYS = int(os.environ.get('WIS2BOX_STORAGE_DATA_RETENTION_DAYS')) # noqa
except TypeError:
STORAGE_DATA_RETENTION_DAYS = None

try:
STORAGE_API_RETENTION_DAYS = int(os.environ.get('WIS2BOX_STORAGE_API_RETENTION_DAYS', 100)) # noqa
except TypeError:
STORAGE_API_RETENTION_DAYS = None

LOGLEVEL = os.environ.get('WIS2BOX_LOGGING_LOGLEVEL', 'ERROR')
LOGFILE = os.environ.get('WIS2BOX_LOGGING_LOGFILE', 'stdout')

Expand Down Expand Up @@ -159,7 +163,6 @@ def create(ctx, verbosity):

storages = {
STORAGE_INCOMING: 'private',
STORAGE_ARCHIVE: 'private',
STORAGE_PUBLIC: 'readonly'
}
for key, value in storages.items():
Expand Down
6 changes: 1 addition & 5 deletions wis2box-management/wis2box/pubsub/subscribe.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,7 @@
from wis2box.data.message import MessageData

from wis2box.env import (DATADIR, DOCKER_BROKER,
STORAGE_SOURCE, STORAGE_ARCHIVE,
STORAGE_INCOMING)
STORAGE_SOURCE, STORAGE_INCOMING)
from wis2box.handler import Handler, NotHandledError
import wis2box.metadata.discovery as discovery_metadata
from wis2box.plugin import load_plugin, PLUGINS
Expand Down Expand Up @@ -156,9 +155,6 @@ def on_message_handler(self, client, userdata, msg):
LOGGER.info(f'Do not process directories: {key}')
return
filepath = f'{STORAGE_SOURCE}/{key}'
if key.startswith(STORAGE_ARCHIVE):
LOGGER.info(f'Do not process archived-data: {key}')
return
# start a new process to handle the received data
while len(mp.active_children()) == mp.cpu_count():
sleep(0.05)
Expand Down
2 changes: 1 addition & 1 deletion wis2box.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ WIS2BOX_STORAGE_USERNAME=minio
WIS2BOX_STORAGE_PASSWORD=minio123
WIS2BOX_STORAGE_INCOMING=wis2box-incoming
WIS2BOX_STORAGE_PUBLIC=wis2box-public
WIS2BOX_STORAGE_ARCHIVE=wis2box-archive
WIS2BOX_STORAGE_DATA_RETENTION_DAYS=7
WIS2BOX_STORAGE_API_RETENTION_DAYS=7

# you should be okay from here

Expand Down
Loading