Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update for feature summary #11

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mart-r
Copy link

@mart-r mart-r commented Feb 20, 2025

While trying to run the core functionality of CMS, I felt that the documentation wasn't really clear as to which features are enabled by which compose files.

For instance, if I follow the documentation and spin up a service for a Snomed model, I can run inference and evaluation on the model. However, if I try to run training on it, it fails due to not being able to acces MLFlow1.
And when I then run the services in docker-compose-mlflow.yml, this does work successfully.

As such, this PR suggests an additional summary / overview of the features / functionality and what extra services they may need.

PS: There may be prerequisites I'm not aware of. But this was just be going back and trying to spin up a service that is able to run inference and training on a model. Perhaps there's ways to do this without mlflow, but the documentation does not seem to make it clear.

1 Exception when trying to train without mlflow:

Failed to resolve 'mlflow-ui' The full exception:
2025-02-19 15:51:28,198 INFO   [uvicorn.access$send] 172.19.0.1:49876 - "POST /train_unsupervised HTTP/1.1" 500
2025-02-19 15:51:28,198 ERROR  [uvicorn.error$run_asgi] Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
  |     yield
  |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 190, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |   File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    |     return await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    |     raise exc
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 189, in __call__
    |     with collapse_excgroups():
    |   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    |     self.gen.throw(typ, value, traceback)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
    |     raise exc
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 191, in __call__
    |     response = await self.dispatch_func(request, call_next)
    |   File "/usr/local/lib/python3.10/site-packages/slowapi/middleware.py", line 136, in dispatch
    |     response = await call_next(request)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 165, in call_next
    |     raise app_exc
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 151, in coro
    |     await self.app(scope, receive_or_disconnect, send_no_error)
    |   File "/usr/local/lib/python3.10/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 155, in __call__
    |     raise exc
    |   File "/usr/local/lib/python3.10/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 153, in __call__
    |     await self.app(scope, receive, send_wrapper)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    |     raise exc
    |   File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    |     await route.handle(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    |     await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    |     raise exc
    |   File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    |     response = await func(request)
    |   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    |     raw_response = await run_endpoint_function(
    |   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    |     return await dependant.call(**values)
    |   File "/app/api/routers/unsupervised_training.py", line 68, in train_unsupervised
    |     training_response = model_service.train_unsupervised(data_file,
    |   File "/app/model_services/medcat_model.py", line 144, in train_unsupervised
    |     return self._unsupervised_trainer.train(data_file, epochs, log_frequency, training_id, input_file_name, raw_data_files, description, synchronised, **hyperparams)
    |   File "/app/trainers/base.py", line 242, in train
    |     return self.start_training(run=self.run,
    |   File "/app/trainers/base.py", line 73, in start_training
    |     self._experiment_id, self._run_id = self._tracker_client.start_tracking(
    |   File "/app/management/tracker_client.py", line 39, in start_tracking
    |     experiment_id = TrackerClient._get_experiment_id(experiment_name)
    |   File "/app/management/tracker_client.py", line 266, in _get_experiment_id
    |     experiment = mlflow.get_experiment_by_name(experiment_name)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 1616, in get_experiment_by_name
    |     return MlflowClient().get_experiment_by_name(name)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1249, in get_experiment_by_name
    |     return self._tracking_client.get_experiment_by_name(name)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 484, in get_experiment_by_name
    |     return self.store.get_experiment_by_name(name)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 519, in get_experiment_by_name
    |     response_proto = self._call_endpoint(GetExperimentByName, req_body)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 82, in _call_endpoint
    |     return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 365, in call_endpoint
    |     response = http_request(**call_kwargs)
    |   File "/usr/local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 212, in http_request
    |     raise MlflowException(f"API request to {url} failed with exception {e}")
    | mlflow.exceptions.MlflowException: API request to http://mlflow-ui:5000/api/2.0/mlflow/experiments/get-by-name failed with exception HTTPConnectionPool(host='mlflow-ui', port=5000): Max retries exceeded with url: /api/2.0/mlflow/experiments/get-by-name?experiment_name=SNOMED_MedCAT_model_unsupervised (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f9c3839efb0>: Failed to resolve 'mlflow-ui' ([Errno -3] Temporary failure in name resolution)"))
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 189, in __call__
    with collapse_excgroups():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 191, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/usr/local/lib/python3.10/site-packages/slowapi/middleware.py", line 136, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 165, in call_next
    raise app_exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 151, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/usr/local/lib/python3.10/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 155, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 153, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/api/routers/unsupervised_training.py", line 68, in train_unsupervised
    training_response = model_service.train_unsupervised(data_file,
  File "/app/model_services/medcat_model.py", line 144, in train_unsupervised
    return self._unsupervised_trainer.train(data_file, epochs, log_frequency, training_id, input_file_name, raw_data_files, description, synchronised, **hyperparams)
  File "/app/trainers/base.py", line 242, in train
    return self.start_training(run=self.run,
  File "/app/trainers/base.py", line 73, in start_training
    self._experiment_id, self._run_id = self._tracker_client.start_tracking(
  File "/app/management/tracker_client.py", line 39, in start_tracking
    experiment_id = TrackerClient._get_experiment_id(experiment_name)
  File "/app/management/tracker_client.py", line 266, in _get_experiment_id
    experiment = mlflow.get_experiment_by_name(experiment_name)
  File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 1616, in get_experiment_by_name
    return MlflowClient().get_experiment_by_name(name)
  File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1249, in get_experiment_by_name
    return self._tracking_client.get_experiment_by_name(name)
  File "/usr/local/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 484, in get_experiment_by_name
    return self.store.get_experiment_by_name(name)
  File "/usr/local/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 519, in get_experiment_by_name
    response_proto = self._call_endpoint(GetExperimentByName, req_body)
  File "/usr/local/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 82, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 365, in call_endpoint
    response = http_request(**call_kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 212, in http_request
    raise MlflowException(f"API request to {url} failed with exception {e}")
mlflow.exceptions.MlflowException: API request to http://mlflow-ui:5000/api/2.0/mlflow/experiments/get-by-name failed with exception HTTPConnectionPool(host='mlflow-ui', port=5000): Max retries exceeded with url: /api/2.0/mlflow/experiments/get-by-name?experiment_name=SNOMED_MedCAT_model_unsupervised (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f9c3839efb0>: Failed to resolve 'mlflow-ui' ([Errno -3] Temporary failure in name resolution)"))

@phoevos
Copy link
Member

phoevos commented Feb 21, 2025

Thanks @mart-r, this summary was definitely missing.

Just one thing, what you're seeing around MLflow looks like a bug and presumably doesn't match the expected behaviour. I believe the intention there was to allow running the CMS services without the MLflow server, simply using the MLflow client to track runs with local files. Note, for instance, how we set the default value for the MLFLOW_TRACKING_URI in the .env file we mount into the CMS containers:

MLFLOW_TRACKING_URI=file:/tmp/mlruns

On the other hand, the default environment variable value in the compose file is set to http://mlflow-ui:5000 which is what ends up in your configuration apparently when no envvars are set.

Perhaps @baixiac has some more insight here.


Edit: If you set the default MLFLOW_TRACKING_URI in the Docker compose file to a local file URI, training (supervised or unsupervised) works, storing the resulting model to a path like this:

file:///tmp/mlruns/851750959833289465/690f44b1e0bd46c687b57360758d9856/artifacts/UMLS_MedCAT_model

We could remove the tracking URI from the envs/.env file cause the duplication only causes confusion (or have it read the environment variable like we do for BASE_MODEL_FULL_PATH) and set the default value in the compose file to something like file:///tmp/mlruns. Or we could leave that empty and allow MLflow to use its internal default (i.e. ./mlruns) if an environment variable is not provided on deployment.

@phoevos phoevos added help wanted Extra attention is needed bug Something isn't working and removed help wanted Extra attention is needed labels Feb 21, 2025
Copy link
Member

@baixiac baixiac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the docs are quite shitty at the moment I have to say. Currently docker-compose.yml assumes all services are containerised, which makes "auxiliary" a deceptive word to some degree. docker-compose-dev.yml may be more appropriate for the used case raised by the OP? I can see four scenarios:

  1. CMS is running on the host OS and mlflow client logs to the local file system (e.g, /tmp/mlruns);
  2. CMS is running on the host OS and mlflow client logs to the remote DB;
  3. CMS is running in a container and mlflow client logs into the same container;
  4. Both CMS and mlflow DB are running in containers and there is no logging to local.

New users will likely end up with scenario 1 and a full deployment requires scenario 4. The README is getting bloated so a balance had to be struck for normal and advanced users. Start feeling README is not a good place for covering all details and a proper readthedocs site is needed, as some other CogStack projects have got. Will log a ticket for this.

@mart-r
Copy link
Author

mart-r commented Feb 21, 2025

Thanks for all the input. So the way I wanted to spin this up is entirely possible, just not well documented. Good to know.

And I agree, the README does look pretty big. But doesn't necessarily mean you have to have a separate readthedocs - could always just separate documentation between different markdown files and link between them. Just suggesting this because it may be a little less work (at least to begin with). Though if you end up needing to move to a docs in the end anyway, it might mean migrating things twice... So the best option might not be immediately clear.

In any case, I think it would make sense to have the main README go through the most common / expected use case and link to other documentation for more detailed stuff (like mine clearly is).

PS:
I'll leave the PR up for now - mainly for discussion. But I'm fairly certain it can be closed later on.

@baixiac
Copy link
Member

baixiac commented Feb 24, 2025

Hi, looking back, the one that best aligns with your need is scenario 3, i.e., the trained model and training metrics would be stored locally within your CMS container. In a production environment, this requires proper backup and housekeeping to prevent data loss and out-of-space errors (or maybe you deliberately want your CMS containers to be short-lived?). It would be difficult if not impossible to leverage the tracking UI in scenario 3, unless you can somehow proxy it to read the data buried inside a folder of the container. So to clarify your use-case/workflow, do you want sth like spinning up a CMS instance on-demand, firing up the training, downloading the trained model and/or metrics and finally terminating the instance to free up the resources on your machine?

To move this PR forward, could creating a separate docker-compose-core.yml with env vars defined in “docker-compose-dev.yml” and linking it to your readme update satisfy your use case? Or combining docker-compose.yml and docker-compose-mlflow.yml into one as suggested by @phoevos? Or is the use case really about disabling the mlflow client and its logging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants