Skip to content

Commit 3b2d0bb

Browse files
committed
🚧 First design draft, still a lot to cover for this design.
Fixes theopenconversationkit#1707
1 parent a7505db commit 3b2d0bb

5 files changed

+87
-0
lines changed

docs/_data/tocs/en.yml

+4
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ content:
3535
url: 'en/dev/i18n'
3636
- title: 'API documentation'
3737
url: 'en/api'
38+
- title: 'Feature technical designs'
39+
children:
40+
- title: '#1706 Gen AI data ingestion workflow / pipeline (gitlab based)'
41+
url: 'en/dev/feature-technical-designs/1706-gen-ai-data-ingestion-pipeline-gitlab'
3842
- title: 'About'
3943
children:
4044
- title: 'Why Tock'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Gen AI Data Ingestion pipeline (gitlab) design
3+
---
4+
5+
# Gen AI Data Ingestion pipeline (gitlab) design #1706
6+
7+
- Proposal PR: [https://github.com/theopenconversationkit/tock/pull/TODO](https://github.com/theopenconversationkit/tock/pull/TODO)
8+
- Github Issue for this feature: [#1706](https://github.com/theopenconversationkit/tock/issues/1706)
9+
10+
11+
## Introduction
12+
13+
Data Ingestion is a complex task and ingested documents needs to be refreshed / renewed continuously. For now this task can be performed using our basic python tooling available here [tock-llm-indexing-tools](https://github.com/theopenconversationkit/tock/blob/tock-24.3.4/gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md).
14+
15+
This is done manually and we are going to automate it a be more and also include testing features based on Langfuse datasets.
16+
17+
Our approach will be based on Gitlab pipelines, this solution is simple and will let us schedule data ingestion or even trigger them using Gitlab's API. We will also be able to keep track of each ingestion jobs using gitlab and each job states.
18+
19+
20+
## Overall pipeline
21+
22+
TODO make a clean workflow diagram using Mermaid.
23+
24+
[![Workflow - Data Ingestion Pipeline (excalidraw)](../../../img/feat-design-1706-data_ingestion_gitlab_pipeline_workflow.excalidraw.png)](../../../img/feat-design-1706-data_ingestion_gitlab_pipeline_workflow.excalidraw.png){:target="_blank"}
25+
26+
27+
### Gitlab repositories organisation
28+
29+
TODO: for each project we will assume that we will need some kind of scripting to fetch data, clean it, organize it, keep important metadata ..
30+
we need to think about how we will organize this repositories, pipeline dependencies ? session ID / naming convention if we have multiple repositories ...
31+
32+
33+
## Architecture design
34+
35+
This design will illustrate 2 cloud based architecture for AWS and GCP (using kubernetes).
36+
37+
### AWS deployment
38+
39+
[![Architecture AWS - Data Ingestion Pipeline (excalidraw)](../../../img/feat-design-1706-data_ingestion_gitlab_architecture_aws.excalidraw.png)](../../../img/feat-design-1706-data_ingestion_gitlab_architecture_aws.excalidraw.png){:target="_blank"}
40+
41+
*File editable using [Excalidraw](https://excalidraw.com/) simply import the PNG, it contains scene data.*
42+
43+
### GCP deployment
44+
45+
TODO: Something using [kube jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/) could be used ?
46+
*Spike currently in progress.*
47+
48+
## Docker images ?
49+
50+
TODO: list docker images that will be used for this pipeline, maybe juste 1 python image with necessary tools.
51+
52+
53+
### Normalization method `normalized(NAMESPACE)`
54+
55+
56+
## Environnement variable settings
57+
58+
59+
### CI / CD settings
60+
61+
|Environment variable name | Default | Allowed values | Description |
62+
|--- |--- |--- |--- |
63+
| `sample`| `sample` | `sample` | sample |
64+
65+
66+
### Lambda environment variables ?
67+
68+
|Environment variable name | Default | Allowed values | Description |
69+
|--- |--- |--- |--- |
70+
| `sample`| `sample` | `sample` | sample |
71+
72+
73+
## Technical change that should be made
74+
75+
### Breaking changes
76+
77+
* TODO
78+
79+
80+
### Other changes
81+
* TODO
Loading
Loading

docs/_en/toc.md

+2
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ title: Table of Contents
3737
* [Internationalization (_i18n_)](../dev/i18n)
3838
* [Tock APIs](../dev/api)
3939
* [Code samples](../dev/exemples-code)
40+
* _Feature technical designs_
41+
* [#1706 Gen AI data ingestion workflow / pipeline (gitlab based)](../dev/feature-technical-designs/1706-gen-ai-data-ingestion-pipeline-gitlab)
4042

4143
* About Tock :
4244
* [Why Tock](../about/why)

0 commit comments

Comments
 (0)