Skip to content

DOC-753 | Graph ML UI #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions site/content/3.13/data-science/arangographml/arangograph-ml.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be the same name twice, but I'm not settled on a particular name. Maybe just ui.md?

Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
---
title: ArangoGraphML Web Interface
menuTitle: ArangoGraphML Web Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title to be discussed (we might rename it to just GraphML)

weight: 15
description: >-
Enterprise-ready, graph-powered machine learning as a cloud service or self-managed
aliases:
- getting-started-with-arangographml
---
Solve high-computational graph problems with Graph Machine Learning. Apply ML on a selected graph to predict connections, get better product recommendations, classify nodes, and perform node embeddings. Configure and run the whole machine learning flow entirely in the web interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have node classification and embeddings available as immediate options. If we mention something like link predictions, we should at least outline how to achieve that.

Would also be good to have a more technical explanation here about how GraphML works (GraphSage, using depth 2 neighborhood, as mentioned in Slack team channel).

Please also add an overview over the process instead of immediately starting with project creation etc., users should first get an understanding of the hierarchy and steps involved.


## Creating a GraphML Project

To create a new GraphML project using the ArangoDB Web Interface, follow these steps:

- **Select the Target Database** – From the **Database** dropdown in the left-hand sidebar, select the database where the project should reside.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are steps that should be followed in order, so use an ordered list here.
dropdown -> dropdown menu (or simply just write to select the database without mentioning the specific widget type)

- **Navigate to the Data Science Section** – In the left-hand navigation menu, click on Data Science to open the GraphML project management interface, then click on RunGraphML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it the Data Science Suite perhaps?
click on Data Science -> click **Data Science**
RunGraphML -> **Run GraphML**

![Navigate to Data Science](../../../images/datascience-intro.jpg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing indentation

- **Click "Add new project"** – In the **GraphML projects** view, click **Add new project**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style seems to be a list items having a concise description of the step in bold and then the concrete actions following after a hyphen (non-bold). Here, this pattern is broken because a specific UI element is referenced in the first part. Please make this uniform. I'm not sure how much value this adds, perhaps we can replace the first part with an introduction above the text to outline the procedure.

- **Fill in Project Details** – A modal titled **Create ML project** will appear. Enter a **name** for your machine learning project.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use the exact UI labels, i.e. **Name** and not **name**.
Use present tense.

- **Create the Project** – Click the **Create project** button to finalize the creation.
- **Verify Project in the List** – After creation, the new project will appear in the list under **GraphML projects**. Click the project name to enter and begin creating ML jobs like Featurization, Training, Model Selection, Prediction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"jobs like" suggests that there are more than the mentioned ones, but there are not


## Featurization Phase

After clicking on a project name, you are taken to a screen where you can configure and start a new Featurization job. Follow these steps:
- **Select a Graph** – In the **Features** section, choose your target graph from the **Select a graph** dropdown (Example, `imdb`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no screenshot nearby then I don't see a reason to provide an example here, unless you explain what happens with the value and also show that, e.g. you could mentioned that the graph name is used as a prefix for some of the generated attributes.

- **Choose Vertex Collections** – Pick the vertex collections (Example, `movie`, `person`) that you want to include for feature extraction.
- **Select Attributes** – From the dropdown, choose the attributes from your vertex collection to convert into machine-understandable features.

{{< info >}}
The following attributes cannot be used: imdb_feat_description, imdb_feat_genre, imdb_feat_homepage, imdb_feat_id, imdb_feat_imageUrl, imdb_feat_imdb_x_hash, imdb_feat_imdbId, imdb_feat_label, imdb_feat_language, imdb_feat_lastModified, imdb_feat_released, imdb_feat_releaseDate, imdb_feat_runtime, imdb_feat_studio, imdb_feat_tagline, imdb_feat_title, imdb_feat_trailer, imdb_feat_type, imdb_feat_version, imdb_x, imdb_y, prediction_model_output. As some of their values are lists or arrays.
{{< /info >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to mention that certain attributes are not eligible for GraphML but there shouldn't be a list of attributes here that are specific to the dataset, graph, and GraphML project. Users will not have these on the first run, and they will be different based on the mentioned things.


- **Expand Configuration and Advanced Settings** – Optionally adjust parameters like batch size, feature prefix, dimensionality reduction, and write behavior. These settings are also shown in JSON format on the right side of the screen for transparency.

- **Batch size** – The number of documents to process in a single batch.
- **Run analysis checks** – Whether to run analysis checks to perform a high-level analysis of the data quality before proceeding. Default is `true`.
- **Skip labels** – Skip the featurization process for attributes marked as labels. Default is `false`.
- **Overwrite FS graph** – Whether to overwrite the Feature Store graph if features were previously generated. Default is `false`, so features are written to an existing graph.
- **Write to source graph** – Whether to store the generated features in the source graph. Default is `true`.
- **Use feature store** – Enable the use of the Feature Store database, which stores features separately from the source graph. Default is `false`, so features are written to the source graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a reasonable amount of additional explanation over the available labels and toolstip in the UI to add value.


- **Click "Begin Featurization"** – Once all selections are done, click the **Begin featurization** button. This will trigger a **node embedding-compatible featurization job**.Once the job status changes to **"Ready for training"**, you can start the **ML Training** step.

![Navigate to Featurization](../../../images/graph-ml-ui-featurization.png)

## Training Phase

This is the second step in the ML workflow after featurization. In the training phase, you configure and launch a machine learning training job on your graph data.

#### Select Type of Training Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a headline, especially not with the same level as the GraphML tasks


There are two types of training jobs available, depending on the use case:


#### Node Classification

Node Classification is used to categorize the nodes in your graph based on their features and structural connections within the graph.

**Use cases include:**
- Entity categorization (Example, movies into genres, users into segments)
- Fraud detection in transaction networks

**Configuration Parameters:**
- **Type of Training Job:** Node classification
- **Target Vertex Collection:** Choose the collection to classify (Example, `movie`)
- **Batch Size:** The nummer of documents processed in a single training iteration. (Example, 256)
- **Data Load Batch Size:** The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. (Example, 50000)
- **Data Load Parallelism:** The number of parallel processes used when loading data from ArangoDB into memory for trainnig. (Example, 10)

After setting these values, click the **Begin training** button to start the job.

![Node Classification](../../../images/ml-nodeclassification.png)

#### Node Embedding

Node Embedding is used to generate vector embeddings (dense numerical representations) of graph nodes that capture structural and feature-based information.

**Use cases include:**
- Similarity search (Example, finding similar products, users, or documents)
- Link prediction (Example, suggesting new connections)

**Configuration Parameters:**
- **Type of Training Job:** Node embeddings
- **Target Vertex Collection:** Select the collection to generate embeddings for (Example, `movie` or `person`)
- No label is required for training in this mode

Once the configuration is complete, click **Begin training** to launch the embedding job.

![Node Embeddings](../../../images/ml-node-embedding.png)


After training is complete, the next step in the ArangoGraphML workflow is **Model Selection**.

## Model Selection Phase

Once the training is finished, the job status updates to READY FOR MODEL SELECTION. This means the model has been trained using the provided vertex and edge data and is now ready for evaluation.

**Understanding Vertex Collections:**

**X Vertex Collection:** These are the source nodes used during training. They represent the full set of nodes on which features were computed (Example, person, movie).

**Y Vertex Collection:** These are the target nodes that contain labeled data. The labels in this collection are used to supervise the training process and are the basis for evaluating prediction quality.

The target collection is where the model’s predictions will be stored once prediction is executed.

**Model Selection Interface:**

A list of trained models is displayed, along with performance metrics such as accuracy, Precision, Recall, F1 score, Loss.
Review the results of different model runs and configurations.

Select the best performing model suitable for your prediction task.

![Model Selection](../../../images/graph-ml-model.png)

## Prediction Phase

Once the best-performing model has been selected, the final step of the GraphML pipeline is to generate predictions for new or unlabeled data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explained, we don't have the capability to only process new/unlabeled data


### Overview

The Prediction interface allows inference to be run using the selected model. It enables configuration of how predictions are executed, which collections are involved, and whether new or outdated documents should be automatically featurized before prediction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a statement about effects on quality when featurizing new/outdated docs


![prediction phase](../../../images/graph-prediction.png)

### Configuration Options
The Prediction screen displays the following configuration options:

- Selected Model: Displays the model selected during the Model Selection phase. This model will be used to perform inference.

- Target Vertex Collection: This is the vertex collection on which predictions will be applied.

- Prediction Type: Depending on the training job (Example, classification or embedding), the prediction will output class labels or updated embeddings.

### Featurization Settings
Two toggles are available to control automatic featurization during prediction

**Featurize New Documents:**
This option controls whether newly added documents are automatically featurized. It is useful when new data arrives after training, allowing predictions to continue without requiring a full retraining process.

**Featurize Outdated Documents:**
Enable or disable the featurization of outdated documents. Outdated documents are those whose attributes (used during featurization) have changed since the last feature computation. This ensures prediction results are based on up-to-date information.

These options provide flexibility in handling dynamic graph data and keeping predictions relevant without repeating the entire ML workflow.

**Data load batch size** – Specifies the number of documents to load in a single batch (Example, 500000).

**Data load parallelism** – Number of parallel threads used to process the prediction workload (Example, 10).

**Prediction field** – The field in the documents where the predicted values will be stored (Example, prediction).

### Enable Scheduling

You can configure automatic predictions using the **Enable scheduling** checkbox.

When scheduling is enabled, predictions will be executed automatically based on a specified **CRON expression**. This is useful for regularly updating prediction outputs as new data enters the system.

#### Schedule (CRON expression)

You can define a CRON expression that sets when the prediction job should run. For example:
0 0 1 1 *
This CRON pattern will execute the prediction **every year on January 1st at 00:00**.

Below the CRON field, a user-friendly scheduling interface helps translate it:

- **Period**: Options include *Hourly*, *Daily*, *Weekly*, *Monthly*, or *Yearly*.
- **Month**: *(Example, January)*
- **Day of Month**: *(Example, 1)*
- **Day of Week**: *(optional)*
- **Hours and Minutes**: Set the exact time for execution *(Example, 0:00)*


### Execute Prediction
After reviewing the configuration, click the Run Prediction button. ArangoGraphML will then:

- Perform featurization

- Run inference using the selected model

- Write prediction results into the target vertex collection or a specified output location

Once prediction is complete, you can analyze the results directly in the Web Interface or export them for downstream use.
Binary file added site/content/images/datascience-intro.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions site/content/images/datascience-intro.jpgZone.Identifier
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[ZoneTransfer]
ZoneId=3
HostUrl=https://squoosh.app/
Binary file added site/content/images/graph-ml-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site/content/images/graph-ml-ui-featurization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site/content/images/graph-prediction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site/content/images/ml-node-embedding.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site/content/images/ml-nodeclassification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.