Skip to content

Databricks on GCP data exfiltration protection workspace deployment #172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

micheledaddetta-databricks
Copy link
Collaborator

The module still uses the CMv2 architecture. When the CMv1 architecture will be released and supported by Terraform provider, the implementation will be reviewed

The commit contains the implementation for the workspace resource group. However this change requires to no more use the local.rg_location variable, since the value is known after the apply, and rhis force the replacement of all of the resources
Most of README files were already defined. TFDocs updated in each of them
@alexott
Copy link
Collaborator

alexott commented Feb 18, 2025

@bhavink - wdyt?

@bhavink
Copy link

bhavink commented Feb 19, 2025

@alexott I do not think on GCP we need traditional hub/spoke based arch. Shared vpc based deployment is a common and popular arch where one could use vpc f/w rules along with vpc sc to prevent data exfiltration. TF support for CMv1 will be available by early March 2025 so may I suggest that we wait for it to be released and then update the GCP specific module?

@alexott
Copy link
Collaborator

alexott commented Feb 19, 2025

I agree about waiting for CMv1 migration

@alexott alexott requested a review from Copilot March 18, 2025 06:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds documentation to support the deployment of Databricks on GCP with data exfiltration protection using a Hub & Spoke network architecture while still using the CMv2 architecture.

  • Added an example README for provisioning the workspace in the examples directory.
  • Introduced a module README that details resource outcomes and the network architecture for the deployment.

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

File Description
examples/gcp-with-psc-exfiltration-protection/README.md New documentation for workspace provisioning using hub & spoke architecture
modules/gcp-with-psc-exfiltration-protection/README.md Detailed module documentation including architecture and resource listings

Comment on lines 37 to 38
Most of the values are related to resources managed by Databricks. Values to use be found at: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html

Copy link
Preview

Copilot AI Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There appears to be a grammatical error. Consider rephrasing to something like 'Most values are related to resources managed by Databricks. The required values can be found at: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html'.

Suggested change
Most of the values are related to resources managed by Databricks. Values to use be found at: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html
Most values are related to resources managed by Databricks. The required values can be found at: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html

Copilot uses AI. Check for mistakes.

Comment on lines 23 to 24
**REMARK THAT** the module does not contain the VPC SC implementation. This can be added to increase the security level in the Databricks deployment, providing detailed access level for ingress and egress traffic.

Copy link
Preview

Copilot AI Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The phrasing 'REMARK THAT' can be softened for better readability. Consider using 'Note that' instead.

Suggested change
**REMARK THAT** the module does not contain the VPC SC implementation. This can be added to increase the security level in the Databricks deployment, providing detailed access level for ingress and egress traffic.
**Note that** the module does not contain the VPC SC implementation. This can be added to increase the security level in the Databricks deployment, providing detailed access level for ingress and egress traffic.

Copilot uses AI. Check for mistakes.

@alexott
Copy link
Collaborator

alexott commented Apr 2, 2025

@micheledaddetta-databricks can you update the code to use provider >= 1.71 - it includes changes for CMv1 support

@micheledaddetta-databricks
Copy link
Collaborator Author

@alexott I'll update it during next week

Starting from provider version 1.71 CMv1 is supported for Databricks on GCP
@micheledaddetta-databricks
Copy link
Collaborator Author

@alexott here you can find updated code

This is an initial implementation. I will enhance it in the future commits in order to include metastore admin assignment, workspace-metastore binding, catalog owner, catalog-workspace binding.
In case the module can be built in order to be cloud agnostic.
Copy link
Collaborator

@alexott alexott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes required, like, update image

Comment on lines +40 to +43
depends_on = [
databricks_storage_credential.this,
databricks_external_location.this
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we don't need this if we we'll use

  storage_root = databricks_external_location.this.url

resource "databricks_external_location" "this" {
provider = databricks.workspace
name = "${var.prefix}-external-location"
url = "gs://${google_storage_bucket.ext_bucket.name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember, the url and storage_root should end with the / because backend will do a normalization, and this will lead to a permanent configuration drift

@@ -0,0 +1,84 @@
# Provisioning Databricks on GCP workspace with a Hub & Spoke network architecture for data exfiltration protection

This example is using the [gcp-with-psc-exfiltration-protection](../../modules/gcp-with-psc-exfiltration-protection) module.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to put a warning at beginning that PSC isn't enabled by default and user should contact databricks team.

@@ -0,0 +1,126 @@
# Databricks on Google Cloud with Private Service Connect and Hub-Spoke network structure (data exfiltration protection).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to put a warning at beginning that PSC isn't enabled by default and user should contact databricks team.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This picture still shows GKE, we need a version with GCE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants