You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: hashr.go
+13-13
Original file line number
Diff line number
Diff line change
@@ -40,18 +40,18 @@ import (
40
40
)
41
41
42
42
var (
43
-
processingWorkerCount=flag.Int("processing_worker_count", 2, "Number of processing workers.")
44
-
importersToRun=flag.String("importers", strings.Join([]string{}, ","), fmt.Sprintf("Importers to be run: %s,%s,%s,%s", gcp.RepoName, targz.RepoName, windows.RepoName, wsus.RepoName))
45
-
exportersToRun=flag.String("exporters", strings.Join([]string{}, ","), fmt.Sprintf("Exporters to be run: %s,%s", gcpExporter.Name, postgresExporter.Name))
46
-
jobStorage=flag.String("storage", "", "Storage that should be used for storing data about processing jobs, can have one of the two values: postgres, cloudspanner")
47
-
cacheDir=flag.String("cache_dir", "/tmp/", "Path to cache dir used to store local cache.")
48
-
export=flag.Bool("export", true, "Whether to export samples, otherwise, they'll be saved to disk")
49
-
exportPath=flag.String("export_path", "/tmp/hashr-uploads", "If export is set to false, this is the folder where samples will be saved.")
50
-
reprocess=flag.String("reprocess", "", "Sha256 of sources that should be reprocessed")
51
-
spannerDBPath=flag.String("spanner_db_path", "", "Path to spanner DB.")
52
-
uploadPayloads=flag.Bool("upload_payloads", false, "If true the content of the files will be uploaded using defined exporters.")
53
-
cloudSpannerWorkerCount=flag.Int("cloudspanner_worker_count", 100, "Number of workers/goroutines that will be used to upload data to Cloud Spanner.")
54
-
gcpExporterGCSbucket=flag.String("gcp_exporter_gcs_bucket", "", "Name of the GCS bucket which will be used by GCP exporter to store exported samples.")
43
+
processingWorkerCount=flag.Int("processing_worker_count", 2, "Number of processing workers.")
44
+
importersToRun=flag.String("importers", strings.Join([]string{}, ","), fmt.Sprintf("Importers to be run: %s,%s,%s,%s", gcp.RepoName, targz.RepoName, windows.RepoName, wsus.RepoName))
45
+
exportersToRun=flag.String("exporters", strings.Join([]string{}, ","), fmt.Sprintf("Exporters to be run: %s,%s", gcpExporter.Name, postgresExporter.Name))
46
+
jobStorage=flag.String("storage", "", "Storage that should be used for storing data about processing jobs, can have one of the two values: postgres, cloudspanner")
47
+
cacheDir=flag.String("cache_dir", "/tmp/", "Path to cache dir used to store local cache.")
48
+
export=flag.Bool("export", true, "Whether to export samples, otherwise, they'll be saved to disk")
49
+
exportPath=flag.String("export_path", "/tmp/hashr-uploads", "If export is set to false, this is the folder where samples will be saved.")
50
+
reprocess=flag.String("reprocess", "", "Sha256 of sources that should be reprocessed")
51
+
spannerDBPath=flag.String("spanner_db_path", "", "Path to spanner DB.")
52
+
uploadPayloads=flag.Bool("upload_payloads", false, "If true the content of the files will be uploaded using defined exporters.")
53
+
gcpExporterWorkerCount=flag.Int("gcp_exporter_worker_count", 100, "Number of workers/goroutines that will be used to upload data to Cloud Spanner.")
54
+
gcpExporterGCSbucket=flag.String("gcp_exporter_gcs_bucket", "", "Name of the GCS bucket which will be used by GCP exporter to store exported samples.")
Copy file name to clipboardexpand all lines: readme.md
+23-9
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@
21
21
-[WSUS](#wsus)
22
22
-[Setting up exporters](#setting-up-exporters)
23
23
-[Setting up Postgres exporter](#setting-up-postgres-exporter)
24
-
-[Setting up Cloud Spanner exporter](#setting-up-cloud-spanner-exporter)
24
+
-[Setting up GCP exporter](#setting-up-gcp-exporter)
25
25
-[Additional flags](#additional-flags)
26
26
27
27
## About
@@ -369,27 +369,41 @@ If you didn't choose Postgres for processing job storage follow steps 1 & 2 from
369
369
370
370
This is currently the default exporter, you don't need to explicitly enable it. By default the content of the actual files won't be uploaded to PostgreSQL DB, if you wish to change that use `-upload_payloads true` flag.
371
371
372
-
In order for the Postgres exporter to work you need to set the following flags: `-postgresHost <host> -postgresPort <port> -postgresUser <user> -postgresPassword <pass> -postgresDBName <db_name>`
372
+
In order for the Postgres exporter to work you need to set the following flags: `-exporters postgres -postgresHost <host> -postgresPort <port> -postgresUser <user> -postgresPassword <pass> -postgresDBName <db_name>`
373
373
374
-
#### Setting up Cloud Spanner exporter
374
+
#### Setting up GCP exporter
375
375
376
-
Cloud Spanner exporter allows sending of hashes, file metadata and the actual content of the file to a GCP Spanner instance. If you haven't set up Cloud Spanner for storing processing jobs, follow the steps in [Setting up Cloud Spanner](####setting-up-cloud-spanner) and instead of the last step run the following command to create necessary tables:
376
+
GCP exporter allows sending of hashes, file metadata to GCP Spanner instance. Optionally you can upload the extracted files to GCS bucket. If you haven't set up Cloud Spanner for storing processing jobs, follow the steps in [Setting up Cloud Spanner](####setting-up-cloud-spanner) and instead of the last step run the following command to create necessary tables:
If you have already set up Cloud Spanner for storing jobs data you just need to the run the command above and you're ready to go.
383
383
384
+
If you'd like to upload the extracted files to GCS you need to create the GCS bucket:
385
+
386
+
Step 1: Make the service account admin of this bucket:
387
+
```shell
388
+
gsutil mb -p project_name> gs://<gcs_bucket_name>
389
+
```
390
+
391
+
Step 2: Make the service account admin of this bucket:
392
+
```shell
393
+
gsutil iam ch serviceAccount:hashr@<project_name>.iam.gserviceaccount.com:objectAdmin gs://<gcs_bucket_name>
394
+
```
395
+
396
+
To use this exporter you need to provide the following flags: `-exporters GCP -gcp_exporter_gcs_bucket <gcs_bucket_name>`
397
+
384
398
### Additional flags
385
399
386
-
1.`-processingWorkerCount`: This flag controls number of parallel processing workers. Processing is CPU and I/O heavy, during my testing I found that having 2 workers is the most optimal solution.
387
-
1.`-cacheDir`: Location of local cache used for deduplication, it's advised to change that from `/tmp` to e.g. home directory of the user that will be running hashr.
400
+
1.`-processing_worker_count`: This flag controls number of parallel processing workers. Processing is CPU and I/O heavy, during my testing I found that having 2 workers is the most optimal solution.
401
+
1.`-cache_dir`: Location of local cache used for deduplication, it's advised to change that from `/tmp` to e.g. home directory of the user that will be running hashr.
388
402
1.`-export`: When set to false hashr will save the results to disk bypassing the exporter.
389
-
1.`-exportPath`: If export is set to false, this is the folder where samples will be saved.
403
+
1.`-export_path`: If export is set to false, this is the folder where samples will be saved.
390
404
1.`-reprocess`: Allows to reprocess a given source (in case it e.g. errored out) based on the sha256 value stored in the jobs table.
391
-
1.`-uploadPayloads`: Controls if the actual content of the file will be uploaded by defined exporters.
392
-
2.`-cloudSpannerWorkerCount`: Number of workers/goroutines that will be used to upload data to Cloud Spanner.
405
+
1.`-upload_payloads`: Controls if the actual content of the file will be uploaded by defined exporters.
406
+
2.`-gcp_exporter_worker_count`: Number of workers/goroutines that the GCP exporter will use to upload the data.
393
407
394
408
395
409
This is not an officially supported Google product.
0 commit comments