You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Developer-Guide.md
+44
Original file line number
Diff line number
Diff line change
@@ -511,6 +511,50 @@ The defaults are:
511
511
512
512
SWIRL is configured to load English stopwords only. To change this, modify `SWIRL_DEFAULT_QUERY_LANGUAGE` in [swirl_settings/settings.py](https://github.com/swirlai/swirl-search/blob/main/swirl_server/settings.py) and change it to another [NLTK stopword language](https://stackoverflow.com/questions/54573853/nltk-available-languages-for-stopwords).
513
513
514
+
## Redact or Remove Personally Identifiable Information (PII) From Queries and/or Results
515
+
516
+
SWIRL supports the removal or redaction of PII entities using [Microsoft Presidio](https://microsoft.github.io/presidio/). There are three options available:
517
+
518
+
### `RemovePIIQueryProcessor`
519
+
520
+
This QueryProcessor removes PII entities from queries.
521
+
522
+
To use it. install it in in the QueryProcessing pipeline for a given SearchProvider:
523
+
524
+
```
525
+
"query_processors": [
526
+
"AdaptiveQueryProcessor",
527
+
"RemovePIIQueryProcessor"
528
+
]
529
+
```
530
+
531
+
Or, install it in the PreQueryProcessing pipeline to redact PII from all SearchProviders:
532
+
533
+
In `swirl/models.py`:
534
+
```
535
+
def getSearchPreQueryProcessorsDefault():
536
+
return ["RemovePIIQueryProcessor"]
537
+
```
538
+
539
+
More information: [ResultProcessors](./Developer-Reference.md#result-processors)
540
+
541
+
### `RemovePIIResultProcessor`
542
+
543
+
This ResultProcessor redacts PII entities in results. For example, "James T. Kirk" is replaced by "<PERSON>". To use it, install it in the ResultProcessing pipeline for a given SearchProvider.
544
+
545
+
```
546
+
"result_processors": [
547
+
"MappingResultProcessor",
548
+
"DateFinderResultProcessor",
549
+
"CosineRelevancyResultProcessor",
550
+
"RemovePIIResultProcessor"
551
+
]
552
+
```
553
+
554
+
More information: [ResultProcessors](./Developer-Reference.md#post-result-processors)
555
+
556
+
### `RemovePIIPostResultProcessor`
557
+
514
558
## Understand the Explain Structure
515
559
516
560
The [CosineRelevancyProcessor](Developer-Reference.html#cosinerelevancypostresultprocessor) outputs a JSON structure that explains the `swirl_score` for each result. It is displayed by default; to hide it add `&explain=False` to any mixer URL.
Copy file name to clipboardExpand all lines: docs/Developer-Reference.md
+6
Original file line number
Diff line number
Diff line change
@@ -983,6 +983,7 @@ This table describes the query processors included in SWIRL:
983
983
| GenericQueryProcessor | Removes special characters from the query ||
984
984
| SpellcheckQueryProcessor | Uses [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction) to predict and fix spelling errors in `query_string`| Best deployed in a `SearchProvider.query_processor` for sources that need it; not recommended with Google PSEs |
985
985
| NoModQueryProcessor | Only removes leading SearchProvider Tags and does not modify the query terms in any way. | It is intended for repositories that allow non-search characters (such as brackets). |
986
+
| RemovePIIQueryProcessor | Removes PII entities from the query. It does not replace them. ||
986
987
987
988
## Result Processors
988
989
@@ -999,6 +1000,7 @@ The following table lists the Result Processors included with SWIRL:
999
1000
| DateFinderResultProcessor | Looks for a date in any a number of formats in the body field of each result item. Should it find one, and the `date_published` for that item is `'unknown'`, it replaces `date_published` with the date extracted from the body, and notes this in the `result.messages`. | This processor can detect the following date formats:<br/> `06/01/23`<br/>`06/01/2023`<br/>`06-01-23`<br/>`06-01-2023`<br/>`jun 1, 2023`<br/>`june 1, 2023`|
1000
1001
| AutomaticPayloadMapperResultProcessor | Profiles response data to find good strings for SWIRL's `title`, `body`, and `date_published` fields. It is intended for SearchProviders that would otherwise have few (or no) good `result_mappings` options. | It should be place after the `MappingResultProcessor`. The `result_mappings` field should be blank, except for the optional DATASET directive, which will return only a single SWIRL response for each provider response, with the original response in the `payload` field under the `dataset` key. |
1001
1002
| RequireQueryStringInTitleResultProcessor | Drops results that do not contain the `query_string_to_provider` in the result `title` field. | It should be added after the `MappingResultProcessor` and is now included by default in the "LinkedIn - Google PSE" SearchProvider. |
1003
+
| RemovePIIResultProcessor | Redacts PII entries in all result fields for configured SearchProviders, including payload string fields, with a generic tag showing the entity type. For example "James T. Kirk" -> "<PERSON>". | This processor may be installed before or after the CosineRelevancyResultProcessor. If it runs before, query terms which are PII entities will not be used in relevancy ranking, since they will be redacted. More information: [https://microsoft.github.io/presidio/](https://microsoft.github.io/presidio/)|
1002
1004
1003
1005
## Post Result Processors
1004
1006
@@ -1064,6 +1066,10 @@ The `DropIrrelevantPostResultProcessor` drops results with `swirl_score < settin
1064
1066
{: .highlight }
1065
1067
The Galaxy UI will not display the correct number of results if this ResultProcessor is deployed.
1066
1068
1069
+
### `RemovePIIPostResultProcessor`
1070
+
1071
+
This processor is identical in most respects to the [RemovePIIResultProcessor](#result-processors), except that it operates on all results in a result set, not just a single SearchProvider.
1072
+
1067
1073
# Mixers
1068
1074
1069
1075
The following table details the Result Mixers included with SWIRL:
0 commit comments