Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename patched filters so the normalizer uses the default ones #409

Conversation

wgresshoff
Copy link
Contributor

@wgresshoff wgresshoff commented Oct 8, 2024

❤️ Thank you for your contribution!

Close #408

Description

The normalizer that was introduced in the funders index with v5.1.0 generates two tokens when used on documents that contain diacritics in the acronym. This behaviour occurs when in the filter definition "preserve_original": true is used. This is ok for analyzers, so this PR just renames the filters and takes care the analyzers use the new ones and the normailzer uses the original ones.

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from 0e369eb to d3c20ca Compare October 8, 2024 12:51
@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from d6f3378 to ae3b4f2 Compare October 11, 2024 08:33
@@ -14,8 +14,8 @@
"type": "custom",
"char_filter": ["strip_special_chars"],
"filter": [
"lowercase",
"asciifolding",
"lowercasepreserveoriginal",
Copy link
Contributor

@kpsherva kpsherva Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgresshoff could I ask you to move all the changes to funder-v3.0.0.json files?
in case of breaking changes like these we need to upgrade the version of the mapping, otherwise instance upgrades are complex
so we need to (for both os-v1 and os-v2)

  • copy the mappings to funder-v3.0.0.json
  • move the changes you proposed here to funder-v3.0.0.json file
  • revert funders-v2.0.0.json changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in practice this doesn't break existing indices created with the old mapping but only affects search behavior/accuracy, I feel we can merge without bumping the mapping version.

This means that for deployment, one needs to either:

  • Do the usual recreation of indices (with downtime)
  • Do a live migration with some special handling since the old and new index name will be the same (only the timestamp suffix changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the recreation of indices with downtime and the missing funders were reindexed without problem.

@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from ae3b4f2 to 6cbca0b Compare October 15, 2024 09:02
@kpsherva kpsherva merged commit dace099 into inveniosoftware:master Oct 16, 2024
4 checks passed
slint added a commit to 0einstein0/zenodo-rdm that referenced this pull request Nov 4, 2024
📁 invenio-accounts (5.1.2 -> 5.1.3 🐛)

    📦 release: v5.1.3
    UI: fix spacing on password reset form

📁 invenio-db (1.2.0 -> 1.3.0 🌈)

    📦 release: v1.3.0
    uow: add "on_exception" lifecycle hook

    - When an exception occurs and the unit of work is being rolled back, we
      want to add the possibility to perform some clean-up actions that can
      also be commited to the database.
    - The new `on_exception` method is added because we want to keep
      backwards compatibility and also introduce the same "triplet" of
      methods as we have for the "happy path" (`on_register`, `on_commit`,
      `on_post_commit`).

📁 invenio-indexer (2.3.0 -> 2.4.0 🌈)

    release: v2.4.0
    cli: run command accepts a new parameter 'chunk_size'
    docs: fix doctest for default index config
    config: remove `INDEXER_DEFAULT_INDEX` default value

📁 invenio-rdm-records (15.6.0 -> 15.7.0 🌈)

    📦 release: v15.7.0
    resources: make record error handlers configurable
    components: make content moderation configurable

    - Closes inveniosoftware/invenio-rdm-records#1861.
    - Adds a new `RRM_CONTENT_MODERATION_HANDLERS` config variable to allow
      for configuring multiple handlers for the different write actions.
    deposit: add missing fields to record deserializer
    user_moderation: use search for faster actions

    * Use search results to determine the user's list of records.
    * Use a TaskOp and Unit of Work to avoid sending Celery tasks immediately.
    * Add a cleanup task that will perform a more thorough check using the
      DB to lookup the user's records.
    refactor: remove showManualEntry argument
    refactor: naming
    UI/UX: add consistent suggestions display to affiliations
    refactor: extract logic to react-invenio-forms
    UI/UX: improve display of ROR information
    refactor: create makeSubheader function
    UI: remove unnecessary divs to fix spacing bug
    collections: move records search into service

    * Added resource for collections
    * Moved records search from community records to collections service
    collections: added task to compute num of records

    * added task to compute number of records for all the collections
    * added "collection_id" parameter to record search
    * added service methods to read collections (many, all)
    * added tests
    * collections: refactor 'resolve' to 'read'
    * collections: rename 'search_records' method
    * collections: update read method signature
    services: make file-service components configurable
    access notification: provide correct draft preview link

    closes inveniosoftware/invenio-app-rdm#2827

📁 invenio-vocabularies (6.4.1 -> 6.5.0 🌈)

    📦 release: v6.5.0
    subjects: euroscivoc: change default to latest version-less URL
    Rename patched filters so the normalizer uses the default ones (inveniosoftware/invenio-vocabularies#409)

    * rename patched filters so the normalizer uses the default ones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

The newly introduced normalizer in the funders vocab generates two tokens when acronyms contain diacritics
4 participants