Skip to content

Incorrect column name ordering for Multi-Table Synthesizer #2280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
R-Palazzo opened this issue Nov 6, 2024 · 0 comments · Fixed by #2295
Closed

Incorrect column name ordering for Multi-Table Synthesizer #2280

R-Palazzo opened this issue Nov 6, 2024 · 0 comments · Fixed by #2295
Assignees
Labels
bug Something isn't working
Milestone

Comments

@R-Palazzo
Copy link
Contributor

Environment Details

  • SDV version: 1.17.1

Error Description

When generating synthetic data, we consider the metadata order as the source of ground truth for the column names.
However, because of this line in the BaseMultiTableSynthesizer, we use the original column order of the real data currently:

SDV/sdv/multi_table/base.py

Lines 526 to 529 in 315266f

table_columns = getattr(self, '_original_table_columns', {})
for table in sampled_data:
if table in table_columns:
sampled_data[table].columns = table_columns[table]

This creates inconsistency between the single-table and multi-table behaviors

Steps to reproduce

Here is a code for HMA

import pandas as pd
from sdv.metadata import Metadata
from sdv.multi_table import HMASynthesizer

table_1 = pd.DataFrame({
    'col_1': [1, 2, 3],
    'col_3': [7, 8, 9],
    'col_2': [4, 5, 6],
})
table_2 = pd.DataFrame({
    'col_A': ['a', 'b', 'c'],
    'col_B': ['d', 'e', 'f'],
    'col_C': ['g', 'h', 'i'],
})
metadata = Metadata.load_from_dict({
    'tables': {
        'table_1': {
            'columns': {
                'col_1': {'sdtype': 'numerical'},
                'col_2': {'sdtype': 'numerical'},
                'col_3': {'sdtype': 'numerical'},
            },
        },
        'table_2': {
            'columns': {
                'col_A': {'sdtype': 'categorical'},
                'col_B': {'sdtype': 'categorical'},
                'col_C': {'sdtype': 'categorical'},
            },
        },
    }
})
data = {
    'table_1': table_1,
    'table_2': table_2,
}

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample()
synthetic_data['table_1']

The printed output is:
Screenshot 2024-11-06 at 15 35 11

While using GaussianCopula:

from sdv.single_table import GaussianCopulaSynthesizer

metadata_single_table = Metadata.load_from_dict({
    'columns': {
        'col_1': {'sdtype': 'numerical'},
        'col_2': {'sdtype': 'numerical'},
        'col_3': {'sdtype': 'numerical'},
    }
})
synthesizer_single_table = GaussianCopulaSynthesizer(metadata_single_table)
synthesizer_single_table.fit(table_1)
synthetic_data_single_table = synthesizer_single_table.sample(num_rows=3)
synthetic_data_single_table

The printed output is:
Screenshot 2024-11-06 at 15 36 34

@R-Palazzo R-Palazzo added bug Something isn't working new Automatic label applied to new issues and removed new Automatic label applied to new issues labels Nov 6, 2024
@R-Palazzo R-Palazzo added this to the 1.17.2 milestone Nov 13, 2024
@R-Palazzo R-Palazzo self-assigned this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant