Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Memory leak when creating a df inside a loop #60897

Open
3 tasks done
Chuck321123 opened this issue Feb 9, 2025 · 4 comments
Open
3 tasks done

BUG: Memory leak when creating a df inside a loop #60897

Chuck321123 opened this issue Feb 9, 2025 · 4 comments
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Windows Windows OS

Comments

@Chuck321123
Copy link

Chuck321123 commented Feb 9, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import tracemalloc
import numpy as np
import time
import gc

# Start memory tracking
tracemalloc.start()

iteration = 0

Row_Number = 20000

while iteration < 1000:
    
    test_lst = [*range(12)]
    
    for i in range(12):
        
        # Create a DataFrame with X amount of rows
        df = pd.DataFrame({
            "A": np.arange(Row_Number),  # Sequential Row_Numbers from 0 to 999999
            "B": np.random.rand(Row_Number),  # Random floats between 0 and 1
            "C": np.random.randint(0, 100, size=Row_Number),  # Random integers between 0 and 99
            "D": np.random.choice(["apple", "banana", "cherry"], size=Row_Number),  # Random categories
            "E": np.random.randn(Row_Number)  # Normally distributed random Row_Numbers
        })

        test_lst[i] = df # The bug also appears without appending to list

        del df # Deleting df at the end of loop doesnt affect memory leak
  
    del test_lst # Deleting list at the end of loop doesnt affect memory leak
        
    time.sleep(0.01)
    
    iteration += 1

    # Check memory usage for 3rd party packages
    if iteration % 1 == 0:
    
        snapshot = tracemalloc.take_snapshot()
        
        # Get memory statistics **without filtering** first
        top_stats = snapshot.statistics("lineno")
        
        print(f"\n[ Memory Snapshot at iteration {iteration} ]")
        for stat in top_stats[:5]:  # Show top memory-consuming locations
            print(stat)

Issue Description

By using tracemalloc (a tool to track memory usage in loops), I can see that pandas doesnt release memory when creating dfs inside a loop. The problem seems to come from pandas\core\internals\blocks around line 228. Would be nice if anyone could find a fix to this.

Expected Behavior

That the memory doesnt leak

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.13.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : Norwegian Bokmål_Norway.1252

pandas : 2.2.3
numpy : 2.2.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : 8.1.3
IPython : 8.31.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : 3.10.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None

@Chuck321123 Chuck321123 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 9, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Feb 9, 2025

Thanks for the report, cannot reproduce on linux. Can you include the stdout from your reproducer.

Further investigations are welcome!

@rhshadrach rhshadrach added Performance Memory or execution speed performance Windows Windows OS Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 9, 2025
@narukaze132
Copy link

narukaze132 commented Feb 9, 2025

I use Windows, so I tried running the reproducer. Here's an excerpt of what I got from stdout, since GitHub isn't letting me post the whole thing. (Note: I don't have Pandas installed in the core Python path, so I manually redacted my user directory out of the result.)

[ Memory Snapshot at iteration 1 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=2820 B, count=49, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=2240 B, count=35, average=64 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:2215: size=1856 B, count=26, average=71 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=1384 B, count=12, average=115 B
<frozen abc>:123: size=896 B, count=8, average=112 B

[ Memory Snapshot at iteration 2 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=4906 B, count=85, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2104 B, count=18, average=117 B
C:\Program Files\Python311\Lib\tracemalloc.py:505: size=1904 B, count=34, average=56 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:2215: size=1904 B, count=27, average=71 B

[ Memory Snapshot at iteration 3 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=7042 B, count=122, average=58 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
C:\Program Files\Python311\Lib\tracemalloc.py:505: size=2688 B, count=48, average=56 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2584 B, count=22, average=117 B
C:\Program Files\Python311\Lib\tracemalloc.py:498: size=2304 B, count=48, average=48 B

[ Memory Snapshot at iteration 4 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=8488 B, count=147, average=58 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=3200 B, count=62, average=52 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=3117 B, count=36, average=87 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=2944 B, count=25, average=118 B

[ Memory Snapshot at iteration 5 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=9410 B, count=163, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=4138 B, count=48, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=3640 B, count=64, average=57 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3184 B, count=27, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 6 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=10.6 KiB, count=188, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=5159 B, count=60, average=86 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3304 B, count=28, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=2904 B, count=53, average=55 B

[ Memory Snapshot at iteration 7 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=12.2 KiB, count=216, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=6182 B, count=72, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=5528 B, count=100, average=55 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3544 B, count=30, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 8 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=12.7 KiB, count=225, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=7206 B, count=84, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=4776 B, count=88, average=54 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 9 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=13.9 KiB, count=247, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=8229 B, count=96, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=5544 B, count=97, average=57 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

[ Memory Snapshot at iteration 10 ]
%user-directory%\Python311\site-packages\pandas\core\internals\blocks.py:228: size=15.1 KiB, count=268, average=58 B
C:\Program Files\Python311\Lib\encodings\cp1252.py:19: size=9252 B, count=108, average=86 B
C:\Program Files\Python311\Lib\tracemalloc.py:558: size=6448 B, count=120, average=54 B
%user-directory%\Python311\site-packages\numpy\_core\fromnumeric.py:57: size=3664 B, count=31, average=118 B
%user-directory%\Python311\site-packages\pandas\core\internals\managers.py:1778: size=3072 B, count=48, average=64 B

@Chuck321123
Copy link
Author

Chuck321123 commented Feb 9, 2025

@rhshadrach What Linux OS and architecture are you using? Im using aarch Raspberry PI on Ubuntu and I still get the memory leak

@rhshadrach
Copy link
Member

rhshadrach commented Feb 9, 2025

@narukaze132 - thanks, I neglected to see how many times the reproducer was looping. I've cut your output down to the first 10 iterations; this is more than enough already.

@Chuck321123 -

INSTALLED VERSIONS
------------------
commit                : 846b2b532dfe81855fed2148c77b0d57727306d7
python                : 3.12.3
python-bits           : 64
OS                    : Linux
OS-release            : 6.8.0-52-generic
Version               : #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1798.g846b2b532d
numpy                 : 2.2.1
pytz                  : 2024.2
dateutil              : 2.9.0.post0
pip                   : 24.2
Cython                : 3.0.11
sphinx                : 8.1.3
IPython               : 8.29.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : 1.4.2
dataframe-api-compat  : None
fastparquet           : 2024.5.0
fsspec                : 2024.10.0
html5lib              : 1.1
hypothesis            : 6.115.5
gcsfs                 : 2024.10.0
jinja2                : 3.1.4
lxml.etree            : 5.3.0
matplotlib            : 3.9.2
numba                 : None
numexpr               : 2.10.1
odfpy                 : None
openpyxl              : 3.1.5
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : 1.4.6
pyarrow               : 18.1.0
pyreadstat            : 1.2.8
pytest                : 8.3.3
python-calamine       : None
pyxlsb                : 1.0.10
s3fs                  : 2024.10.0
scipy                 : 1.14.1
sqlalchemy            : 2.0.36
tables                : 3.10.1
tabulate              : 0.9.0
xarray                : 2024.9.0
xlrd                  : 2.0.1
xlsxwriter            : 3.2.0
zstandard             : 0.23.0
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Windows Windows OS
Projects
None yet
Development

No branches or pull requests

3 participants