I have a 25GB pickle of a dictionary of numpy arrays.
The dictionary looks like the following:
668,956 key-value pairs.
The keys are strings. Example key:
“109c3708-3b0c-4868-a647-b9feb306c886_1″
The values are numpy arrays of shape 200×23, type float64
When I load the data using pickle repeatedly in a loop, the time to load slows down (see code and result below). What could be causing this?
Code:
def load_pickle(file: int) -> dict:
with open(f”D:/data/batched/{file}.pickle”, “rb”) as handle:
return pickle.load(handle)
for i in range(0, 9):
print(f”\nIteration {i}”)
start_time = time.time()
file = None
print(f”Unloaded file in {time.time() – start_time:.2f} seconds”)
start_time = time.time()
file = load_pickle(0)
print(f”Loaded file in {time.time() – start_time:.2f} seconds”)
Result:
Iteration 0
Unloaded file in 0.00 seconds
Loaded file in 18.80 seconds
Iteration 1
Unloaded file in 14.78 seconds
Loaded file in 30.51 seconds
Iteration 2
Unloaded file in 28.67 seconds
Loaded file in 30.21 seconds
Iteration 3
Unloaded file in 35.38 seconds
Loaded file in 40.25 seconds
Iteration 4
Unloaded file in 39.91 seconds
Loaded file in 41.24 seconds
Iteration 5
Unloaded file in 43.25 seconds
Loaded file in 45.57 seconds
Iteration 6
Unloaded file in 46.94 seconds
Loaded file in 48.19 seconds
Iteration 7
Unloaded file in 51.67 seconds
Loaded file in 51.32 seconds
Iteration 8
Unloaded file in 55.25 seconds
Loaded file in 56.11 seconds
Notes:
During the processing of the loop the RAM usage ramps down (I assume dereferencing the previous data in the file variable), before ramping up again. Both unloading and loading parts seem to slow down over time. It surprises me how slow the RAM decreases in the unloading part.
The total RAM usage it ramps up to stays about constant (it doesn’t seem like there’s a memory leak).
I’ve tried including del file and gc.collect() in the loop, but this doesn’t speed anything up.
If I change return pickle.load(handle) to return handle.read(), the unload time is consistently 0.45s and load time is consistently 4.85s.
I’m using Python 3.9.13 on Windows with SSD storage (Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)]).
I have 64GB RAM and don’t seem to be maxing this out.
Why am I doing this? During training of an ML model, I have 10 files that are each 25GB big. I can’t fit them all into memory simultaenously, so have to load and unload them each epoch.
Any ideas? I’d be willing to move away from using pickle too if there’s an alternative that has similar read speed and doesn’t suffer from the above problem (I’m not worried about compression).
Edit:
I’ve run the above loading and unloading loop for different sized pickles. Results below showing the relative change in speed over time. For anything above 3 GB, the unload time starts to significantly ramp.
The slowdown you\’re experiencing with repeated pickle loading is likely due to a combination of factors related to memory management, disk access patterns, and the internal workings of the pickle module. Let\’s break down the possible causes and then explore solutions.\n\n**Understanding the Problem**\n\n1. **Memory Fragmentation:** Even though you have ample RAM and `gc.collect()` doesn\’t seem to help, memory fragmentation could still be a factor. Repeatedly allocating and deallocating large blocks of memory (your NumPy arrays) can lead to fragmentation. While the OS can eventually find contiguous blocks, it might take increasingly longer to do so as the loop progresses. The observation that `del file` doesn\’t help suggests the memory isn\’t immediately released, possibly waiting for garbage collection or held by the underlying NumPy structures.\n\n2. **Disk Caching and I/O:** While you\’re using an SSD, repeated reading of the *same* file might lead you to assume the data is cached. However, the OS\’s disk cache management isn\’t always predictable, especially with large files. The first read might be fast, but subsequent reads could still involve disk access or cache invalidation, especially as other processes compete for memory. The change in \”unload time\” suggests that something is happening with how the OS handles memory mapping or caching of the file contents.\n\n3. **Pickle Overhead:** Pickle is known to be relatively slow compared to more efficient serialization formats. The deserialization process involves reconstructing Python objects, which can be computationally expensive, especially with complex data structures like NumPy arrays. Each call to `pickle.load` has to reconstruct the entire dictionary and all its NumPy array values from scratch.\n\n**Possible Solutions**\n\nHere\’s a prioritized list of solutions to try, moving from the simplest/most likely to succeed to more complex approaches.\n\n1. **`mmap` (Memory Mapping): The Most Promising Solution**\n\n Memory mapping is the most likely path to significant improvement because it avoids repeated reading and deserialization. `mmap` lets you treat a file as if it were directly loaded into memory without *actually* loading the whole file into memory. The OS handles the caching and paging of the file\’s contents as needed. This should eliminate the slowdown you are seeing.\n\n “`python\n import numpy as np\n import mmap\n import time\n import pickle\n\n def load_pickle_mmap(file_path: str) -> dict:\n \”\”\”Loads a pickled dictionary of NumPy arrays using memory mapping.\”\”\”\n with open(file_path, \’rb\’) as f:\n with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:\n # Deserialize the entire dictionary from the mmap object\n data = pickle.load(mm)\n return data\n\n for i in range(9):\n print(f\”\\nIteration {i}\”)\n\n start_time = time.time()\n file = None\n print(f\”Unloaded file in {time.time() – start_time:.4f} seconds\”)\n\n start_time = time.time()\n file = load_pickle_mmap(\”D:/data/batched/0.pickle\”) # Replace with your actual path\n print(f\”Loaded file in {time.time() – start_time:.4f} seconds\”)\n del file\n\n “`\n\n * **Explanation:** The `mmap.mmap()` function creates a memory map of the file. We then use `pickle.load()` to deserialize the dictionary *from the memory map*. The key here is that the OS now manages the file\’s contents in memory, which should be much more efficient than repeatedly reading the file. The `access=mmap.ACCESS_READ` makes it read-only.\n\n * **Important:** If you *modify* the NumPy arrays, `mmap` can be tricky (you\’d need `mmap.ACCESS_WRITE` and careful synchronization). Since you\’re only loading the data for training, read-only access should be sufficient.\n\n2. **Optimize Disk I/O (if mmap doesn\’t fully resolve):**\n\n Even with an SSD, suboptimal I/O can hurt. Here\’s what to consider:\n\n * **Ensure Proper SSD Configuration:** Verify your SSD drivers are up-to-date and the drive is properly configured for optimal performance.\n\n * **Defragmentation (Yes, Even on SSDs):** While SSDs don\’t suffer from fragmentation as severely as HDDs, excessive file creation/deletion can still lead to fragmented free space. Periodically running an SSD optimization tool (usually provided by the manufacturer or the OS) can help.\n\n * **Consider Multiple Processes (Carefully):** If the bottleneck is purely I/O, you could explore using multiple processes to load data in parallel. *However*, this introduces significant complexity with memory management and inter-process communication. It\’s generally *not* recommended unless you\’ve exhausted all other options.\n\n3. **Alternative Serialization Formats:**\n\n While you mentioned wanting something as fast as pickle, it might be worthwhile to explore alternatives, especially if `mmap` isn\’t sufficient:\n\n * **`numpy.save` and `numpy.load`:** If your data is *exclusively* NumPy arrays, using `numpy.save` and `numpy.load` can be significantly faster than pickle. You\’d need to save each array individually and keep track of the keys, or create a single structured array. This eliminates the Python object construction overhead of pickle. This would require restructuring your data loading.\n\n * **HDF5 (Hierarchical Data Format):** HDF5 is a binary data format that\’s designed for storing large, heterogeneous datasets. Libraries like `h5py` provide Python bindings. HDF5 allows you to store your dictionary of NumPy arrays in a single file, with efficient read/write access to individual arrays or slices. This is a very good option if you need more complex data storage and access patterns in the future.\n\n * **Arrow/Parquet:** If you\’re dealing with tabular data and performance is paramount, consider Apache Arrow and Parquet. These are columnar data formats optimized for analytics and fast data access. They\’re especially suitable if you need to load specific columns/features from your data. Again, restructuring your data storage would be necessary.\n\n4. **Address Space Limits (Less Likely, but Possible):**\n\n On 32-bit systems, address space limits can become a problem. Even on 64-bit systems, it\’s *possible* that the Python process is running in a way that limits its address space (although this is less common). Ensure you\’re using a 64-bit Python distribution. You can verify this by checking `sys.maxsize` in your Python interpreter. If it\’s a very large number (like 2\\*\\*63 – 1), you\’re running 64-bit Python.\n\n5. **Garbage Collection Tuning (Least Likely):**\n\n While `gc.collect()` didn\’t help, you could experiment with more granular control over the garbage collector. See the `gc` module documentation for details. However, this is unlikely to be the primary cause of the slowdown.\n\n**Revised Recommendation and Justification**\n\nGiven the information, the *best* answer is to use `mmap`. Here\’s why:\n\n* **Avoids Deserialization Overhead:** `mmap` avoids the repeated `pickle.load` overhead by mapping the file into memory and letting the OS handle caching.\n* **Leverages OS Caching:** The OS is typically very efficient at managing file caches.\n* **Minimal Code Changes:** The code changes required to use `mmap` are relatively small.\n* **Read-Only Optimization:** You\’re only reading the data, which makes `mmap` simpler to use.\n* **Addresses the Root Cause:** The issue is the repeated loading and unloading, and `mmap` is a direct solution to that problem.\n\nIf `mmap` *doesn\’t* completely solve the problem (unlikely but possible), then move on to investigating the I/O optimizations (SSD configuration, defragmentation) and alternative serialization formats (NumPy save/load, HDF5). The other solutions are more complex and less likely to yield significant improvements given your specific scenario.\n