1 year ago
#384704
Xochozomatli
Python generator pipeline using all available memory
I'm working on a cli application that searches through a disc image for byte strings that could be jpgs. The core of the application is a pipeline of generators that opens the disc image file, reads blocks of data into a buffer, searches for jpg-like byte strings, and saves them to the file system as jpgs.
Building the pipeline exclusively out of generators, I expected memory usage to be slightly higher than the size of the buffer used for reading the input file.
What actually happens is that it begins to run, devours all available RAM and a sizable chunk of swap space, and the process is eventually killed. I've been reading and picking around to try to find what might be the cause of this, but no luck after quite a while now, which tells me it's probably something obvious that I'm not noticing.
Here's some of the code stripped down and concatenated but showing the same problem:
import os
import re
import sys
JPG_HEADER_PREFIX = b"\xff\xd8"
JPG_EOF = b"\xff\xd9"
FTYPES = ['jpg']
FILE_BOUNDS = {"jpg": (JPG_HEADER_PREFIX, JPG_EOF)}
DEFAULT_FILE_NAME = "img4G.iso"
DEFAULT_FILE_TYPE = "jpg"
DEFAULT_BATCH_SIZE = 2**28 # 256MB
dest_dir="."
def buffer(filename: str, batch_size: int=DEFAULT_BATCH_SIZE) -> bytes:
"""
opens file "filename", reads bytes into buffer and yields bytes
"""
with open(filename, 'rb') as f:
while True:
chunk = f.read(batch_size)
if not chunk:
break
yield chunk
def lazy_match(chunk: bytes, file_type: str=DEFAULT_FILE_TYPE) -> bytes:
"""
Takes buffer-full of bytes, yields byte strings
of the form "SOI....+EOI" with no intervening SOIs or EOIs
"""
header, eof = FILE_BOUNDS[file_type]
file_pattern = b'%s(?:(?!%s)[\x00-\xff]){1000,}?%s' % (header, header, eof)
matches = re.finditer(file_pattern, chunk)
for m in matches:
print("Size of m is: ", sys.getsizeof(m.group()))
yield m.group()
def lazy_find_files(file: str=DEFAULT_FILE_NAME) -> bytes:
for chunk in buffer(file):
yield from lazy_match(chunk)
if __name__ == "__main__":
from hashlib import md5
import tracemalloc
from pympler import muppy, summary
tracemalloc.start(25)
try:
for f in lazy_find_files():
dest_file = md5(f).hexdigest() + "." + DEFAULT_FILE_TYPE
with open(os.path.join(dest_dir, dest_file), 'wb') as dest:
dest.write(f)
finally:
mups = muppy.get_objects()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('traceback')
stat = top_stats[0]
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
sumy = summary.summarize(mups)
summary.print_(sumy)
Here's example pympler
/tracemalloc
output from a typical run (keyboard interrupted):
4 memory blocks: 8341738.8 KiB
types | # objects | total size
============================ | =========== | ============
bytes | 108 | 256.01 MB
str | 15416 | 3.04 MB
dict | 4656 | 1.78 MB
code | 5592 | 966.24 KB
type | 934 | 754.97 KB
tuple | 4733 | 272.88 KB
wrapper_descriptor | 2231 | 156.87 KB
builtin_function_or_method | 1502 | 105.61 KB
set | 131 | 93.63 KB
list | 465 | 92.73 KB
method_descriptor | 1267 | 89.09 KB
weakref | 1260 | 88.59 KB
abc.ABCMeta | 87 | 85.66 KB
frozenset | 131 | 57.88 KB
getset_descriptor | 897 | 56.06 KB
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/.../pypenador/__main__.py", line 43, in <module>
for f in lazy_find_files(input_file):
File "/.../pypenador/scrounge.py", line 77, in lazy_find_files
yield from lazy_match(chunk, file_type=DEFAULT_FILE_TYPE)
File "/.../pypenador/scrounge.py", line 71, in lazy_match
for m in matches:
KeyboardInterrupt
When printing tracemalloc
's statistics by line number:
/home/.../example.py:36: size=10183 MiB, count=11, average=926 MiB
/home/.../example.py:22: size=256 MiB, count=1, average=256 MiB
where example.py:36
corresponds to the for-loop header in lazy_match
:
for m in matches:
where matches = re.finditer(file_pattern, chunk)
, suggesting that the problem is related to reading from the finditer generator.
Thanks in advance.
python
regex
memory
generator
bytebuffer
0 Answers
Your Answer