Python generator pipeline using all available memory

1 year ago

#384704

Xochozomatli

I'm working on a cli application that searches through a disc image for byte strings that could be jpgs. The core of the application is a pipeline of generators that opens the disc image file, reads blocks of data into a buffer, searches for jpg-like byte strings, and saves them to the file system as jpgs.

Building the pipeline exclusively out of generators, I expected memory usage to be slightly higher than the size of the buffer used for reading the input file.

What actually happens is that it begins to run, devours all available RAM and a sizable chunk of swap space, and the process is eventually killed. I've been reading and picking around to try to find what might be the cause of this, but no luck after quite a while now, which tells me it's probably something obvious that I'm not noticing.

Here's some of the code stripped down and concatenated but showing the same problem:

import os
import re
import sys

JPG_HEADER_PREFIX = b"\xff\xd8"
JPG_EOF = b"\xff\xd9"

FTYPES = ['jpg']
FILE_BOUNDS = {"jpg": (JPG_HEADER_PREFIX, JPG_EOF)}

DEFAULT_FILE_NAME = "img4G.iso"
DEFAULT_FILE_TYPE = "jpg"
DEFAULT_BATCH_SIZE = 2**28 # 256MB
dest_dir="."

def buffer(filename: str, batch_size: int=DEFAULT_BATCH_SIZE) -> bytes:
    """
        opens file "filename", reads bytes into buffer and yields bytes
    """
    with open(filename, 'rb') as f:
        while True:
            chunk = f.read(batch_size)
            if not chunk:
                break
            yield chunk

def lazy_match(chunk: bytes, file_type: str=DEFAULT_FILE_TYPE) -> bytes:
    """
        Takes buffer-full of bytes, yields byte strings
        of the form "SOI....+EOI" with no intervening SOIs or EOIs
    """
    header, eof = FILE_BOUNDS[file_type]
    file_pattern = b'%s(?:(?!%s)[\x00-\xff]){1000,}?%s' % (header, header, eof)
    matches = re.finditer(file_pattern, chunk)
    for m in matches:
        print("Size of m is: ", sys.getsizeof(m.group()))
        yield m.group()

def lazy_find_files(file: str=DEFAULT_FILE_NAME) -> bytes:
    for chunk in buffer(file):
        yield from lazy_match(chunk)


if __name__ == "__main__":
    from hashlib import md5
    import tracemalloc
    from pympler import muppy, summary

    tracemalloc.start(25)

    try:
        for f in lazy_find_files():
            dest_file = md5(f).hexdigest() + "." + DEFAULT_FILE_TYPE
            with open(os.path.join(dest_dir, dest_file), 'wb') as dest:
                dest.write(f)

    finally:
        mups = muppy.get_objects()
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics('traceback')
        stat = top_stats[0]
        print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
        sumy = summary.summarize(mups)
        summary.print_(sumy)

Here's example pympler/tracemalloc output from a typical run (keyboard interrupted):

4 memory blocks: 8341738.8 KiB
                       types |   # objects |   total size
============================ | =========== | ============
                       bytes |         108 |    256.01 MB
                         str |       15416 |      3.04 MB
                        dict |        4656 |      1.78 MB
                        code |        5592 |    966.24 KB
                        type |         934 |    754.97 KB
                       tuple |        4733 |    272.88 KB
          wrapper_descriptor |        2231 |    156.87 KB
  builtin_function_or_method |        1502 |    105.61 KB
                         set |         131 |     93.63 KB
                        list |         465 |     92.73 KB
           method_descriptor |        1267 |     89.09 KB
                     weakref |        1260 |     88.59 KB
                 abc.ABCMeta |          87 |     85.66 KB
                   frozenset |         131 |     57.88 KB
           getset_descriptor |         897 |     56.06 KB

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/.../pypenador/__main__.py", line 43, in <module>
    for f in lazy_find_files(input_file):
  File "/.../pypenador/scrounge.py", line 77, in lazy_find_files
    yield from lazy_match(chunk, file_type=DEFAULT_FILE_TYPE)
  File "/.../pypenador/scrounge.py", line 71, in lazy_match
    for m in matches:
KeyboardInterrupt

When printing tracemalloc's statistics by line number:

/home/.../example.py:36: size=10183 MiB, count=11, average=926 MiB
/home/.../example.py:22: size=256 MiB, count=1, average=256 MiB

where example.py:36 corresponds to the for-loop header in lazy_match:

for m in matches:

where matches = re.finditer(file_pattern, chunk), suggesting that the problem is related to reading from the finditer generator.

Thanks in advance.

python

regex

memory

generator

bytebuffer

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs