More Detail =========== How does iteration work? ------------------------ Calling iter() on a tube, either explicitly, or implicitly (by a call to list() or for-loop) causes a number of things to happen: #. pytubes recurses over each tube, looking at its inputs to build directed dependency graph of each step #. This DAG is then partitioned into chains (see `Concepts`) of tubes that are iterated together #. The chains, and their contents are sorted so that each iter's input are processed in the correct order. #. Each tube is then asked to create an iter, converting all arguments and inputs into their relevant C values #. The resulting chain of Iters are given to a cython Iter wrapper object that has a ``__next__()`` method that produces python values for each iteration. A lot of the cost of loading data using pure python is typically centered around function call overhead and allocating/copying object data. Pytubes tackles these bottlenecks by using a number of strategies: - iterator hot-loops are pure c++ function calls - zero-copy views onto array data - strict epoch-based lifetime rules avoid reference counting or GC during iteration - where possible, zero allocations during iteration - avoiding creating python objects where possible These optimizations lead to significant performance improvements over pure python, despite offering complex loading functionality. Concepts -------- The idea of having an efficient c-based data loader library is quite straight-forward but the detail is quite subtle. The main issues are: - Python's pure dynamic typing is hard to reconcile with type information needed for c-level performant code - To be useful, data loaders have to handle a verywide variety of use-cases, which can easily lead to quadratic code complexity without careful abstraction management To implement a natural feeling interface to the library, the python-facing side tries to feel pythonic and generally tries to avoid deling with types explicitly. By the time the ``Tube`` is converted to an ``Iter``, lots of type information has been resolved, and optimally connected. To achieve this while keeping the implementation managable, pytubes internally has a number of concepts: Tube: A tube describes a number of steps to load data. Tubes typically refer to parent tubes recursively, and store any parameters that configure the behaviour of the resulting iterator. By themselves, they are pure cython classes, and do not deal with processing the actual data directly. Instead, calling :func:`iter` on a tube causes it to generate an Iter object which is a recursive C++ class that performs the actual data processing Iter: A set of C++ classes that implement the data processing implementation. They are entirely managed by cython wrappers and the Tube interface, so knowledge of them is not required for normal usage of the library. Internally, ``Iter`` s expose two methods: - ``get_slots()`` which returns a vector of `SlotPointers` (glorified, type-checked c-pointers) that allow other Iterators to consume the data produced by this iter. - ``next()`` causes the iter to process the next item, and update its internal slots so they reflect upstream changes Dtype: Each tube exposes a property ``Tube.dtype`` that defines the data type of each ``Slot`` in the iterator. Tubes can consume/produce multiple values per iteration, so Tube ``dtype`` s are tuples. Most tubes only have a single dtype entry, but some (e.g. :func:`Tube.enumerate`, :func:`Tube.multi`) can have many. Most tubes only look at/act on the first item of a dtype when getting values. The :func:`Tube.slot()` method can extract a single slot from a tube with many slots. Slot: Knowledge of slots is not required for normal usage of pytubes, but may be useful for a more in-depth understanding of how the library works. The underlying C++ Iter classes work by sharing pointers to iter-lifetime fixed memory locations. Each time ``next()`` is called on an Iter, the values at these memory location change, but the addresses do not. This can make iteration very efficient. The memory locations for each value being handled by an ``Iter`` is referred to both as a ``Slot``, and a ``SlotPointer`` in the code.