Iterators as Function Inputs

Remembering More Iterator Issues#

Because of a comment on my LinkedIn post about the iterable membership check blog post, I started doing some more testing with edge cases. One experiment was the difference between an Iterator and an Iterable when being used as a function input. This reminded me that Iterators can be modified by a function, which can have dangerous consequences.

Generating Permutations#

Let’s take a look at a function that is generating all the permutations of pairs from an iterable input. With a quick check the itertools.permutations would be the easier way to do this, but this is just an illustration. All of this code and test cases are in generating_permutation_pairs.py on GitHub.

def generate_pairs_from_iterable[T](input_iterable: Iterable[T]) -> list[tuple[T, T]]:
    """
    Given an input iterable, returns pairs of all possible permutations.

    For example, an input of [0, 1, 2] returns:
    [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

    Note that ordering of the output is not guaranteed.
    """
    output_list: list[tuple[T, T]] = []
    for val_1 in input_iterable:
        for val_2 in input_iterable:
            output_list.append((val_1, val_2))
    return output_list

Writing some unit tests, this works as we would expect for a list as the input. Here are my tests:

class TestGenerateParisFromIterable(unittest.TestCase):
    def test_empty_input(self) -> None:
        input_list: list[int] = []
        expected_output: list[tuple[int, int]] = []
        self.assertCountEqual(generate_pairs_from_iterable(input_list), expected_output)

    def test_single_item(self) -> None:
        input_list: list[int] = [0]
        expected_output: list[tuple[int, int]] = [(0, 0)]
        self.assertCountEqual(generate_pairs_from_iterable(input_list), expected_output)

    def test_two_items(self) -> None:
        input_list: list[int] = [0, 1]
        expected_output: list[tuple[int, int]] = [(0, 0), (0, 1), (1, 0), (1, 1)]
        self.assertCountEqual(generate_pairs_from_iterable(input_list), expected_output)

    def test_three_items(self) -> None:
        input_list: list[int] = [0, 1, 2]
        expected_output: list[tuple[int, int]] = [
            (0, 0),
            (1, 0),
            (2, 0),
            (0, 1),
            (1, 1),
            (2, 1),
            (0, 2),
            (1, 2),
            (2, 2),
        ]
        self.assertCountEqual(generate_pairs_from_iterable(input_list), expected_output)

As an aside, the assertCountEqual function is very poorly named as it does verify “first contains the same elements as second” from the documentation.

Everything in those tests runs, so my function seems to be working as expected. But what happens if we pass an iterator instead of an iterable (like list) into our function?

Passing Iterators Into Functions#

Let’s look at a test with a single input:

    def test_single_item_iter(self) -> None:
        input_iter: Iterator[int] = iter([0])
        expected_output: list[tuple[int, int]] = [(0, 0)]
        self.assertCountEqual(generate_pairs_from_iterable(input_iter), expected_output)

This test fails saying that (0,0) is not present in the output. Let’s investigate.

When starting the for loop, Python internally calls the __iter__ function to get an iterator. For a concrete type (like a list or a dict), this returns a new iterator object. But, __iter__ for an iterator object just returns that iterator. For a list, this means that getting two iterators are not the same object

>>> my_list = [0, 1, 2]
>>> iter(my_list) is iter(my_list)
False

But, if we look at an iterator object, we will see that not only does both iter calls return the same object, but it also matches the original object.

>>> my_list = [0, 1, 2]
>>> my_list_iter = iter(my_list)
>>> iter(my_list_iter) is iter(my_list_iter)
True
>>> iter(my_list_iter) is my_list_iter
True

What this means is that we are iterating the same object in both our inner and outer loop. Therefore, the first value encountered by the inner for loop is actually the end of the iteration, thus there are no outputs.

To better visualize this, it may be easiest to think about how the for loop decomposes into a while loop:

# for loop implemenation
for item in iterator_obj:
    pass

# while loop implementation
try:
    while item := next(iterator_obj):
      pass
except StopIteration:
  pass

The outer loop will call next, and the inner loop calls it again, causing it to hit the StopIteration exception, and thus breaking out immediately. Also if you are unfamiliar with what := does (the “walrus operator”), there is some good info in the Python 3.8 what’s new for assignment expressions, or PEP 572 for full details.

Similarly, the two and three item cases will also fail. You can download generating_permutation_pairs.py on GitHub and run the file to inspect the other errors.

Practical Issues When Passing Iterators into Functions#

The most important thing to know is that if you pass an iterator into a function, you should consider that iterator as now invalid unless you know that function’s internals (eg: the type function). For example, there are several gotchas that can happen when using iterators with the zip function.

First, the zip returns a generator. It does not go though the iterators until the generator is called. It is also holding onto a reference to the iterators that were passed in, so they can be used in the generator function. If you assume that the iterators you called the zip with can be used, it will actually change the results of the zip. For example, look at the following code:

# First, create a zip and print the results
iterator_1 = iter([1, 2, 3, 4])
iterator_2 = iter([-1, -2, -3, -4])
zip_result = zip(iterator_1, iterator_2)
list_result = list(zip_result)
print(f"Zip result directly after doing the zip: {list_result}")


# Do the same thing again, except call next on the first iterator after zip
# is called before it is turned into a list, thus causing iterator_1 to
# increment before it is accessed by the zip generator
iterator_1 = iter([1, 2, 3, 4])
iterator_2 = iter([-1, -2, -3, -4])
zip_result: zip[tuple[int, int]] = zip(iterator_1, iterator_2)
_ = next(iterator_1)
list_result = list(zip_result)
print(f"Zip result if next is called on an iterator first: {list_result}")

Running this code produces the following output:

Zip result directly after doing the zip: [(1, -1), (2, -2), (3, -3), (4, -4)]
Zip result if next is called on an iterator first: [(2, -1), (3, -2), (4, -3)]

Since the first iterator was incremented once before the zip generator was called, the first value is now 2 instead of 1. This means our result for the zip is now misaligned and only contains three records instead of four. For this reason, you should not manipulate the iterators after calling zip.

It seems logical that if the generator is run to completion that reading the iterators would now be safe. But there is another quirk of how zip works internally that can cause an issue.

Iterators are only required to implement the __next__ and __iter__ functions as described in the iterator documentation. There is no way to peak at the next record, or even determine if there is a another record without advancing the iterator by calling next. This results in an edge case when the two iterators are different lengths. Depending on if the longer iterator is passed in as the first or second zip, the longer iterator will return different values for next after the generator runs.

To verify this, we can use zip twice, once with the longer iterator first and once with it second. An example of this in code is as follows:

    # Create longer and shorter iterators
    # And call zip with the longer iterator first
    iterator_1 = iter([1, 2, 3, 4])
    iterator_2 = iter([-1, -2])
    zip_result = zip(iterator_1, iterator_2)
    # Need to generate the list to run the zip generator
    _ = list(zip_result)
    print(f"When the longer iterator is passed first: {next(iterator_1)=}")

    # Recreate the longer and shorter iterator
    # This time call zip with the shorter iterator first
    iterator_1 = iter([1, 2, 3, 4])
    iterator_2 = iter([-1, -2])
    zip_result = zip(iterator_2, iterator_1)
    # Need to generate the list to run the zip generator
    _ = list(zip_result)
    print(f"When the longer iterator is passed second: {next(iterator_1)=}")

Running this code results in the result:

When the longer iterator is passed first: next(iterator_1)=4
When the longer iterator is passed second: next(iterator_1)=3

This actually gives us some internal knowledge about how zip works. It calls next on the first argument before the second argument. Thus, if the first argument has more records, it will be incremented an additional time, since it iterates before zip knows the second argument is at the end. zip and other functions make no promise of the state of the iterators after the functions are called, so one cannot rely on their values anymore.

Both of these examples are in the using_zip_with_iterators.py on GitHub.

“Copying” an Iterator#

Directly copying an iterator is not possible, since iterators do not have a size. Looking back at zip, it cannot determine the size ahead of time. Here are two ways you can effectively copy an iterator, but each has advantages and disadvantages.

Manual Copying#

The first one is to just call list or some other function that iterates over the list and generates a concrete object. The main advantage here is that it is easy, and then you can use copy.copy or copy.deepcopy as needed to create additional copies of the object. In my example where it iterates over the same thing twice, this would also be valid, since iter(list) returns different iterators each time. But this is at the expense of a large amount of memory. As an example, I created a generator to publish the first n values of the Fibonacci sequence with the following:

def fib_generator(n: int) -> Generator[int]:
    """
    Returns a generator for the first n Fibonacci values
    """
    if n >= 1:
        # If there is at least one return, the first value is 0
        yield 0

        # All other values we must count
        count: int = 1
        # By doing this weird logic, we get "1" twice to start
        cur_val: int = 0
        prev_val: int = 1
        while n > count:
            prev_val, cur_val = cur_val, prev_val + cur_val
            yield cur_val
            count += 1

In a main function I then test with several input n sizes to see how large the generator is, and how large the list of the data is. Here is the code to do that:

def main() -> None:
    # Typing our reused variables
    fib_gen: Generator[int]
    fib_list: list[int]

    # Test with varying sizes of n to get size of the generator and the list
    for fib_n in [0, 10, 100, 1000]:
        fib_gen = fib_generator(fib_n)
        fib_list = list(fib_generator(fib_n))
        print(f"Size of generator for n={fib_n} is {sys.getsizeof(fib_gen)} bytes")
        print(f"Size of list for n={fib_n} is {sys.getsizeof(fib_list)} bytes")

This results in the following output:

Size of generator for n=0 is 216 bytes
Size of list for n=0 is 56 bytes
Size of generator for n=10 is 216 bytes
Size of list for n=10 is 184 bytes
Size of generator for n=100 is 216 bytes
Size of list for n=100 is 920 bytes
Size of generator for n=1000 is 216 bytes
Size of list for n=1000 is 8856 bytes

What we can see is that the generator always uses 216 bytes in this example, which in some cases is less efficient than the list representation. The breakpoint in this example is somewhere between 10 and 100, but this will vary depending on the preallocated sizes of the list. Above this, we can see that the list becomes significantly less efficient. By the time n is 100, it is almost 1KB, and almost 9KB when n=1000. By creating the list we force this computation to occur and end up having significantly larger objects.

This code is in generator_example.py on GitHub.

Using `itertools`#

I could have avoided all of this if I used itertools.permutations (but remember that the iterator input will still be invalid after calling permutations). It turns out itertools has other very useful tools for handling iterators. Specifically it has the tee function, which allows one to create multiple instances of an iterator.

The tee documentation does a better job than I ever can with all of the details. But there are a few key things to know.

First, the original iterator cannot be used again since modifying it will modify any of the tee’d objects (like how zip iterators can’t be used afterwards).

Second, if one iterator will use most of the data before the other, it is faster just to use list. It also will not require a significantly larger amount of memory, since tee is buffering the data.

Lastly, tee is not threadsafe, and could return a RuntimeError.

Assuming none of the above are deal breakers, tee will use less memory. It also does provide the ability to do things like peak at the next object, unlike an iterator.

Conclusions#

One must be very cautious when iterators are being passed as function arguments. Since they are passed by references, they can be modified.

The most important thing is that when an iterator is passed to a function, you should consider it invalid, unless you know the function does not modify it (eg: type).

If you have to use an iterator in multiple locations, then either use list to create a copyable list, or itertools.tee to create multiple instances. Just be mindful of the differences of each and the possible increased memory utilization.