Iterable Membership Check

The Setup#

I had built a wrapper for interfacing with several of our databases at work. The database connection came from another library, so my wrappers took in a database connection, and any of the other data needed to run the command. It wasn’t the cleanest connection interface, but it worked for our purposes and was fast to get going.

The problem came when we had a new use case, where the connection to the database now had much higher latency and thus was much slower. Some of the users expressed to me that they already were caching the records in the database to get around this, but writes were still very slow. I quickly wrote a new wrapper for writing to the database that would take the cached data and the data to write as inputs. If the record wasn’t in the cache, it would write it to the database and inform the user. Quick and simple and got the job done, until I got reports from a user several weeks later of TypeError exceptions.

Here’s a stripped down version of what the code was doing. Note that in my recreated example I don’t have a DB connection and am not doing any logic if the record doesn’t exist. It was also more generalized for taking different types of database records, which were stored in dataclasses. This example only has one type of DB record. Both of these simplifications make the sample easier to read and also remove the extraneous logic and code.

from collections.abc import Iterable
from dataclasses import dataclass


@dataclass
class DBRecord:
    name: str
    age: int


def store_record_in_db_if_new(current_rows: Iterable[DBRecord], row: DBRecord) -> bool:
    """
    Given a list of the current rows, tests if the input row is already present.

    If row is already present, return False, indicated no changes to the DB.

    If row is not present, add to the database and return True.
    """
    if row in current_rows:
        return False

    # Value is not present in the current rows, so add to DB
    ############################################
    # This is where there would be a DB insert #
    ############################################
    return True

I then had a quick set of test cases that I wrote up:

###########
# Testing #
###########
class TestStoreRecordInDBIfNew(unittest.TestCase):
    def test_empty_list_adds_record(self) -> None:
        current_rows: list[DBRecord] = []
        row: DBRecord = DBRecord("Arthur Dent", 35)
        self.assertTrue(store_record_in_db_if_new(current_rows, row))

    def test_list_without_match_adds_record(self) -> None:
        current_rows: list[DBRecord] = [DBRecord("Ford Prefect", 200)]
        row: DBRecord = DBRecord("Arthur Dent", 35)
        self.assertTrue(store_record_in_db_if_new(current_rows, row))

    def test_list_with_match_does_not_add_record(self) -> None:
        current_rows: list[DBRecord] = [DBRecord("Arthur Dent", 35)]
        row: DBRecord = DBRecord("Arthur Dent", 35)
        self.assertFalse(store_record_in_db_if_new(current_rows, row))

    def test_empty_dict_adds_record(self):
        # NOTE: This test will fail, showing our error case
        current_rows: dict[DBRecord, str] = {}
        row: DBRecord = DBRecord("Arthur Dent", 35)
        self.assertTrue(store_record_in_db_if_new(current_rows, row))

Running this code, we get that all the test cases with the lists pass, but we get an error with the empty dictionary input. My static checker says that I have no errors though and the user calling my code sees no type check errors when calling my library. Clearly, there is something strange going on.

This code is in the original_function.py on GitHub.

The Failure Mode#

While most of my users were caching the database, this user had a few use cases where they didn’t have any cached data. As such, they were just calling my function with an empty dictionary for the list of current rows. This was yielding the TypeError exception. The full error is below:

======================================================================
ERROR: test_empty_dict_adds_record (__main__.TestStoreRecordInDBIfNew.test_empty_dict_adds_record)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/lon/Coding/puzzling_python/iterable_membership_check/original_function.py", line 58, in test_empty_dict_adds_record
    self.assertTrue(store_record_in_db_if_new(current_rows, row))
                    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/Users/lon/Coding/puzzling_python/iterable_membership_check/original_function.py", line 25, in store_record_in_db_if_new
    if row in current_rows:
       ^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'DBRecord'

This clearly shows that the error is that our DBRecord is not hashable, which is then erroring out when it attempts to check if the value is in the dictionary.

What is `Iterable`?#

The collections.abc library is used provide abstract base classes (ABCs) to Python. This is very helpful for Python typing, since this allows users to use a variety of inputs that have the required methods, but may be different types. As in this example, you can iterate over both a list and a dictionary. Using abstract classes as the types is preferable for the inputs to functions since it also allows inheritance to work properly. For example, if the input is typed as a list, it must be a list, even something that inherits list will throw a type error.

Looking at the documentation of ABCs offered, the only abstract method that is required for an iterable is __iter__. From my perspective, this worked great for my function because the in operator can just iterate over input, see if the value exists and return the result I expect. My test cases with the list seemed to validate this behavior, but I had overlooked a more subtle behavior of the in operator.

How does `in` actually behave?#

If I had looked at the membership test operations documentation, I would have found that the in and not in operators actually have multiple methods to check for existence. While I thought __iter__ would be how in would verify the membership, it actually first checks if the class has a __contains__ method first. This is because there may be a more efficient method for determining membership than a simple iteration. Interestingly, if a function doesn’t have __iter__, it will default back to using __getitem__ until it goes off the end of the object as a simplistic iterator.

Since the Iterable ABC only requires the existence of __iter__, but allows other methods to be defined, you do not know which way in will be called when using that typing definition. Thus, our code was now calling the __contains__ function when a dictionary was passed in.

Why does this fail type checking?#

Since type checking is only verifying that the __iter__ method exists, a dictionary is a valid input, as long as it is holding DBRecord objects.

The next logical change would be to use something like a Collection from the collections.abc, since this requires a __contains__ to be present. Therefore, our type checking would then throw an error since it would know to check that __contains__ is valid. Pyright (and to my knowledge other Python type checkers) won’t be able to do this deep analysis about if an object is hashable or not. Therefore, this appears as a runtime error without a way to check it during static type checks.

Possible Solutions#

This section proposes a series of solutions. Depending on the application, some of these solutions may or may not be viable.

Using a different input type#

The easiest solution is to use a different input type to our function that eliminates the set (and other hashing classes like dict). The way I quickly fixed it was by using a Sequence instead. While this doesn’t eliminate there being a Sequence that uses hashing for __contains__, that is not the case for the base classes. Sets and dictionaries are not sequences. Our function definition is now simply:

def store_record_in_db_if_new(current_rows: Sequence[DBRecord], row: DBRecord) -> bool:

Another option is using a list. This is a stronger level of safety, since we know a list doesn’t have issues with hashing for __contains__ (unless Python changes this implementation). But it does mean that a user cannot call the function with a variety of valid inputs like a dict.keys() call, or a tuple. Thus, I avoided this option, but in a case where stronger type safety is favored over flexibility it may be a preferred solution.

def store_record_in_db_if_new(current_rows: list[DBRecord], row: DBRecord) -> bool:

To keep the blog short, you can look at the change_input_typing.py on GitHub for a full working function and test cases.

Changing the dataclass object#

Another possible solution is to make the dataclass object hashable. Looking through the dataclass documentation, there is the information about frozen instances. If the dataclass decorator is called with a frozen=True parameter, the object will not be able to be modified with setting attributes. In the default configuration, just setting this parameter will now let the dataclass generate a __hash__ function.

There are some caveats to when this __hash__ function is generated. First, all the items within the dataclass must be immutable for this to work. Using a list, or other mutable object will cause the same unhashable runtime errors we are trying to fix. Also, the user can override __eq__, __hash__, or set the unsafe_hash parameter. All of this will change the behavior from what I have specified above.

In this sample case, none of these caveats are applicable so we can simply use the frozen=True for our dataclass:

@dataclass(frozen=True)
class FrozenDBRecord:
    name: str
    age: int

For my actual code, I wasn’t able to do this because the objects had mutable data types, and needed to be mutable. Therefore, I opted for the earlier solutions.

Creating a new set of test cases for inputs with dictionaries validates that this works as intended. If you want to see samples of the test cases, see the frozen_dataclass.py on GitHub.

Conclusions and lessons learned#

Type checking in not infallible. Without understanding how some of the more complex double underscore (sometimes called dunder) methods work and are called, the code may not behave as originally thought. Even test cases don’t work if you aren’t testing the failing test cases. So even though Python is supposedly an “easy” language, there are subtle things that can bite you.

This also brings up the debate on whether or not using the abstract types is best for functions. I still prefer it for the ease to interface with different modules or use inheritance. Part of this also depends on the criticality of your infrastructure. While my system was important, if it went down it only impacted my org and was on a small number of machines, meaning a fix could be rolled out quickly.

Overall my fix to the problem was very fast, where I just changed an input type and used type checking to fix the broken call points. But it lead to a much deeper dive to truly understand how some of the hidden methods and operators function in Python.