Iterable Membership Check
The Setup#
I had built a wrapper for interfacing with several of our databases at work. The database connection came from another library, so my wrappers took in a database connection, and any of the other data needed to run the command. It wasn’t the cleanest connection interface, but it worked for our purposes and was fast to get going.
The problem came when we had a new use case, where the connection to the database now had much higher latency and thus was much slower. Some of the users expressed to me that they already were caching the records in the database to get around this, but writes were still very slow. I quickly wrote a new wrapper for writing to the database that would take the cached data and the data to write as inputs. If the record wasn’t in the cache, it would write it to the database and inform the user. Quick and simple and got the job done, until I got reports from a user several weeks later of TypeError
exceptions.
Here’s a stripped down version of what the code was doing. Note that in my recreated example I don’t have a DB connection and am not doing any logic if the record doesn’t exist. It was also more generalized for taking different types of database records, which were stored in dataclasses. This example only has one type of DB record. Both of these simplifications make the sample easier to read and also remove the extraneous logic and code.
from collections.abc import Iterable
from dataclasses import dataclass
@dataclass
class DBRecord:
name: str
age: int
def store_record_in_db_if_new(current_rows: Iterable[DBRecord], row: DBRecord) -> bool:
"""
Given a list of the current rows, tests if the input row is already present.
If row is already present, return False, indicated no changes to the DB.
If row is not present, add to the database and return True.
"""
if row in current_rows:
return False
# Value is not present in the current rows, so add to DB
############################################
# This is where there would be a DB insert #
############################################
return True
I then had a quick set of test cases that I wrote up. Here, I’m just run these as a script for purposes of the example:
###########
# Testing #
###########
@dataclass
class TestCase:
input_current_rows: list[DBRecord]
input_row: DBRecord
expected_result: bool
def main() -> None:
# Test cases to run
test_cases: list[TestCase] = [
# 1) Empty list should add record
TestCase([], DBRecord("Arthur Dent", 35), True),
# 2) List without match should add record
TestCase([DBRecord("Ford Prefect", 200)], DBRecord("Arthur Dent", 35), True),
# 3) List with match should not add record
TestCase([DBRecord("Arthur Dent", 35)], DBRecord("Arthur Dent", 35), False),
]
# All of the list test cases pass
for test_case in test_cases:
result: bool = store_record_in_db_if_new(
test_case.input_current_rows, test_case.input_row
)
if result != test_case.expected_result:
print(f"Test case ({test_case}) did not match expected result")
else:
print(f"Test case ({test_case}) passed!")
Running this code, we get that all the test cases pass. I also run my static type checker and see that I have no errors in my code. And the user calling my code sees no type check errors when calling my library. Clearly, there is something strange going on.
This code is in the original_function.py
on GitHub.
The Failure Mode#
While most of my users were caching the database, this user had a few use cases where they didn’t have any cached data. As such, they were just calling my function with an empty dictionary for the list of current rows. This was yielding the TypeError
exception. The full error is below:
Traceback (most recent call last):
File "/Users/lon/Coding/puzzling_python/iterable_contains/original_function.py", line 78, in <module>
main()
~~~~^^
File "/Users/lon/Coding/puzzling_python/iterable_contains/original_function.py", line 67, in main
result: bool = store_record_in_db_if_new({}, DBRecord("Arthur Dent", 35))
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lon/Coding/puzzling_python/iterable_contains/original_function.py", line 24, in store_record_in_db_if_new
if row in current_rows:
^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'DBRecord'
This clearly shows that the error is that our DBRecord
is not hashable, which is then erroring out when it attempts to check if the value is in the dictionary.
What is Iterable
?#
The collections.abc
library is used provide abstract base classes (ABCs) to Python. This is very helpful for Python typing, since this allows users to use a variety of inputs that have the required methods, but may be different types. As in this example, you can iterate over both a list and a dictionary. Using abstract classes as the types is preferable for the inputs to functions since it also allows inheritance to work properly. For example, if the input is typed as a list, it must be a list, even something that inherits list will throw a type error.
Looking at the documentation of ABCs offered, the only abstract method that is required for an iterable is __iter__
. From my perspective, this worked great for my function because the in
operator can just iterate over input, see if the value exists and return the result I expect. My test cases with the list seemed to validate this behavior, but I had overlooked a more subtle behavior of the in
operator.
How does in
actually behave?#
If I had looked at the Membership test operations documentation, I would have found that the in
and not in
operators actually have multiple methods to check for existence. While I thought __iter__
would be how in
would verify the membership, it actually first checks if the class has a __contains__
method first. This is because there may be a more efficient method for determining membership than a simple iteration. Interestingly, if a function doesn’t have __iter__
, it will default back to using __getitem__
until it goes off the end of the object as a simplistic iterator.
Since the Iterable
ABC only requires the existence of __iter__
, but allows other methods to be defined, you do not know which way in
will be called when using that typing definition. Thus, our code was now calling the __contains__
function when a dictionary was passed in.
Why does this fail type checking?#
Since type checking is only verifying that the __iter__
method exists, a dictionary is a valid input, as long as it is holding DBRecord
objects.
The next logical change would be to use something like a Collection
from the collections.abc
, since this requires a __contains__
to be present. Therefore, our type checking would then throw an error since it would know to check that __contains__
is valid. Pyright (and to my knowledge other Python type checkers) won’t be able to do this deep analysis about if an object is hashable or not. Therefore, this appears as a runtime error without a way to check it during static type checks.
Possible Solutions#
This section proposes a series of solutions. Depending on the application, some of these solutions may or may not be viable.
Using a different input type#
The easiest solution is to use a different input type to our function that eliminates the set
(and other hashing classes like dict
). The way I quickly fixed it was by using a Sequence
instead. While this doesn’t eliminate there being a Sequence
that uses hashing for __contains__
, that is not the case for the base classes. Sets and dictionaries are not sequences. Our function definition is now simply:
def store_record_in_db_if_new(current_rows: Sequence[DBRecord], row: DBRecord) -> bool:
Another option is using a list
. This is a stronger level of safety, since we know a list doesn’t have issues with hashing for __contains__
(unless Python changes this implementation). But it does mean that a user cannot call the function with a variety of valid inputs like a dict.keys()
call, or a tuple
. Thus, I avoided this option, but in a case where stronger type safety is favored over flexibility it may be a preferred solution.
def store_record_in_db_if_new(current_rows: list[DBRecord], row: DBRecord) -> bool:
To keep the blog short, you can look at the change_input_typing.py
on GitHub for a full working function and test cases.
Changing the dataclass object#
Another possible solution is to make the dataclass object hashable. Looking through the dataclass documentation, there is the information about frozen instances. If the dataclass decorator is called with a frozen=True
parameter, the object will not be able to be modified with setting attributes. In the default configuration, just setting this parameter will now let the dataclass generate a __hash__
function.
There are some caveats to when this __hash__
function is generated. First, all the items within the dataclass must be immutable for this to work. Using a list, or other mutable object will cause the same unhashable runtime errors we are trying to fix. Also, the user can override __eq__
, __hash__
, or set the unsafe_hash
parameter. All of this will change the behavior from what I have specified above.
In this sample case, none of these caveats are applicable so we can simply use the frozen=True
for our dataclass:
@dataclass(frozen=True)
class FrozenDBRecord:
name: str
age: int
For my actual code, I wasn’t able to do this because the objects had mutable data types, and needed to be mutable. Therefore, I opted for the earlier solutions.
Creating a new set of test cases for inputs with dictionaries validates that this works as intended. If you want to see samples of the test cases, see the frozen_dataclass.py
on GitHub.
Conclusions and lessons learned#
Type checking in not infallible. Without understanding how some of the more complex double underscore (sometimes called dunder) methods work and are called, the code may not behave as originally thought. Even test cases don’t work if you aren’t testing the failing test cases. So even though Python is supposedly an “easy” language, there are subtle things that can bite you.
This also brings up the debate on whether or not using the abstract types is best for functions. I still prefer it for the ease to interface with different modules or use inheritance. Part of this also depends on the criticality of your infrastructure. While my system was important, if it went down it only impacted my org and was on a small number of machines, meaning a fix could be rolled out quickly.
Overall my fix to the problem was very fast, where I just changed an input type and used type checking to fix the broken call points. But it lead to a much deeper dive to truly understand how some of the hidden methods and operators function in Python.