Techno Blender
Digitally Yours.

Python Collections Module: The Forgotten Data Containers | by Diego Barba | Jun, 2022

0 78


If you are not using the container datatypes from the collections module, you should

Image by author.

In the learning journey of a programming language is not uncommon to develop our own hacks and tricks that allow us to implement specific tasks. As data scientists, we might end up with little recipes of our own making that enable us to manipulate data in particular ways. We tend to cling to these recipes in our treasured notebooks.

Sometimes these recipes use tools that are not the best suited for the task at hand. Still, we never bother looking for better tools because the job is trivial. And our implementation works — for instance, implementing a counter or storing the ten last occurrences of a live feed.

The odds are we use some combination of Python’s basic data structures to achieve our goal when there may be better-suited data containers. Python’s collections module is an example of this. The somewhat forgotten yet great module implements data containers that will simplify your life enormously if you give it a chance. Maybe a collections module container will replace one of those old recipes.

This story will review some of the most useful data containers from the collections module. For more info, check the official documentation.

The story is structured as follows:

  • defaultdict
  • deque
  • Counter
  • namedtuple
  • Final words

defaultdict

The first container on the list is my favorite container from the module, defaultdict. It will simplify your life in unimagined ways. This container is a dictionary, but its values are initialized according to a default factory.

It is often the case that we want all the values of our dictionary to be lists, for example. It is painful and messy to initialize the lists for each key. Each time we want to do a list operation on a value, we have to check if the key exists; if not, initialize the list first, then proceed with the list operation. Major headache. Well, defaultdict solves this.

Here is an example of the list default factory:

defaultdict(<class ‘list’>, {‘a’: [1, 2, 3]})

If we had not used defaultdict, we would have to do something like this:

{‘a’: [1, 2, 3]}

Looks familiar?

What about using sets as the default factory?

defaultdict(<class ‘set’>, {‘a’: {1, 2, 3}})

As you can see, all the keys hold values that are sets, and we can do set operations on them without having to initialize each one on its own.

Now we use int as the default factory; this is useful to count things:

defaultdict(<class ‘int’>, {‘a’: 3})

Without defaultdict, we would have to do something like this each time we want to change the count:

{‘a’: 3}

How not to love defaultdict? Next time you are about to use the plain old dict, evaluate whether defaultdict can better serve your purposes. If it does, it will simplify your life and code greatly.

deque

A double-ended queue, or deque, is a container that generalizes queues (FIFO) and stacks (LIFO). In many ways, this container is similar to a list; however, it implements optimized fixed-length operations. It also has the property of maximum length; something lists do not.

Imagine a simple use case; we store tweets in real-time and want a container with the ten latest tweets. If we kept the tweets in a list, each time we appended one, we would have to check if the length was larger than ten; in that case, we would need to trim the list. With deques, that is no longer necessary; we can keep appending, and it will keep the predefined maximum length.

While deques can be compared to lists in some ways, they are much more; they can also be used as LIFO or FIFO queues. They are similar to the queues in the Queue standard module. However, such are designed for communication between threads (thread-locks, etc.), while deques are just a data structure.

As deque is not intended for thread intercommunication, it does not implement the put method, but implements append and extend. We can append from the right:

deque([2, 3, 4], maxlen=3)

False

or append from the left:

deque([3, 2, 1])

We can make shallow copies and clear the deque:

deque([])

deque([3, 2, 1])

We can extend the deque with an iterable as if it were a list and count occurrences:

deque([4, 5, 3, 2, 1, 3, 4])

2

1

We can get the index of elements in the whole deque or an index range:

2, 4

5

And, of course, we can pop elements; how could it be called a double-ended queue if we could not pop elements.

4

deque([4, 5, 3, 2, 1, 3])

4

deque([5, 3, 2, 1, 3])

We can remove elements:

deque([5, 2, 1, 3])

reverse the deque:

deque([3, 1, 2, 5])

And very conveniently, we can also rotate the deque or shift it:

deque([5, 3, 1, 2])

deque([1, 2, 5, 3])

The deque should be your go-to data structure when you want a queue and do not care about multithreading. While deques are threadsafe, queues from the Queue module are better suited.

Counter

Although we can use defaultdict with int default factory as a counter, the collections module implements a counter in its own right. Counter can be created as an empty object and updated later:

0

notice that the count for keys not on the Counter is always zero.

We can increase each element’s count by directly updating the element or updating iterables (list, tuple, etc.) We can use the elements method to get all the elements in the Counter.

2 1

[‘a’, ‘a’, ‘b’]

a

a

b

There are many other helpful methods in the object, such as ordering the most common elements, getting the total count, or, my personal favorite, subtracting counters:

[(‘a’, 2), (‘b’, 1)]

[(‘a’, 2)]

3

Counter({‘a’: 1, ‘b’: 1})

Counter({‘a’: 1, ‘b’: 0})

In the previous example, we started with an empty Counter. However, it is also possible to create a Counter from an iterable or a dictionary with the count for each key:

Counter({‘a’: 2, ‘b’: 1})

Counter({‘a’: 2, ‘b’: 1})

This counter implementation can save you significant headaches while implementing counters in less suitable data structures, like ordinary dictionaries or lists. Next time you need to count occurrences, give Counter from the collections module a spin.

namedtuple

Tuples in Python are great if we seek immutable data containers. They are fast and memory-efficient; what is not to love about them? Let me tell you; integer index data access is what I don’t love. And apparently, Python core developers do not love that either. There have been many iterations towards a tuple with name access. One of such data containers is the namedtuple.

We first need to make the tuple class to create a namedtuple. We set the name of the class and the name of the attributes (elements, in the order we want the tuple to be indexed). Then we instantiate the class with the actual data:

Person(name=’Diego’, age=33)

Diego, Diego

33, 33

While it is a good solution, I’m not sold completely. I think that the creation mechanism is somewhat bizarre. I would instead use NamedTuple from the typing module:

Person(name=’Diego’, age=0)

Diego, Diego

0, 0

Check out this story if you want to know more about typed data structures in Python. It makes a thorough comparison of all the primary containers in Python, such as NamedTuple.

Final words

Next time you use Python’s basic data structures to implement a task that somehow feels like a hack, take a look at the collections module. You might be able to find the right tool for the job.

Don’t forget to check out the official documentation for more info.


If you are not using the container datatypes from the collections module, you should

Image by author.

In the learning journey of a programming language is not uncommon to develop our own hacks and tricks that allow us to implement specific tasks. As data scientists, we might end up with little recipes of our own making that enable us to manipulate data in particular ways. We tend to cling to these recipes in our treasured notebooks.

Sometimes these recipes use tools that are not the best suited for the task at hand. Still, we never bother looking for better tools because the job is trivial. And our implementation works — for instance, implementing a counter or storing the ten last occurrences of a live feed.

The odds are we use some combination of Python’s basic data structures to achieve our goal when there may be better-suited data containers. Python’s collections module is an example of this. The somewhat forgotten yet great module implements data containers that will simplify your life enormously if you give it a chance. Maybe a collections module container will replace one of those old recipes.

This story will review some of the most useful data containers from the collections module. For more info, check the official documentation.

The story is structured as follows:

  • defaultdict
  • deque
  • Counter
  • namedtuple
  • Final words

defaultdict

The first container on the list is my favorite container from the module, defaultdict. It will simplify your life in unimagined ways. This container is a dictionary, but its values are initialized according to a default factory.

It is often the case that we want all the values of our dictionary to be lists, for example. It is painful and messy to initialize the lists for each key. Each time we want to do a list operation on a value, we have to check if the key exists; if not, initialize the list first, then proceed with the list operation. Major headache. Well, defaultdict solves this.

Here is an example of the list default factory:

defaultdict(<class ‘list’>, {‘a’: [1, 2, 3]})

If we had not used defaultdict, we would have to do something like this:

{‘a’: [1, 2, 3]}

Looks familiar?

What about using sets as the default factory?

defaultdict(<class ‘set’>, {‘a’: {1, 2, 3}})

As you can see, all the keys hold values that are sets, and we can do set operations on them without having to initialize each one on its own.

Now we use int as the default factory; this is useful to count things:

defaultdict(<class ‘int’>, {‘a’: 3})

Without defaultdict, we would have to do something like this each time we want to change the count:

{‘a’: 3}

How not to love defaultdict? Next time you are about to use the plain old dict, evaluate whether defaultdict can better serve your purposes. If it does, it will simplify your life and code greatly.

deque

A double-ended queue, or deque, is a container that generalizes queues (FIFO) and stacks (LIFO). In many ways, this container is similar to a list; however, it implements optimized fixed-length operations. It also has the property of maximum length; something lists do not.

Imagine a simple use case; we store tweets in real-time and want a container with the ten latest tweets. If we kept the tweets in a list, each time we appended one, we would have to check if the length was larger than ten; in that case, we would need to trim the list. With deques, that is no longer necessary; we can keep appending, and it will keep the predefined maximum length.

While deques can be compared to lists in some ways, they are much more; they can also be used as LIFO or FIFO queues. They are similar to the queues in the Queue standard module. However, such are designed for communication between threads (thread-locks, etc.), while deques are just a data structure.

As deque is not intended for thread intercommunication, it does not implement the put method, but implements append and extend. We can append from the right:

deque([2, 3, 4], maxlen=3)

False

or append from the left:

deque([3, 2, 1])

We can make shallow copies and clear the deque:

deque([])

deque([3, 2, 1])

We can extend the deque with an iterable as if it were a list and count occurrences:

deque([4, 5, 3, 2, 1, 3, 4])

2

1

We can get the index of elements in the whole deque or an index range:

2, 4

5

And, of course, we can pop elements; how could it be called a double-ended queue if we could not pop elements.

4

deque([4, 5, 3, 2, 1, 3])

4

deque([5, 3, 2, 1, 3])

We can remove elements:

deque([5, 2, 1, 3])

reverse the deque:

deque([3, 1, 2, 5])

And very conveniently, we can also rotate the deque or shift it:

deque([5, 3, 1, 2])

deque([1, 2, 5, 3])

The deque should be your go-to data structure when you want a queue and do not care about multithreading. While deques are threadsafe, queues from the Queue module are better suited.

Counter

Although we can use defaultdict with int default factory as a counter, the collections module implements a counter in its own right. Counter can be created as an empty object and updated later:

0

notice that the count for keys not on the Counter is always zero.

We can increase each element’s count by directly updating the element or updating iterables (list, tuple, etc.) We can use the elements method to get all the elements in the Counter.

2 1

[‘a’, ‘a’, ‘b’]

a

a

b

There are many other helpful methods in the object, such as ordering the most common elements, getting the total count, or, my personal favorite, subtracting counters:

[(‘a’, 2), (‘b’, 1)]

[(‘a’, 2)]

3

Counter({‘a’: 1, ‘b’: 1})

Counter({‘a’: 1, ‘b’: 0})

In the previous example, we started with an empty Counter. However, it is also possible to create a Counter from an iterable or a dictionary with the count for each key:

Counter({‘a’: 2, ‘b’: 1})

Counter({‘a’: 2, ‘b’: 1})

This counter implementation can save you significant headaches while implementing counters in less suitable data structures, like ordinary dictionaries or lists. Next time you need to count occurrences, give Counter from the collections module a spin.

namedtuple

Tuples in Python are great if we seek immutable data containers. They are fast and memory-efficient; what is not to love about them? Let me tell you; integer index data access is what I don’t love. And apparently, Python core developers do not love that either. There have been many iterations towards a tuple with name access. One of such data containers is the namedtuple.

We first need to make the tuple class to create a namedtuple. We set the name of the class and the name of the attributes (elements, in the order we want the tuple to be indexed). Then we instantiate the class with the actual data:

Person(name=’Diego’, age=33)

Diego, Diego

33, 33

While it is a good solution, I’m not sold completely. I think that the creation mechanism is somewhat bizarre. I would instead use NamedTuple from the typing module:

Person(name=’Diego’, age=0)

Diego, Diego

0, 0

Check out this story if you want to know more about typed data structures in Python. It makes a thorough comparison of all the primary containers in Python, such as NamedTuple.

Final words

Next time you use Python’s basic data structures to implement a task that somehow feels like a hack, take a look at the collections module. You might be able to find the right tool for the job.

Don’t forget to check out the official documentation for more info.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment