In python’s standard library, we have the following sequence types:
list. Heterogenous and mutable.tuple. Heterogenous and immutable.collections.deque. Heterogenous and mutable.str. Flat and immutable.bytes. Flat and immutable.bytearray. Flat and mutable.memoryview. Flat and mutable.array.array. Flat and mutable.A data structure is flat/heterogenous if it can only hold the same type or hold different types, and is mutable/immutable if its contents can be modified.
Listcomps are fairly simple, so I will just give a few examples of them here before we jump to genexps.
>>> letters = ['a','b','c']
>>> numbers = [1,2,3]
>>> product = [(letter, number) for letter in letters for number in numbers]
>>> product
[('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3), ('c', 1), ('c', 2), ('c', 3)]
>>> letters = ['a','b','c']
>>> filtered_letters = [letter for letter in letters if letter != 'c']
>>> filtered_letters
['a', 'b']
Generator expressions can be done to save space in the case that an entire list is not needed. We replace listcomp with genexp in the above cartesian product example:
>>> letters = ['a','b','c']
>>> numbers = [1,2,3]
>>> for element in ('%s %s' % (l, n) for l in letters for n in numbers):
... print(element)
...
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3
Here, we did not create a cartesian product list and scan it to print. This is clearly more efficient.
tupleTuples are immutable lists which serve as records with no field names. In particular, the position of the item in the tuple may have a semantic meaning. We unpack a tuple like so:
>>> coordinates = (1,2)
>>> _, longitude = coordinates
>>> longitude
2
Another way to unpack is by using the star operator:
>>> coordinates = (1,2)
>>> f = lambda x, y: x+y
>>> f(*coordinates)
3
You can use these two in mixture:
>>> long_tup = (1,2,3,4,5,6,7)
>>> x,y,*z = long_tup
>>> x
1
>>> y
2
>>> z
[3, 4, 5, 6, 7]
namedtuplenamedtuples are a part of the collections package which basically, as the name implies, have names to each field. Just as we create tuples as nameless records, we have namedtuples for records with names. These should be used more often:
>>> from collections import namedtuple
>>> Snek = namedtuple('Snake', ['name', 'length'])
>>> an = Snek('anaconda', '4.511')
>>> py = Snek('python', '3.7')
>>> py
Snake(name='python', length='3.7')
>>> an
Snake(name='anaconda', length='4.511')
>>> py[1]
'3.7'
You can modify portions of a mutable sequence using slice, i.e. add, delete, etc:
>>> l = [1,2,3]
>>> l[1:1] = ['a','b','c']
>>> l
[1, 'a', 'b', 'c', 2, 3]
>>> l = [1,2,3]
>>> del l[1:]
>>> l
[1]
When you call seq[start:stop:step], python calls seq.__getitem__(slice(start, stop, step)), where slice is actually an object. Thus, you can actually create named constant slices for better readability, like so:
>>> NAME = slice(0,4)
>>> FIRST = slice(4,10)
>>> LAST = slice(10, None)
>>> items
['a 123 456', 'b 23 4567']
>>> for item in items:
... print("Name: {}, First: {}, Last: {}".format(item[NAME].strip(), item[FIRST].strip(), item[LAST].strip()))
...
Name: a, First: 123, Last: 456
Name: b, First: 23, Last: 4567
This makes your functions less “magic” and more readable for future developers.
When you use numpy’s multidimensional array class, np.ndarray, how does it handle multiple arguments?
When you call a[i,j], you are actually calling a.__getitem__((i, j)), so that the arguments inside of [] is packaged as a tuple. numpy defined custom __getitem__ so it can accept tuples like above, unlike built-in types.
When you use a[i, ...], it is another multi-argument slice, equivalent to a[i, :, :, :] if it was a 4-dimensional matrix. The ... is considered a special keyword, an alias to Ellipsis object, from the ellipsis class. Think of Ellipsis as True or False, and its class ellipsis like bool.
+, * Operators With Sequences*Never do this: [[]] * 3. When you evaluate it at first, you’ll see [[],[],[]]. Innocent looking enough, but they are references to the same list. Do this instead: [[] for _ in range(3)], so that each object is constructed individually.
__iadd__ and __imul__ With Immutable ObjectsWhen you try to augment a sequence using += or *=, it invokes the magic functions __iadd__ and __imul__. If the object does not have these functions, it falls back to __add__ and __mul__. For immutable objects like tuples, this sometimes spells trouble:
>>> l = (1,2,3)
>>> id(l)
4510772608
>>> l += (4,5,6)
>>> l
(1, 2, 3, 4, 5, 6)
>>> id(l)
4510680840
These are not the same objects. What’s happening underneath the hood is equivalent to l = l + (4,5,6), which means we created a new tuple! To be precise, the tuple container is replaced while the references to objects inside are the same before and after.
>>> t = ([],)
>>> t[0] += [1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> t
([1],)
Wait… So we modified the list inside the tuple, but it also gave us an error? Why is this true? Because this is not an atomic operation. Specifically, it looks like this:
>>> temp_t = t
>>> temp_idx = 0
>>> mutable_item = temp_t[temp_idx]
>>> mutable_item += [1]
>>> temp_t[temp_idx] = mutable_item
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
You can also disassemble python code and view the bytecode to see a similar process.
bisect For Sorted Sequencesbisect()There are 2 bisect functions, bisect_left and bisect_right. By default, bisect.bisect is an alias for bisect.bisect_right. These functions are basically binary searches for sequences with generic ordered types. The difference between the two is subtle - when you try to search an object and the object itself is present in the sequence, bisect_left returns the element’s index, and bisect_right returns the index plus one:
>>> from bisect import bisect_right, bisect_left
>>> l = [1,2,3]
>>> bisect_right(l, 2)
2
>>> bisect_left(l, 2)
1
One application is discretization. For example, from numerical scores on an exam to one’s final grade, which is in discrete intervals of F, D, C, B, A. Or, waist size to shirt sizes:
>>> def shirt_size(size, cutoff=[30,60], sizes=['Small', 'Medium', 'Large']):
... return sizes[bisect_right(cutoff, size)]
...
>>> shirt_size(20)
'Small'
>>> shirt_size(30)
'Medium'
>>> shirt_size(50)
'Medium'
>>> shirt_size(60)
'Large'
>>> shirt_size(100)
'Large'
Here, we use bisect_right because we would rather want the person to fit comfortably in a larger t-shirt in case their waist size was at the cutoff. If you want to look like you’re bigger than you actually are(like me), then you would use bisect_left.
insort()We use bisect.insort to add an element into a sorted sequence:
>>> import bisect
>>> l = [1,3,5,7]
>>> for i in range(4):
... bisect.insort(l, i*2)
... print(l)
...
[0, 1, 3, 5, 7]
[0, 1, 2, 3, 5, 7]
[0, 1, 2, 3, 4, 5, 7]
[0, 1, 2, 3, 4, 5, 6, 7]
insort has extra keyword arguments to insert in a sorted subsequence, and insort left or right like bisect. Once again, this can be applied to more than just lists. Any ordered collection will do, like arrays!
arrayArrays are underrated and underused. They are so, so much faster than lists. lists are the default, but don’t be lazy when you need the performance. arrays contain the bit & byte level representation of primitive data types, and it’s basically a C-style array. To create an array:
>>> from array import array
>>> import numpy as np
>>> nums = array('d', (np.random.random() for _ in range(10)))
>>> for num in nums:
... print(num)
...
0.2076318634616442
0.5052559930909137
0.26556051714640794
0.3538229563850064
0.24394891007765362
0.829697244498978
0.8050680531932854
0.7540974416748557
0.5157377814111441
0.6025949390048687
>>> with open('temp.bin', 'wb') as f:
... nums.tofile(f)
...
And then we see the saved binary file:
ls -la temp.bin
-rw-r--r-- 1 dev staff 80 Jun 27 16:18 temp.bin
Nice. It’s small and compact in binary format. numpy arrays do something similar. bytes and bytearray are simply specific types of array that will be discussed in detail later.
memoryviewmemoryview is like a slice of an array. There is no copying, everything is referenced and is usually mutable. Here we change the content of the first double in an array by casting the memoryview to unsigned 8-bit ints, and modifying it:
>>> from array import array
>>> nums = array('d', [1,2,3])
>>> mv = memoryview(nums)
>>> mvc = mv.cast('B')
>>> mvc.tolist()
[0, 0, 0, 0, 0, 0, 240, 63, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 8, 64]
>>> mvc[0] = 10
>>> nums
array('d', [1.0000000000000022, 2.0, 3.0])
Most of the time, you probably don’t want to use this though.
dequecollections supplies with us a deque container which is heterogenous. It has the standard append and pop operations. On top of it, it has a couple cool functions like rotate, extend and extendleft.
rotate() takes in a single integer as argument, and rotates the deque in that direction. Why no rotateleft like extend? Because you can supply a negative number to rotate in the other direction. By default, rotate(k) moves the i-th element to the (i+k) % N-th place in the deque, where N is the size of the container.
>>> from collections import deque
>>> dq = deque("abcdefg", maxlen=4)
>>> dq
deque(['d', 'e', 'f', 'g'], maxlen=4)
>>> dq.rotate(2)
>>> dq
deque(['f', 'g', 'd', 'e'], maxlen=4)
>>> dq.rotate(-1)
>>> dq
deque(['g', 'd', 'e', 'f'], maxlen=4)
>>> dq.extend("abc")
>>> dq
deque(['f', 'a', 'b', 'c'], maxlen=4)
>>> dq.extendleft("gfe")
>>> dq
deque(['e', 'f', 'g', 'f'], maxlen=4)
There are a lot of different queue containers in python. You can technically use a deque as a queue itself too. However, for more complicated applications where asynchronicity is a key factor, insertion and popping from a queue may be tricky. Here, we use threadsafe queue library with its Queue, LifoQueue (which is literally a stack), and PriorityQueue. When you call pop on an empty default queue here, it will wait until an item has been inserted, rather than return an error message. multiprocessing and asyncio implements its own queues as well.