迭代器

在 Python 文档中，实现接口通常被称为遵守协议。因为 "弱类型" 和 "Duck Type" 的缘故，很多静态语言中繁复的模式被悄悄抹平。

迭代器

迭代器协议，仅需要 iter() 和 next() 两个方法。前者返回迭代器对象，后者依次返回数据，直到引发 StopIteration 异常结束。

最简单的做法是用内置函数 iter()，它返回常用类型的迭代器包装对象。问题是，序列类型已经可以被 for 处理，为何还要这么做？

>>> class Data(object):
...     def __init__(self):
...         self._data = []
...
...     def add(self, x):
...         self._data.append(x)
...
...     def data(self):
...         return iter(self._data)

>>> d = Data()

>>> d.add(1)
>>> d.add(2)
>>> d.add(3)

>>> for x in d.data(): print x
1
2
3

返回迭代器对象代替 self._data 列表，可避免对象状态被外部修改。或许你会尝试返回 tuple，但这需要复制整个列表，浪费更多的内存。

iter() 很方便，但无法让迭代中途停止，这需要自己动手实现迭代器对象。在设计原则上，通常会将迭代器从数据对象中分离出去。因为迭代器需要维持状态，且可能有多个迭代器在同时操控数据，这些不该成为数据对象的负担，无端提升了复杂度。

>>> class Data(object):
...     def __init__(self, *args):
...         self._data = list(args)
...
...     def __iter__(self):
...         return DataIter(self)

>>> class DataIter(object):
...     def __init__(self, data):
...         self._index = 0
...         self._data = data._data
...
...     def next(self):
...         if self._index >= len(self._data): raise StopIteration()
...         d = self._data[self._index]
...         self._index += 1
...         return d

>>> d = Data(1, 2, 3)

>>> for x in d: print x
1
2
3

Data 仅仅是数据容器，只需 iter 返回迭代器对象，而由 DataIter 提供 next 方法。

除了 for 循环，迭代器也可以直接用 next() 操控。

>>> d = Data(1, 2, 3)

>>> it = iter(d)
>>> it
<__main__.DataIter object at 0x10dafe850>

>>> next(it)
1
>>> next(it)
2
>>> next(it)
3

>>> next(it)
StopIteration

生成器

基于索引实现的迭代器有些丑陋，更合理的做法是用 yield 返回实现了迭代器协议的 Generator 对象。

>>> class Data(object):
...     def __init__(self, *args):
...         self._data = list(args)
...
...     def __iter__(self):
...         for x in self._data:
...             yield x

>>> d = Data(1, 2, 3)

>>> for x in d: print x
1
2
3

编译器魔法会将包含 yield 的方法 (或函数) 重新打包，使其返回 Generator 对象。这样一来，就无须废力气维护额外的迭代器类型了。

>>> d.__iter__()
<generator object __iter__ at 0x10db01280>

>>> iter(d).next()
1

协程

yield 为何能实现这样的魔法？这涉及到协程 (coroutine) 的工作原理。先看下面的例子。

>>> def coroutine():
...     print "coroutine start..."
...     result = None
...     while True:
...         s = yield result
...         result = s.split(",")

>>> c = coroutine()   # 函数返回协程对象。

>>> c.send(None)    # 使用 send(None) 或 next() 启动协程。
coroutine start...

>>> c.send("a,b")    # 向协程发送消息，使其恢复执行。
['a', 'b']

>>> c.send("c,d")
['c', 'd']

>>> c.close()    # 关闭协程，使其退出。或用 c.throw() 使其引发异常。
>>> c.send("e,f")    # 无法向已关闭的协程发送消息。
StopIteration

协程执行流程：

创建协程后对象，必须使用 send(None) 或 next() 启动。
协程在执行 yield result 后让出执行绪，等待消息。
调用方发送 send("a,b") 消息，协程恢复执行，将接收到的数据保存到 s，执行后续流程。
再次循环到 yeild，协程返回前面的处理结果，并再次让出执行绪。
直到关闭或被引发异常。

close() 引发协程 GeneratorExit 异常，使其正常退出。而 throw() 可以引发任何类型的异常，这需要在协程内部捕获。

虽然生成器 yield 能轻松实现协程机制，但离真正意义上的高并发还有不小的距离。可以考虑使用成熟的第三方库，比如 gevent/eventlet，或直接用 greenlet。

模式

善用迭代器，总会有意外的惊喜。

生产消费模型

利用 yield 协程特性，我们无需多线程就可以编写生产消费模型。

>>> def consumer():
...     while True:
...         d = yield
...         if not d: break
...         print "consumer:", d

>>> c = consumer()  # 创建消费者
>>> c.send(None)  # 启动消费者

>>> c.send(1)  # 生产数据，并提交给消费者。
consumer: 1

>>> c.send(2)
consumer: 2

>>> c.send(3)
consumer: 3

>>> c.send(None)  # 生产结束，通知消费者结束。
StopIteration

改进回调

回调函数是实现异步操作的常用手法，只不过代码规模一大，看上去就不那么舒服了。好好的逻辑被切分到两个函数里，维护也是个问题。有了 yield，完全可以用 blocking style 编写异步调用。

下面是 callback 版本的示例，其中 Framework 调用 logic，在完成某些操作或者接收到信号后，用 callback 返回异步结果。

>>> def framework(logic, callback):
...     s = logic()
...     print "[FX] logic: ", s
...     print "[FX] do something..."
...     callback("async:" + s)

>>> def logic():
...     s = "mylogic"
...     return s

>>> def callback(s):
...     print s

>>> framework(logic, callback)
[FX] logic: mylogic
[FX] do something...
async:mylogic

看看用 yield 改进的 blocking style 版本。

>>> def framework(logic):
...     try:
...         it = logic()
...         s = next(it)
...         print "[FX] logic: ", s
...         print "[FX] do something"
...         it.send("async:" + s)
...     except StopIteration:
...         pass

>>> def logic():

...     s = "mylogic"
...     r = yield s
...     print r

>>> framework(logic)
[FX] logic: mylogic
[FX] do something
async:mylogic

尽管 framework 变得复杂了一些，但却保持了 logic 的完整性。blocking style 样式的编码给逻辑维护带来的好处无需言说。

宝藏

标准库 itertools 模块是不应该忽视的宝藏。

chain

连接多个迭代器。

>>> it = chain(xrange(3), "abc")
>>> list(it)
[0, 1, 2, 'a', 'b', 'c']

combinations

返回指定长度的元素顺序组合序列。

>>> it = combinations("abcd", 2)
>>> list(it)
[('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]

>>> it = combinations(xrange(4), 2)
>>> list(it)
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

combinations_with_replacement 会额外返回同一元素的组合。

>>> it = combinations_with_replacement("abcd", 2)
>>> list(it)
[('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'b'), ('b', 'c'), ('b', 'd'),
('c', 'c'), ('c', 'd'), ('d', 'd')]

compress

按条件表过滤迭代器元素。

>>> it = compress("abcde", [1, 0, 1, 1, 0])
>>> list(it)
['a', 'c', 'd']

条件列表可以是任何布尔列表。

count

从起点开始，"无限" 循环下去。

>>> for x in count(10, step = 2):
...     print x
...     if x > 17: break

10
12
14
16
18

cycle

迭代结束，再从头来过。

>>> for i, x in enumerate(cycle("abc")):
...     print x
...     if i > 7: break

a
b
c
a
b
c
a
b
c

dropwhile

跳过头部符合条件的元素。

>>> it = dropwhile(lambda i: i < 4, [2, 1, 4, 1, 3])
>>> list(it)
[4, 1, 3]

takewhile 则仅保留头部符合条件的元素。

>>> it = takewhile(lambda i: i < 4, [2, 1, 4, 1, 3])
>>> list(it)
[2, 1]

groupby

将连续出现的相同元素进行分组。

>>> [list(k) for k, g in groupby('AAAABBBCCDAABBCCDD')]
[['A'], ['B'], ['C'], ['D'], ['A'], ['B'], ['C'], ['D']]

>>> [list(g) for k, g in groupby('AAAABBBCCDAABBCCDD')]
[['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['C', 'C'], ['D'], ['A', 'A'], ['B', 'B'], ['C',
'C'], ['D', 'D']]

ifilter

与内置函数 filter() 类似，仅保留符合条件的元素。

>>> it = ifilter(lambda x: x % 2, xrange(10))
>>> list(it)
[1, 3, 5, 7, 9]

ifilterfalse 正好相反，保留不符合条件的元素。

>>> it = ifilterfalse(lambda x: x % 2, xrange(10))
>>> list(it)
[0, 2, 4, 6, 8]

imap

与内置函数 map() 类似。

>>> it = imap(lambda x, y: x + y, (2,3,10), (5,2,3))
>>> list(it)
[7, 5, 13]

islice

以切片的方式从迭代器获取元素。

>>> it = islice(xrange(10), 3)
>>> list(it)
[0, 1, 2]

>>> it = islice(xrange(10), 3, 5)
>>> list(it)
[3, 4]

>>> it = islice(xrange(10), 3, 9, 2)
>>> list(it)
[3, 5, 7]

izip

与内置函数 zip() 类似，多余元素会被抛弃。

>>> it = izip("abc", [1, 2])
>>> list(it)
[('a', 1), ('b', 2)]

要保留多余元素可以用 izip_longest，它提供了一个补缺参数。

>>> it = izip_longest("abc", [1, 2], fillvalue = 0)
>>> list(it)
[('a', 1), ('b', 2), ('c', 0)]

permutations

与 combinations 顺序组合不同，permutations 让每个元素都从头组合一遍。

>>> it = permutations("abc", 2)
>>> list(it)
[('a', 'b'), ('a', 'c'), ('b', 'a'), ('b', 'c'), ('c', 'a'), ('c', 'b')]

>>> it = combinations("abc", 2)
>>> list(it)
[('a', 'b'), ('a', 'c'), ('b', 'c')]

product

让每个元素都和后面的迭代器完整组合一遍。

>>> it = product("abc", [0, 1])
>>> list(it)
[('a', 0), ('a', 1), ('b', 0), ('b', 1), ('c', 0), ('c', 1)]

repeat

将一个对象重复 n 次。

>>> it = repeat("a", 3)
>>> list(it)
['a', 'a', 'a']

starmap

按顺序处理每组元素。

>>> it = starmap(lambda x, y: x + y, [(1, 2), (10, 20)])
>>> list(it)
[3, 30]

tee

复制迭代器。

>>> for it in tee(xrange(5), 3):
... print list(it)

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]