ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95%. 我略做了一下修改, 把联合国联合国人权宣言作为语料库,目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。
class Entropy:
def __init__(self):
self.entro = []
def register(self, name, corpus):
"""
register a text as corpus for a language or author.
<name> may also be a function or whatever you need
to handle the result.
"""
corpus = str(corpus)
ziplen = len(zlib.compress(corpus))
print name, ziplen
self.entro.append((name, corpus, ziplen))
def guess(self, part):
"""
<part> is a text that will be compared with the registered
corpora and the function will return what you defined as
<name> in the registration process.
"""
what = None
diff =
part = str(part)
for name, corpus, ziplen in self.entro:
nz = len(zlib.compress(corpus+part)) - ziplen
if diff== or nz<diff:
what = name
diff = nz
return what
先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题, 这就是人工智能的精华.