ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95%. 我略做了一下修改, 把联合国联合国人权宣言作为语料库,目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。
class Entropy: def __init__(self): self.entro = [] def register(self, name, corpus): """ register a text as corpus for a language or author. <name> may also be a function or whatever you need to handle the result. """ corpus = str(corpus) ziplen = len(zlib.compress(corpus)) print name, ziplen self.entro.append((name, corpus, ziplen)) def guess(self, part): """ <part> is a text that will be compared with the registered corpora and the function will return what you defined as <name> in the registration process. """ what = None diff = part = str(part) for name, corpus, ziplen in self.entro: nz = len(zlib.compress(corpus+part)) - ziplen if diff== or nz<diff: what = name diff = nz return what
先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题, 这就是人工智能的精华.