Posts published during November, 2007

ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95%. 我略做了一下修改, 把联合国联合国人权宣言作为语料库,目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。

class Entropy:
    def __init__(self):
		self.entro = []

    def register(self, name, corpus):
        """
        register a text as corpus for a language or author.
        <name> may also be a function or whatever you need
        to handle the result.
        """
        corpus = str(corpus)
        ziplen = len(zlib.compress(corpus))
        print name, ziplen
	self.entro.append((name, corpus, ziplen))

    def guess(self, part):
        """
        <part> is a text that will be compared with the registered
        corpora and the function will return what you defined as
        <name> in the registration process.
        """
        what = None
        diff = 0
        part = str(part)

        for name, corpus, ziplen in self.entro:
		nz = len(zlib.compress(corpus+part)) - ziplen
		if diff==0 or nz<diff:
                	what = name
        		diff = nz
        return what

先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题, 这就是人工智能的精华.

[所有文件打包下载(包含语料源文件10Mb). 代码本身其实只有50行]

Recently I have to get some historical data from one RSS feed. It seems that RSS can only output a limited number of recent items. Since in Google Reader, we can always roll to previous item [if there is one], my solution here is to use Google Reader as the feed processor.

Actually I am not the first one to do this. gPowered and  GoogleReaderAPI have already made it possible. I extract the necessary code here and omit other lines. As usual, it’s Python.[download]

""" wuFetcher
	Usage: python wufetcher.py

	Author: Eric You XU, Washington University
	[Free to use for whatever purpose, absolutely NO WARRANTY]
	Kudos to 	gPowered: 			http://blog.gpowered.net/2007/08/google-reader-api-functions.html
				GoogleReaderAPI		http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
"""

import urllib
import urllib2
import re  

login = 'wufetcher@gmail.com'
password = 'wuFetcher2007'
source = 'wuFetcher'

google_url = 'http://www.google.com'
reader_url = google_url + '/reader'
login_url = 'https://www.google.com/accounts/ClientLogin'
get_feed_url = reader_url + '/atom/feed/'

def get_SID():
	header = {'User-agent' : source}
	post_data = urllib.urlencode({ 'Email': login, \
								'Passwd': password, \
								'service': 'reader', \
								'source': source, \
								'continue': google_url, })
	# @see GoogleReaderAPI: Identification

	request = urllib2.Request(login_url, post_data, header)
	try :
		f = urllib2.urlopen(request)
		result = f.read()
		print result
	except:
		print 'Error logging in'
		exit(-1)
	return re.search('SID=(\S*)', result).group(1)

def get_results(SID, url, number):
	header = {'User-agent' : source}
	header['Cookie']='Name=SID;SID=%s;Domain=.google.com;Path=/;Expires=160000000000' % SID
	post_data = urllib.urlencode({'n': str(number)})
	request = urllib2.Request(url+'?n='+str(number), None, header)
	try :
		f = urllib2.urlopen( request )
		return f.read()
	except:
		print 'Error getting data from %s' % url
	return None

if __name__ == "__main__":
	sid= get_SID()
	feed_url= "http://feeds.feedburner.com/xumathena"
	# replace this url with the rss feed you want to fetch

	number = 10
	# replace this number with number of items you want to fetch

	result = get_results(sid, get_feed_url+feed_url.encode('utf-8'), number)
	f = open(feed_url.split('/')[-1], 'w')
	f.write(result)
	f.close()