Nov 14, 2007 - 一道图论题

Comments

晚上和同学讨论了一个问题, 觉得很有意思, 贴出来.

假设一个带权有向图, 找 A 到 B 的最短路径显然是简单的, 如果权重没有负数的话, Dijstra’s Algorithm 瞬间就做完了. 现在问题略有变化了. 假设这个带权有向图已知, 但是边的权值不知道. 因为计算权值的代价太大, 所以如果有些边必然不在从 A 到 B 的最短路径上, 就没有必要计算这条边的权值. 那么, 问题是:

给定一个有向图和 A B 两点, 找出原来图的一个子图, 使得对于任意的边的权值赋值, 所有可能的最短路径都包含在这个子图中. (等价的说, 把所有不可能在最短路径上的边全部给移除掉)

大家想想办法, 我后天贴我们的算法 :)

Nov 12, 2007 - 语言模型与语言识别程序 -1

Comments

[先说一下, 上次的语言识别程序的思路来自 Python Cookbook, 我只是略作修改, 并没有什么太大的贡献, 这个系列可能包含三到四篇文章. 底下我可能会用朴素贝页斯分类器再写一个然后比较两者效果]

以前我在翻译怎样写一个拼写检查器中, 就提到了(概率)语言模型这个东西. 什么是语言模型呢, 简单的说, 就是这个语言当中, 每个词出现的概率分布是有一定规律的, 与其他因素无关的. 如果我们要写一个汉语到英文的翻译器, 那么, 理论上说, 我们翻译器输出来的结果当中每个单词出现的概率, 应该和英文当中每个词出现的概率一样. 试想一下, the 这个单词在英文中常常出现, 而如果我们的翻译器对于任意汉语的翻译结果中, 故意调低或者错误设置 the 出现的概率, 造成翻译出来的结果几乎不出现 the, 或者出现的概率出奇的奇怪, 那么, 这段翻译出来的话, 它怎么看也不会像英语. 同样的道理, 一个拼写检查器可以把任意拼错的单词纠正过来, 在纠正过程中必须照顾到语言模型. 如果 teh 可以被改正为 the 和 tek, 那么我们应该优先考虑 the 才对, 因为 the 被使用的概率比 tek 大. (统计语言模型整体是很大的一块, 大家可以去读 Google 黑板报数学之美系列得到一个大致的感觉. 我这里只用到语言模型, 用不到条件概率)

基于这个假设, 我们自然的得到一个主意: 好呀, 假如两个文本的语言模型类似, 我们就认为他们是同种语言写成的. 对的, 怎么衡量这个类似呢, 我们可以用数学的方法: 统计任何两个语言模型之间的”距离”, 然后选取最小的. 这个地方距离随便怎么定义, 可以用向量夹角, 可以用欧式距离. 但是, 现实的问题不是这么简单. 首先, 如果每个词都对应一个词频的话, 得到的向量是高维的. 除非有一个包含各种语言单词的全集, 否则很难说这个向量一共要有多少个维度. 而且高维的随机向量还往往以很大的概率正交. 为了控制维度灾难, 一个简单的方法是使用每一个字节作为一个词统计. 这个模型虽然看上去很没道理, 而且还很”面向机器”, 但是考虑到计算机处理语言的时候本来也就是根据字节来处理, 这样做一下也未尝不可.

我并没有亲自实践比较字节频率然后计算向量距离的方法, 不过我可以想像, 这样的方法在UTF-8字符集上是有很好表现的. 有人可能会担心, 加入某个汉字编码成了两个字节, 这两个字节正好分别等于西文字母的编码的话, 汉字就可以在计算机中等价的写成一串字母了, 这样, 就很难保证一段汉字是不是和一段希腊文正好有同样的字节分布了, 因为我们可以作弊一下, 搞出一段汉字, 正好和希腊文对应分布一样. 但是, 巧妙之处在于UTF-8 字符集本身通过若干前缀区分了不同的变长编码, 不同的编码区段没有任何交集. 反过来说就是, 从UTF-8 编码的字节序列中任意取一个字节出来, 一定能够区分这个字节是某个变长编码的一部分. 任何一个多字节符号编出来的编码的一部分, 绝对不可能正好是其他单字节符号的编码. 这样, 单字节多字节变成字节编码后各自为政, 映射到的字节码的概率分布不会出现“作弊”现象. 用严格的数学语言来说就是, 一个字符对应的字节在叠加到向量中的时候, 绝对不会叠加到属于另一个字符的所有字节上, 也就是说, 他们总有独立分布的部分存在. 这个是字节模型能正确反映语言模型的一个必要要求.

[To be continued…]

Nov 9, 2007 - My plan about the “Read The F Manual” Project

Comments

A little bit history

Being in the academical area for years, one of my dreaming tool is an online collaborative paper reader . You can say it’s YouTube with documents, or Digg with pdf files–name actually doesn’t matter. The ultimate goal for this system is to support collaborative document reading/organization for professors and Ph.D. students. Since we now have Ajax (more interactive than before) and Flash(as powerful as PDF), I would expect an light-wighted solution. arXiv and Citeseer do really good jobs in storing and linking all the documents, but the ultimate goal for me is to read.

On winter 2006, I expanded my ambition to a larger project: read the f source code and read the f books (Here of course the f word means fine :). I launched my domain name rtfsc.org [now redirected to Apache.org]. Our team members thought that rtfsc was easier than rtfm because we only need to deal with text information instead of PDF file. However, we set ourselves on the wrong track. It turned out that if we let the users upload and update their codes all the time, we have to implement a Subversion or even a Sourceforge. For us, an online code-reading community without code management and code search is terrible, but we don’t have that much time to finish all these ambitions. The other obstacle comes from my lacking experiences of Ajax development. Anyway, at last, we gave up. We stopped the development after Christmas. On that week, I was the person of the year 2006.

New motivation

I have been thinking about this project for nearly one year without any action. Lots of similar projects were there during this year, for example, flashpaper, edocr and scribd. But none of these are my dreaming tool. I took a retrospect about our project again and found that nobody actually wanted to do that except us because they didn’t actually understand our requirements. Their goals are usually becoming the next DocTube thing, while my goal is to have a handy system open for academical research. Also, I found some open source tools that were really inspiring. These make me interested in the project again.

Technical aspects:

  1. PDF to SWF is not so that hard, there is an open source tool: PDF2SWF.

  2. I thought that I can’t control the converted SWF as there is no API exposed as Flashpaper, but I was wrong. Actionscript can control it directly. What I need to do is to learn Actionscript, which is not so that hard.

  3. I was thought that Flashpaper use some undocumented APIs which we could never know, but I got SWFmill today and now I can study the code of Flashpaper. [Disclaimer: I didn’t say anything about how to decompile flashpaper or violate the EULA of flashpaper]

  4. I thought Ajax + Flash development was hard and had a very steep learning curve, but again I was wrong. On hacking Google Finance these days, I found the Flash Ajax Integration Package, which is very handy to use for some newbie like me without any Ajax/Flash developing experiences.

Proof of concept

Here is a proof of concept product. I convert my PDF version CV to swf and then combined it to a controller I wrote based on the default controller provided in PDF2SWF. (This CV has two pages, you can proceed to next page by click the red arrow)

cv.swf

Final Word

What I need is an sophisticated viewer which can be as powerful as flashpaper can communicate with other Ajax components on that webpage. So for developers, if you find this idea interesting and you are good at Ajax/Flash, feel free to take this idea and do it, as I might not have time to do it or it will take me a long time to finish it. I desperately need an Ajax style flash-based document reader with inline comment support. If you can do thing and get your own startup based on the idea, my only request is letting me be your user.

BTW, if there was some lessons in this project, I would say, choose the handy tools before you get start.

Nov 8, 2007 - 识别文本用哪种语言写成

Comments

ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95%. 我略做了一下修改, 把联合国联合国人权宣言作为语料库,目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。

class Entropy:
    def __init__(self):      
		self.entro = []

    def register(self, name, corpus):
        """
        register a text as corpus for a language or author.
        <name> may also be a function or whatever you need
        to handle the result.
        """
        corpus = str(corpus)
        ziplen = len(zlib.compress(corpus))
        print name, ziplen
	self.entro.append((name, corpus, ziplen))

    def guess(self, part):
        """
        <part> is a text that will be compared with the registered
        corpora and the function will return what you defined as
        <name> in the registration process.
        """
        what = None
        diff = 
        part = str(part)

        for name, corpus, ziplen in self.entro:
		nz = len(zlib.compress(corpus+part)) - ziplen
		if diff== or nz<diff:
                	what = name
        		diff = nz
        return what

先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题, 这就是人工智能的精华.

[所有文件打包下载(包含语料源文件10Mb). 代码本身其实只有50行]

Nov 6, 2007 - A simple grabber to get unlimitted number of items from RSS

Comments

Recently I have to get some historical data from one RSS feed. It seems that RSS can only output a limited number of recent items. Since in Google Reader, we can always roll to previous item [if there is one], my solution here is to use Google Reader as the feed processor.

Actually I am not the first one to do this. gPowered and  GoogleReaderAPI have already made it possible. I extract the necessary code here and omit other lines. As usual, it’s Python.[download]

""" wuFetcher
	Usage: python wufetcher.py

	Author: Eric You XU, Washington University
	[Free to use for whatever purpose, absolutely NO WARRANTY]
	Kudos to 	gPowered: 			http://blog.gpowered.net/2007/08/google-reader-api-functions.html
				GoogleReaderAPI		http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
"""

import urllib
import urllib2
import re  

login = 'wufetcher@gmail.com'
password = 'wuFetcher2007'
source = 'wuFetcher'

google_url = 'http://www.google.com'
reader_url = google_url + '/reader'
login_url = 'https://www.google.com/accounts/ClientLogin'
get_feed_url = reader_url + '/atom/feed/'

def get_SID():
	header = {'User-agent' : source}
	post_data = urllib.urlencode({ 'Email': login, \
								'Passwd': password, \
								'service': 'reader', \
								'source': source, \
								'continue': google_url, })
	# @see GoogleReaderAPI: Identification

	request = urllib2.Request(login_url, post_data, header)
	try :
		f = urllib2.urlopen(request)
		result = f.read()
		print result
	except:
		print 'Error logging in'
		exit(-1)
	return re.search('SID=(\S*)', result).group(1)

def get_results(SID, url, number):
	header = {'User-agent' : source}
	header['Cookie']='Name=SID;SID=%s;Domain=.google.com;Path=/;Expires=160000000000' % SID
	post_data = urllib.urlencode({'n': str(number)})
	request = urllib2.Request(url+'?n='+str(number), None, header)
	try :
		f = urllib2.urlopen( request )
		return f.read()
	except:
		print 'Error getting data from %s' % url
	return None

if __name__ == "__main__":
	sid= get_SID()
	feed_url= "http://feeds.feedburner.com/xumathena"
	# replace this url with the rss feed you want to fetch

	number = 10
	# replace this number with number of items you want to fetch

	result = get_results(sid, get_feed_url+feed_url.encode('utf-8'), number)
	f = open(feed_url.split('/')[-1], 'w')
	f.write(result)
	f.close()