Posts archived in CompSci

网上流行几条面试题, 都和经典的采样洗牌问题有关, 网上每过一段时间都有哥们拿出来问一下. 正好最近在翻的 TAoCP Vol 2 上有详细专门的讲解, 因此在此一并介绍.

这几道面试题是:

1.  Given a random number generator which can generate the number in range (1,5) uniformly. How can you use it to build a random number generator which can generate the number in range (1,7) uniformly?

(http://discuss.joelonsoftware.com/default.asp?interview.11.489564.8)

2. Generate a random permutation for a deck of cards.

3. Given a long log file (not sure how long), pick 1000 items evenly from them. (Google Interview)

其中, 第一条是随机数的生成问题. 第二条是生成一个序列的随机排列的问题, 第三条等价于生成一个序列的随机组合的问题. 实质上洗牌问题就是从集合中选取随机排列, 而采样问题就是从集合中选取随机组合.

第一条的解法要用到拒绝采样定理. 简单的说, 把 1-5 的随机数发生器用两次, 拼成一个5进制的数, 就是1-25. 将这 1-25 平均分配的25种情况映射到7种情况上, 问题就解决了. 因为21是7的倍数, 我们可以每三个映射到一个, 即1-3 映射到1, …, 19-21 映射到7. 可见, 这些情况之间的概率是一样的. 那么, 要是拼成的数字正好是 22-25 这四个呢? 有两种方法, 第一种是丢弃这个数字, 从头再来, 直到拼成的数字在1-21之间. 因为这个是个概率算法, 不能保证每次都能落在1-21, 所以采样的密度不高. 还有一种方法, 是说, 假如落到了 22-25, 那这次的采样结果就用上次的. 可以证明, 这看上去两个互相矛盾的算法, 结果都能均等的得到等概率的分布. (前者叫做 Reject Sampling, 后者叫做 Metropolis Algorithm, 都是数学物理模拟里面常用的方法)

第二条的解法很简单, 从后往前第k步的时候, 每次取一个1-52k (假如牌有52张) [感谢读者评论指出这个错误 不是1-52, 而是1-k] 的随机数 j, 交换 k 和 j. 这样, 每一位都是等概率的被交换, 最后的排列也是等概率的得到.

这个方法有一个趣闻. 我们知道, 计算机里的随机数发生器是伪随机数发生器. 伪随机数发生器的原理是用线性同余实现的, 即本次的随机数 乘以一个常数, 加上另一个常数, 再对一个大数c求同余. 我们可以把从一个随机数到下一个随机数的映射看成一个函数, 这个函数是必然形成循环的 (您把这个操作和集合整体当成循环群理解也是对的, 因为好的伪随机数发生器都设计得使得所有的数是各态遍历的). 这个循环的周期, 最长就是就是c. 所以, 如果您用伪随机数发生器来生成以上的排列的话, 只会得到最多 c 种不同的交换方案. 假设我们使用 32 位机器, 那么最大的可以用在做同余的c 就是 2^{32}. 可是 13! 就已经是这个数量级了, 更不要说是 52 的阶乘. 这就意味着, 用计算机里的随机数发生器, 只能生成总体可能排列中极其有限的一部分. 所以, 假如您在电脑上玩扑克牌, 而扑克牌的洗牌算法又正好是上面我们说的那种的话, 您可能一辈子也抓不到某种组合. 这个不是概率很小的问题, 而是完全不可能. (此例来自 TAoCP, Vol 2)

第三条也不难, 虽然听上去吓人. 本质上只需要知道在已经处理了前 t (t \ge 1000)个的情况下, 第 t +1 个被选中的概率. 实际上就是 1000/(t+1). 因此, 用这个概率去替换已经选中的1000个当中的某个. 可以证明, 任何时间停下来, 算法都是等概率的挑出了前面 t 个里面的1000 个. 这种任何时间能停下来还能获得正确结果的算法, 叫做 Anytime Algorithm, 在计算机科学领域随处可见, 尤其是 AI Search 方面.

第三条的算法在 TAoCP 第二卷里面交代得更加全面, 因为考虑了这 1000 个记录在内存里面放不下的情况. 这时候, 一个叫做 reservoir 的缓存被引入. 每次仅仅把选中的记录加入 reservoir, 而只保留记录在其中的序号. 在序号的数据结构上作替换, 最后, 按照最终留下来的那些序号, 再走一遍 reservoir, 就得到了要选的记录. 这个采样的方法在分析不定长网络日志的时候比较有用, 所以也是 Google 亲睐的面试题.

采样和洗牌的更多内容, 请参考 TAoCP Vol 2. 3.4.2.

Recently I have to get some historical data from one RSS feed. It seems that RSS can only output a limited number of recent items. Since in Google Reader, we can always roll to previous item [if there is one], my solution here is to use Google Reader as the feed processor.

Actually I am not the first one to do this. gPowered and  GoogleReaderAPI have already made it possible. I extract the necessary code here and omit other lines. As usual, it’s Python.[download]

""" wuFetcher
	Usage: python wufetcher.py

	Author: Eric You XU, Washington University
	[Free to use for whatever purpose, absolutely NO WARRANTY]
	Kudos to 	gPowered: 			http://blog.gpowered.net/2007/08/google-reader-api-functions.html
				GoogleReaderAPI		http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
"""

import urllib
import urllib2
import re  

login = 'wufetcher@gmail.com'
password = 'wuFetcher2007'
source = 'wuFetcher'

google_url = 'http://www.google.com'
reader_url = google_url + '/reader'
login_url = 'https://www.google.com/accounts/ClientLogin'
get_feed_url = reader_url + '/atom/feed/'

def get_SID():
	header = {'User-agent' : source}
	post_data = urllib.urlencode({ 'Email': login, \
								'Passwd': password, \
								'service': 'reader', \
								'source': source, \
								'continue': google_url, })
	# @see GoogleReaderAPI: Identification

	request = urllib2.Request(login_url, post_data, header)
	try :
		f = urllib2.urlopen(request)
		result = f.read()
		print result
	except:
		print 'Error logging in'
		exit(-1)
	return re.search('SID=(\S*)', result).group(1)

def get_results(SID, url, number):
	header = {'User-agent' : source}
	header['Cookie']='Name=SID;SID=%s;Domain=.google.com;Path=/;Expires=160000000000' % SID
	post_data = urllib.urlencode({'n': str(number)})
	request = urllib2.Request(url+'?n='+str(number), None, header)
	try :
		f = urllib2.urlopen( request )
		return f.read()
	except:
		print 'Error getting data from %s' % url
	return None

if __name__ == "__main__":
	sid= get_SID()
	feed_url= "http://feeds.feedburner.com/xumathena"
	# replace this url with the rss feed you want to fetch

	number = 10
	# replace this number with number of items you want to fetch

	result = get_results(sid, get_feed_url+feed_url.encode('utf-8'), number)
	f = open(feed_url.split('/')[-1], 'w')
	f.write(result)
	f.close() 

我的Blog从4月份开张到现在, PageRank 一直是0. 今天PageRank 更新了, 变成  4 了. 我除了用了 mod_rewrite, 一点SEO都没有做过. 可见只要内容好, PageRank 就上去了. (当然和 AW 同学的 PageRank 还是不能比).

最近 Blog 更新慢了, 因为工作学习上的事情有点忙. 过了这几天就好了. 乘今天出来冒泡的机会向 Blog 读者先说个抱歉. 虽然写Blog 不是任务, 不过看到700多的订阅量和400多的PV (和 AW 同学比还是太小了), 感觉要是每天不加点新的内容, 真的对不住大家 :)

10 comments

几条面试题

1. 两个单向链表, 开头结点不一样, 在中间某处开始, 结点一样了, 比如
a-b- c-
       |- p q e d
r-t-y-u-

用最好的方法找出第一个公共节点(本题是 p)

2. 一个唱片, 可以被分成8个大小一样的小扇形. 要求给每个扇形涂上红色或者蓝色, 使得唱片旋转起来以后, 按照顺序读出颜色序列, 就能判断唱片是顺时针旋转还是逆时针.

3. 如何判断一个整数是不是完全平方数 (不允许开方)

4. 数一数一个无符号整数有多少 bit 长 (不允许使用 mask, 想最好的方法)

5. 几条根本等价的概率题

不知长度的日志文件, 如何等概率选出100条.

用两个数码 (2bit)的存储空间 估算我Blog的每天访问量. (假设只有0-500, 500-1000, 1000-1500和1500-2000四档)

第一个全答对的送小奖品 :)

研究资本主义面试题, 向党的十七大献礼

LOL. My second favourite song this month.

For lyrics

两个月前收 到Google 一个猎头的邀请我去谈谈的信后, 今天又收到另一个 Google 猎头的 Hello from Google. 其实只是被猎头找找聊聊, 这种事情也没啥值得拿出来炫耀, 只是我真的很奇怪, 不知道他们这些人是怎么用人肉搜索引擎发现我的邮箱的. 除了我的Blog 和几个个人主页, 我在网上从没留过邮箱. 我的Blog 和个人主页 PageRank 都是0, 怎么会被人发现呢?

面对 Google 猎头这些”套词”信件, 我真的对眼前这个巨头感到害怕起来.

1. 可以想象, 每年无数优秀的人会被这样套词. 也有很多的人就此选择和Google面试.

2. Google 是世界上最受欢迎的雇主, 拿了 offer 去的比例算是非常非常高的.

3. 听我在Google 的朋友说, Google 每个人的员工号都是唯一的, 就算实习生以后走了, 员工号也就空在那里. 据说李开复的员工号已经是10000+了. 我曾经打算要是博士毕业加盟Google, 指不定能搞到 32768 (2^15), 但据朋友说此号码已经分配了. 估计等到我毕业的时候, 65536 (2^16) 也没有了.

4. 据说如果 Google 猎头招到优秀的人, 是有奖励的. 这也是猎头不停发邮件的原因. 可问题是再这样无序的满地招人, 到处发信(我个人的感觉), 这个公司就不是求贤若渴, 而是圈人运动了. 如果一个公司的架构不能适应这样的招人速度, 越多的牛人进来, 问题只会越尖锐.

5. 如果Google 架构很好, 可以养这么多牛人, 他们要是不造 Google 牌航天飞机, 把总部搬到火星, 或者造宇宙终极计算机, 还真对不起他们的智商和Google 赚的钱.

总之, 一方面, 我觉得要是真的这样到处发信招人, 即使是求贤若渴, 也绝对不是健康发展的 Google; 另一方面, 如果Google 真的这样健康发展, 几年后Google 是不是真的该造出什么终极的让其他互联网公司根本无力抗衡的东西出

来? 无论怎样, 说实话, 这样的Google, 让我害怕.

现在, 我只想做个普通的学生. 希望等我毕业的时候, Google 还是如日中天, 并且我能顺利加入这家伟大的有趣的公司. 至于我的员工号, 我希望是111111(十进制).

猎头: 我现在对我从事的项目非常感兴趣,我感觉到我正在做一些了不起的研究和应用; 而且,我必须先完成我博士学业。所以,我暂时不会加盟其他公司。如果将来加盟其他公司,我的第一选择是 Google.

Note to recruiters: Please don’t offer me a job now. I am quite proud of my current research and project. Additionally, I have to finish my Ph.D. study first.

Google will be my first choice in the future.

另: 哪位朋友想加盟Google, 我或许可以代为推荐, 我在Google 也有能帮助推荐的朋友. 可以坦率的说, 推荐的分量比自己投简历要重要得多.


有人说, 等我有了钱, 养一只狗取名叫古狗,养一只鸽子叫谷鸽. 本文和这句话一样, 纯属ZB, 请用自己智力判别.

All the materials are cited from Google. I highlighted some important items that might be useful for our Chinese students.

>General information to include

To make it easier for us to determine where you might best fit within our organization we ask that you take a few simple steps to help us understand your qualifications. Following the guidelines below will ensure your resume/CV finds its way to the appropriate group more quickly, giving you a better opportunity to discuss your qualifications in person or via a phone screen.

* All resumes/CVs and supporting materials must be submitted electronically; no paper resumes will be accepted.
* PDF, HTML, or Microsoft Word documents or text formats are acceptable – or you can submit using plain text format.
* All resumes and related materials (transcripts, etc.) should be submitted in English.
* Pictures, images, or other graphics are not necessary – and are discouraged as they can slow evaluations.
* Only send essential personal information – be sure to include your name and how to contact you in the resume, not just your cover letter. Include your email, phone, and residence address. Do NOT include your gender, date of birth, age, family status, or personal identification numbers.
* It isn’t necessary to include military service you may have performed, unless it reflects some special achievements or accomplishments that you feel illustrate your qualifications for the job.
* To increase the accuracy of the information we have about you and the speed with which we’re able to reply to your submission, please keep your resume clean and simple. The use of special formatting, tables, images, multiple columns, etc., can decrease the ability to accurately review resumes. As we’ve found with Google itself, plain text works best!

>Submitting a resume – Educational background

Your resume/CV should reflect your academic achievements and accomplishments in these areas. In the education section of your resume, be sure to include the information outlined below.

* Your resume should show all post-secondary institutions attended, degrees conferred, and a cumulative grade point average (if available) for each degree received.
* Only report your educational history dating back to the university level; do NOT include elementary or secondary schooling. However, if you completed a “year abroad” program as part of your pre-university education, feel free to include this in your resume.
* Provide a brief description of any important projects you completed as part of your coursework, and indicate whether it was all your work or done as part of a team. If part of a team, indicate your own role and contributions to the effort.
* If you graduated from a university within the last five years, include a copy of your transcripts (unofficial is okay), a list showing individual coursework completed and grades received, as well as the overall grade average.

>Submitting a resume – Your Work Experience

You may be fresh out of a university, or have substantial work experience and a history of accomplishments. Either way, we want to know what skills you’ve acquired along the way. We’ll look closely at the work experience section of your resume, so the information you provide here is very important. Please follow the guidelines below carefully.

* List your experience – projects completed, accomplishments, etc. – by your position with each employer.
* Include more information than just the name of your employer and your job title. We also want to see concise, yet important, detail on your specific accomplishments and the impact your efforts had on your company.
* Rather than including all job responsibilities, only focus on those that you feel are relevant to the job for which you are applying at Google.
* If you worked while attending a university, either during the summer or concurrent with your course work, be sure to include a brief mention even if it isn’t specifically related to a potential job at Google.

>Submitting a resume – Additional Information

Here at Google, we value talent and intelligence, group spirit and diversity, creativity and idealism. Googlers range from former neurosurgeons and puzzle champions to alligator wrestlers and Lego maniacs. Tell us what makes you unique!

* Include the names and contact information of 2-3 references. These can include faculty advisors, co-workers, managers, or others who can talk knowledgeably about your skills and abilities
* Be sure to include any awards you’ve received, articles you’ve published, or conference presentations you’ve given.
* We don’t need to see copies of any awards or publications, just a reference to them.
* We don’t need copies of any written references you already have, just a mention of 2-3 individuals that can reflect on your most recent skills and experiences. Be sure to include their contact information. We will not contact your references until after we talk to you.

> What to expect from your interview

* While we’ll certainly do our best to make you feel comfortable during the interview process, we’re very interested in learning more about how you approach problem-solving. The questions you’ll be asked will be in-depth and will be intended to let us get a peek at how you think about complicated things. Many candidates find this challenging, but ultimately exhilarating. It’s your chance to show an appreciative audience exactly how much you’ve learned about your area of expertise.
* Interviews are always conducted in English and you should have a strong command of the language so you’ll be able to describe your ideas clearly. This is essential as all positions interact directly with engineers in the U.S. and other countries.
* Google’s phone screen and in-person interviews are highly technical in nature. You’ll be asked to write code during the interview itself and to speak to the technical details of your past designs and implementations . You should expect that your interviewers will have a great deal of curiosity about the specifics of your work and will ask questions about how you arrived at your conclusions. Our engineers admire and respect the work of others and are truly interested in learning more about what you’ve accomplished and how you did it.

今天digg 出了一个情况:digg 对用户的内容进行了审查,并禁止了部分用户发言。美国是没有什么反总统或者反党的顾虑的,那么为什么 Digg 要下这个重手呢? 原来,有的用户贴出了一串数字。这个128位的二进制串是所有蓝光和HD-DVD 加密算法的基石 HD-DVD 的一个设备码. 好莱坞以及无数家大公司这么多年的心血全部白费. (我以后专门写文章讲这个解密原理). [Update: 严格的说,只要这个设备码,就可以轻松解密和盗版所有现有的 HD-DVD。以后的影碟会继续更换加密方式。但是这个门一开,以后想关就难了]

美国数字千年法案上并没有关于一串数字是否涉及破坏版权的规定。但是美国曾经判过“非法素数” 的例子。因此,按照数字安全法案,这个号码被要求删除也说得过去。

不过 digg 的用户不干了,他们纷纷说,digg 已死, 强硬贴下那串数字后永远离开 digg。Digg 是黑客的乐园,可以想像如果走掉核心用户,就好比猫扑走掉核心用户一样, digg 就不是digg 了。不过美国人善于危机公关,我们拭目以待这次digg 如何收场。

数字千年法案已经让Google 出来要求blogger 用户删除在帖子中公开的号码了,可见这次的来头非常大。但是对于不受这个法案约束的其他国家,比如中国和英国的Blogger 用户,不知道Google 是否会以服务器在美国为由要求删除 。 YouTube 上有人把这个号码唱成了歌,如果Google 继续被要求删除,我们就会看到一个强硬的Google 还是一个被好莱坞和大公司“审查”的Google.

wikipedia 本来也有和这段号码相关的条目,现在已经被删除。所有相关条目都被保护不允许更改,可见这次的来头…

这次的事情我们可以从这几个角度看:

1. 到底是不是核心用户决定一个社区
2. 危机公关怎么做
3. Google 会不会向千年数字版权法妥协
4. 到底这串数字的其他形式,比如说唱形式,是否构成了对数字安全法的侵害?

Update1: 现在的digg 首页全讨论这个问题,人们用各种各样的方法在散布这个号码,比如写成图片,做成IP地址等等,来告诉digg, 这样审查是没有效果的。任何一篇和这个号码有关的文章都能几分钟变火。

Update2: 据说Digg 管理者收了钱Google 已经被要求删除相关内容.

Update3: 关于“非法素数”大概的意思是这样的: 一个牛B 的黑客 用C语言写了一个破解DVD的程序, 但他没法发布,因为发布是违法的。所以呢 他把编译好的2进制公布出来,再转成10进制,正好是一个素数。后来, 美国法律规定这个数违反了千年数字法案的,因此这个数字被称为“非法素数”, 至今仍然有人辩论是否可以把一个数字定为非法。[wiki

Update4: digg 创始人开始转变立场了。

We hear you, and effective immediately we won’t delete stories or comments containing the code and will deal with whatever the consequences might be.

If we lose, then what the hell, at least we died trying.

Digg on,

Kevin

翻译成中文就是: 用户,我们听到了。我们将不删除含有那个code 的文章并且准备应对由此造成的任何情况。 如果我们失败了,那也管不着了,至少我们尝试过了。 把这个文章digg 上去! Kevin 这一轮digg 大战,用户彻底胜利了. 可以想像这样会有更多的用户来digg 而不是更少。

Update 5: 按照我的理解,这次AACS必定会罢休,而且及其可能不会要挟Google删除搜索结果。 原因是这样的,在Google 的 所有DMCA条款上都有这样的一行字:

Please note that a copy of each legal notice we receive is sent to a third-party partner for publication and annotation. As such, your letter (with your personal information removed) will be forwarded to Chilling Effects (http://www.chillingeffects.org) for publication. You can see an example of such a publication at http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=861. A link to your published letter will be displayed in Google’s search results in place of the removed content.

啥意思呢,就是说, AACS给Google 发的信会被Google 转发给第三方,而Google 的相关搜索结果会被指向这个信。这个是Google 的条款,也就是说,AACS 无法阻止这个信的公开。不幸的是,这次的信中指出的一个侵权URL本身就含有那个码。也就是说,AACS 自己发的信,按照一切法律,Google 都照做了的话,用户就无论怎样还是会看到那个码,即使Google删除所有违反法律的内容,这个信是不可能被删除了。如果AACS要 Google 和第三方删除这个信,那就是自己打自己了,因为信是自己写的。这次这个事情好玩了。