Archive for CompSci

识别文本用哪种语言写成

ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95%. 我略做了一下修改, 把联合国联合国人权宣言作为语料库,目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。

class Entropy:
    def __init__(self):
		self.entro = []

    def register(self, name, corpus):
        “”"
        register a text as corpus for a language or author.
        <name> may also be a function or whatever you need
        to handle the result.
        “”"
        corpus = str(corpus)
        ziplen = len(zlib.compress(corpus))
        print name, ziplen
	self.entro.append((name, corpus, ziplen))

    def guess(self, part):
        “”"
        <part> is a text that will be compared with the registered
        corpora and the function will return what you defined as
        <name> in the registration process.
        “”"
        what = None
        diff = 0
        part = str(part)

        for name, corpus, ziplen in self.entro:
		nz = len(zlib.compress(corpus+part)) - ziplen
		if diff==0 or nz<diff:
                	what = name
        		diff = nz
        return what

先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题, 这就是人工智能的精华.

[所有文件打包下载(包含语料源文件10Mb). 代码本身其实只有50行]

Comments (1)

关于读书

看到师弟蔡白银写的文章, 有所感, 写下几句不成文的心得体会.

大学读经典书, 能缩小自己和发达国家学生的差距. 我们国家的大学和发达国家相比, 差距大家也都能看到. 在理工科上, 或许中国大学唯一能和外国大学在同一起跑线的, 就是图书馆里面躺着的经典英文原版书了. 科学发展到今天, 每个领域都是枝丫茂盛, 都有经典的著作. 作为大学生, 如果没有读过(哪怕看看前言也行呀), 在我看来, 不算合格. 修身第一条, 就是读读经典.

要是上大学的时候只是全是靠课堂而不主动阅读, 这大学也和幼儿园没两样.

好书能帮你打通筋脉, 读了之后立即高屋建瓴, 建立起对一个领域的全盘认识. 我常常觉得, 好书大气磅礴, 就像武林名门正道功夫一样, 大开大阖, 读完以后人的气象都不一样. 读烂书, 或者上用烂教材的课, 会降低你的智商.不相信我这句话的人可以选择 TAoCP 读一下第一小节,再比较一下其他(本校)写数据结构和算法的书.

好的教材好比是心法, 提纲挈领. 先看心法是磨刀不误砍柴功的. 而可惜的是无论是市场还是实际, 砍柴的书都很受欢迎, 磨刀的书没人喜欢看. 很多做技术的都容易做到头, 就是因为没心法, 练不好72绝技的. 就计算机方面而言, 数据结构算法 (Intro. to Algorithm/TAoCP)看一本, 编程工具(TIJ/TIC++/SICP+Python)看一本, 离散数学(Concrete M/Discrete M)看一本, 足以秒杀大多数公司面试题了.

要善于利用中英对照, 好坏对照的比武看书法. 我并非崇洋媚外, 国内很多大学的教材, 存在很强的门户之见, 其教材主要内容囿于本校擅长的某些领域, 往往不能使人窥得一个体系的全貌. 因此最好的办法是找本英文的, 找本中文的. 英文书往往文字冗长, 实例繁多; 中文书往往讲空泛理论, 几乎无例子, 两者结合, 能够取长补短. 更有效的是想像两个作者在吵架, 并且将自己置于审稿人的地位, 跳出书本, 评点某处某书写得如何, 如果自己写该如何写等等. 如此一比武, 好书烂书高下立判, 而且能从一个更高的层次审视一本书. 一招一式学高手, 未必能成高手; 常看高手过招, 华山论剑, 你很快会变成高手.

书是要往厚处读的. 一本书, 哪怕只读一章, 也要保证从头读到尾. 如果没这个自信, 还是不要读了. 弱水三千, 只取一瓢, 与其说是走捷径, 不如说是纵容自己的浮躁和浅尝辄止. 这样读书, 基础不牢. 好比光看别人拳法打的好看, 自己不下功夫从头到尾把拳法演练几遍, 很快就会忘掉的. 我认识不少人, 书往往看得很多, 但考试面试或者实际运用的时候, 和没读过没两样. 结果如此, 那当年读了干啥?

–推而广之, 书上的程序和习题是要一条一条做的. 我认识一个高中朋友, 他说物理不好. 我给他出了个主意: 只做一本书上的习题, 而且一丝不苟, 从格式到步骤都要完美, 假想自己是写本能出版的习题解. 他尝试了几个星期, 单科成绩已经是年级第一了. 往往看上去最笨的方法, 实际上是最聪明的. 就我个人而言, 高年级小学生一笔一划写完初中平面几何只要一年半, 老师讲要三年; 完全不懂OO的大一新生一键一键敲完 Thinking in Java 上的所有程序只要半学期, 就能独步于10万行代码的中型项目. 而很多人学 4 年 Java 也不知道架构到底是什么样子的.

如果我没记错的话, 侯捷在 STL 源码剖析的序言中, 引用了老子的一句话: “天下大事,必作于细”. 我觉得, 能将这句心法读到, 又有什么不能剖析的源码呢. 纵使若干年后, STL 丢在历史某处, 你仍然能够在新的技术中”游刃有余”.

补充: 有很多辅助手段能帮人选书. 我的选法是: 互联网. 判断一本书(我特指数学或计算机方面的教材)是不是好书太简单了, 看看这本书作者如何, 这本书反馈如何. 如果有网页的话, 看看 Google 给这本书的 PageRank 是多少. 尽量选择国外一流大学的教材. 一般情况下, 国内出版的教材, 如果不是清华北大中科院或者这方面的杰出专家出的, 可直接略过. 少买几本庸书, 是为抵制全球变暖和保护森林做贡献.

当然, 说的这些, 都是一己之见. 我知道大家都很”快餐阅读”. 但愿看这篇文章的, 和能认真做到的比例, 能到10:1, 也就行了.

Comments (13)

Build VHDL simulation tool chain on Linux

I take Computer Architecture course this semester and study basic VHDL for digital design. The instructor recommends ModelSim with both Windows and Linux versions. But unfortunately, ModelSim for Linux can only be installed on RH Linux. I tried to install it on Ubuntu but obviously it didn’t work. It turned out that the only solution would be using ModelSim for Windows in our computer lab. But as a dead-head Linux fan, I simply want to find alternative software to get things done on Linux platform. After some googling and asking help from friends, I found two handy software packages that form a tool chain to do VHDL simulation on Linux: GHDL and GTKWave.

What

GHDL is an amazing package that it employ gcc and compile VHDL code to objective code. It can’t translate your design into a netlist but it’s sufficient for us to do the simulation.
GTKwave is another tool which can help you view the wave form dumped by GHDL simulation. There two, in principle, can help you handle all the VHDL simulation tasks involved in a one-semester-course like computer architecture.

How

Build the tool chain

If you are using a Debian family (Debian/Ubuntu, etc) Linux, an apt-get will save your life. Try:
apt-get install ghdl
and
apt-get install gtkwave.
[You might need sudo or run it as root.]

If you not, try to download the package and read the README. It’s easy. If you want to install it on Mac as what I’ve done, remember to install X11. The whole procedure should take you less than five minutes.

How to use it

It’s very simple. GHDL will compile the VHDL code for you to object code and convert it to executable. For example, if I have CPU.vhdl which is a simple computer description. Try to type:

ghdl -a cpu.vhdl

to compile it, and

ghdl -e cpu

to make executable file (do linking).

Finally you can use

./cpu --help

to see how to run it.

Some tips:
If you include ieee package, add this option: –ieee=synopsys
If you want to let the simulation run certain amount of time, use something like

./cpu --stop-time=100ns

If you want to dump a signal wave format for gtkwave, just add –vcd=vcd.file

After that, you can use

gtkwave vcd.file to see the wave. You might need to add all the signals to the wave window via “Search->Signal Search Tree”.

Why

You might ask why it’s handy than ModelSim. It’s obvious. For example, if you write a shell script that can issue all the compiling/dumping/viewing command in the right order, you only need two keyboard stoke to run a test, in comparing with in ModelSim, where you need to do at least 10 mouse click and maybe type these commands:

view wave
add wave -r /*
"set format, etc"

Finally, you can use GHDL as your VHDL code formatter and get a very nice HTML format for your neat code. You can now make it even neater, why not :)

Some really missing features for these small tools.

1. Non-standard signal watch.

If you have a signal which is an array, it’s not handy to view the wave as ghdl doesn’t actually output the signal changing information for this signal.

2. Automatic dependency solving.

Currently GHDL will complain about the dependency, but it doesn’t actually handle this like GNU make. Therefore, you have to either write your own makefile to use make to manage your code dependency, or recompile all the code every time.

My example code of a silly CPU:

    1 – CSE 560 Homework, a simple CPU
    2 – Eric You XU
    3 — version 0.05
    4
    5 library ieee;
    6 use ieee.std_logic_arith.all;
    7 use ieee.std_logic_signed.all;
    8 use ieee.std_logic_1164.all;
    9
   10 entity CPU is
   11         port (  IR      :       in std_logic_vector (31 downto 0);
   12                 READY   :       in bit;
   13                 CLK     :       in bit
   14                 );
   15 end entity CPU;
   16
   17 architecture behav of CPU is
   18     type reg_type is array(0 to 255) of std_logic_vector(31 downto 0);
   19     signal storage: reg_type;
   20
   21     signal d1, d2: std_logic_vector(31 downto 0);
   22     signal raw1, raw2: std_logic_vector(31 downto 0);
   23     signal s1_int, s2_int: integer;
   24     signal prod: std_logic_vector(63 downto 0);
   25     signal status:      std_logic_vector(1 downto 0);
   26     signal phase:       bit;
   27     signal index:       integer;
   28     signal opcode:      std_logic_vector(2 downto 0);
   29     signal needd2:      bit;
   30
   31 begin
   32
   33
   34
   35 process(clk)
   36    begin
   37     if (ready=‘0′) then
   38         phase <= ‘0′;
   39 – For test
40 storage(0) <= X”00000004″;
   41 storage(1) <= X”FFFFFFFD”;
   42 storage(2) <= X”01010103″;
   43 storage(3) <= X”0F0F0F08″;
   44 storage(4) <= X”F0F0F0F0″;
   45 storage(5) <= X”FFFFFFFF”;
   46 storage(6) <= X”00000000″;
   47 storage(7) <= X”0000000E”;
   48 storage(8) <= X”00000004″;
   49 storage(9) <= X”DEADBEEF”;
   50
   51      else
   52        if(clk’event) then
   53           if(clk = ‘1′) then                    – CLK _| 
   54               if(phase = ‘0′) then              – state 1, read instructions
   55
   56                   opcode        <=      IR(26 downto 24);
   57                   – 31 30 29 28 27 26 25 24: Lower 3 bit
   58                   raw1 <= storage(ieee.std_logic_unsigned.conv_integer( IR(23 downto 16) ));
   59                   raw2 <= storage(ieee.std_logic_unsigned.conv_integer( IR(15 downto 8) ));
   60                   – RAW Hazard 
   61                   –avoid to use s1_int = con_int(raw1)
   62                   s1_int        <=      ieee.std_logic_signed.conv_integer(storage(ieee.std_logic_unsigned.conv_integer( IR(23 downto 16) )));
   63                   s2_int        <=      ieee.std_logic_signed.conv_integer(storage(ieee.std_logic_unsigned.conv_integer( IR(15 downto 8) )));
   64                   index <= ieee.std_logic_unsigned.conv_integer(IR(7 downto 0));
   65
   66               else                              – state 3, write register files
   67                 case opcode is
   68                         when “000″ => – ADD
   69                                 storage(index) <= d1;
   70                         when “001″ => – SUB
   71                                 storage(index) <= d1;
   72                         when “011″ =>
   73                                 storage(index) <= d1;
   74                         when “100″ => – INC
   75                                 storage(index) <= d1;
   76                         when “101″ => – DEC	
   77                                 storage(index) <= d1;
   78                         when “010″ => – MULT
   79                                 storage(index) <= prod(31 downto 0);
   80                                 d2 <= prod(63 downto 32);
   81                                 d1 <= prod(31 downto 0);
   82                                 if (index = 255) then
   83                                          storage(0) <= prod(63 downto 32);
   84                                 else
   85                                          storage(index+1) <= prod(63 downto 32);
   86                                 end if;
   87                         when “110″ => – DIV
   88                                 storage(index) <= d1;
   89                                 if (index = 255) then
   90                                          storage(0) <= d2;
   91                                 else
   92                                          storage(index+1) <= d2;
   93                                 end if;
   94                         when others =>
   95                                 report “Invalid OpCode.” severity FAILURE;
   96                   end case;
   97              end if;
   98
   99            else                                 – CLK |_		
  100               if(phase = ‘0′) then              – state 2, ALU step
  101
  102                   case opcode is
  103                      when “000″ =>  – ADD
  104                                 d1 <= conv_std_logic_vector((s1_int + s2_int), 32);
  105                                 needd2 <= ‘0′;
  106                      when “001″ => – SUB
  107                                 d1 <= conv_std_logic_vector((s1_int - s2_int), 32);
  108                                 needd2 <= ‘0′;
  109                      when “011″ => – COMP
  110                                 d1 <= conv_std_logic_vector((-s1_int - 1), 32);
  111                                 needd2 <= ‘0′;
  112                      when “100″ => – INC
  113                                 d1 <=  conv_std_logic_vector((s1_int + 1), 32);
  114                                 needd2 <= ‘0′;
  115                      when “101″ => – DEC
  116                                 d1 <=  conv_std_logic_vector((s1_int - 1), 32);
  117                                 needd2 <= ‘0′;
  118                      when “010″ => – MULT
  119                                 – If I assign the prod value to d1 and d2 here, a RAW hazard will occur.			
  120                                 prod <= raw1 * raw2;
  121                                 needd2 <= ‘1′;
  122                      when “110″ => – DIV
  123                                 d1 <= conv_std_logic_vector((s1_int/s2_int), 32);
  124                                 d2 <= conv_std_logic_vector((s1_int rem s2_int), 32);
  125                                 needd2 <= ‘1′;
  126                      when others =>
  127                              report “Invalid OpCode.” severity FAILURE;
  128                   end case;
  129
  130
  131                   phase <= ‘1′;
  132                else                             – state 4: NOP, Increase PC, etc.
  133
  134                   – Set STATUS —
  135                   status <= “00″;
  136                   if(needd2 = ‘0′) then – only look at d1
  137                       if d1 = X”00000000″ then
  138                           status <= “01″;
  139                       end if;
  140                       if d1(31) = ‘1′ then
  141                           status <= “10″;
  142                       end if;
  143                       if d1 = X”FFFFFFFF” then
  144                           status <= “11″;
  145                       end if;
  146                   else – needd2 = 1
  147                      if d1 = X”00000000″ and d2 = X”00000000″ then
  148                          status <= “01″;
  149                      end if;
  150                      if d2(31) = ‘1′ and opcode = “010″ then            – high order bit of mult is 1
  151                         status <= “10″;
  152                      else
  153                         if d1(31) = ‘1′ and opcode = “110″ then         – high order bit of div is 1
  154                            status <= “10″;
  155                         end if;
  156                      end if;
  157                      if d1 = X”FFFFFFFF” and d2 = X”FFFFFFFF” then
  158                          status <= “11″;
  159                      end if;
  160                   end if;
  161                   ——————
  162
  163                   phase <= ‘0′;
  164               end if;
  165           end if;
  166        end if;
  167     end if;
  168
  169 end process;
  170 end architecture behav;

Comments (1)

A Practical Guide to the AIMA Python Code 2

Continued with the last post

Some readers reported that if using graph_search to solve the NQueensProblem provided in search.py, python will complain that:

TypeError: list objects are unhashable

Yes, you are doing perfectly right and the code itself is imperfect. As this point, you might want to make the List objective hashable. Inspired by the idea of Set and FrozenSet in Python, you would expect a frozen(x) function to make any mutable object immutable (and therefore hashable) [Or you can try to use tuple from beginning, but in the NQueensProblem code, it uses .index method, which is only suitable for List]. Unfortunately, this proposal (PEP-351) was rejected by Guido. Hence, we don’t have a unified way to convert any “state” to an immutable object and insert it in to “closed table”. One possible way to do this is to write your own State class and implement the __hashcode__ function. The other handy but awkward way is to convert the list to a tuple when insterting to the “closed table, i.e. edit Line 144 of search.py as:

if tuple(node.state) not in closed:
closed[tuple(node.state)] = True

Here is a toy forzen function.

“”"Toy function frozen which can convert mutable objects to immutable
	Author: Eric You XU.
	GPLv2
“”"

def frozen(x):
	“”"Return the immutable version of an object.
	“”"
	if type(x) == type([]):
		return tuple(x)
	elif type(x) == type([]):
		raise “I can’t froze dictionary object as they are very hot.”
	# fill your type here
	else:
		return x

You can put it in to search.py and edit the Line 144 as:

if frozen(node.state) not in closed:
colsed[
frozen(node.state)] = True

Now the code looks much better.

Comments

A practical guide to the AIMA Python code -1

Part -1 Search

Introduction

images.jpg AIMA (Artificial Intelligence: A Modern Approach) might be the most successful textbook about AI. I take this course with Prof. Yixin Chen this semester. I am going to go through all the source code (Python) provided on the website and use them to finish my homework. Therefore, both for my personal referral purpose and for someone who might be interested in using that code, I write a very simple “practical guide to AIMA python source code” series here. Enjoy the magic of AI and Python!
Step 1, build the environment.

In this section, you will need three python files:

http://aima-python.googlecode.com/svn/trunk/agents.py
http://aima-python.googlecode.com/svn/trunk/search.py
http://aima-python.googlecode.com/svn/trunk/utils.py

Type:
python agents.py
and
python search.py

to make sure that you have all the packages like Image and Tk with your python. If you don’t have them, just download them. The general guideline is to use apt-get if possible. If you are not using Debian family OS, download the Python package and then use

python setup.py

at the python package folder.

Step 2, study the code — tree search

In order to use the code, we have to get familiar with the interface. Let’s get start from search.py and start from depth_first_tree_search (Line 130). The line reads:

def depth_first_tree_search(problem):
return tree_search(problem, Stack())

We all know that depth-first search uses Stack. Now let’s omit the technical detail about the implementation of Stack in util.py and check what the “problem” is.

Let’s move to Line 16 of search.py. Here is the definition of Problem class. Here __init__ just assigns the initial state. All functions like successor and goal_test are abstract. So, let’s take a look a concrete example. Let’s jump to Line 458, NQueenProblem. It’s a subclass of Problem. It defines the initial state, successor function, and goal_test function. Node that the function conflicted and conflict are for internal use only, there are not the essential part of a problem.

Here, by this case study, we at least know that we have to define a problem with

1. self.initial field which is the starting point for a problem.
2. A successor function which generate the successor state for a given state. We will study it carefully later.
3. A goal_test function which returns either False or True.

The initial is really trivial, we can just assign a initial state to it. Note that Python is a dynamic type language, so we don’t actually care about what type of state is. We will exam the type of state later, but now let’s assume that we know the type of state.

The successor function is really tricky. A straight-forward way to design a successor function will return all possible states in a list. But if you do that, you will get an “Too many values to unpack” error. The design here is actually very neat: instead of put a state in to the list, you can put a (action, state) tuple in to a list. [Line 88 of search.py, in

function expand(problem)
...
for (act, next) in problem.successor(self.state)

read this line and you will figure out the date structure to store the successors. The reason is also simple: we can trace the path easily if you record the action down at every expanding step.

Since we moved our eyes to the Node class, let's exam it carefully.

The constructor of Node is fairly simple. The path function will backtrack all the path to the root. We can modify the code of __repr__ here to print the action out if needed. If any search finally return a result in the form of Node, we can actually get everything we need from this Node, e.g., the path to the node, the action series to the node.

Based on the knowledge above, you can write your own problem model based on tree search. Now let's move on to graph search.

Step 3. Graph search.

The only difference between tree search and graph search is that in graph search, we have a fringe that stores the closed notes. As you might know, the fringe is implemented in hash table. In Python, there is a tiny technical point: List is not hash-able as it's mutable, Tuple and String are hash-able. Therefore, if you implement your state as a list, make sure to convert it to a tuple or string or a constant. Otherwise graph search will not be happy with this and complain about it.

If you want to use A* search, another point arises: heuristic function. Note that Line 211 reads:

def f(n):
return max(getattr(n, 'f', -infinity), n.path_cost + h(n))

here n is a Node instead of a State. Hence, if you want to get the state information and implement h, you have to use n.state to get the state information.

Step 4. build your application

Perhaps the best part to study the code is to write your own code to test it. Here are my implementations on eightpuzzle and three-missionaries three cannibals problem. The code is ugly and only for study purpose only. Please don't directly copy it to your homework.

Eightpuzzle.py

#!/usr/bin/python

from utils import *;
from search import *;

class EightPuzzle(Problem):
	"""The problem of solving a eight puzzle problem """
	def __init__(self, state):
		self.initial = tuple(state)

	def successor(self, state):
		"""In the 8 puzzle problem, move the blank to other position."""
		result = []
		position = list(state).index(0)
		for action, newposition in self.possible_position(position).iteritems():
			new = list(state[:])
			new[newposition], new[position] = new[position], new[newposition]
			result.append((action, tuple(new)))
		return result

	def possible_position(self, position):
		“”"return the possible position”"”
		result = {};
		col = position % 3;
		row = position // 3;
		if row != 2:
			result['D'] = position +3
		if col != 0:
			result['L'] = position -1
		if row != 0:
			result['U'] = position -3
		if col != 2:
			result['R'] = position +1

		return result
	

	def goal_test(self, state):
		if state[0] != 0:
			return False
		elif state[1] !=1:
			return False
		elif state[2] !=2:
			return False
		elif state[3] !=3:
			return False
		elif state[4] !=4:
			return False
		elif state[5] !=5:
			return False
		elif state[6] !=6:
			return False
		elif state[7] !=7:
			return False
		elif state[8] !=8:
			return False
		else:
			return True 	

	def h(self, node):
		“”"Heuristic function”"”
		temp=list(node.state)
		sum=0
		for x in range(8):	# mahatten distance
			sum = sum + abs(temp.index(x)-x)
		return sum

	def h2(self, node):
		“”"bad h”"”
		sum = 0
		pos = 0
		for x in node.state:
			sum = sum + abs(x - pos)
			pos = pos + 1
		return sum
if (__name__==“__main__”):

#	state = [1, 2, 3, 4, 5, 6, 7, 8, 0]
	state = [7, 2, 4, 5, 0, 6, 8, 3, 1]
#	state = (2, 6, 4, 7, 0, 3, 5, 8, 1)
	instance = EightPuzzle(state)
#	breadth_first_tree_search(instance)
#	depth_first_tree_search(instance)
#	breadth_first_graph_search(instance)
#	print depth_first_graph_search(instance).path()
	print astar_search(instance, instance.h).path() 

Boat.py

#!/usr/bin/python

from utils import *;
from search import *;

“”"state is a tuple with (#m1, #m2, #s1, #s2, where_is_the_boat)”"”

class Boat(Problem):
	def __init__(self, state):
		self.initial = tuple(state)
		self.N = state[0]
		self.nodenum = 1

	def successor(self, state):
		result = []
		symbol = state[4]*2 -1; # 0 => -1
		action = 0
		for x in (0, 1, 2):
			for y in (0, 1, 2):
				if x + y >= 1 and x + y <= 2:
					new = (state[0] + symbol*x, state[1]-symbol*x, state[2] + symbol*y, state[3] - symbol*y, 1 - state[4])
					if self.check(new):
						result.append((action, new))
						action = action + 1
		self.nodenum = self.nodenum + action
		print result
		return result

	def check(self, state):
		for x in state[:-1]:
			if x<0:
				return False;

		if state[0]!=0 and state[0] < state[2]: return False;
		if state[1]!=0 and state[1] < state[3]: return False;

		return True

	def goal_test(self, state):
		if state[0] == 0 and state[1] == self.N and state[2] == 0 and state[3] == self.N and state[4] ==1:
			print state
			return True
		return False;

	def h(self, node):
		return 0
		state=node.state
		return state[0]

if (__name__==“__main__”):

	state=[3, 0, 3, 0, 0]

	instance = Boat(state)
#	breadth_first_tree_search(instance)
#	depth_first_tree_search(instance)
#	breadth_first_graph_search(instance)
#	depth_first_graph_search(instance)
	astar_search(instance, instance.h)
	print instance.nodenum

BTW, if you need to highlight your source code and convert it to HTML for example, one of the handy program is GNU source-highlight. I really love this.

Comments (1)

几条面试题

1. 两个单向链表, 开头结点不一样, 在中间某处开始, 结点一样了, 比如
a-b- c-
       |- p q e d
r-t-y-u-

用最好的方法找出第一个公共节点(本题是 p)

2. 一个唱片, 可以被分成8个大小一样的小扇形. 要求给每个扇形涂上红色或者蓝色, 使得唱片旋转起来以后, 按照顺序读出颜色序列, 就能判断唱片是顺时针旋转还是逆时针.

3. 如何判断一个整数是不是完全平方数 (不允许开方)

4. 数一数一个无符号整数有多少 bit 长 (不允许使用 mask, 想最好的方法)

5. 几条根本等价的概率题

不知长度的日志文件, 如何等概率选出100条.

用两个数码 (2bit)的存储空间 估算我Blog的每天访问量. (假设只有0-500, 500-1000, 1000-1500和1500-2000四档)

第一个全答对的送小奖品 :)

研究资本主义面试题, 向党的十七大献礼

Comments (10)

推荐几篇关于密码学的文章

最近一段时间, 争取经常推荐优秀的技术文章.

统统告…告诉我密码——DES

统统告…告诉我密码——RSA

统统告…告诉我密码——MD5

虽然网上这种入门文章非常多, 但往往缺少技术细节, 看完依然不知道实践中要掌握什么. 这几篇文章在入门和技术细节中达到了比较好的折衷(当然具体写程序的时候还是有很多技术细节要注意的).

对了, 作者是高中生.

Comments (3)

Computation and Storage Model for the Web

When I was in collage, I had two classes in computer science, namely, “Algorithm” and “Data Structure”. These two concepts are universal in both computer programs and software applications, whether on a rescued laptop or a million dollar mainframe. Nowadays, Web becomes tremendously popular, and of course, extends significantly in scalability. Therefore, are there still any general concepts like algorithm and data structure in modeling the web? Here are some incomplete thoughts of mine about the computation and storage model of the web.

1. Google.

For Google search engine, it treats the web as a sorted list based on different keywords. Thus, provided with keywords, the web is sorted by the relevance and PageRank system. Google does both the computation and the storage. For general users, these lists are sorted via some criteria extensively studied by Google; we just get the result out. It seems to me that this is the most successful model for user to access the web. However, sorted list requires both sophisticated sorting mechanism and advanced computational power. Although there are fairly amount of search engines in the world, for most of them, their “sorting quality” or “general coverage” are not as good as Google.

2. Del.icio.us/YouTube/Flickr/Google Base, etc.

For these and other similar sites, they basically organize the web or some special media files on the web on a tag-based system, or preciously, an n-to-n mapping structure. All these four in this category are the leading websites on their own fields. For instance, YouTube is the largest video-sharing website, etc. I would conclude that n-to-n mapping about the media sharing is suitable for socialized website. We can find other success models elsewhere.

3. Amazon, Facebook, Microsoft and Google’s Data Storage API.

According to R/WW, today Microsoft announced Windows Live SkyDrive. Facebook actually quietly released the Data Store API beta recently. Amazon has already had the famous S3 service for a while. They are all treating the web storage as Lookup Table. A closer look shows that all these four data storage API sets are trying to let the user to store heterogeneous media as “object” and support random accessing via keys. For some startups, this feature is critical as the storage scalability is usually an obstacle. Via using these four APIs, the scalability is hidden or moved to these big name companies. Treating the Web as objects will absolutely simplify the storage model and reduce lots of overhead in scalability. Similar to organizing all files under the same folder or on a disk using the file name as the key, sooner they might need some search tool/index/tag mechanism to get rid of the name space nightmare. Additionally, as meta-information should be stored in this system, a search will take twice the database access. Obviously developers have to do more than simply dump the data in. The bottom line is, data storage pools are not big trucks, developers has to maintain them. But I do see the dawn of gird computation here. Here, this is for storage task, but if we can later provide a similar interface to computational task, we will jump into the era of grid computation.

4. Yahoo! Pipe and Google Mashup Editor, FQL, etc.

There is an article about Yahoo! Pipe named “web as a database”. I would rather say that Pipe treads the web as an UNIX file with handy tools dealing with it. Later we got Google Mashup Editor, which is ugly but powerful (at least for me). GME is somehow like Yahoo! Pipe but more natural for programmers. They both tread the Web as a special file (or the concatenate of several files). They provide the “operation system” where you can run the services like sorting and filtering on that particular file. They are making some sort of operation system and binary applications on a Google Inside (or Yahoo! Inside) web. FQL is another funny thing worth mention. It models the Facebook data like people and groups as RDB, and FQL/Facebook platform is RDBMS. My conclusion in this section is Yahoo! Pipe might be a GUI for mashup editing and GME is like a console-based editor. It’s hard to tell which one is better now. Facebook is quite aggressive in reorganizing the web information; guys at Facebook are going to re-model the web in a Facebook manner via the F8 and several millions of users.

5. SUN/SETI@home

While SUN is selling the CPU power at 1$/h to general public, SETI@home utilizes all the idle CPU power around the world. Since I haven’t done much research on the computational model, I just list these two here. The previous one might become the future of grid computation. Actually SUN is very good at grid computation. The second one is the distributed computation. We also have other computation models like P2P computation or other decentralized computation model. While SUN is treating itself as the CPU for the web, SETI@home is treating the Web as the CPU. It’s hard to judge which is superior. SUN might be the first several companies who can develop the actually grid computation on the web, with SETI@home project might be the most powerful computer on the web.

Keep in mind that there is not silver bullet in modeling the web. I do believe that if one wants to setup a big company, he/she does need a big picture about how to model the web and how to setup the model.

Again, all ideas here are immature and need to be refined. You are more than welcome to leave comments and suggestions.

Comments (1)

你是你么

今天一个好朋友 phoenix 上线给我发消息, 商量一个比较重要的事情. 因为他开头几句话语气特别异常, 因此我怀疑不是他本人, 于是, 我开始了漫长的身份认证. 以下是除去重要信息后的聊天记录.

[第一回合, 个人信息下套验证]

P: 能把材料发过来么?
X: 好的 不过 我人在上海耶 而且 我真名是?
P: …我是phoenixinter… 我的msn没有被盗.. 你叫You XU..
X: 先确定一下再说
P: 不然我不会知道你叫Eric.. 你在上海?
X: 你说我在哪里, 我们怎么认识的
P: ……不是你刚才说你在上海么。
P: ……我们通过某某认识的(一个朋友的名字)…
P: 你在wustl.. 我真的是phoenixinter…
P: 你在哪里? 说实话… 上海? wustl?
X: 美国. 用Gtalk. 我先确认一下. 你是你
P: 难道你还在怀疑我不是phoenixinter…@_@
X: 大哥 你我所有的信息都是公开的亚

[第二回合 计算机问题认证]

P: ……那你问点私人信息好了。。
X: 给我讲讲并查集
P: …要怎么讲?
X: 或者 GCJ 的照片 你用的计算机是 HP 还还是DELL? 选择题
P: 。。。。这我真的没注意。。。 还是讲并查集吧。。
X: GCJ 是什么的缩写
P: Google Code Jam…
X: 不问了, 并查可以 Google 到

P: ..哦.. 宥哥 你猛。
P: 你可以找一道题, 我给你讲…
X: 我开始相信了
P: 怎么样@_@

[第三轮: Gtalk + Facebook 认证, 我们开始使用 Gtalk]

X: 最后一个问题, 问完结束
X: 我 Facebook 现在的 Status 是什么 F 的 S 的分别是什么
X: 这个我知道了 就行了. MSN 太容易被盗号了
X: 还有 用的电脑是什么你肯定知道的. 没理由不知道. 我从图片上都能看到…P: …我进facebook看…… 没注意。。。
P: F的是%#^* S 的是@^&^% 你的是 *%$** 相信了?
P: 大哥,你想点你觉得只有真的phoenixinter才会知道的东西吧

[第四轮, 终于找到公钥系统了]

Y: 我们得要一个公钥: S 同学最近追的女生是, 英文名是?
P: F, 英文是F..
P: 这个应该是他”仍然”在追的吧. 我不知道他是否放弃了。。
X: 好的 好, pass

然后我彻底相信了, 这时候, 我们开始攻辩小结.

X: 其实我本来想问动规的, 就是太简单了, 没敢问.
P: 动规是什么? 哈哈, 是不是开始怀疑了 ;)

X: 哈哈. 100% 相信了; 你问我贪心是啥 我都不怕了
P: 其实你可以找个题目给我做.. 然后我发现我不会… @_@
X: 哈哈, 那就搞笑了. 因为这种事情还常有
P: 。。。我用的本是什么牌子的? gcj上面? 我还真没注意。。。
X: IBM 呀. Goolge 从不给人其他牌子.
P: …我就说.. 你给个hp / dell的选择.. 也太bt了……不过我确实连ibm都想不起来..
X: 恩, 不搞点trick. 防止上当

X: 我是想验证你不是你的同学冒充的, 所以问动规没用, 问了一下 GCJ
X: 然后 问了一下并查 因为挫人不懂并查
X: 我的意思是 不做竞赛的
P: 你当时其实可以问, 并查集的复杂度推导给我看一下, 然后我就不知道了, 我是假的了.
X: 我想问 TAOCP 谁写的
P: ……这也太弱智了。。。
X: 哈哈 我原来想问 并查不加路经压缩 最差是多少. 但是这个也弱智
P: 这个还是要想想的.. 我应该会去翻算法导论。。。
P: 但是… 你这样只能判断出, 聊天的人至少是懂一点cs的
X: 对的 我慢慢来 先问这个, 缩小范围
P: ..太狠了你

X: 强命题我没这么容易构造. 我要慢慢想的
X: 还不能问通过你 Blog 知道的; 只能问大家都知道的
P: 对…

X: 零知识认证, 真麻烦. 非要借助第三方
P: 比如 B 同学……。。。。这个第三方太搞了。。。。
P: 因为B没有在任何网络上点名F. 这个认证显然可以把范围缩小到一个很小的范围
X: 对的. 这个游戏很好玩. 找到两个人共同的知识.

P: 先不谈正事了, 我们继续玩这个游戏吧
X: orz…

大家不妨和自己的好友闺密玩玩这个游戏, 看看你们到底有多少共同的秘密, 或者分享了多少只有你们两个知道的一些事情. 说不定, 能听到意想不到的八卦, 还能和朋友更加默契.

Comments (6)

关于Bloom Filter 补充说几句

今天谷歌黑板报上吴军研究员深入浅出的讲解了Bloom Filter. 因为前段时间我在拼写检查器的一点注记 当中也提到了Bloom Filter, 所以补充说几句.

1. 黑板报上说:

对于每一个电子邮件地址 X,我们用八个不同的随机数产生器(F1,F2, …,F8) 产生八个信息指纹(f1, f2, …, f8)。再用一个随机数产生器 G 把这八个信息指纹映射到 1 到十六亿中的八个自然数 g1, g2, …,g8。

其实这句话有点绕人, 本质上来说, 就是有8个不同的Hash 函数, 能把这个 X 映射到八个自然数. (实际上对于好的Hash 函数, 比如 MD5, 算一次, 截成八段, 就是八个很好的Hash 函数了, 不一定要8个随机数产生器.)

2. 黑板报上说:

布隆过滤器决不会漏掉任何一个在黑名单中的可疑地址。但是,它有一条不足之处。也就是它有极小的可能将一个不在黑名单中的电子邮件地址判定为在黑 名单中,因为有可能某个好的邮件地址正巧对应个八个都被设置成一的二进制位。好在这种可能性很小。我们把它称为误识概率。在上面的例子中,误识概率在万分 之一以下。

实际上, 这说的就是 Bloom filter 不会有Flase Negative, 可能有 False Positive. 我们来算一下概率. 假设Hash 函数是理想的, 也就是说, 函数值是均一分布的, Bloom Filter 长为m bits, 那么, 对于一个输入, 某一位没被设置的概率是 1-\frac{1}{m}, 而我们一共有 k 个独立不相关的 Hash 函数, 所以这一位保持为 0 的概率应该是 (1-\frac{1}{m})^k. 因此, 假如我们一直插入了 n 个元素进来, 某一位是 0 的概率就是 (1-\frac{1}{m})^{kn}. 用 1 减去它, 就是这一位是 1 的概率了. 那么, 如果我们这时候开始测试元素是否在集合中而发生了错误, 就是说, 明明元素不在集合里面, 可是Hash 过后每一位都是 1, 这个的概率就是 \left(1-\left(1-\frac{1}{m}\right)^{kn}\right)^k \approx \left(1-e^{-kn/m}\right)^k. 这个题目中, m=16G, n=1G, k=8, 我算出来的错误率是0.0005744 (Linux Bash: echo “(1-e(-8*1/16))^8″|bc -l), 大于万分之一了. [Wikipedia 和我的算法一样]

Comments (7)

« Previous entries · Next entries »