Archive for Linux

几个趁手语句

处于局域网中的小型开发团队常常需要互相贴代码, 传文件, 共享资源, 我长期使用过程中总结了几行趁手的语句, 贴出来共享. (本文不适合 windows 用户)

1. Gtalk 传递命令行程序输出信息

常常需要把程序的输出结果或者调试结果通过 IM 发给同事诊断. 而这些结果通常都在字符界面下,拷贝出来很麻烦,于是,我写了一个小程序 gpipe.py,可以把 gtalk 当作一个管道接在程序后面, 比如说, 想把程序编译结果给郝培强(tinyfool),

make 2>&1 | gpipe tinyfool

他的gtalk 客户端就被我用输出给淹没了.
有兴趣的还可以套接 gtalk, 把信息用 base64 编码, 接受方再解码, 如此一来, gtalk 就和Linux 中的管道一样, 将一个机器上的程序的输出套接到另一个机器上另一个程序的输入. 实践证明, 在跨平台的环境下这种做法比使用中间文件分别执行高效很多. 调试时间也大大减少.

2. 传送文件作为邮件附件.

使用matt 客户端,一行即可完成:
echo “Content” | mutt -s “Subject” -a file email@address.demo

这个方法对及时传输一些小文件非常有效, 特别是传送源代码. 还能起到存档备份的效果, 反正Gmail 那么大不用也浪费. 懒人还可以进一步用一个脚本包装, 比如我机器上就包装出了一个 sendboss.sh, 里面是:
echo “Hi, These are the file(s), thanks. Eric” | mutt -s “File” -a $* myboss_email@wustl.edu

这样我每次就只要 “sendboss.sh files” 就可以了. 我老板常常惊讶于我发送文件的反应速度.

3. 一行语句的HTTP文件服务器.

python -m SimpleHTTPServer

即可将当前目录开设为一个8000端口的http 服务器的根目录. 在局域网中,如果需要临时共享当前目录下的一个较大文件,这个方法简便安全,实在是居家旅行必备.

还有, 下载的时候使用 “wget -c” 可以断点续传,很多哥们好像不知道这个小花招.

4. NFS 共享文件夹

SVN 和 CVS 对于代码和文档控制得很好,可是团队中免不了有些 PDF 文档或者色戒电影需要全团队共享,又不需要送到版本控制系统里面。一个共享的文件夹很有必要. 最简单的方法是使用 NFS, 能够跨平台且性能稳定. 具体服务器设置可以参考这里,客户端只要

mount nfs_server:/dir /mnt/share

即可顺利使用此文件夹. 此法对于有电驴 bt 爱好者存在的团队来说,实在是必备良方.

Comments (8)

How to give a program fake system time so that you can use it forever (Linux)

[Disclaim: It’s evil, don’t use it unless you are fighting with some even more evil software.]

Short Intro. [Skip it if you don’t know much about OS or aren’t interested in the technical detail ]

As you might know, every program on Linux system runs on the kernel instead of directly contacting with actual machine. For modern operation systems like Linux, BSD(Mac) and Windows, a mechanism called system call is used to request the system resources via operation system so that operation system has the full control of all programs. In brief, when user program need call a function in library, e.g. print in stdlib, library function forwards (usually library function is a light-weighed wrap of the system call) the request to operation system. Since the system is highly hierarchical and user program is built on the top of libraries and OS kernel, it’s possible to insert some layers in between program and OS to intercept the request. Don’t panic about the nerdy name. Actually this strategy is commonly used on Window platform in anti-virus software as well, because anti-virus software want to monitor every system resource usage for any program and prevent the malicious resource requests.

Here, we simply want to intercept a system call named “TIME” so that every time a program request the current time (so that it can verify whether the licence has expired), we feed the program with a fixed (fake) time. By fooling the program around, you can literally use a program forever. God, doesn’t this mean I can use all software forever? The bad news is for some OSs like Windows, it’s very hard to do system call interception as all the APIs are undocumented and software might have other ways to prevent this. The good news is lots of software on Linux and Mac are simply reading system time. Actually, only top developers and Microsoft partners know how to do system interception. However, for Linux, since the system itself is open source from bottom up, there is no way to prevent such kind of interception (Now you know why some software companies don’t like Linux :).

Approach 

On Linux, tons of methods are around. Here I just introduce three of them briefly under the assumption that you don’t have the source code of that software. [Otherwise you can just modify the source code]

Method 1: Intercept library call in linking time.

Sometimes you have a library (A) that can be used as a part of your program and you want to intercept the library call of that library (A). The best way and the easiest way is to write a fake function and link it in the compile time. This method is totally harmless to your system and very neat. If you can do some modification in makefile, then this procedure is totally transparent to both developer and user.

Method 2: Intercept system call in the run time

If you’ve already got a execute program, then there is no way to intercept the system call in compiling time. To intercept the system call in the run time, there are two ways. The first approach is putting the target program in a designed container. Typically, a container fork/create/call the target program as child process. Since in OS, parent process has accessibility to the child process, it can intercept syscalls easily via ptrace toolset. The second method is to hack the kernel, namely, to tell the Linux kernel to response syscall in a certain way. Since now Linux supports kernel module, a very convenience approach is to compile a program as kernel module and install it on the fly. However, this method is less flexible then the previous method as now all the syscall are intercepted, even for system calls from other programs. [Sure, you can restrict the module only applicable to a certain process via a pid comparison, but then you need to feed the kernel module with PID, it’s awkward ]

In my implementation, I use ptrace/container method. I’ve tried kernel module method but failed as there were not so much well-formed documents on Linux 2.6 kernel.

Here is my code, it’s self-explanatory, have fun with hacking. [Download the C file]

/* Faketime wraps a user program and feed it with user-specified fake system time
   so that it can be used forever without any “licence expired” problem
	 
    Copyright (C) 2007 Eric You XU, Washington University ( youxu [@T] wustl.edu ) 
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 

*/

#include <sys/ptrace.h>
#include <asm/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <linux/user.h>

/*
Register layout defined linux/user.h, but actually in 
 asm-$(arch)/user.h
struct user_regs_struct {
        long ebx, ecx, edx, esi, edi, ebp, eax;
        unsigned short ds, __ds, es, __es;
        unsigned short fs, __fs, gs, __gs;
        long orig_eax, eip;
        unsigned short cs, __cs;
        long eflags, esp;
        unsigned short ss, __ss;
};
*/

/* Note that EAX now is RAX in x86-64
 	we can also find the actural offset for any register
 	from <asm-$(arch)/ptrace-abi.h>
#define RAX 24
*/

#define ORIG_RAX 44
/* ORIG_RAX stores the number of syscall */

#define SYS_TIME 13
/* Machine specific syscall number is defined in 
	unistd.h */

#define back_to_future 1175737392

/* Time is stored as a long interger in C, you can get 
	current time via time(NULL). Thus, it’s very easy to 
	get a long integer denoting some time in the past. 
	
	Python/Java can also be helpful in figuring this out 
	
	If you don’t know how, just keep in mind that 
	Dec. 1, 2007 is about 1196476452. 
	One day interval = 60*60*24 = 86400 [Time flies fast]
*/

char* host_program = “your program name here”;
char* arglist = “your program fake list here”;
/* Make modifications for these two lines, then 
	compile it via
		gcc faketime.c -o faketime
	use it via 
		./faketime
*/

int main()
{   pid_t child;
    long orig_rax, eax;

	 struct user_regs_struct regs;
	 int status;
    int insyscall = 0;
    child = fork();
    if(child == 0) {
        ptrace(PTRACE_TRACEME, 0, NULL, NULL);
        execl(host_program, arglist, NULL);
          }
    else {
       while(1) {
          wait(&status);
          if(WIFEXITED(status))
              break;
          orig_rax = ptrace(PTRACE_PEEKUSER,
                     child, ORIG_RAX, NULL);

          if(orig_rax == SYS_TIME ) { /* Intercept SYS_TIME syscall */
             if(insyscall == 0) { /* Syscall entry */
                insyscall = 1;
             			  }
         	 else { /* Syscall exit */
					ptrace(PTRACE_GETREGS, child, 0, &regs);
				   	/* We can also use ptrace(PTRACE_SETREGS, child ,RAX, &back_to_future); 
						but it doesn’t work. There might be some tricky here */
					regs.eax = back_to_future;
					ptrace(PTRACE_SETREGS, child, 0, &regs);
            	                 }
          } // End if with SYS_TIME 
       ptrace(PTRACE_SYSCALL, child, NULL, NULL);
        }
    }
    return 0;
}

Comments (4)

Build VHDL simulation tool chain on Linux

I take Computer Architecture course this semester and study basic VHDL for digital design. The instructor recommends ModelSim with both Windows and Linux versions. But unfortunately, ModelSim for Linux can only be installed on RH Linux. I tried to install it on Ubuntu but obviously it didn’t work. It turned out that the only solution would be using ModelSim for Windows in our computer lab. But as a dead-head Linux fan, I simply want to find alternative software to get things done on Linux platform. After some googling and asking help from friends, I found two handy software packages that form a tool chain to do VHDL simulation on Linux: GHDL and GTKWave.

What

GHDL is an amazing package that it employ gcc and compile VHDL code to objective code. It can’t translate your design into a netlist but it’s sufficient for us to do the simulation.
GTKwave is another tool which can help you view the wave form dumped by GHDL simulation. There two, in principle, can help you handle all the VHDL simulation tasks involved in a one-semester-course like computer architecture.

How

Build the tool chain

If you are using a Debian family (Debian/Ubuntu, etc) Linux, an apt-get will save your life. Try:
apt-get install ghdl
and
apt-get install gtkwave.
[You might need sudo or run it as root.]

If you not, try to download the package and read the README. It’s easy. If you want to install it on Mac as what I’ve done, remember to install X11. The whole procedure should take you less than five minutes.

How to use it

It’s very simple. GHDL will compile the VHDL code for you to object code and convert it to executable. For example, if I have CPU.vhdl which is a simple computer description. Try to type:

ghdl -a cpu.vhdl

to compile it, and

ghdl -e cpu

to make executable file (do linking).

Finally you can use

./cpu --help

to see how to run it.

Some tips:
If you include ieee package, add this option: –ieee=synopsys
If you want to let the simulation run certain amount of time, use something like

./cpu --stop-time=100ns

If you want to dump a signal wave format for gtkwave, just add –vcd=vcd.file

After that, you can use

gtkwave vcd.file to see the wave. You might need to add all the signals to the wave window via “Search->Signal Search Tree”.

Why

You might ask why it’s handy than ModelSim. It’s obvious. For example, if you write a shell script that can issue all the compiling/dumping/viewing command in the right order, you only need two keyboard stoke to run a test, in comparing with in ModelSim, where you need to do at least 10 mouse click and maybe type these commands:

view wave
add wave -r /*
"set format, etc"

Finally, you can use GHDL as your VHDL code formatter and get a very nice HTML format for your neat code. You can now make it even neater, why not :)

Some really missing features for these small tools.

1. Non-standard signal watch.

If you have a signal which is an array, it’s not handy to view the wave as ghdl doesn’t actually output the signal changing information for this signal.

2. Automatic dependency solving.

Currently GHDL will complain about the dependency, but it doesn’t actually handle this like GNU make. Therefore, you have to either write your own makefile to use make to manage your code dependency, or recompile all the code every time.

My example code of a silly CPU:

    1 – CSE 560 Homework, a simple CPU
    2 – Eric You XU
    3 — version 0.05
    4
    5 library ieee;
    6 use ieee.std_logic_arith.all;
    7 use ieee.std_logic_signed.all;
    8 use ieee.std_logic_1164.all;
    9
   10 entity CPU is
   11         port (  IR      :       in std_logic_vector (31 downto 0);
   12                 READY   :       in bit;
   13                 CLK     :       in bit
   14                 );
   15 end entity CPU;
   16
   17 architecture behav of CPU is
   18     type reg_type is array(0 to 255) of std_logic_vector(31 downto 0);
   19     signal storage: reg_type;
   20
   21     signal d1, d2: std_logic_vector(31 downto 0);
   22     signal raw1, raw2: std_logic_vector(31 downto 0);
   23     signal s1_int, s2_int: integer;
   24     signal prod: std_logic_vector(63 downto 0);
   25     signal status:      std_logic_vector(1 downto 0);
   26     signal phase:       bit;
   27     signal index:       integer;
   28     signal opcode:      std_logic_vector(2 downto 0);
   29     signal needd2:      bit;
   30
   31 begin
   32
   33
   34
   35 process(clk)
   36    begin
   37     if (ready=‘0′) then
   38         phase <= ‘0′;
   39 – For test
40 storage(0) <= X”00000004″;
   41 storage(1) <= X”FFFFFFFD”;
   42 storage(2) <= X”01010103″;
   43 storage(3) <= X”0F0F0F08″;
   44 storage(4) <= X”F0F0F0F0″;
   45 storage(5) <= X”FFFFFFFF”;
   46 storage(6) <= X”00000000″;
   47 storage(7) <= X”0000000E”;
   48 storage(8) <= X”00000004″;
   49 storage(9) <= X”DEADBEEF”;
   50
   51      else
   52        if(clk’event) then
   53           if(clk = ‘1′) then                    – CLK _| 
   54               if(phase = ‘0′) then              – state 1, read instructions
   55
   56                   opcode        <=      IR(26 downto 24);
   57                   – 31 30 29 28 27 26 25 24: Lower 3 bit
   58                   raw1 <= storage(ieee.std_logic_unsigned.conv_integer( IR(23 downto 16) ));
   59                   raw2 <= storage(ieee.std_logic_unsigned.conv_integer( IR(15 downto 8) ));
   60                   – RAW Hazard 
   61                   –avoid to use s1_int = con_int(raw1)
   62                   s1_int        <=      ieee.std_logic_signed.conv_integer(storage(ieee.std_logic_unsigned.conv_integer( IR(23 downto 16) )));
   63                   s2_int        <=      ieee.std_logic_signed.conv_integer(storage(ieee.std_logic_unsigned.conv_integer( IR(15 downto 8) )));
   64                   index <= ieee.std_logic_unsigned.conv_integer(IR(7 downto 0));
   65
   66               else                              – state 3, write register files
   67                 case opcode is
   68                         when “000″ => – ADD
   69                                 storage(index) <= d1;
   70                         when “001″ => – SUB
   71                                 storage(index) <= d1;
   72                         when “011″ =>
   73                                 storage(index) <= d1;
   74                         when “100″ => – INC
   75                                 storage(index) <= d1;
   76                         when “101″ => – DEC	
   77                                 storage(index) <= d1;
   78                         when “010″ => – MULT
   79                                 storage(index) <= prod(31 downto 0);
   80                                 d2 <= prod(63 downto 32);
   81                                 d1 <= prod(31 downto 0);
   82                                 if (index = 255) then
   83                                          storage(0) <= prod(63 downto 32);
   84                                 else
   85                                          storage(index+1) <= prod(63 downto 32);
   86                                 end if;
   87                         when “110″ => – DIV
   88                                 storage(index) <= d1;
   89                                 if (index = 255) then
   90                                          storage(0) <= d2;
   91                                 else
   92                                          storage(index+1) <= d2;
   93                                 end if;
   94                         when others =>
   95                                 report “Invalid OpCode.” severity FAILURE;
   96                   end case;
   97              end if;
   98
   99            else                                 – CLK |_		
  100               if(phase = ‘0′) then              – state 2, ALU step
  101
  102                   case opcode is
  103                      when “000″ =>  – ADD
  104                                 d1 <= conv_std_logic_vector((s1_int + s2_int), 32);
  105                                 needd2 <= ‘0′;
  106                      when “001″ => – SUB
  107                                 d1 <= conv_std_logic_vector((s1_int - s2_int), 32);
  108                                 needd2 <= ‘0′;
  109                      when “011″ => – COMP
  110                                 d1 <= conv_std_logic_vector((-s1_int - 1), 32);
  111                                 needd2 <= ‘0′;
  112                      when “100″ => – INC
  113                                 d1 <=  conv_std_logic_vector((s1_int + 1), 32);
  114                                 needd2 <= ‘0′;
  115                      when “101″ => – DEC
  116                                 d1 <=  conv_std_logic_vector((s1_int - 1), 32);
  117                                 needd2 <= ‘0′;
  118                      when “010″ => – MULT
  119                                 – If I assign the prod value to d1 and d2 here, a RAW hazard will occur.			
  120                                 prod <= raw1 * raw2;
  121                                 needd2 <= ‘1′;
  122                      when “110″ => – DIV
  123                                 d1 <= conv_std_logic_vector((s1_int/s2_int), 32);
  124                                 d2 <= conv_std_logic_vector((s1_int rem s2_int), 32);
  125                                 needd2 <= ‘1′;
  126                      when others =>
  127                              report “Invalid OpCode.” severity FAILURE;
  128                   end case;
  129
  130
  131                   phase <= ‘1′;
  132                else                             – state 4: NOP, Increase PC, etc.
  133
  134                   – Set STATUS —
  135                   status <= “00″;
  136                   if(needd2 = ‘0′) then – only look at d1
  137                       if d1 = X”00000000″ then
  138                           status <= “01″;
  139                       end if;
  140                       if d1(31) = ‘1′ then
  141                           status <= “10″;
  142                       end if;
  143                       if d1 = X”FFFFFFFF” then
  144                           status <= “11″;
  145                       end if;
  146                   else – needd2 = 1
  147                      if d1 = X”00000000″ and d2 = X”00000000″ then
  148                          status <= “01″;
  149                      end if;
  150                      if d2(31) = ‘1′ and opcode = “010″ then            – high order bit of mult is 1
  151                         status <= “10″;
  152                      else
  153                         if d1(31) = ‘1′ and opcode = “110″ then         – high order bit of div is 1
  154                            status <= “10″;
  155                         end if;
  156                      end if;
  157                      if d1 = X”FFFFFFFF” and d2 = X”FFFFFFFF” then
  158                          status <= “11″;
  159                      end if;
  160                   end if;
  161                   ——————
  162
  163                   phase <= ‘0′;
  164               end if;
  165           end if;
  166        end if;
  167     end if;
  168
  169 end process;
  170 end architecture behav;

Comments (1)

Install Ubuntu on Dell Vostro 200

[Keywords for search engine: Ubuntu dell vostro 200 live cd can’t boot harddisk]

Make sure add “irqpoll” as the kernel parameter, i.e. :

title           Ubuntu gutsy, kernel 2.6.22-12-generic (recovery mode)
root            (hd0,0)
kernel          /boot/vmlinuz-2.6.22-12-generic root=UUID=78dea514-8cc8-40a9-9e51-13c359bc681b ro  quite splash irqpoll

initrd          /boot/initrd.img-2.6.22-12-generic

These are cited from here.

When the PC boots up, you will see the Grub countdown, which is set to 3 seconds by default. Press “Esc” to intercept this countdown and go enter a Grub menu. Then

  • Press ‘e’ to start editing.
  • Scroll down to the “kernel…” line. The is the line that tells Grub which kernel to boot with and the parameters to be passed to the kernel when it boots are placed at the end of this line.
  • Press ‘e’ again to edit this line.
  • Move to the end of the line. You will see any existing parameters and can add other new parameters to the end. [Add your irqpoll here]
  • Parameters are separated by spaces and are mostly either a single word (e.g. nolapic), or an equation (e.g. acpi=off).
  • Once you have added the parameter to the end of the line, press Enter to accept the editing.
  • Then press ‘b’ to boot using that kernel and those parameters.

Then, go here to see how can you modify the parameter permanently.

Comments (1)

给那些说Linux图形界面不好看的人

iTunes 型窗口切换
Shift Window

Tab 切换窗口时小图动态变化
Tab Switch

平铺桌面
screenshot-2.png

桌面也是展台, 记得配上美女图
Reflect

苹果上的 Expose 效果
screenshot-4.png

我现在就在这么绚的环境下工作 :)

(装一个 Ubuntu 你就可以拥有这一切, 为何还在买世界找盗版的Vista.)

Comments (7)

Totally random

In a Linux shell with GNU make:

>make love
make: *** No rule to make target ‘love’. Stop

Do you really get tired of this line? Try this:

>vim makefile

love: @ echo “oh, yeah, oh oh oh yea, yes, yes, yes.”

> make love

oh, yeah, oh oh oh yea, yes, yes, yes.

Em, literally ’sounds’ better. BTW, if you are using a Mac. try to pipe this to “say”, which is the command line interface of the embedded Text-to-Speech engine. Well, awesome!

Another one from xkcd:

Any new idea to make fun with Linux?

Thanks to Tinyfool.

images.jpg

Comments (4)

Some useful tools for you to write English articles on Linux

As an ESL (English as Second language) student, I usually have a fear of writing articles. Nevertheless, I have to write about one article per week, either for learning English or for recoding my idea. For many people in China, their killer application is Word and Kingsoft Ciba. They simply type a Chinese phrase into the electronic dictionary, copy and paste the English word, do some grammar check in Word. After doing all of this and Word stops reporting any spelling and grammar error, they feel a grant sense of achievement. I was one of them before.

In the meanwhile, as a Linux deadhead, I dislike M$ products emotionally. It seems to me that the only way out is AbiWord or Openoffice. I’ve used both for a while. Yet, I have to say that they are helpful but not perfect. To use them, I have to prepare a text file, which is inconvenient when you are working on a Tex file. For MacOSX, the other thing is I have to install X11. Don’t get me wrong, *nix is industrial-strength and designed to do everything solely with the shell. (Well, WoW is the last thing in my mind.)

After a painful Googling, now I have at least four tools helping ESL writing.

1. GNU Aspell.

GNU Aspell is a Free and Open Source spell checker. It supports the spell checking for source codes, script comments, TeX files as well as HTML web page and email. Aspell provides the user both interactive and batch mode. It contains several advanced features that are missing in both M$ Office and OO such as text-file-based user-defined dictionary and “sound like” (e.g., know and no). GNU Aspell is definitely for literate programmers or PhD. students who want to write elegant code comments and academic articles.

2. GNU diction

GNU diction is originated from the diction on the AT&T UNIX. It is actually a rule-based style checker. I’ve read the code thoroughly and found that almost every piece of the rule came from a book titled “The Elements of Style” authored by William Strunk. That is to say, you have an “Elements of Style” in your pocket now. Please note that the simple grammar checker in Word has nothing to do with style checking. GNU diction is a charming complement to Word/Openoffice if you insist using them.

As it is rule based. It sometimes provides redundant information even your usage is indeed correct. As D.E. Knuth has mentioned in the “Mathematical Writing”, the analysis of diction is quite superficial. “However, said Don, these programs are kind of fun. And they do provide an excuse to read the document from another point of view. Even if the analysis is wrong it does prompt you to re-read your prose, and this has to be a good thing”.

3. GNU Style

GNU style is contained in GNU diction package. It will report the readability of your article in several well-known indexes. For the native speaker, these are used for improving the readability of the article. Nevertheless, for ESL students, these indexes would be viewed as the writing level in terms of “grade/school year to understand your article for average American”. In my opinion, we ESL students should prevent using too naive words and too simple sentences in technical writing. Definitely don’t use a million dollar word where a one-dollar word will do. Yet for ESL students, trying to use some new and sophisticated words would eventually boost the ability in writing.

4. LanguageTool (GPLed)

It is an open source language checker for English and other languages based on Java. I began to use it recently. It’s better than the embedded grammar checker in Openoffice. Moreover, it does support CLI mode and web mode. This is the missing tool on the Linux platform for grammar checking.

I can remember when I was a collage student, I struggled to write English articles with M$ word or Openoffice. My personal experience with English writing and M$ Word grammar checker brought me the truth that we should never ever rely the quality on the f**king damn grammar checker. As a rule of thumb, the best way to improve ESL writing skill is to write and to practice.

BTW: In preparing this article, I’ve employed vim, aspell, diction, style, languagetool and other tools on the Linux and Mac platform.

Comments (3)

拼写检查器的一点注记

拼写检查这个东西, 其实就是求文本集合与词典集合的差集. 因此, 使用一点简单的命令行技巧, 就可以发现拼写错误.

可是实际情况不这么简单, 因为面对的可能不仅仅是纯文本, 比如我在 Linux 下, 最需要拼写检查的是我的网页和我的论文, 也就是 HTML 文件 和 TeX 文件. 不过, Linux 下这些工具早就有了. Aspell 就是这样一个强大的工具.

Word 和其他的工具都可以执行拼写检查, 不过Gmail 的简单拼写检查实在是方便无比, 强烈推荐大家在需要拼写检查的时候使用.

说到Google, 让我们看看 AI 大牛, Google 研究主任 Peter Norvig 怎样用 20 行 Python 代码写一个基于概率模型的拼写检查器. 我利用闲暇时间把这篇文章翻译成了中文 [这里].

如果在内存受限系统上开发, 比如嵌入式系统上, Peter Norvig 的方法就不太可行了. 因为连词典存下去都够呛. 这时候, 比较好的方法肯定是用 hash 表. 不过单一 hash 表错误率比较高, 我们可以使用 Bloom Filter [wiki], 这里提供了一个简单的实现. 这个实现非常有启发性, 因为他考虑了词的变形, 比如 -es -ing 后缀. Peter Norvig 虽然在文中提到这个问题, 却没有解决这个问题, 有兴趣的读者可以尝试自己重写一下 Peter Norvig 的代码. 我把 Bloom Filter 实现拼写检查的代码重新写了注释在此.

补几句废话:

1. Python 语言简洁迷人, 这20行代码说明了一切 :)

2. 时间都是挤出来的, Peter Norvig 这篇文章是我陆陆续续每天睡觉前敲几行字翻译出来的. 翻译完了觉得, 人贵有恒.

3. 写这篇完全出于好玩, 我既不是搞自然语言的专家, 也不是Python 高手. 研究方向也和这些不搭边. 因此如果有见识浅陋的地方, 或者遗漏了一些, 大家多交流.

Comments (2)

简单的 LINUX Shell 下求集合交集差集的办法

车东大虾在最近的Blog 中说:

最近Winter刚教会了我一个文件比较命令: comm,是一个比diff更简单的取2个文件交集/补集的方法。原先以为需要用join 2个表的方法,现在很少几个参数就实现了。

comm 属于 diff 家族的命令, 相当于寻找两个字符串的最长公共子串. 在寻找前, 要对两个文件先排序. 然后diff内部哈希把一行字符变成一个整数, 再使用寻找最长公共子串的动态规划. 因此, 比较一个 m 长文件和一个 n 长文件的复杂度还是比较大. 读者有兴趣的可以阅读这篇论文[PS]. 其实我自己以前也用Comm, 但是不知道为什么, comm 求集合差集的效果不好. 即使排好了序, 有很多文件之间算差集还是做不对.

找两个文件的交集和差集是很普通的一个事情. 举个简单的例子, 我每周都去Google 音乐趋势抓歌曲名, 抓完了以后, 我要知道哪些是我需要下的新歌 (因为有的歌可能在榜好几个星期, 不需要重复下载). 假设我机器上存的歌曲是集合B, 新抓取了集合A. 假设这些集合这都用文件存着, 每行一个歌曲名. 任务就是计算 A-B, 也就是在集合A中而不在集合B 中的那些歌曲. 我长期的使用经验证明, comm 并不能准确的算出差集.

抓耳挠腮了好久, 想: 要是的确有这个需求, 又没这个现成的工具, 那么肯定一行脚本能写出来. 其实只要运用集合论知识一想就知道, 就是在A中把同时在 A和B 中出现的剔除了呗. 因此办法是, 把 两个B 集合和一个 A 文件放到一起成一个大文件(集合), 然后选取这个大集合中唯一的. 那么这个唯一的肯定就是只在A中不在B中的, 仔细想想就知道是对的. 所以, 方法是:

sort B B A | uniq -u

同理, 对称差是:

sort A B | uniq -u

为什么我不喜欢用 comm 呢, 因为comm 是对有序的列表做操作的, 如果对集合, sort 和 uniq 足矣. 可能车大虾说的 join 两个表就是这个了, 不过我觉得, 这个看上去可能更加容易写一点, 毕竟用 comm 的比用 sort 和 uniq 的少一点, 而且, 复杂度也低, 构造也巧妙, 这个多能显示良好的数学和计算机功底啊 :P

Linux 不是玩具, 是提高工作效率的智慧的工具. 我越用越觉得缺少的只是想象力.

Update: Linux 下面解决问题从来就不止一种方法, 今天一个美国同事告诉我另一个方法求差集 A-B:

grep -F -f listb lista -v

Comments (2)

IE for Linux

笑来老师说因为可恶网上银行死活不支持Linux, 所以没法把计算机换成纯粹的 Ubuntu, 其实如果是招商银行, 我就有办法. (其他可能也类似) 我一直就知道这个方法, 只是忘记写出来了. 实在有点对不起广大Linux用户.

其实只要我们知道问题的核心在于 IE, 不在windows, 这个问题就好办了. 一年前我看到Mac上也有IE的时候我就想, Linux上有没有啊, 嘿, 在同学的提醒下, 发现果然有一帮爱好者居然把微软的IE都搞到Linux下了 (当然是在wine中跑的), 而且还能装插件, 运行ActiveX. 我的这个同学还尝试假装用这个IE去逗微软的自动升级系统玩, 结果微软的自动升级系统 (Windows Update) 没法识别系统, 极其无辜的报错, 让我们看得真爽 :)

言归正传, 访问这个网站, 照着做就行了: http://www.tatanka.com.br/ies4linux/page/Main_Page

有些看到英语就头疼的哥们我大概说一下:

1. 你需要 cabextractWine. 这两个库, 都可以apt-get (debian家族)或者 yum/rpm得到.

2. 在控制台中打下面的代码:

wget http://www.tatanka.com.br/ies4linux/downloads/ies4linux-latest.tar.gz

tar zxvf ies4linux-latest.tar.gz

cd ies4linux-*
./ies4linux

这个程序就运行起来了, 会提示你装IE的什么版本, 然后它自己会跑到微软的网站上把IE下载下来, 自己装好. 你可以在安装好的目录中发现一个快捷方式之类的东西, 软件的提示留意看就行了, 没什么特别难的技术要点. (注意: 命令第三行是个 * 你不要打*, 看创建了什么目录你就进那个目录就行了. 推荐大家装 ie-6中文版, 因为我实验了, 这个上来就可以上招行.)

我刚刚用我的招行帐号尝试了一下, 可以.

笑来老师, 放弃 windows 吧, 用windows 最后的一个理由不存在了 :)

Update 一下: 一般我文章写好了都要看看有没有人和我写类似的, 如果有, 我会参考人家的一点想法,我发现很多人都这么写了, 有一些写的比我好; 但是我很失望的发现, 一些莫名奇妙的人又在强调 windows安全性高了, 我只说一个事实: 美国所有的银行都支持Linux和 Firefox. 国内的任何银行估计都没法和美国的花旗 (Citigroup) 这些比吧, 我一直用花旗和美国银行的网上银行,不是用苹果+Firefox 就是用Linux + Firefox,从来没有问题.

某些同志别把自己的无知当教训人的”规矩”, 好像全世界银行就是这个样子的. 自己不知不要紧, 别误人.

Comments (11)

« Previous entries