Jul 11, 2007 - Some useful tools for you to write English articles on Linux

Comments

As an ESL (English as Second language) student, I usually have a fear of writing articles. Nevertheless, I have to write about one article per week, either for learning English or for recoding my idea. For many people in China, their killer application is Word and Kingsoft Ciba. They simply type a Chinese phrase into the electronic dictionary, copy and paste the English word, do some grammar check in Word. After doing all of this and Word stops reporting any spelling and grammar error, they feel a grant sense of achievement. I was one of them before.

In the meanwhile, as a Linux deadhead, I dislike M$ products emotionally. It seems to me that the only way out is AbiWord or Openoffice. I’ve used both for a while. Yet, I have to say that they are helpful but not perfect. To use them, I have to prepare a text file, which is inconvenient when you are working on a Tex file. For MacOSX, the other thing is I have to install X11. Don’t get me wrong, *nix is industrial-strength and designed to do everything solely with the shell. (Well, WoW is the last thing in my mind.)

After a painful Googling, now I have at least four tools helping ESL writing.

**

  1. GNU Aspell.**

GNU Aspell is a Free and Open Source spell checker. It supports the spell checking for source codes, script comments, TeX files as well as HTML web page and email. Aspell provides the user both interactive and batch mode. It contains several advanced features that are missing in both M$ Office and OO such as text-file-based user-defined dictionary and “sound like” (e.g., know and no). GNU Aspell is definitely for literate programmers or PhD. students who want to write elegant code comments and academic articles.

2. GNU diction

GNU diction is originated from the diction on the AT&T UNIX. It is actually a rule-based style checker. I’ve read the code thoroughly and found that almost every piece of the rule came from a book titled “The Elements of Style” authored by William Strunk. That is to say, you have an “Elements of Style” in your pocket now. Please note that the simple grammar checker in Word has nothing to do with style checking. GNU diction is a charming complement to Word/Openoffice if you insist using them.

As it is rule based. It sometimes provides redundant information even your usage is indeed correct. As D.E. Knuth has mentioned in the “Mathematical Writing”, the analysis of diction is quite superficial. “However, said Don, these programs are kind of fun. And they do provide an excuse to read the document from another point of view. Even if the analysis is wrong it does prompt you to re-read your prose, and this has to be a good thing”.

3. GNU Style

GNU style is contained in GNU diction package. It will report the readability of your article in several well-known indexes. For the native speaker, these are used for improving the readability of the article. Nevertheless, for ESL students, these indexes would be viewed as the writing level in terms of “grade/school year to understand your article for average American”. In my opinion, we ESL students should prevent using too naive words and too simple sentences in technical writing. Definitely don’t use a million dollar word where a one-dollar word will do. Yet for ESL students, trying to use some new and sophisticated words would eventually boost the ability in writing.

**

  1. LanguageTool (GPLed)**

It is an open source language checker for English and other languages based on Java. I began to use it recently. It’s better than the embedded grammar checker in Openoffice. Moreover, it does support CLI mode and web mode. This is the missing tool on the Linux platform for grammar checking.

I can remember when I was a collage student, I struggled to write English articles with M$ word or Openoffice. My personal experience with English writing and M$ Word grammar checker brought me the truth that we should never ever rely the quality on the f**king damn grammar checker. As a rule of thumb, the best way to improve ESL writing skill is to write and to practice.

BTW: In preparing this article, I’ve employed vim, aspell, diction, style, languagetool and other tools on the Linux and Mac platform.

Jul 10, 2007 - 10 reasons why we shouldn’t have the holy war towards others on coding/design style

Comments

I came across a story from the Solidot (A Chinese version of Slashdot) this morning that one of the developers in Fcitx project finally decided to terminate it, one of the top open source (GPLed) Chinese input method project on *nix platform. For English-speaking users, the importance of the IME might not be fully realized. For Linux users living in East Asia, IME is somehow equivalent to keyboard. IME is so critical to the software platform such that Google also has developed a Chinese IME recently. So, why did this developer make this decision to terminate such a significant (and also a well-known) project? According to the main page of the project, the core developer got pissed off by an other developer who criticizes this project as a “poor and ugly” coding style. So my question arises: can we developers in the open source community criticize others’ projects on the basis of “coding style”?

Further reading prompts me that criticizing others’ coding style is very common in the open source community. I am not an expert in coding per se, however, I have at least 10 reasons why we shouldn’t have the holy war on the coding or design style towards other developers.

1) In open source community, coding is for run or for fun, not (merely) for read.

Traditionally, most people would thought that the skills that required to write one’s own software are so advanced that one could never hope to write his/her own code one day. However, lots of advanced programming tool such as IDE and some high-level script language have inherently remodeled the schema. Now, lots of beginners are willing to write some code to get things done; and they are as passionate as the gurus to put their code in the public domain. Consequently, their ugly coding/design style has been criticized by others in the community for “not readable” or “not beautiful”. However, what is the purpose of the open source movement? I would like to say that open source movement is about sharing and freedom—you can learn from others and do whatever you want. However, no one in the open source community aims to write the “textbook” source code. We basically write the code for a special purpose. Therefore, people should not criticize form an aesthetic perspective. After all, the coding is for run or for personal fun. Coding for reading is not the purpose per se–at least it is not the original purpose. So, we have to bear with the truth that every developer has his own coding/design style; even sometimes the style is goofy.

2) Style is highly restricted by the language feature, or, the ‘native’ programming language this developer uses.

It has been quite a long time since PERL was first time called pathologically eclectic rubbish lister. However, as I know, lots of researchers around the world use PERL on a daily base. In our department, many colleagues use PERL as their ‘native’ language despite that PERL is somehow write only. Provided this, it is unfair to judge others’ coding style as it differ from language to language. For example, my ‘native’ programming language is Java. When first time I read the book “Text Processing in Python”, I found that “map” operation is amazing when I need apply some homogenous operation on every element in a container with iterator. However, in practice, I still cannot help using “for loop” instead of “map”. As syntactical sugar differs from language to language, it usually requires quite amount of experience before one actually realize the right coding style in some programming languages other than the native one. Therefore, the holy war on the coding style here is similar to the holy war on English and Spanish, which is vulgar and intolerant.

3) Design pattern and coding style reflect the underlying thinking, or different design purpose, which might be difference from person to person.

If you Google “code style”, you will find tons of guidelines ranging from kernel programming to CSS coding style. This is because for a robust and collaborated open source project, a nice coding style will significantly reduce the communication overhead as well as the time wasted on the maintenance. However, what if some one is going to write a system or framework from scratch?

Basically, a coding/design style will somehow reflect the underlying idea. For example, I guess I am not the only one who dislikes the wrapping design in the Java.io package. In order to use a single Unicode string reader from file, you have to warp the FileInputStream with several objects; while in C++ or python, a single statement settles all the chaos. However, the idea of Java IO is putting least assumption upon the IO stream and providing the programmer with the most flexibility. Therefore, merely criticize the coding style to use Java IO or the design of java IO is gratuitous. Frankly speaking, I do not feel like the design style of Apache Struts as it is very complicated for me to deploy a small system (That’s why we have RoR); however, Struts strictly implements the MVC model 2 pattern and hereby makes itself powerful for large systems. Everyone can make up his/her own style at this point. Therefore, the holy war on the coding/design style is similar to choose between apples and pears—it’s not necessary to gauge which one is better, as they are just different fruits from different genesis.

4) Although there might be some rule of thumb to write beautiful code, there is not a unique standard.

As I’ve mentioned before, there is no unique standard in coding/design style as you can always attack the same problem in different approaches. This statement not only holds for the coding style, but also for the XML configuration file style and other design-related stuff. For example, when I was learning the Apache Ant, I would like to call the build.xml as makefile.xml or to have separated XML files for each target, etc. However, it is hard to tell which design is better. In the IME case, the programmer simply uses Chinese as the tag name in the XML file. As you might know, as long as the XML is encoded in UTF-8, it is an issue in neither understanding nor program migration. However, this design was criticized for “not very i18n”. I would say that this judgment is goofy and absurd.

5) Stay foolish

As it is usually hard to define which style is better, the developers in the open source community should indeed stay foolish. I do agree that arguments and the discussions towards a particular project are quite helpful. However, keep this discussion in a polite and elegant way would be more productive.

6) Never judging people by their code style.

Because this doesn’t make any sense. In the book “12 habits that hold good people back”, the author mentioned a kind of people who “see the world in black and white”. One who simply judging person by the coding style falls to this category. Coding style is not the whole part in programming. Moreover, a poor coding style/design does not necessary means the lacking of ability in developing the system.

7) Refactory is a procedure, not a purpose.

Just like Feynman’s famous quote that “physics is like sex”, refactory of the code is just like sex too. Although it may give some practical results, but that’s not why we do it.

In short, refactory is a procedure that makes the code more readable or easier to maintain. However, there is not reason why we should refact the code solely for the aesthetic purpose. I can image that the GoF’s book will boost a passion from the bottom of the hearts of the readers to refact every piece of code written by others. I was one of them. The GoF’s book will definitely help build a “sensitive nose” that can sniff the smell of the code. Probably applying a nice coding style or design pattern is a good practice. I still insist that a nice coding style or sophisticate design pattern is not the reason why we write code.

8) Efficiency or beauty is an issue, but making it workable is the first priority.

When I was interviewed with Google, one of the interviewers gave me a very good suggestion when I got stuck in a problem. He told me that the philosophy at Google is “first make it work, then improve it”. In the open source community, usually the software is for solving a real world problem. Therefore, making it work is much more important than making it beautiful. I have to concede that there are some developers who can achieve both goals in the same time. Nevertheless, for most developers, usually the code is ugly and awkward at the very beginning. If I have to choose in between a piece of workable but ugly code and a piece of beautiful but malfunctioned code, I prefer the previous one. I guess except the coding-style paranoia, everyone will choose the first one. The truth is (or 80-20 principle tells us that): an ugly but workable prototype will cost 20% of the total time; a pretty and not necessarily workable prototype will cost about 80% of the total developing time. As software is changing all the time, on cannot expect to have a “final” version that is both beautiful and workable. Therefore, to choose workable instead of pretty code is wise.

9) Peer review is about finding the (potential) bug, not about the coding style.

I’ve heard that in many big name companies such as Google, the code peer review plays a quite important role. I’ve also once been an intern at Siemens. There, before checking the code into the code repository, usually a colleague will go though your code to see it there’s something wrong. (Of course you have to pass the unit test before you checking in). According to my experience, peer review is more about the nice practice of extreme programming than the code style exam.

In the open source community, the scenario changes: everyone can read your code and figure out what happens in the code. While the developer should expect the feedback from the community, it should be in the form of suggestion or patches instead of fierce criticizing, especially on the coding/design style. Again I would emphasize that open source community should always be polite to the contributors while cruel to the malicious saboteurs.

10) Instead of to say something, why not to do something.

The best way to contribute the open source community is not to say something, but to do something. For instance, if you feel uncomfortable about one project, then instead of writing a letter to the author complaining about their poor coding style, why not just refacting the code and republishing the code? In my humble opinion, barking dogs seldom bite.

Jul 7, 2007 - 乱写

Comments

  1. 写了一个脚本跑着一个程序, 结果等了四个小时没出结果, 趁这个时间, 写点东西.

  2. 70 年前, 1937年7月7日, 卢沟桥事变爆发, 日本正式以吞并中国为目的发动战争. 在此后的八年中, 中华民族精诚团结, 国民党和共产党军队分别在正面战场和敌后战场抗击日军, 各国的援助, 全世界人民, 包括海外华人华侨的支持, 使得中华民族没有亡国灭种. 向那些战争中的民族英雄, 支持抗战的正义之士和所有支持中国的盟友, 致敬!

  3. 今天和朋友去看变形金刚了. 大片, 很炫, 很好看. Spielberg 监制, 梦工厂特技, Michael Bay 导演; 结尾的歌是 Linkin Park 的 What I’ve Done. 所有的我喜欢的大片元素都有, 看得很过瘾. (对于美国价值观和炫耀美国精神感到厌恶的同志们可以不看, 这种大片, 无非就是独立日档期表现一下美国的强大和自由而已. 剧情发展和独立日没啥区别)

  4. 买到大白兔奶糖了. 我想了小半年的东西.

Jul 5, 2007 - 羡慕死我了

Comments

啥也不说了, 直接上图.

laoxu1.jpg

laoxu2.jpg

laoxu3.jpg

[版权申明: 所有照片都是从 Tiny 的 Flickr 贴过来的]

a. Tiny 真的 Tiny 到物极必反了

b. 银杏形象代言人问题终于解决了

c. Tiny 该搞个单反了, 手机拍照片像啥样子

d. 下次去北京, 争取搞个一夜成名, 我也要上<开啦>

Jul 3, 2007 - 关于Bloom Filter 补充说几句

Comments

今天谷歌黑板报上吴军研究员深入浅出的讲解了Bloom Filter. 因为前段时间我在拼写检查器的一点注记 当中也提到了Bloom Filter, 所以补充说几句.

  1. 黑板报上说:

对于每一个电子邮件地址 X,我们用八个不同的随机数产生器(F1,F2, …,F8) 产生八个信息指纹(f1, f2, …, f8)。再用一个随机数产生器 G 把这八个信息指纹映射到 1 到十六亿中的八个自然数 g1, g2, …,g8。

其实这句话有点绕人, 本质上来说, 就是有8个不同的Hash 函数, 能把这个 X 映射到八个自然数. (实际上对于好的Hash 函数, 比如 MD5, 算一次, 截成八段, 就是八个很好的Hash 函数了, 不一定要8个随机数产生器.)

  1. 黑板报上说:

布隆过滤器决不会漏掉任何一个在黑名单中的可疑地址。但是,它有一条不足之处。也就是它有极小的可能将一个不在黑名单中的电子邮件地址判定为在黑 名单中,因为有可能某个好的邮件地址正巧对应个八个都被设置成一的二进制位。好在这种可能性很小。我们把它称为误识概率。在上面的例子中,误识概率在万分 之一以下。

实际上, 这说的就是 Bloom filter 不会有Flase Negative, 可能有 False Positive. 我们来算一下概率. 假设Hash 函数是理想的, 也就是说, 函数值是均一分布的, Bloom Filter 长为\(m\) bits, 那么, 对于一个输入, 某一位没被设置的概率是 \(1-\frac{1}{m}\), 而我们一共有 \(k\) 个独立不相关的 Hash 函数, 所以这一位保持为 \(0\) 的概率应该是 \((1-\frac{1}{m})^k\). 因此, 假如我们一直插入了 \(n\) 个元素进来, 某一位是 \(0\) 的概率就是 \((1-\frac{1}{m})^{kn}\). 用 \(1\) 减去它, 就是这一位是 \(1\) 的概率了. 那么, 如果我们这时候开始测试元素是否在集合中而发生了错误, 就是说, 明明元素不在集合里面, 可是Hash 过后每一位都是 \(1\), 这个的概率就是 \(\left(1-\left(1-\frac{1}{m}\right)^{kn}\right)^k \approx \left(1-e^{-kn/m}\right)^k\). 这个题目中, \(m\)=16G, \(n\)=1G, \(k\)=8, 我算出来的错误率是0.0005744 (Linux Bash: echo “(1-e(-8*1/16))^8″ bc -l), 大于万分之一了. [Wikipedia 和我的算法一样]