Archive for November, 2009

How to write a crawler program?

Actually writing a crawler program is not that bad, you can use the existing tools, but write your own one probably enable you do every function you want. While I can not provide the code, I searched and find an algorithm for this. It’s an interesting program.

You’ll be reinventing the wheel, to be sure. But here’s the basics:

* A list of unvisited URLs – seed this with one or more starting pages
* A list of visited URLs – so you don’t go around in circles
* A set of rules for URLs you’re not interesting – so you don’t index the whole Internet

Put these stored in a database, so you can stop and start the crawler without losing state.

Algorithm is

Continue »

Get all web pages from a website

I need a program to get all the web pages under a website. The website is Chinese, I want to get all those English words out. Then I can extract all the information I need.

Under a linux system here is a solution:

Use eg wget -r http://site.to.copy.com to recursively retrieve all the web pages to your local machine (hope it’s not too big…), then you can search or do whatever with the files afterward.

Continue »

XOOM Review

I just use XOOM to send money to a friend in China. There are a lot of bad reviews about this service. I will update this post, see how it works.

UPDATE:

The testing result is good. My friend got a call the same day and then the second day the money went into his account. By the way, his account is Agricultural Bank of China and I use Paypal to pay for that.

CONCLUSION:

This service is terrific, it’s fast and it’s all online. So I would like to recommend to friends. But be careful about the currency rate.

Create a folder using a name that begins with a dot in Windows

While I was trying to install Globus in my Windows machine, I was required to create a folder with the name “.globus”. When using the Windows Explorer in Windows XP, I get an error message saying “You have to enter a filename”. The only solution I have come up with, is to open a command prompt (Start, Run, “CMD”, OK) and enter “mkdir .globus”.

Continue »