How to write a crawler program?
Actually writing a crawler program is not that bad, you can use the existing tools, but write your own one probably enable you do every function you want. While I can not provide the code, I searched and find an algorithm for this. It’s an interesting program.
You’ll be reinventing the wheel, to be sure. But here’s the basics:
* A list of unvisited URLs – seed this with one or more starting pages
* A list of visited URLs – so you don’t go around in circles
* A set of rules for URLs you’re not interesting – so you don’t index the whole Internet
Put these stored in a database, so you can stop and start the crawler without losing state.
Algorithm is
Comments(0)