Posted earlier on Search Nuggets.
To crawl most pages elegantly and easily, you need five information elements:
- Somewhere to start. Which place do you want your crawler to start. You don’t have to specify the domain, we pick the domain name from the page you’re visiting.
- Which links to follow. This is not necessarily the pages you want to crawl. Typically these pages have lists of pages you want to crawl.
- Which links not to follow. To not make the crawler go wild, you set some boundaries. Often a page has several URLs.
- Which links to crawl. These are the actual pages you’re looking for.
- Which links not to crawl.
A simple illustration on the above rules. Norch Fetch doesn’t have all these features yet, but they’re suggested as enhancements.
To ensure you’re adding valid rules, it’s a good ting to test first.
Next tasks will be to make a clickable prototype in HTML/CSS and read up on HTML5 local storage/web storage.
All comments on the idea are welcome!