casperJS script to crawl a website (can be based on existing https://github.com/seethroughtrees/status-crawler
Casperjs -start-url=http://www.proxymis.com spider.js
Should follow 301 redirect automatcally
ignored : can contain url or extensions…
Ex: ignored = [‘.css’, ‘.js’, ‘.ttf’, ‘index2’]
setFollowMode (default = 2)
The following list explains the supported follow-modes:
0 - The crawler will follow EVERY link, even if the link leads to a different host or domain.
If you choose this mode, you really should set a limit to the crawling-process (see limit-options),
otherwise the crawler maybe will crawl the whole WWW!
1 - The crawler only follow links that lead to the same domain like the one in the root-url.
E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will follow links to "http://www.foo.com/..."
and "http://bar.foo.com/...", but not to "http://www.another-domain.com/...".
2 - The crawler will only follow links that lead to the same host like the one in the root-url.
E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will ONLY follow links to "http://www.foo.com/...", but not
to "http://bar.foo.com/..." and "http://www.another-domain.com/...". This is the default mode.
Telling the crawler to follow links that contain the string "forum" before links that contain "blog" before contact and all other found links.
Each time a page is loaded, it should call pageLoaded function
function pageLoaded (PageInfo)