A few months ago I drastically changed how the urls on my site were built. I moved to using the ASP.NET
2.0 virtual path provider to make more friendly urls. See the discussions
last April if you're
interested. There were several posts that month about it. One problem with a change like this is that
it can wreak havoc on your urls, especially your relative ones. Using the url rewriting features built
into ASP.NET 2.0 I redirected all the old urls to the new ones, but that didn't fix the relative urls in
the blog posts, because there were now more subdirectories that needed to be navigated. I have finally
gotten around to building something to check to make sure all my urls are good: a web crawler.
Just in case you don't know what a web crawler is, a web crawler is a program that someone uses to
view a page, extract all the links and various pieces of data for the page, which then hits all the links
referenced on that page, getting all the data for those, and so on. This is how search engines, for
example, get all their data. They write crawlers.
And that is exactly what I needed; something to crawl my site to make sure all my links were good.
So I decided to write one, and I'm sharing it with you here. You can download it at the end. Between
here and there is a discussion of some of the more interesting bits of features and code in the
crawler.
Disclaimer
First, I'm not sharing this because I think it is the best crawler ever. My quality bar for this one
was "will it meet the needs for which I developed it?". The answer to that is "yes". It may not meet
yours. If not, change it yourself, use the code as a starting point for your own, or run away cursing
my insufficient code, ruing the day that I was brought into this cold, hard world.
Second, I have only tested this on a few of my own personal sites. It seems to work fine on all of them.
If it doesn't work completely on yours, see the first point. Third, this was not optimized for speed.
If you want to crawl the entire web with this thing, you'll probably find that it is not fast enough.
Sorry, but see the first point. Fourth, I did not build in robots.txt support into the crawler...because
I was just wanting this for myself. If you're going to use this on other people's sites, please do that.
It is the nice thing to do. Don't be evil.
Overview
Here are some notes on the basics of the crawler.
- It is a console app - It doesn't need a rich interface, so I figured a console application would
do. The output is done as an html file and the input (what site to view) is done through the app.config.
Making a windows app out of this seemed like overkill.
- The crawler is designed to only crawl the site it originally targets. It would be easy to change that
if you want to crawl more than just a single site, but that is the goal of this little application.
- Originally the crawler was just written to find bad links. Just for fun I also had it collect
information on page and viewstate sizes. It will also list all non-html files and external urls, just
in case you care to see them.
- The results are shown in a rather minimalistic html report. This report is automatically opened in
Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html
off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET
has classes for doing this very thing built into the framework.
1: private static string GetWebText(string url)
2: {
3: HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
4: request.UserAgent = "A .NET Web Crawler";
5:
6: WebResponse response = request.GetResponse();
7:
8: Stream stream = response.GetResponseStream();
9:
10: StreamReader reader = new StreamReader(stream);
11: string htmlText = reader.ReadToEnd();
12: return htmlText;
13: }
14:
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved
through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a
StreamReader, and read the text to get your html.
Hunting for Links
So how do you find those links? I used regex to find the href's on the page, and then a little code to get the values. You'll also see in this sample how I did
a little link sorting. The most important thing though, is the regex. Pretty simple, isn't it!
1: private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";
2:
3: /// <summary>
4: /// Parses a page looking for links.
5: /// </summary>
6: /// <param name="page">The page whose text is to be parsed.</param>
7: /// <param name="sourceUrl">The source url of the page.</param>
8: public void ParseLinks(Page page, string sourceUrl)
9: {
10: MatchCollection matches = Regex.Matches(page.Text, _LINK_REGEX);
11:
12: for (int i = 0; i <= matches.Count - 1; i++)
13: {
14: Match anchorMatch = matches[i];
15:
16: if (anchorMatch.Value == String.Empty)
17: {
18: BadUrls.Add("Blank url value on page " + sourceUrl);
19: continue;
20: }
21:
22: string foundHref = null;
23: try
24: {
25: foundHref = anchorMatch.Value.Replace("href=\"", "");
26: foundHref = foundHref.Substring(0, foundHref.IndexOf("\""));
27: }
28: catch (Exception exc)
29: {
30: Exceptions.Add("Error parsing matched href: " + exc.Message);
31: }
32:
33:
34: if (!GoodUrls.Contains(foundHref))
35: {
36: if (IsExternalUrl(foundHref))
37: {
38: _externalUrls.Add(foundHref);
39: }
40: else if (!IsAWebPage(foundHref))
41: {
42: foundHref = Crawler.FixPath(sourceUrl, foundHref);
43: _otherUrls.Add(foundHref);
44: }
45: else
46: {
47: GoodUrls.Add(foundHref);
48: }
49: }
50: }
51: }
Is It a Link to an External Site?
Because I had no desire to crawl the entire web, I needed to focus on the site in question. Given a url, here's the code I wrote to do that:
1: /// <summary>
2: /// Is the url to an external site?
3: /// </summary>
4: /// <param name="url">The url whose externality of destination is in question.</param>
5: /// <returns>Boolean indicating whether or not the url is to an external destination.</returns>
6: private static bool IsExternalUrl(string url)
7: {
8: if (url.IndexOf(ConfigurationManager.AppSettings["authority"]) > -1)
9: {
10: return false;
11: }
12: else if (url.Substring(0, 7) == "http://" || url.Substring(0, 3) == "www" || url.Substring(0, 7) == "https://")
13: {
14: return true;
15: }
16:
17: return false;
18: }
Conclusion (Writing a Crawler Is Not Rocket Surgery)
Writing a crawler really boils down to two tasks, fetching html and parsing it for links. It's really not all that hard. The code included in the download has
a little more to it, but not a lot. If you would find it useful, download it yourself and take a look. If you have any questions or comments, shoot me an email.
And yes. I did find some bad links. Glad I wrote it!
Downloads
Source