转载:Build a Web Crawler in Go
One of the basic tests I use to try out a new programming language is building a web crawler. I stole the idea from my colleague Mike Lewis and I love it because it uses all the principles necessary in internet engineering: A web crawler needs to parse semi-structured text, rely on 3rd-party APIs, manage its internal state, and perform some basic concurrency.
# Starting a new project with Go
This is a tutorial that is accessible for complete beginners to the Go programming language or to web programming. All the commands you need to run are provided and all the code you need to type into the .go
source code files is shown to you a piece at a time. Each time I introduce a new concept you’ll need to edit the same file, slowly building up functionality. When you run each version of the code Go will give you an exact error message with a line number so you’ll be able to fix any mistakes you make. If you get totally lost you can just copy straight out of the page into a new file and pick back up. If you’re new to Go I highly recommend reading the excellent Go documentation to help you make sense of what’s on this page.
Since this example requires the Go programming language you should start by installing Go:
1 | sudo apt-get install golang # (if you're on Linux) |
# Creating the ‘main’ function
Go is so fast that it’s practically a scripting language. All you need to run it is a ‘main’ package and a ‘main’ function. On my machine the following tiny program takes only 0.04 seconds to compile. Let’s create a new file and try it out:
1 | //// Put this in a new file 'crawl.go' |
Now you’re a Go programmer. Hooray! Let’s make it just slightly more interesting by turning this into a Hello World program.
1 | //// file: crawl.go |
And now let’s run it:
1 | go run crawl.go # run this in your terminal in whatever directory you created the file. |
You could have executed go build crawl.go
instead of go run crawl.go
and it would have just compiled the file for you. The “run” command both compiles and executes it so you’ll find it turns Go into a usable scripting language (indeed, it’s faster than a lot of Ruby or Python projects).
# What’s a web crawler?
A web crawler is the portion of a search engine that scans web pages looking for links and then follows them. It would normally store the data it finds into some database where it can be useful but our version is just going to print the URLs to the screen and move along.
While not quite enterprise grade this’ll let us play with all the same concepts and see some really satisfying output as our little script wanders the web.
# Step 1. Starting from a specific page
You’ve got to start crawling from somewhere and in our program we’ll let the person who executes the crawler specify what starting page to use. This involves reading a command line argument which, in most programming languages, will be stored in a variable named something like ARGV
and Go is no exception.
1 | //// file: crawl.go |
Now run this file by typing go run crawl.go
and see that it yells at you because you didn’t specify a starting web page. Then try running it with an argument provided ( go run crawl.go http://news.google.com
) and it won’t show that message anymore.
# Step 2. Retrieving a page from the internet
The next thing you need is to download the page your starting URL represents so you can scan it for links. In Go there is a great http package right in the standard library. You can use the primitives Go gives you to turn your URL string into a string representing the page body.
First I’ll show you what this looks like in isolation, then we can incorporate it into our crawler. Put the following in retrieve.go:
1 | //// file: retrieve.go |
And run it with go run retrieve.go
. You should see the body of 6brand.com printed to your screen. And if you scroll up to the top you’ll see that there were no errors (literally, err was nil) in either the transporting or reading of your http call. Yay, go you.
Let’s copy the new pieces of this code into our crawl.go
:
1 | //// file: crawl.go |
What we’ve got now is an excellent start to a web crawler. It’s able to boot, parse the URL you’ve given it, open a connection to the right remote host, and retrieve the html content.
# Step 3. Parsing hyperlinks from HTML
If you have a page of HTML you may want to use a regular expression to extract the links. Don’t. Like, really don’t. Regular expressions are not a good tool for that. The best way is, sadly, to walk through the tree structure of the page finding anchor tags and extracting href
attributes from them. In most languages this is no fun but in Go’s standard library it’s extra not fun (though not nearly as painful as Java) and I won’t subject you to it here. I’ve encapsulated the act of pulling links from a large HTML string in this project and you’re welcome to check out the code if you’re interested.
Now let’s modify our code to extract links from whatever page is provided and print those links to the screen:
1 | //// file: crawl.go |
If you run this against a simple website this’ll work fine:
1 | go run crawl.go http://6brand.com |
But if you try to run it against an https-secured site it may error out because it can’t validate the SSL cert. We don’t care about security in this toy app so let’s disable the SSL verification.
1 | //// file: crawl.go |
# Step 4. Concurrency
I’ve asked programming questions that required work similar to what we’re doing here in programming interviews and the step that candidates need the most help with is building some kind of queue. If you were to crawl the internet without queueing up the links you found you’d just visit the top link of every page and rapidly, in a depth-first search across the whole web, exhaust your resources without ever visiting the second link on the first page. To avoid that we need to keep some kind of queue where we put links we find in the back of it and we visit pages that we pull off the front of the queue.
In C# we’d using a ConsumingQueue
, in Ruby we’d probably require the stdlib’s Queue
, and in Java we’d use a ConcurrentLinkedQueue
or something. Go gives us a great alternative: a channel. It’s kinda like a lightweight thread that abstracts away some concurrency primitives for you and functions much like a queue.
We’ll create a channel when we start the program and we’ll put our starting URL into it. Then we’ll begin to read URLs from the channel and whenever we find new URLs we’ll write them to the channel, effectively putting them into the back of our queue.
The queue will grow continuously over time because we’re putting things in faster than we take them out but we just don’t care.
1 | //// file: crawl.go |
The flow now is that main
has a for
loop reading from the channel called queue
and enqueue
does the HTTP retrieval and link parsing, putting the discovered links into the same queue used by main
.
If you try running our code so far it’ll work but you’ll immediately discover two things: The World Wide Web is a messy place full of invalid links and most pages link to themselves.
So let’s add some sanity to our code.
# Step 5. Data sanitization
To properly explore the web let’s turn all of the relative links we find into absolute links.
1 | package main |
Now you’re cookin’. This program will now pretty reliably walk around the web downloading and parsing pages. However, there’s still one thing we lack.
# Step 6. Avoiding Loops
Nothing so far prevents us from visiting a page that has one link pointing to itself and just looping on that single page forever. That’s dumb, let’s not fetch any page more than once.
The right data structure for keeping track of the presence or absense of things is a set. Go, like JavaScript, doesn’t have a native way of doing sets so we need to use a map (a.k.a a hash or hashmap or a dictionary) with urls as keys to keep track of which pages we’ve visited.
1 | package main |
And now you can start at any html page you like and slowly explore the entire world wide web.
1 | go run crawl.go http://6brand.com |
Full source code is available on GitHub as well.