Build a Basic Web Scraper in Go

· 238 words · 2 minutes read

This is a single page web scraper, it uses the goquery library to parse the html and allow it to be queried easily (like jQuery). There is a Find method we can use to query for classes and ids in same way as a css selector. In our example we use this to get the latest blog titles from golangcode.

If you needed to search an entire site, you could implement a query to follow and recall a link urls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	blogTitles, err := GetLatestBlogTitles("https://golangcode.com")
	if err != nil {
		log.Println(err)
	}
	fmt.Println("Blog Titles:")
	fmt.Printf(blogTitles)
}

// GetLatestBlogTitles gets the latest blog title headings from the url
// given and returns them as a list.
func GetLatestBlogTitles(url string) (string, error) {

	// Get the HTML
	resp, err := http.Get(url)
	if err != nil {
		return "", err
	}

	// Convert HTML into goquery document
	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		return "", err
	}

	// Save each .post-title as a list
	titles := ""
	doc.Find(".post-title").Each(func(i int, s *goquery.Selection) {
		titles += "- " + s.Text() + "\n"
	})
	return titles, nil
}

web scraper to get post titles

Image of Author Edd Turtle

Author:  Edd Turtle

Edd is the Lead Developer at Hoowla, a prop-tech startup, where he spends much of his time working on production-ready Go and PHP code. He loves coding, but also enjoys cycling and camping in his spare time.

See something which isn't right? You can contribute to this page on GitHub or just let us know in the comments below - Thanks for reading!