Build a Basic Web Scraper in Go

Oct 26, 2019 · 238 words · 2 minutes read

#web #scrape #http #html #query #classes #get #document #selection

This is a single page web scraper, it uses the goquery library to parse the html and allow it to be queried easily (like jQuery). There is a Find method we can use to query for classes and ids in same way as a css selector. In our example we use this to get the latest blog titles from golangcode.

If you needed to search an entire site, you could implement a query to follow and recall a link urls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	blogTitles, err := GetLatestBlogTitles("https://golangcode.com")
	if err != nil {
		log.Println(err)
	}
	fmt.Println("Blog Titles:")
	fmt.Printf(blogTitles)
}

// GetLatestBlogTitles gets the latest blog title headings from the url
// given and returns them as a list.
func GetLatestBlogTitles(url string) (string, error) {

	// Get the HTML
	resp, err := http.Get(url)
	if err != nil {
		return "", err
	}

	// Convert HTML into goquery document
	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		return "", err
	}

	// Save each .post-title as a list
	titles := ""
	doc.Find(".post-title").Each(func(i int, s *goquery.Selection) {
		titles += "- " + s.Text() + "\n"
	})
	return titles, nil
}

web scraper to get post titles

See something which isn't right? You can contribute to this page on GitHub or just let us know in the comments below - Thanks for reading!

Related Posts

Author: Edd Turtle