Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Go Web Scraping Quick Start Guide
  • Table Of Contents Toc
  • Feedback & Rating feedback
Go Web Scraping Quick Start Guide

Go Web Scraping Quick Start Guide

By : Smith
3.5 (2)
close
close
Go Web Scraping Quick Start Guide

Go Web Scraping Quick Start Guide

3.5 (2)
By: Smith

Overview of this book

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery. The book starts with an introduction to the use cases of building a web scraper and the main features of the Go programming language, along with setting up a Go environment. It then moves on to HTTP requests and responses and talks about how Go handles them. You will also learn about a number of basic web scraping etiquettes. You will be taught how to navigate through a website, using a breadth-first and then a depth-first search, as well as find and follow links. You will get to know about the ways to track history in order to avoid loops and to protect your web scraper using proxies. Finally the book will cover the Go concurrency model, and how to run scrapers in parallel, along with large-scale distributed web scraping.
Table of Contents (10 chapters)
close
close

Why do you need a web scraper?

There are many different use cases where you might need to build a web scraper. All cases center around the fact that information on the internet is often disparate, but can be very valuable when collected into one single package. Often, in these cases, the person collecting the information does not have a working or business relationship with the producers of the data, meaning they cannot request the information to be packaged and delivered to them. Because of the lack of this relationship, the one who needs the data has to rely on their own means to gather the information.

Search engines

One well-known use case for web scraping is indexing websites for the purpose of building a search engine. In this case, a web scraper would visit different websites and follow references to other websites in order to discover all of the content available on the internet. By collecting some of the content from the pages, you could respond to search queries by matching the terms to the contents of the pages you have collected. You could also suggest similar pages if you track how pages are linked together, and rank the most important pages by the number of connections they have to other sites.

Googlebot is the most famous example of a web scraper used to build a search engine. It is the first step in building the search engine as it downloads, indexes, and ranks each page on a website. It will also follow links to other websites, which is how it is able to index a substantial portion of the internet. According to Googlebot's documentation, the scraper attempts to reach each web page every few seconds, which requires them to reach estimates of well into billions of pages per day!

If your goal is to build a search engine, albeit on a much smaller scale, you will find enough tools in this book to collect the information you need. This book will not, however, cover indexing and ranking pages to provide relevant search results.

Price comparison

Another known use case is to find specific products or services sold through various websites and track their prices. You would be able to see who sells the item, who has the lowest price, or when it is most likely to be in stock. You might even be interested in similar products from different sources. Having a web scraper periodically visit websites to monitor these products and services would be easily solve this problem. This is very similar to tracking prices for flights, hotels, and rental cars as well.

Sites like camelcamelcamel (https://camelcamelcamel.com/) build their business model around such a case. According to their blog post explaining how their system works, they actively collect pricing information from multiple retailers every half hour to every few hours, covering millions of products. This allows users to view pricing differences across multiple platforms, as well as get notified if the price of an item drops.

This type of web scraper requires very careful parsing of the web pages to extract only the content that is relevant. In later chapters, you will learn how to extract information from HTML pages in order to collect this information.

Building datasets

Data scientists often need hundreds of thousands of data points in order to build, train, and test machine learning models. In some cases, this data is already pre-packaged and ready for consumption. Most of the time, the scientist would need to venture out on their own and build a custom dataset. This is often done by building a web scraper to collect raw data from various sources of interest, and refining it so it can be processed later on. These web scrapers also need to periodically collect fresh data to update their predictive models with the most relevant information.

A common use case that data scientists run into is determining how people feel about a specific subject, known as sentiment analysis. Through this process, a company could look for discussions surrounding one of their products, or their overall presence, and gather a general consensus. In order to do this, the model must be trained on what a positive comment and a negative comment are, which could take thousands of individual comments in order to make a well-balanced training set. Building a web scraper to collect comments from relevant forums, reviews, and social media sites would be helpful in constructing such a dataset.

These are just a few examples of web scrapers that drive large business such as Google, Mozenda, and Cheapflights.com. There are also companies that will scrape the web for whatever available data you need, for a fee. In order to run scrapers at such a large scale, you would need to use a language that is fast, scalable, and easy to maintain.

Create a Note

Modal Close icon
You need to login to use this feature.
notes
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Delete Note

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Edit Note

Modal Close icon
Write a note (max 255 characters)
Cancel
Update Note

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY