In this recipe, we will learn how to collect data by web scraping. We will write a script for that.
Besides having a Terminal open, you need to have basic knowledge of the grep
and wget
commands.
Now, we will write a script to scrape the contents from imdb.com
. We will use the grep
and wget
commands in the script to get the contents. Create a scrap_contents.sh
script and write the following code in it:
$ mkdir -p data $ cd data $ wget -q -r -l5 -x 5 https://imdb.com $ cd .. $ grep -r -Po -h '(?<=href=")[^"]*' data/ > links.csv $ grep "^http" links.csv > links_filtered.csv $ sort -u links_filtered.csv > links_final.csv $ rm -rf data links.csv links_filtered.csv