Getting started with Web Scraping

What is Web Scraping
Michelangelo an Italian sculptor famously known for his work David was once asked how he sculpted David with marble. He replied, “It is easy. You just chip away the stone that doesn’t look like David.” Yeah, this is relatable to web scraping. We know that there is plenty of data available on the web but we need only the data which is beneficial to us according to the situation. The extraction of neede data from websites by analyzing underlying HTML code is known as Web Scraping.

When to do Web Scraping

So next we should understand when we can do web scraping. Imagine you need to collect the data available on a particular website say ISRO and look for the job vacancies and collect the job roles, application link, location of the job, and curate into a single .txt file. It will be a tedious task of surfing the web and copy-pasting all the things and finally getting the job done. It may be
easy if you were asked to do this task for a single website but what if you are asked to do the same for so many websites, it will be a tedious task right. This is where you can use web scraping. With a few lines of code, you can just automate the process. The only thing you want is to be is be patient and use the inspect tool of chrome and any other browser preferred , understand the underlying HTML code, and scrape those things.

How to do Scraping

Here I am using BeautifulSoup a python library to start with web scraping.
Code editors used is Visual Studio Code.
How to start with:
1. Install BeautifulSoup library in your system by opening the terminal and writing the command: pip install beautifulsoup4
2. Install requests library using the command: pip install requests
3. Next, we want to parse our HTML code into python objects, for that we need the help of a parser, so here I am using lxml parser. There are other parsers available but I am preferring lxml because if there come a situation where your HTML Code is not clean then lxml can solve that problem. Install it using the command: pip install lxml

Now our setup is ready and now we can move on with web scraping.

Basic understanding of Scraping

Now we are going to Scraping the quotes from the website: https://www.goodreads.com/quotes'

Python code

from bs4 import BeautifulSoup

import requests

source = requests.get(‘https://www.goodreads.com/quotes').text

soup = BeautifulSoup(source, ‘lxml’)

for obj in soup.find_all(‘div’, class_ = ‘quoteDetails’):

quote = obj.div.text

print(quote)

The Output will be something like this:

This is just a simple demonstration and you can do much more. Here I am attaching the link of a project which is an automated email alert sender when the price of a desired product in flipkart drop below a partiular value. The name and price of the item is scraped using BeautifulSoup library of python. The scraped price is then compared every 86400 seconds ie, one day and when the price drop below the specified value, an automated email will be send to a specified email address with the help of smtplib of python.

Github Link : https://github.com/Sradha-0805/flipkartEmailAlert

Resources and References

Spark of web scraping obtained from the video of
Tintu Vlogger(Malayalam tech youtuber). Video link: https://youtu.be/GeqYKGZ63aI

I am very much thankful to Alex the Analyst, youtuber for the idea of project he shared through his video. video link: https://youtu.be/HiOtQMcI5wg

Concepts cleared through the videos of Corey Schafer.

BeautifulSoup Documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/