A Beginners guide to Web Scrapping in Python

July 21, 2020

Introduction

Web Scraping is programmatically extracting information from online sources. It’s useful for several tasks - gathering data for analysis, articles, data for for Machine/Deep Learning.

Basics

You can broadly split it into two parts. The first is getting the contents of the webpage and the second is to parse them. We can use a GET request to extract the HTML of the page using a library like requests and parsing it with BeautifulSoup.

First, let’s download the libraries we are going to use (Install requests, BeautifulSoup and Pandas):

pip install requests pandas beautifulsoup

Make a request:

import requests 
url = "https://in.reuters.com/"
data = requests.get(url)
print(data.content)

This will print all the content, from that URL.

Parse the request

from bs4 import BeautifulSoup 
import requests 

url = "https://in.reuters.com/"
data = requests.get(url)

soup = BeautifulSoup(data.content, 'html.parser') #Converting the raw data into a  parsable form. 
print(soup.find_all('div',class_= "edition"))

The above code retrieves the data from the website, converts it into a parsable form and extracts the edition we are currently viewing. This is done using the find_all command and passing in the HTML tag (in this case <div>) and the class name (edition).

But how did we come to know the exact tag and class name? For this, we can get this by inspecting the website. Find here how to do it on different browsers. Clicking on the text we want, we can get the HTML of it.

Nice, we have all the basic building blocks to start scrapping!

Building a real scrapper.

First, import the libraries:

import requests  
import pandas
from bs4 import BeautifulSoup 

For a simple scraper, let’s focus only on one article. Let’s retrieve the title, content and time.

url = "https://in.reuters.com/article/apple-carbon/apple-to-remove-carbon-from-supply-chain-products-by-2030-idINKCN24M1LM" 
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
data = requests.get(url = url, headers = headers) 
soup = BeautifulSoup(data.content, 'html.parser') 

We know everything here, except headers. Headers are additional information we pass on along with the HTTP request. This is especially useful as most websites ban bots running on servers. You can find the User-Agent of your browser here.

Now that we have the data in a parsable form, lets extract the relevant information.

headline = soup.find('h1',class_= "ArticleHeader_headline").get_text()
body = soup.find('div',class_= "StandardArticleBody_body").get_text()
date_and_time = soup.find('div',class_= "ArticleHeader_date").get_text()

We use the get_text() attribute of BeautifulSoup to get only the text (and not the accompanying HTML tags). Finally, we can save the data in the form of a .CSV using the pandas library.

df_dic = {"Headline" : headline, "Content" : body, "Data/Time" : date_and_time}
df = pd.DataFrame(df_dic, index = [0])
df.to_csv('Reuters.csv')

Here we first create a dictionary containing all of the values, convert it into a pandas DataFrame and save it to a .csv.

Congratulations! You have built your first web scrapper Check out part 2 for a more advanced and complete scrapper.

Complete Code:

from bs4 import BeautifulSoup 
import pandas as pd 
import requests 

url = "https://in.reuters.com/article/apple-carbon/apple-to-remove-carbon-from-supply-chain-products-by-2030-idINKCN24M1LM" 
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

data = requests.get(url = url, headers = headers) 
soup = BeautifulSoup(data.content, 'html.parser')

headline = soup.find('h1',class_= "ArticleHeader_headline").get_text()
body = soup.find('div',class_= "StandardArticleBody_body").get_text()
date_and_time = soup.find('div',class_= "ArticleHeader_date").get_text()

df_dic = {"Headline" : headline, "Content" : body, "Data/Time" : date_and_time}
df = pd.DataFrame(df_dic, index = [0])
df.to_csv('Reuters.csv')