Try it here
Subscribe
requests and BeautifulSoup modules

Webscrapping in Python

webscrapping_in_python

The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

Let’s try downloading page https://deexams.com/machine_learning_languages. We’ll need to first download it using the requests.get method.

import requests
url="https://deexams.com/machine_learning_languages"
page=requests.get(url)
print(page)


After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully.

print(page.status_code)

200

A status_code of 200 means that the page downloaded successfully.

We can print out the HTML content of the page using the content property:

print(page.content[:150]) # Only printitng 150 characters

b'\n\nDOCTYPE HTML>\n\n<head>\n    <>title>MACHINE LEARNING PROGRAMMING LANGUAGES - dEexams.com </title>\n    <script async src="//pagead2.go'

Parsing a page with BeautifulSoup

Till now we have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the div tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

from bs4 import BeautifulSoup

soup=BeautifulSoup(page.content,'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

print(soup.prettify())

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it:

print(list(soup.children))

Now Let’s see what the type of each element in the list is:

print([type(item) for item in list(soup.children)])

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As you can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

We can now select the html tag and its children by taking the third item in the list:

html = list(soup.children)[3]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

list(html.children)

['\n', <head> <title>MACHINE LEARNING PROGRAMMING LANGUAGES - dEexams.com </title>........</meta></meta></head>, '\n', <body>.... ]

As you can see above, there are two tags here, head, and body. We want to extract the text inside the DIV tag, so we’ll dive into the body:

body = list(html.children)[3]

Now, we can get the DIV tag by finding the children of the body tag:

print(list(body.children))

['\n', <div id="fb-root"></div>, '\n', <script> ......</script>,'\n', <div class="container-fluid bg-dark"> ....</div>, '\n', <header>..</header>, '\n', <div class="slider-main ...<p>...<h3>...]

We can now isolate the Section tag which has the div inside it:

section = list(body.children)[36]

print(list(section.children))


['\n', <div class="container-fluid">
<div class="row">
<div class="col-lg-1"></div>
<div class="col-md-12 col-lg-8 ptb-15 pr-30 pr-md-15">
<h3 class="mtb-10"><b>MACHINE LEARNING PROGRAMMING LANGUAGES</b></h3>
<div class="news_block">
<p  .....]

Now take the outer DIV

div = list(section.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

print(div.get_text())

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

Suppose we want to take all the p tags :

soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text, so take first <p>'s data:

soup.find_all('p')[0].get_text()

In 1959, Arthur Samuel mentioned the words machine learning out of the blue to investigate the development of algorithms that can be utilized to forecast on data by conquering static programming instructions entirely to settle on predictions and choices based on data. Machine learning is utilized today in various computing works where the utilization of unequivocal programming and designing algorithms isn’t practical like detection of a data breach by malevolent insiders or system intruders and so forth.

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

The above same result can be obtained using

print(soup.find('p').get_text())

Searching for tags by class and id

Suppose we want to find all the

which has the class="lang-img"

print(soup.find_all('p',class_="lang-img"))

and if want to search all the tags which has class "lang-img"

print(soup.find_all(class_="lang-img"))

To search the tag by id

print(soup.find(id="signin"))

<button class="signin inaction" id="signin" type="button">Sign In</button>

Find Tags Using CSS Selectors

You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style.

Here are some examples:

  • p a — finds all a tags inside of a p tag.
  • body p a — finds all a tags inside of a p tag inside of a body tag.
  • html body — finds all body tags inside of an html tag.
  • p.outer-text — finds all p tags with a class of outer-text.
  • p#first — finds all p tags with an id of first.
  • body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

print(soup.select("div p"))

Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

Writer profile pic

Prakash on Apr 23, 2020 at 01:04 am


This article is contributed by Prakash. If you like dEexams.com and would like to contribute, you can write your article here or mail your article to admin@deexams.com . See your article appearing on the dEexams.com main page and help others to learn.



Post Comment

Comments( 0)

×

Forgot Password

Please enter your email address below and we will send you information to change your password.