Round 1:
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.
robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website.
Round 2:
To modify the strings, Python’s “re” module is providing 3 methods. They are:
split() – uses a regex pattern to “split” a given string into a list. sub() – finds all substrings where the regex pattern matches and then replace them with a different string subn() – it is similar to sub() and also returns the new string along with the no. of replacements.
Round 3:-
How to load a file into database if delimiter is unknown ?
Round 4:-
How to do api integration in python ?
This article is contributed by Deepak. If you like dEexams.com and would like to contribute, you can write your article here or mail your article to admin@deexams.com . See your article appearing on the dEexams.com main page and help others to learn.