Try it here
Subscribe
Python Webscrapping

Webscrapping of weather data

webscrapping_of_weather_data

We'll extract data about the extended forecast from this page.

To see the HTML tags of extended forecast area, you can inspect the page using developer tool of the browser.

We can then scroll up in the elements panel to find the "outermost" element that contains all of the text that corresponds to the extended forecasts.

In this case, it’s a div tag with the id seven-day-forecast.

If you click around on the console, and explore the div, you’ll discover that each forecast item (like "Tonight", "Thursday", and "Thursday Night") is contained in a div with the class tombstone-container.

Now, we can follow below steps to parse the page:

  • Download the web page containing the forecast.
  • Create a BeautifulSoup class to parse the page.
  • Find the div with id seven-day-forecast, and assign to seven_day
  • Inside seven_day, find each individual forecast item.
  • Extract and print the first forecast item.
import requests
from bs4 import BeautifulSoup

url="https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.XqMU2sgzZPY"

page=requests.get(url).content

soup=BeautifulSoup(page,'html.parser')

seven_day=soup.find(id="seven-day-forecast")

#print(list(seven_day.children))

forecast_items =seven_day.find_all(class_="forecast-tombstone")

today=forecast_items[0]

print(today.prettify())

<li class="forecast-tombstone">
  <div class="tombstone-container" >
  <p class="period-name">
   Today
   <br/>
   <br/>
  </p>
  <p>
   <img alt="Today: Sunny, with a high near 75. West wind 5 to 13 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 75. West wind 5 to 13 mph. "/>
  </p>
  <p class="short-desc">
   Sunny
  </p>
  <p class="temp temp-high">
   High: 75 °F
  </p>
 </div>
</li>

Extracting information from the page

As you can see, inside the forecast item today is all the information we want. There are 4 pieces of information we can extract:

  • The name of the forecast item — in this case, Today.
  • The description of the conditions — this is stored in the title property of img.
  • A short description of the conditions — in this case, Sunny.
  • The temperature High — in this case, 75 degrees.

We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

period=today.find(class_="period-name").get_text()
print(period)
short_desc=today.find(class_="short-desc").get_text()
print(short_desc)
temp=today.find(class_="temp").get_text()
print(temp)

Today
Sunny
High: 75 °F

Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

img=today.find("img")
desc=img["title"]
print(desc)

Today: Sunny, with a high near 75. West wind 5 to 13 mph. 

Extracting all the information from the page

Now using css selectors and list comprehensions we can extract everything at once.

Steps are

  • Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
  • Use a list comprehension to call the get_text method on each BeautifulSoup object.
period_tag=seven_day.select(".tombstone-container .period-name")
periods=[pt.get_text() for pt in period_tag]
print(periods)

['Today', 'Tonight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight', 'Monday', 'MondayNight', 'Tuesday']

We can apply the same technique to get the other 3 fields:

short_desc_tag=seven_day.select(".tombstone-container .short-desc")
short_descs=[st.get_text() for st in short_desc_tag]
print(short_descs)

temp_tag=seven_day.select(".tombstone-container .temp")
temps=[tt.get_text() for tt in temp_tag]
print(temps)

img_tag=seven_day.select(".tombstone-container img")
descs=[it['title'] for it in img_tag]
print(descs)

['Sunny', 'Partly Cloudy', 'Partly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny']
['High: 75 °F', 'Low: 56 °F', 'High: 72 °F', 'Low: 54 °F', 'High: 68 °F', 'Low: 53 °F', 'High: 70 °F', 'Low: 56 °F', 'High: 75 °F']
['Today: Sunny, with a high near 75. West wind 5 to 13 mph. ', 'Tonight: Partly cloudy, with a low around 56. West wind 7 to 11 mph. ', 'Saturday: Partly sunny, with a high near 72. West wind 6 to 13 mph. ', 'Saturday Night: Mostly cloudy, with a low around 54. West wind 10 to 15 mph, with gusts as high as 18 mph. ', 'Sunday: Mostly sunny, with a high near 68. West wind 9 to 15 mph, with gusts as high as 20 mph. ', 'Sunday Night: Mostly clear, with a low around 53.', 'Monday: Sunny, with a high near 70.', 'Monday Night: Mostly clear, with a low around 56.', 'Tuesday: Sunny, with a high near 75.']

Combining our data into a Pandas Dataframe

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

import pandas as pd

weather=pd.DataFrame({"periods":periods,"short_descs":short_descs,"temps":temps,"descs":descs})
print(weather)

          periods  ...                                              descs
0       Overnight  ...  Overnight: Partly cloudy, with a low around 54...
1          Sunday  ...  Sunday: Sunny, with a high near 67. West wind ...
2     SundayNight  ...  Sunday Night: Mostly clear, with a low around ...
3          Monday  ...  Monday: Mostly sunny, with a high near 69. Wes...
4     MondayNight  ...  Monday Night: Mostly clear, with a low around ...
5         Tuesday  ...               Tuesday: Sunny, with a high near 73.
6    TuesdayNight  ...  Tuesday Night: Partly cloudy, with a low aroun...
7       Wednesday  ...      Wednesday: Partly sunny, with a high near 69.
8  WednesdayNight  ...  Wednesday Night: Mostly cloudy, with a low aro...

[9 rows x 4 columns]

We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

temp=weather["temps"].str.extract('(?P[0-9]+)')
weather["temp_num"]=temp.astype(int)
print(temp)
  temp
0   66
1   54
2   70
3   56
4   72
5   55
6   68
7   55
8   67
>>> 

We could then find the mean of all the high and low temperatures:

>>>print(weather["temp_num"].mean())
62.55555555555556
>>> 

We could also only select the rows that happen at night:

is_night=weather["temps"].str.contains("Low")

weather["is_night"]=is_night

print(is_night)

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temps, dtype: bool
>>> 

print(weather)

          periods    short_descs  ... temp_num is_night
0           Today          Sunny  ...       66    False
1         Tonight   Mostly Clear  ...       54     True
2          Monday   Mostly Sunny  ...       70    False
3     MondayNight   Mostly Clear  ...       56     True
4         Tuesday          Sunny  ...       72    False
5    TuesdayNight  Mostly Cloudy  ...       55     True
6       Wednesday  Mostly Cloudy  ...       68    False
7  WednesdayNight  Partly Cloudy  ...       55     True
8        Thursday          Sunny  ...       67    False

[9 rows x 6 columns]
>>> 

Writer profile pic

Steve on Apr 26, 2020 at 07:04 am


This article is contributed by Steve. If you like dEexams.com and would like to contribute, you can write your article here or mail your article to admin@deexams.com . See your article appearing on the dEexams.com main page and help others to learn.



Post Comment

Comments( 0)

×

Forgot Password

Please enter your email address below and we will send you information to change your password.