We'll extract data about the extended forecast from this page.
To see the HTML tags of extended forecast area, you can inspect the page using developer tool of the browser.
We can then scroll up in the elements panel to find the "outermost" element that contains all of the text that corresponds to the extended forecasts.
In this case, it’s a div tag with the id seven-day-forecast.
If you click around on the console, and explore the div, you’ll discover that each forecast item (like "Tonight", "Thursday", and "Thursday Night") is contained in a div with the class tombstone-container.
Now, we can follow below steps to parse the page:
BeautifulSoup
class to parse the page.div
with id seven-day-forecast
, and assign to seven_day
seven_day
, find each individual forecast item.import requests from bs4 import BeautifulSoup url="https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.XqMU2sgzZPY" page=requests.get(url).content soup=BeautifulSoup(page,'html.parser') seven_day=soup.find(id="seven-day-forecast") #print(list(seven_day.children)) forecast_items =seven_day.find_all(class_="forecast-tombstone") today=forecast_items[0] print(today.prettify()) <li class="forecast-tombstone"> <div class="tombstone-container" > <p class="period-name"> Today <br/> <br/> </p> <p> <img alt="Today: Sunny, with a high near 75. West wind 5 to 13 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 75. West wind 5 to 13 mph. "/> </p> <p class="short-desc"> Sunny </p> <p class="temp temp-high"> High: 75 °F </p> </div> </li>
As you can see, inside the forecast item today is all the information we want. There are 4 pieces of information we can extract:
Today
.title
property of img
.Sunny
.75
degrees.We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:
period=today.find(class_="period-name").get_text() print(period) short_desc=today.find(class_="short-desc").get_text() print(short_desc) temp=today.find(class_="temp").get_text() print(temp) Today Sunny High: 75 °F
Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:
img=today.find("img") desc=img["title"] print(desc) Today: Sunny, with a high near 75. West wind 5 to 13 mph.
Now using css selectors and list comprehensions we can extract everything at once.
period-name
inside an item with the class tombstone-container
in seven_day
.get_text
method on each BeautifulSoup
object.period_tag=seven_day.select(".tombstone-container .period-name") periods=[pt.get_text() for pt in period_tag] print(periods) ['Today', 'Tonight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight', 'Monday', 'MondayNight', 'Tuesday']
We can apply the same technique to get the other 3 fields:
short_desc_tag=seven_day.select(".tombstone-container .short-desc") short_descs=[st.get_text() for st in short_desc_tag] print(short_descs) temp_tag=seven_day.select(".tombstone-container .temp") temps=[tt.get_text() for tt in temp_tag] print(temps) img_tag=seven_day.select(".tombstone-container img") descs=[it['title'] for it in img_tag] print(descs) ['Sunny', 'Partly Cloudy', 'Partly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny'] ['High: 75 °F', 'Low: 56 °F', 'High: 72 °F', 'Low: 54 °F', 'High: 68 °F', 'Low: 53 °F', 'High: 70 °F', 'Low: 56 °F', 'High: 75 °F'] ['Today: Sunny, with a high near 75. West wind 5 to 13 mph. ', 'Tonight: Partly cloudy, with a low around 56. West wind 7 to 11 mph. ', 'Saturday: Partly sunny, with a high near 72. West wind 6 to 13 mph. ', 'Saturday Night: Mostly cloudy, with a low around 54. West wind 10 to 15 mph, with gusts as high as 18 mph. ', 'Sunday: Mostly sunny, with a high near 68. West wind 9 to 15 mph, with gusts as high as 20 mph. ', 'Sunday Night: Mostly clear, with a low around 53.', 'Monday: Sunny, with a high near 70.', 'Monday Night: Mostly clear, with a low around 56.', 'Tuesday: Sunny, with a high near 75.']
We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy.
In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:
import pandas as pd weather=pd.DataFrame({"periods":periods,"short_descs":short_descs,"temps":temps,"descs":descs}) print(weather) periods ... descs 0 Overnight ... Overnight: Partly cloudy, with a low around 54... 1 Sunday ... Sunday: Sunny, with a high near 67. West wind ... 2 SundayNight ... Sunday Night: Mostly clear, with a low around ... 3 Monday ... Monday: Mostly sunny, with a high near 69. Wes... 4 MondayNight ... Monday Night: Mostly clear, with a low around ... 5 Tuesday ... Tuesday: Sunny, with a high near 73. 6 TuesdayNight ... Tuesday Night: Partly cloudy, with a low aroun... 7 Wednesday ... Wednesday: Partly sunny, with a high near 69. 8 WednesdayNight ... Wednesday Night: Mostly cloudy, with a low aro... [9 rows x 4 columns]
We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract
method to pull out the numeric temperature values:
temp=weather["temps"].str.extract('(?P[0-9]+)') weather["temp_num"]=temp.astype(int) print(temp) temp 0 66 1 54 2 70 3 56 4 72 5 55 6 68 7 55 8 67 >>>
We could then find the mean of all the high and low temperatures:
>>>print(weather["temp_num"].mean()) 62.55555555555556 >>>
We could also only select the rows that happen at night:
is_night=weather["temps"].str.contains("Low") weather["is_night"]=is_night print(is_night) 0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 True 8 False Name: temps, dtype: bool >>> print(weather) periods short_descs ... temp_num is_night 0 Today Sunny ... 66 False 1 Tonight Mostly Clear ... 54 True 2 Monday Mostly Sunny ... 70 False 3 MondayNight Mostly Clear ... 56 True 4 Tuesday Sunny ... 72 False 5 TuesdayNight Mostly Cloudy ... 55 True 6 Wednesday Mostly Cloudy ... 68 False 7 WednesdayNight Partly Cloudy ... 55 True 8 Thursday Sunny ... 67 False [9 rows x 6 columns] >>>
This article is contributed by Steve. If you like dEexams.com and would like to contribute, you can write your article here or mail your article to admin@deexams.com . See your article appearing on the dEexams.com main page and help others to learn.