Issue
I'm building a web scraper for testing/education purposes and I'm running into the following issue:
When I run the code the terminal displays the text, dates and links as expected, but when I export the info to a csv file the last column where I'm supposed to see the adte and time of the article when it was published I get the html
and tags.
I believe it's something either with the formatting or I'm not hitting the right method to remove the p and span tags (this is tricky I've tried a few methods and suggestions already discussed here). It works with 'get_text()', but doesn't export it correctly to csv.
My code:
import requests
from bs4 import BeautifulSoup
import sys
import pandas as pd
url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
article_link = article.a.get('href')
article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
article_date = article.find('p', class_ = 'g-grid__item--time')
print(article_title, end='\n')
print(article_link, end='\n'*2)
if article_date == None:
print('')
else:
print(article_date.text)
art.append({
'Headline': article_title,
'Link': article_link,
'Date Published': article_date
})
df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)
Screenshot below: p and span tags appear after export to csv
Any suggestions would be appreciated. Thanks!
Solution
Because you aren't invoking .text
property in date column
import requests
from bs4 import BeautifulSoup
import sys
import pandas as pd
url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
article_link = article.a.get('href')
article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
article_date = article.find('p', class_ = 'g-grid__item--time')
article_date = article_date.text if article_date else None
#print(article_title, end='\n')
#print(article_link, end='\n'*2)
# if article_date == None:
# print('')
# else:
# print(article_date.text)
art.append({
'Headline': article_title,
'Link': article_link,
'Date Published': article_date
})
#print(art)
df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.