Saturday, September 17, 2022

[FIXED] When exporting to a csv file I get tags but in my terminal when I run the code I don't see them

September 17, 2022 beautifulsoup, python No comments

Issue

I'm building a web scraper for testing/education purposes and I'm running into the following issue:

When I run the code the terminal displays the text, dates and links as expected, but when I export the info to a csv file the last column where I'm supposed to see the adte and time of the article when it was published I get the html

and tags.

I believe it's something either with the formatting or I'm not hitting the right method to remove the p and span tags (this is tricky I've tried a few methods and suggestions already discussed here). It works with 'get_text()', but doesn't export it correctly to csv.

My code:

import requests 
from bs4 import BeautifulSoup
import sys
import pandas as pd



url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
    article_link = article.a.get('href')
    article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
    article_date = article.find('p', class_ = 'g-grid__item--time')
    
    print(article_title, end='\n')
    print(article_link, end='\n'*2)
    if article_date == None:
        print('')
    else:
        print(article_date.text)
    

    art.append({
        'Headline': article_title,
        'Link': article_link,
        'Date Published': article_date
    })

df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)

Screenshot below: p and span tags appear after export to csv

Any suggestions would be appreciated. Thanks!

Solution

Because you aren't invoking .text property in date column

import requests 
from bs4 import BeautifulSoup
import sys
import pandas as pd

url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
    article_link = article.a.get('href')
    article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
    article_date = article.find('p', class_ = 'g-grid__item--time')
    article_date = article_date.text if article_date else None
    
    #print(article_title, end='\n')
    #print(article_link, end='\n'*2)
    # if article_date == None:
    #     print('')
    # else:
    #     print(article_date.text)
    

    art.append({
        'Headline': article_title,
        'Link': article_link,
        'Date Published': article_date
    })

#print(art)

df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] When exporting to a csv file I get tags but in my terminal when I run the code I don't see them

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels