Thursday, October 20, 2022

[FIXED] beatiful soup 4 getting an output as (['link1'] ['link2'] ['link3']). How to change as a required format? (['link1', 'link2', 'link3'])

October 20, 2022 beautifulsoup, python, python-3.x, web-scraping No comments

Issue

beatiful soup 4 getting an output as (example - ['link1']['link2']['link3']). How to change as a required format? (example - ['link1', 'link2', 'link3'])

I am getting this below output.

['link1']
['link2']
['link3']

I need an output as i mentioned below like this to form a data frame, so what i need to do now.

['link1', 'link2', 'link3']

Exaplain with code also fine. please help me to solve this issue, thanks in advance.

My code

import bs4
from bs4 import BeautifulSoup
from csv import writer
import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0'}
HOST = 'https://www.zocdoc.com'
#PAGE = 'gastroenterologists/2'
web_page = 'https://www.zocdoc.com/search?address=Houston%2C%20TX&insurance_carrier=&city=Houston&date_searched_for=&day_filter=AnyDay&filters=%7B%7D&gender=-1&language=-1&latitude=29.7604267&locationType=placemark&longitude=-95.3698028&offset=1&insurance_plan=-1&reason_visit=386&search_query=Gastroenterologist&searchType=specialty&sees_children=false&after_5pm=false&before_10am=false&sort_type=Default&dr_specialty=106&state=TX&visitType=inPersonVisit&&timesgridType='
with requests.Session() as session:
    (r := session.get(HOST, headers=headers)).raise_for_status()
    #(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
    (r := session.get(f'{web_page}', headers=headers)).raise_for_status()
    # process content from here
print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
soup
print(soup.prettify())

code 1 to get as a link

for item in soup.find_all('img'):
    images = []
    items = (item['src'])
    images = 'https:'+items
    print(images)

code 2 to get below mentioned output format

for item in soup.find_all('img'):
    c = []
    items = (item['src'])
    image = ('https:'+items)
    c.append(image)
    print(c)

Output - ['link1'] . . ['linkn']

Solution

You have to append the urls to a list outsite your loop to avoid overwriting and get the structure you expect:

images = []
for item in soup.find_all('img'):
    images.append('https:'+item['src'])

As an alternative you can go with a list comprehension notation:

images = ['https:'+item['src'] for item in soup.find_all('img')]

Just a hint - Avoid storing scraped information in these bunch of lists, use more structured like dict:

data = []
for item in soup.find_all('article'):
    data.append({
        'name':item.find('span',{'itemprop':'name'}).text,
        'image':'https:'+item.img['src'],
        'anyOtherInfo':'anyOtherInfo'
    })

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 20, 2022

[FIXED] beatiful soup 4 getting an output as (['link1'] ['link2'] ['link3']). How to change as a required format? (['link1', 'link2', 'link3'])

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels