Thursday, October 14, 2021

[FIXED] Webscraping with Selenium, problems with scraping child pages

October 14, 2021 python, selenium, selenium-webdriver, web-scraping No comments

Issue

And trying to scrap this website with selenium.

https://startupbase.com.br/home/startups?q=&states=all&cities=all&segments=Constru%C3%A7%C3%A3o%20Civil~Imobili%C3%A1rio&targets=all&phases=all&models=all&badges=all

What I need: to enter in every child page and extract a lot of information and do this for all the company that is shown.

The code:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent

ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')

driver = webdriver.Chrome('chromedriver')

driver.get("https://startupbase.com.br/home/startups?q=&states=all&cities=all&segments=Construção%20Civil~Imobiliário&targets=all&phases=all&models=all&badges=all")

import time

time.sleep(3)

cookies_button = driver.find_element_by_xpath("//button[contains(text(), 'Accept')]")
cookies_button.click()

time.sleep(3)

# Lists that we will iterate to
founder_name = []
name_company = []
site_url = []
local = []
mercado = []
publico_alvo = []
modelo_receita = []
momento = []
sobre = []
fundacao = []
tamanho_time = []
linkedin_company = []
linkedin_founder = []
atualizacao = [] 

while True:
    time.sleep(2)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    code = soup.prettify()
    print(code)

    containers = soup.find_all("div", {"class": "search-body__item"})

    for container in containers:
         internal_page = container.find('a', href=True)

The is still in the beginning because I'm trying to enter into the child pages and I can't that.

I've already tried:

internal_page = driver.find_element_by_xpath("/html/body/app-root/ng-component/app-layout/div/div/div/div/div/app-layout-column/ng-component/div/ais-instantsearch/div/div/div/div[2]/section/ais-infinite-hits/div/div[2]/a")
internal_page.click()

Could someone give a light, please?

Solution

You can use a different approach than to simulate clicking all the buttons.
If you check the link of each start up base it is https://startupbase.com.br/c/startup/ with start up base name separated by dashes
So you can use a base url

base_url = 'https://startupbase.com.br/c/startup/{}'

You can get the titles of every start up base using the following css selector .org__title.sb-size-6

titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]

After that you can iterate through all titles and add its name after the base url, seperated by dashes instead of spaces

for title in titles:
    url = base_url.format(title)

And do whatever code you want to request using the url variable

Code:

base_url = 'https://startupbase.com.br/c/startup/{}'

titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]

for title in titles:
    url = base_url.format(title)

Answered By - Jad Shaker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 14, 2021

[FIXED] Webscraping with Selenium, problems with scraping child pages

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels