Issue
And trying to scrap this website with selenium.
What I need: to enter in every child page and extract a lot of information and do this for all the company that is shown.
The code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome('chromedriver')
driver.get("https://startupbase.com.br/home/startups?q=&states=all&cities=all&segments=Construção%20Civil~Imobiliário&targets=all&phases=all&models=all&badges=all")
import time
time.sleep(3)
cookies_button = driver.find_element_by_xpath("//button[contains(text(), 'Accept')]")
cookies_button.click()
time.sleep(3)
# Lists that we will iterate to
founder_name = []
name_company = []
site_url = []
local = []
mercado = []
publico_alvo = []
modelo_receita = []
momento = []
sobre = []
fundacao = []
tamanho_time = []
linkedin_company = []
linkedin_founder = []
atualizacao = []
while True:
time.sleep(2)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
code = soup.prettify()
print(code)
containers = soup.find_all("div", {"class": "search-body__item"})
for container in containers:
internal_page = container.find('a', href=True)
The is still in the beginning because I'm trying to enter into the child pages and I can't that.
I've already tried:
internal_page = driver.find_element_by_xpath("/html/body/app-root/ng-component/app-layout/div/div/div/div/div/app-layout-column/ng-component/div/ais-instantsearch/div/div/div/div[2]/section/ais-infinite-hits/div/div[2]/a")
internal_page.click()
Could someone give a light, please?
Solution
You can use a different approach than to simulate clicking all the buttons.
If you check the link of each start up base it is https://startupbase.com.br/c/startup/
with start up base name separated by dashes
So you can use a base url
base_url = 'https://startupbase.com.br/c/startup/{}'
You can get the titles of every start up base using the following css selector .org__title.sb-size-6
titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]
After that you can iterate through all titles and add its name after the base url, seperated by dashes instead of spaces
for title in titles:
url = base_url.format(title)
And do whatever code you want to request using the url variable
Code:
base_url = 'https://startupbase.com.br/c/startup/{}'
titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]
for title in titles:
url = base_url.format(title)
Answered By - Jad Shaker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.