Issue
good day dear coding ppl.
well i have issues to get this runned in google-colab
However, i did some checks - and tried using the cloudscraper module.
For example:
import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS Xyz) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/113.0.0.0 Safari/537.36 Edg/xyz",
}
scraper = cloudscraper.create_scraper()
response = scraper.get("https://clutch.co/il/it-services", headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)
what i get back is this
+----------------+------------+
| Company Name | Location |
|----------------+------------|
+----------------+------------+
Solution
The website you're trying to scrap from probably has some sort of anti-bot protection with CloudFlare or similar services, hence your scrapper are not extracting anything. You need to use selenium
with a headless browser like Headless Chrome or PhantomJS. Selenium automates a real browser, which can navigate Cloudflare's anti-bot pages just like a human user.
Here's how you should use selenium
to imitate a real human browser interaction:
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
url = "https://clutch.co/il/it-services"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)
driver.quit()
Output:
+----+-----------------------------------------------------+--------------------------------+
| | Company Name | Location |
|----+-----------------------------------------------------+--------------------------------|
| 1 | Artelogic | L'viv, Ukraine |
| 2 | Iron Forge Development | Palm Beach Gardens, FL |
| 3 | Lionwood.software | L'viv, Ukraine |
| 4 | Greelow | Tel Aviv-Yafo, Israel |
| 5 | Ester Digital | Tel Aviv-Yafo, Israel |
| 6 | Nextly | Vitória, Brazil |
| 7 | Rootstack | Austin, TX |
| 8 | Opinov8 Technology Services | London, United Kingdom |
| 9 | Scalo | Tel Aviv-Yafo, Israel |
| 10 | TLVTech | Herzliya, Israel |
| 11 | Dofinity | Bnei Brak, Israel |
| 12 | PURPLE | Petah Tikva, Israel |
| 13 | Insitu S2 Tikshuv LTD | Haifa, Israel |
| 14 | Sogo Services | Tel Aviv-Yafo, Israel |
| 15 | Naviteq LTD | Tel Aviv-Yafo, Israel |
| 16 | BMT - Business Marketing Tools | Ra'anana, Israel |
| 17 | Profisea | Hod Hasharon, Israel |
| 18 | MeteorOps | Tel Aviv-Yafo, Israel |
| 19 | Trivium Solutions | Herzliya, Israel |
| 20 | Dynomind.tech | Jerusalem, Israel |
| 21 | Madeira Data Solutions | Kefar Sava, Israel |
| 22 | Titanium Blockchain | Tel Aviv-Yafo, Israel |
| 23 | Octopus Computer Solutions | Tel Aviv-Yafo, Israel |
| 24 | Reblaze | Tel Aviv-Yafo, Israel |
| 25 | ELPC Networks Ltd | Rosh Haayin, Israel |
| 26 | Taldor | Holon, Israel |
| 27 | Clarity | Petah Tikva, Israel |
| 28 | Opsfleet | Kfar Bin Nun, Israel |
| 29 | Hozek Technologies Ltd. | Petah Tikva, Israel |
| 30 | ERG Solutions | Ramat Gan, Israel |
| 31 | Komodo Consulting | Ra'anana, Israel |
| 32 | SCADAfence | Ramat Gan, Israel |
| 33 | Ness Technologies | נס טכנולוגיות | Tel Aviv-Yafo, Israel |
| 34 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel |
| 35 | Radware | Tel Aviv-Yafo, Israel |
| 36 | BigData Boutique | Rishon LeTsiyon, Israel |
| 37 | NetNUt | Tel Aviv-Yafo, Israel |
| 38 | Asperii | Petah Tikva, Israel |
| 39 | PractiProject | Ramat Gan, Israel |
| 40 | K8Support | Bnei Brak, Israel |
| 41 | Odix | Rosh Haayin, Israel |
| 42 | Panaya | Hod Hasharon, Israel |
| 43 | MazeBolt Technologies | Giv'atayim, Israel |
| 44 | Porat | Tel Aviv-Jaffa, Israel |
| 45 | MindU | Tel Aviv-Yafo, Israel |
| 46 | Valinor Ltd. | Petah Tikva, Israel |
| 47 | entrypoint | Modi'in-Maccabim-Re'ut, Israel |
| 48 | Adelante | Tel Aviv-Yafo, Israel |
| 49 | Code n' Roll | Haifa, Israel |
| 50 | Linnovate | Bnei Brak, Israel |
| 51 | Viceman Agency | Tel Aviv-Jaffa, Israel |
| 52 | develeap | Tel Aviv-Yafo, Israel |
| 53 | Chalir.com | Binyamina-Giv'at Ada, Israel |
| 54 | WolfCode | Rishon LeTsiyon, Israel |
| 55 | Penguin Strategies | Ra'anana, Israel |
| 56 | ANG Solutions | Tel Aviv-Yafo, Israel |
+----+-----------------------------------------------------+--------------------------------+
Answered By - Musabbir Arrafi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.