Issue
trying to gather the data form the page https://clutch.co/il/it-services
and that said i - think that there are probably several options to do that
using bs4
and requests b. using pandas
this first approach uses a.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://clutch.co/il/it-services"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
company_names = soup.find_all("h3", class_="company-name")
locations = soup.find_all("span", class_="locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.to_csv("it_services_data.csv", index=False)
This code will scrape
a. the company names and locations from the specified webpage and b. stores them in a Pandas DataFrame. c. It will then save the data to a CSV file named it_services_data.csv
in the current working directory.
but i ended up with a empty result-file. In fact the file is really empty:
what i did was the following:
1.install the required packages:
pip install beautifulsoup4 requests pandas
Import the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Send a GET request to the webpage and retrieve the HTML content:
url = "https://clutch.co/il/it-services"
response = requests.get(url)
Create a BeautifulSoup object to parse the HTML content:
soup = BeautifulSoup(response.content, "html.parser")
Identify the HTML elements containing the data we want to scrape. Inspect the webpage's source code to find the relevant tags and attributes. For example, let's assume we want to extract the company names and their respective locations. In this case, the company names are contained in tags with the class name "company-name" and the locations are in tags with the class name "locality":
company_names = soup.find_all("h3", class_="company-name")
locations = soup.find_all("span", class_="locality")
Extract the data from the HTML elements and store them in lists:
company_names_list = [name.get_text(strip=True) for name in company_names] locations_list = [location.get_text(strip=True) for location in locations]
Create a Pandas DataFrame to organize the extracted data:
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
8: Optionally, you can perform further data processing or analysis using the Pandas DataFrame, or export the data to a file. For example, to save the data to a CSV file:
`df.to_csv("it_services_data.csv", index=False)`
That's it! that was all i did: I thougth that with this approach i am able to scrape the company names and their locations from the specified webpage using Python with the Beautiful Soup, Requests, and Pandas packages.
Well - i need also to have the url of the company. and if i would be able to gather even a bit more data, that would be great.
update: many thanks to badduker: awesome. i tried it out in Colab - and after installing cloudsraper-plugin - runned the code and got back the following:
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
Solution
The site returns an error that says you need JavaScript
enabled. In other words, plain requests
might not be enough.
However, you could try using the cloudscraper
module.
For example:
import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57",
}
scraper = cloudscraper.create_scraper()
response = scraper.get("https://clutch.co/il/it-services", headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)
Output:
+----+-----------------------------------------------------+--------------------------------+
| | Company Name | Location |
|----+-----------------------------------------------------+--------------------------------|
| 1 | Brainhub | Gliwice, Poland |
| 2 | Vates | Atlanta, GA |
| 3 | UVIK Software | Tallinn, Estonia |
| 4 | TLVTech | Ramat Gan, Israel |
| 5 | Broscorp | Beersheba, Israel |
| 6 | Exoft | Vienna, VA |
| 7 | EchoGlobal | Tallinn, Estonia |
| 8 | Codup | Karachi, Pakistan |
| 9 | Dofinity | Bnei Brak, Israel |
| 10 | Insitu S2 Tikshuv LTD | Haifa, Israel |
| 11 | Sogo Services | Tel Aviv-Yafo, Israel |
| 12 | Naviteq LTD | Tel Aviv-Yafo, Israel |
| 13 | BMT - Business Marketing Tools | Ra'anana, Israel |
| 14 | Accedia | Sofia, Bulgaria |
| 15 | Profisea | Hod Hasharon, Israel |
| 16 | Trivium Solutions | Herzliya, Israel |
| 17 | Dynomind.tech | Jerusalem, Israel |
| 18 | Madeira Data Solutions | Kefar Sava, Israel |
| 19 | Titanium Blockchain | Tel Aviv-Yafo, Israel |
| 20 | Octopus Computer Solutions | Tel Aviv-Yafo, Israel |
| 21 | Reblaze | Tel Aviv-Yafo, Israel |
| 22 | ELPC Networks Ltd | Rosh Haayin, Israel |
| 23 | Taldor | Holon, Israel |
| 24 | Opsfleet | Kfar Bin Nun, Israel |
| 25 | Clarity | Petah Tikva, Israel |
| 26 | Hozek Technologies Ltd. | Petah Tikva, Israel |
| 27 | ERG Solutions | Ramat Gan, Israel |
| 28 | SCADAfence | Ramat Gan, Israel |
| 29 | Ness Technologies | נס טכנולוגיות | Tel Aviv-Yafo, Israel |
| 30 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel |
| 31 | Radware | Tel Aviv-Yafo, Israel |
| 32 | BigData Boutique | Rishon LeTsiyon, Israel |
| 33 | NetNUt | Tel Aviv-Yafo, Israel |
| 34 | Asperii | Petah Tikva, Israel |
| 35 | PractiProject | Ramat Gan, Israel |
| 36 | K8Support | Bnei Brak, Israel |
| 37 | Odix | Rosh Haayin, Israel |
| 38 | Adaptiq | Tel Aviv-Yafo, Israel |
| 39 | Israel IT | Tel Aviv-Yafo, Israel |
| 40 | Panaya | Hod Hasharon, Israel |
| 41 | MazeBolt Technologies | Giv'atayim, Israel |
| 42 | ActiveFence | Binyamina-Giv'at Ada, Israel |
| 43 | Komodo Consulting | Ra'anana, Israel |
| 44 | MindU | Tel Aviv-Yafo, Israel |
| 45 | Valinor Ltd. | Petah Tikva, Israel |
| 46 | entrypoint | Modi'in-Maccabim-Re'ut, Israel |
| 47 | Code n' Roll | Haifa, Israel |
| 48 | Linnovate | Bnei Brak, Israel |
| 49 | Adelante | Tel Aviv-Yafo, Israel |
| 50 | develeap | Tel Aviv-Yafo, Israel |
| 51 | Chalir.com | Binyamina-Giv'at Ada, Israel |
| 52 | Trinity Agency | Tel Aviv-Yafo, Israel |
| 53 | MeteorOps | Tel Aviv-Yafo, Israel |
| 54 | Penguin Strategies | Ra'anana, Israel |
| 55 | ANG Solutions | Tel Aviv-Yafo, Israel |
| 56 | Sanapix - Web & Media Services | Umm al-Fahm, Israel |
| 57 | Pen and Chip Consulting | Netanya, Israel |
+----+-----------------------------------------------------+--------------------------------+
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.