Issue
I am only getting 308 rows on my CSV file. Where I should get more than 900 rows. I have written this below code. I am tried to change the iteration in links. But still the same. getting the amount of data every time. Is it problem with my data frame declaration or anything else?
from bs4 import BeautifulSoup
import requests
import pandas as ps
#list of dataframe
suppliers_name = []
suppliers_location = []
suppliers_type = []
suppliers_content = []
suppliers_est =[]
suppliers_income = []
def parse(url):
web = requests.get(url)
soup = BeautifulSoup(web.content, "html.parser")
container = soup.find_all(class_ = "supplier-search-results__card profile-card profile-card profile-card--secondary supplier-tier-1")
for cont in container:
# getting the names
name = cont.find("h2").text
suppliers_name.append(name)
#getting the locations
location =cont.find(class_ = "profile-card__supplier-data").find("a").text[8:]
if " " in location:
suppliers_location.append(location.replace(" ",""))
elif "Locations" in location:
suppliers_location.append(location.replace("Locations", "None"))
#suppliers type
types = cont.find(class_ = "profile-card__supplier-data").find_all("span")[1].text[2:]
suppliers_type.append(types.replace("*", ""))
# suppliers content
content = cont.find(class_ = "profile-card__body-text").find("p").text
suppliers_content.append(content)
# suppliers establishment
years = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})
if len(years) == 4:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[2].text
suppliers_est.append(year[5:])
elif len(years) == 3:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
word =year[5:]
if len(word) != 4:
suppliers_est.append("None")
else:
suppliers_est.append(word)
elif len(years) == 2:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
suppliers_est.append(year[5:])
elif len(years)==1:
suppliers_est.append("None")
# suppliers income
incomes = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})
if len(incomes) == 4:
income = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
suppliers_income.append(income[4:])
elif len(incomes) == 3:
income = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
word = income[4:]
if len(word) != 5:
suppliers_income.append(word)
else:
suppliers_income.append("None")
elif len(incomes) == 2:
suppliers_income.append("None")
elif len(incomes) == 1:
suppliers_income.append("None")
#iterate over links
number = 1
num =1
for i in range(43):
urls = f'https://www.thomasnet.com/nsearch.html?_ga=2.53813992.1582589371.1586649402-45317423.1586649402&cov=NA&heading=97010359&pg={num}'
parse(urls)
num += 1
print("\n" f'{number} - done')
number += 1
#dataframe
covid = ps.DataFrame({
"Name of the Suppliers": suppliers_name,
"Location": suppliers_location,
"Type of the suppliers": suppliers_type,
"Establishment of the supplies": suppliers_est,
"Motive": suppliers_content
})
covid.to_csv("E:/New folder/covid.csv", index=False)
print("File Creation Done")
code works without any error but I am not getting all data.
Solution
The class attributes change after a few pages: ie goes from "supplier-search-results__card profile-card profile-card profile-card--secondary supplier-tier-1"
to "supplier-search-results__card profile-card profile-card profile-card--tertiary "
(and thats just 1 that I noticed, it appears to be more).
So as an alternative, it looks to me like they all have an id attribute that starts with pc
. You can try that (find all elements that have id attribute that starts with pc
.)
There also errors in your logic if statements in that you don't catch all the scenarios. For example, at some point, location is " ", but you only check for double white space in front of the city, state. And it also doesn't have "Locations"
in it, so it gets passed up and you end up with a different length of locations in your list. Best to just use .strip()
the whitespace, and then don't even need to use an if there and/or use replace as that will take care of the whitespace in the string.
Finally, not a huge deal, but I've seen majority of pandas users import as pd
and opposed to ps
So I think this code gets what you want, and in fact, it gives me 1065 rows:
from bs4 import BeautifulSoup
import requests
import pandas as ps
#list of dataframe
suppliers_name = []
suppliers_location = []
suppliers_type = []
suppliers_content = []
suppliers_est =[]
suppliers_income = []
def parse(urls):
web = requests.get(urls, headers={})
soup = BeautifulSoup(web.content, "html.parser")
container = soup.find_all(id=re.compile("^pc")) # <----- Fixed this line' find all id attributes that start with pc
for cont in container:
# getting the names
name = cont.find("h2").text
suppliers_name.append(name)
#getting the locations
location =cont.find(class_ = "profile-card__supplier-data").find("a").text[8:]
if "Locations" in location:
suppliers_location.append(location.replace("Locations", "None"))
else:
suppliers_location.append(location.strip()) # <----- Fixed this line
#suppliers type
types = cont.find(class_ = "profile-card__supplier-data").find_all("span")[1].text[2:]
suppliers_type.append(types.replace("*", ""))
# suppliers content
content = cont.find(class_ = "profile-card__body-text").find("p").text
suppliers_content.append(content)
# suppliers establishment
years = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})
if len(years) == 4:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[2].text
suppliers_est.append(year[5:])
elif len(years) == 3:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
word =year[5:]
if len(word) != 4:
suppliers_est.append("None")
else:
suppliers_est.append(word)
elif len(years) == 2:
year = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
suppliers_est.append(year[5:])
elif len(years)==1:
suppliers_est.append("None")
# suppliers income
incomes = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})
if len(incomes) == 4:
income = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
suppliers_income.append(income[4:])
elif len(incomes) == 3:
income = cont.find(class_ = "profile-card__supplier-data").find_all("span", {"data-toggle":"popover"})[1].text
word = income[4:]
if len(word) != 5:
suppliers_income.append(word)
else:
suppliers_income.append("None")
elif len(incomes) == 2:
suppliers_income.append("None")
elif len(incomes) == 1:
suppliers_income.append("None")
#iterate over links
number = 1
num =1
for i in range(43):
urls = f'https://www.thomasnet.com/nsearch.html?_ga=2.53813992.1582589371.1586649402-45317423.1586649402&cov=NA&heading=97010359&pg={num}'
parse(urls)
num += 1
print("\n" f'{number} - done')
number += 1
#dataframe
covid = ps.DataFrame({
"Name of the Suppliers": suppliers_name,
"Location": suppliers_location,
"Type of the suppliers": suppliers_type,
"Establishment of the supplies": suppliers_est,
"Motive": suppliers_content
})
Output:
print (covid)
Name of the Suppliers ... Motive
0 All Metal Sales, Inc. ... Supplier providing services and products for C...
1 Ellsworth Adhesives ... We are a large distributor with the ability to...
2 Monroe Engineering Products ... Proud member of the Defense Industrial Base an...
3 New Process Fibre Company, Inc. ... We supply parts to help produce the following ...
4 Vanguard Products Corp. ... Custom manufacturing available for component p...
5 Techmetals, Inc. ... We are a certified metal plating facility, hol...
6 The Rodon Group ... We can design, mold and manufacturer plastic i...
7 Mardek, LLC ... Mardek LLC's core business is sourcing manufac...
8 Allstates Rubber & Tool Corp. ... Materials or component parts are available tha...
9 Estes Design & Manufacturing, Inc. ... We are a sheet metal fabricator still in opera...
10 Nadco Tapes and Labels, Inc. ... We are manufacturer of tapes and labels capabl...
11 NewAge Industries, Inc. ... We are a manufacturer and fabricator of plasti...
12 Associated Bag ... We are a nationwide supplier of packaging, shi...
13 3D Hubs ... Surgical masks - we have started a GoFundMe ra...
14 Tailored Label Products, Inc. ... In response to the COVID-19 crisis, we're prov...
15 Compressed Air Systems, Inc. ... We are specialists in compressed air, we can d...
16 A & S Mold & Die Corp. ... Custom manufacturing available for component p...
17 Vibromatic Co., Inc. ... We manufacture custom part handling systems. I...
18 Wyandotte Industries, Inc. ... Wyandotte Industries is a manufacturer of Spec...
19 MOCAP LLC ... We manufacture a full line of protective caps,...
20 Emco Industrial Plastics, Inc. ... Custom manufacturer of guards, divider and fac...
21 Bracalente Manufacturing Co., Inc. ... We specialize in complex and high volume turne...
22 Liberty Industries, Inc. ... Engineers, designs and builds cleanrooms, modu...
23 Waples Manufacturing ... We are using our expertise in CNC precision ma...
24 Griff Paper & Film ... In response to the COVID-19 crisis, we are cur...
25 The Hollaender Mfg. Co. ... Hollaender is a manufacturer supplying key inf...
26 IFM Efector, Inc. ... We offer durably tested and highly reliable se...
27 Precision Associates, Inc. ... We are ramping up our production of several es...
28 LBU, Inc. ... For the production of in demand COVID response...
29 EMC Precision Machining ... Available to custom manufacture component part...
... ... ...
1035 Chembio Diagnostic Systems, Inc. ... Manufacturer of rapid PCR test kits for aid in...
1036 Innuscience ... Manufacturer of biotechnology based, environme...
1037 Baumgartner Machine ... In response to the COVID-19 crisis, we can off...
1038 Resitech Industries LLC ... We can provide face masks and face shields for...
1039 Bean, L.L., Inc. ... We can supply face masks to assist during the ...
1040 Trinity Medical Devices Inc. ... Supplies and materials for respirators - Certa...
1041 Prent Corp. ... We can supply face shields that can be used du...
1042 GDC, Inc. ... Extensive range of machinery available to supp...
1043 Honeywell International, Inc. ... We can supply face masks that can be used duri...
1044 Scan Group, The ... We are adapting part of our production to manu...
1045 Advanced Sterilization Products ... We can supply face masks that can be used duri...
1046 International Wire Dies div. of DS Hai, LLC ... We have access to 3D printers so we can make s...
1047 Prima Strategic Group Inc. ... We are a Houston, TX based organization and wi...
1048 R+L Global Logistics ... We can ship anything, anytime, anywhere. We ar...
1049 Interiors by Maite Granda ... If you need materials or component parts, we h...
1050 Pioneer IWS ... Prototyping services and mass production offer...
1051 American Belleville ... Custom component part manufacturing available ...
1052 PSP Seals, LLC ... We have access to supplies of Chinese made KN9...
1053 Gulf States International, Inc. ... We manufacture hand sanitizer for COVID-19 res...
1054 Prochem Specialty Products, Inc. ... We manufacturer biodegradable and environmenta...
1055 Business & Industry Resource Solutions LLC, (B... ... During the COVID-19 crisis, we are supplying M...
1056 Boardman Molded Products ... Custom thermoplastic injection molding company...
1057 Machine Safety Management ... We are a manufacturing company with engineers ...
1058 JN White ... Manufacturing a proprietary all-in-one-piece f...
1059 New Balance ... We can supply face masks that can be used duri...
1060 Rhino Health ... One stop nitrile exam glove manufacturer. We h...
1061 Orchid International ... Able to custom manufacture rapid PCR test kits...
1062 MegaPlast United States ... We manufacture mobile hospital shelters, corps...
1063 Lion Apparel ... We can supply protective clothing that can be ...
1064 NuMa Group, Inc. ... Manufacturer of supplies and materials capable...
[1065 rows x 5 columns]
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.