Issue
This was an attempt to first get the all links from the titles on the first page: this worked but i want to get the links in a .txt files, and get for all available pages too.
bs4 import BeautifulSoup
import requests
import re
URL= "https://www.usaopps.com/government_contractors/naics-111110-Soybean-Farming.htm"
fixed_url= "https://www.usaopps.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="contractor-list")
links = []
contractor_elements = results.find_all("div", class_="lr-title")
for contractors_element in contractor_elements:
links = contractors_element.find_all("a")
for link in links:
link_url = link["href"]
print(f"full link:{fixed_url}{link_url}\n")
after that i got the contact person details and fax number with the code
from bs4 import BeautifulSoup
import requests
import re
from urllib.request import urlopen
url = "https://www.usaopps.com/government_contractors/contractor-5922555-BSL-GLOBAL-WATER-SOLUTION.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results_info = soup.find(id="box-sideinfo")
info_elements = results_info.find_all("div", class_="info-gen-box clearfix")
Fax = soup.select("#box-sideinfo > div > dl > dd:nth-child(14)")
contact_person = soup.select("#box-sideinfo > div > dl > dd:nth-child(16)")
print(contact_person)
enter code hereprint(Fax)
i wanted the new url to be the links from my first code and have the both codes together...
Solution
This is one way of obtaining that info, and displaying it in a meaningful way:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
big_list = []
for i in range(1, 2):
url= f"https://www.usaopps.com/government_contractors/naics-111110-Soybean-Farming.{i}.htm"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('div.list-one')[:3]:
det_url = 'https://www.usaopps.com' + x.select_one('a').get('href')
# print(det_url)
req = requests.get(det_url)
det_soup = BeautifulSoup(req.text, 'html.parser')
info_box = det_soup.select_one('div.info-gen-box')
c_name = info_box.find('dt', text='Company Name:').find_next_sibling('dd').text
c_address = info_box.find('dt', text='Address:').find_next_sibling('dd').text
c_phone = info_box.find('dt', text='Phone:').find_next_sibling('dd').text
# print(c_name, c_address, c_phone)
big_list.append((c_name, c_address, c_phone))
df = pd.DataFrame(big_list, columns = ['Company', 'Address', 'Phone'])
print(df)
This will print in terminal:
Company | Address | Phone | |
---|---|---|---|
0 | BSL GLOBAL WATER SOLUTIONS, INC | 5020 Campus Dr | 949-296-7666 |
1 | JONES 3 CO. LLC | 4133 Fishcreek Rd Apt 401 | 360-279-8638 |
2 | Banneker Ventures, LLC | 5 Choke Cherry Road, Suite 378 | 301-990-4980 |
There are 83 pages with companies, so this will take some time.
Requests docs: https://requests.readthedocs.io/en/latest/ BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html And of course, pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
Answered By - platipus_on_fire
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.