Issue
I'm writing a python script to extract records of all people in a site using selenium, beautifulsoup and pandas. I, however don't know how to go about that because the site is designed such that someone has to search first before getting the result. For test purposes henceforth, I'm passing a search value and manipulating the same via selenium. The issue is that after writing the script on a python shell in ipython, I get the desirable results, but the same is throwing an error in a python file when running via python command.
code
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import requests
import re
br.get(url)
content = br.page_source
soup = BeautifulSoup(content, 'lxml')
sleep(2)
sName = br.find_element_by_xpath("/html/body/div[1]/div[2]/section/div[2]/div/div/div/div/div/div/div[2]/form/div[1]/div/div/input")
sleep(3)
sName.send_keys("martin")
br.find_element_by_xpath("//*[@id='provider']/div[1]/div/div/div/button").click()
sleep(3)
table = soup.find('table')
tbody = table.find_all('tbody')
body = tbody.find_all('tr')
#
# get column heads
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all('th'):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
#declare an empty list for holding all records
all_rows = []
# loop through all table rows to get all table datas
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all('td'):
stripA = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(stripA)
all_rows.append(row)
# match each record to its field name
# cols = ['name', 'license', 'xxx', 'xxxx']
df = pd.DataFrame(data=all_rows, columns=headings)
Solution
You don't need the overhead of a browser or to worry about waits. You can simply mimic the post request the page makes
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
data = {'search_register': '1', 'search_text': 'Martin'}
r = requests.post('https://osp.nckenya.com/ajax/public', data=data)
soup = bs(r.content, 'lxml')
results = pd.read_html(str(soup.select_one('#datatable2')))
print(results)
Answered By - QHarr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.