Issue
I am trying to write scrapper for "free-proxy.cz" website, however, I am facing a problem
I know my "port" section is wrong, but I don't know the problem and how to fix it.
here is the code:
import requests
from bs4 import BeautifulSoup
import base64
urls = ['http://free-proxy.cz/en/proxylist/country/all/socks5/date/all',
'http://free-proxy.cz/en/proxylist/country/all/socks5/date/all/2',
'http://free-proxy.cz/en/proxylist/country/all/socks5/date/all/3',
'http://free-proxy.cz/en/proxylist/country/all/socks5/date/all/4',
'http://free-proxy.cz/en/proxylist/country/all/socks5/date/all/5',
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', {'id': 'proxy_list'})
for row in table.find('tbody').find_all('tr'):
for ip in row.find('script'):
text=base64.b64decode(ip[29:-2:])
for port in row.find('span', attrs='fport'):
print(port.get_text())
#ipadd=print(prt.decode('utf-8')+':'+ports)
** I commented the last line because the port grabber is not working correct.
the result of running the above code is :
Traceback (most recent call last):
File "LOCATION\main.py", line 22, in <module>
for port in row.find('span', attrs='fport'):
TypeError: 'NoneType' object is not iterable
80
45554
1080
1080
what is the issue here ?
Solution
The issue is that your row in table.find('tbody').find_all('tr')
sometimes returns NoneType
, hence throwing an error. You should be able to overcome it by encapsulating everything in a try
- except
block. Like this:
import requests
import base64
from bs4 import BeautifulSoup
BASE_URL = 'http://free-proxy.cz/en/proxylist/country/all/socks5/date/all'
i = 1
while i < 5:
if i == 1:
r = requests.get(BASE_URL)
else:
r = requests.get(BASE_URL + f'/{i}')
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', {'id': 'proxy_list'})
for row in table.find('tbody').find_all('tr'):
try:
ip = base64.b64decode(row.select_one('script').get_text()[29:-2:])
port = row.select_one('span').get_text()
print(f'IP: { ip }')
print(f'Port: { port } \n')
except:
pass
i += 1
If you want to focus more on handling the data, rather than on the actual web scraping implementation, you could also give WebScrapingAPI a try. The service has an extract_rules
feature that returns elements in JSON format, based on the CSS selector you specify. For example:
curl https://api.webscrapingapi.com/v1\?api_key\=<YOUR_API_KEY>\&url\=http://free-proxy.cz/en/proxylist/country/all/socks5/date/all\&render_js\=1\&extract_rules\=%7B%22ports%22%3A%7B%22selector%22%3A%22span.fport%22%2C%22output%22%3A%22text%22%7D%7D
Response:
{
"ips":[
"<script type=\"text/javascript\">document.write(Base64.decode(\"NTEuNzcuMTQxLjI5\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTcyLjEwNC4yNDEuMjk=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuMjU0LjI0Ny4xOTI=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTIyLjIyNC4yMjAuOTA=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"OTYuMTI2LjEyNC4xOTc=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTg1LjYuMTAuMjMx\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuOTEuMTA3LjI1MQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTY1LjIyNy4xMDQuMTIy\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTg4LjEyNy4yMjQuNDY=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NTEuODMuMTQwLjcw\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"OTQuNzQuMTE3Ljkz\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTUyLjMyLjE2NC4yMg==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTY3LjE3Mi4yMy4xOQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NzIuMjA2LjE4MS4xMDU=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTIyLjI1Mi4yMzAuMTE3\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"OTguMTcwLjU3LjIzMQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuNzQuNjYuNw==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTYyLjI0My4xNDAuODI=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuODguNi42Ng==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTk0LjE5NS4yNDAuNjA=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDMuMjQwLjExMy44OQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTEzLjE3Ni4xMTguMTUw\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTkyLjI1Mi4yMTQuMjA=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NzIuMjIxLjIzMi4xNTU=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuMjQwLjIyNi4xNzM=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NjcuMjA1LjE0NS40MA==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTg0LjE4MS4yMTcuMjEw\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTM0LjIwOS4xMDUuMTYw\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTM4LjE5Ny4yMDMuODQ=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"NDcuMjQzLjEzOC4yMDg=\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTE0Ljk1LjE1NC4yNw==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTQ5LjIwMi4xNjQuNQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"OTEuMTIxLjIxMC41Ng==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTUyLjY5LjIyMS41NQ==\"))</script>",
"<script type=\"text/javascript\">document.write(Base64.decode(\"MTg0LjE4MS4yMTcuMTk0\"))</script>"
],
"ports":[
"1080",
"8011",
"4153",
"7300",
"81",
"52047",
"3128",
"29411",
"4617",
"8181",
"25268",
"21616",
"46858",
"64935",
"7497",
"4145",
"3000",
"17463",
"4672",
"3128",
"46859",
"1080",
"15864",
"4145",
"1080",
"10126",
"4145",
"22503",
"7497",
"64",
"23456",
"17562",
"41260",
"10114",
"4145"
]
}
Answered By - Mihnea-Octavian Manolache
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.