Issue
Help me please with my code.
relation_tables = char_soup.find('ul', class_='subNav').find_all('li')
like_page_url = url + relation_tables[2].find('a').get('href') # Get like page's url
dislike_page_url = url + relation_tables[3].find('a').get('href') # Get dislike page's url
like_r = requests.get(like_page_url) # Get source of page with users who liked/disliked
dislike_r = requests.get(dislike_page_url)
like_soup = BeautifulSoup(like_r.text, 'html.parser')
dislike_soup = BeautifulSoup(dislike_r.text, 'html.parser')
like_pages = int(like_soup.find('ul', class_='nav').find_all('li')[13].text)
dislike_pages = int(dislike_soup.find('ul', class_='nav').find_all('li')[13].text)
n = like_soup.find('table', class_='pure-table striped').find_all('tr') # WORKS
for i in range(0, like_pages):
like_users_trs = like_soup.find('table', class_='pure-table striped').find_all('tr') # DON'T
curr_character_like_names.extend([f'{url}{tr.find("a").text}' for tr in like_users_trs]) # Get
# all users names
like_page_url = url + like_soup.find('li', class_='next').find('a').get('href') # and extend them to a list
like_r = requests.get(like_page_url) # Then find 'next' button and get next page's url
like_soup = BeautifulSoup(like_r.text, 'html.parser') # Get source of the next page
This code should take list with users names from page with users who liked character and who disliked(2 different pages). Problem is that one of 2 lines that do same thing don't work:
n = like_soup.find('table', class_='pure-table striped').find_all('tr')
(that line is just for test)
That one is outside the loop and works good, but equal line inside the loop(like_users_trs = like_soup.find('table', class_='pure-table striped').find_all('tr')
) throw error:
Traceback (most recent call last):
File "/home/sekki/Documents/Pycharm/anime_planetDB/main.py", line 131, in <module>
like_users_trs = like_soup.find('table', class_='pure-table striped').find_all('tr') # DON'T
AttributeError: 'NoneType' object has no attribute 'find_all'
Additional info:
- like_page_url = https://www.anime-planet.com/characters/armin-arlelt/loves
- dislike_page_url = https://www.anime-planet.com/characters/armin-arlelt/hates
Solution
Looks like you are over complicating this a bit. Looking at the patterns, the names start to repeat once you got past the last page. So just do a while True
loop until that happens.
Secondly, let pandas
parse that table for you:
import pandas as pd
import requests
def get_date(url):
df = pd.DataFrame(columns=[0])
page = 1
continueLoop = True
while continueLoop == True:
url_page = f'{url}?page={page}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url_page, headers=headers).text
temp_df = pd.read_html(response)[0]
if list(temp_df[0])[0] not in list(df[0]):
print(f'Collected Page: {page}')
df = df.append(temp_df)
page+=1
else:
continueLoop = False
return df
dfLoves = get_date('https://www.anime-planet.com/characters/armin-arlelt/loves')
dfHates = get_date('https://www.anime-planet.com/characters/armin-arlelt/hates')
Output:
print(dfLoves)
0
0 atsumuboo
1 Ken0brien
2 Kabooom
3 xsleepyn
4 camoteconpapas
.. ...
21 SonSoneca
22 SayaSpringfield
23 Kurasan
24 HikaruTenshi
0 silvertail123
[15026 rows x 1 columns]
print(dfHates)
0
0 selvnq
1 LiveLaughLuffy
2 SixxTheGoat
3 IceWolfTO
4 Sam234io
.. ...
11 phoenix5793
12 Tyrano
13 SimplyTosh
14 KrystaChan
15 SHADOWORZA0
[2591 rows x 1 columns]
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.