Issue
What can I add to the end of my code to append a list of items to a dataframe as column 1 and the resulting webscrape data from a for loop to column 2 and match each of column 1's list items to the webscrape data? I'm trying to get -
col1 col2
url1 A Details
url2 B Details
The code I have is,
urls = ['url1','url2']
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
for link in urls:
sleep(randint(11,23))
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',}
req = requests.get(link, headers = headers)
soup = BeautifulSoup(req.content, features="lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Solution
The easiest way to do this is to append your data to a list inside your loop and then create a dataframe from your two lists.
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
#initializes empty lists to append data
url_list = []
details_list = []
for link in urls:
sleep(randint(11,23))
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',}
req = requests.get(link, headers = headers)
soup = BeautifulSoup(req.content, features="lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
url_list.append(link) # appends links
if not text:
# if text is empty
details_list.append('none') # appends text
else:
details_list.append(text) # appends text
results_df = pd.DataFrame() # creates empty dataframe
results_df['col1'] = url_list # adds column to dataframe with URLs
results_df['col2'] = details_list # adds column to dataframe with details
Answered By - Raul Gonzalez
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.