Issue
Resubmitted for clarity.
I am trying use Python to loop through a list of websites and extract information (locations, $$$ under management, partners, etc) from each site in the form of a dataframe (i.e. each website will have its own dataframe).
However, when I place the code inside a for
loop as shown below, it will only extract information from the first website in the list. I am not receiving any errors in my code, it simply terminates after the first loop. I am not sure why it doesn't move onto the second loop. I have tried moving the driver.quit()
inside and outside the loop and neither worked.
Code below:
from bs4 import BeautifulSoup
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
import pandas as pd
import spacy
from spacy import displacy
import requests
import re
import en_core_web_sm
nlp = en_core_web_sm.load()
NER = spacy.load("en_core_web_sm")
final_list = ['https://www.google.com','https://www.bing.com', 'https://www.amazon.com']
pd.set_option("display.max_rows", None, "display.max_columns", None)
df = []
for i in range(0,2):
driver = webdriver.Chrome("C:/Users/~~~/chromedriver.exe")
url = final_list[i]
driver.get(url)
sleep(randint(5,15))
soup = BeautifulSoup(driver.page_source, 'html.parser')
body=soup.body.text
body = ' '.join(body.split())
text3= NER(body)
displacy.render(text3,style="ent",jupyter=True)
doc = NER(body)
entities = [(e.label_,e.text) for e in doc.ents]
df[i] = pd.DataFrame(entities, columns=['Entity','Identified'])
driver.quit()
Solution
Change:
df[i] = pd.DataFrame(entities, columns=['Entity','Identified'])
to:
df.append(pd.DataFrame(entities, columns=['Entity','Identified']))
Answered By - keramat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.