Issue
I am using the following code to scrape some information of different pages of Google Scholar using Selenium and Beautiful Soup.
I can print all the scraped information but I can't save the results into one Dataframe for export.
How do I save the results (Title, Author, Link, Abstract) for each result of the search?
# Dataframe initialisieren
data = {
"Titel": [],
"Link" :[],
"Authoren" : [] ,
"Veröffentlichungsjahr" : [],
"Abstract" :[]
}
df = pd.DataFrame(data)
# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(PATH)
# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)
#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)
## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)
i=0
for i in range(2): #y
# Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
for item in soup.select('[data-lid]'):
try:
print('----------------------------------------')
# print(item)
print(item.select('h3')[0].get_text())
title = item.select('h3')[0].get_text()
print(item.select('a')[0]['href'])
link = item.select('a')[0]['href']
print(item.select('.gs_a')[0].get_text())
author = item.select('.gs_a')[0].get_text()
txt = item.select('.gs_a')[0].get_text()
print(re.findall(r'\d+', txt)[0])
year = re.findall(r'\d+', txt)[0]
print(item.select('.gs_rs')[0].get_text())
abstract = item.select('.gs_rs')[0].get_text()
data_2 = {
"Titel" : title,
"Link" : link,
"Authoren" : author,
"Veröffentlichungsjahr" : year,
"Abstract" : abstract
}
df_new = pd.DataFrame(data_2)
df = df.append(df_new, ignore_index=True)
print('----------------------------------------')
except Exception as e:
#raise e
print('---')
# Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
w = random.randint(1,14)
time.sleep(w)
try:
driver.find_element_by_link_text('Weiter').click()
except:
driver.quit()
i+=1
Solution
Don't create set the dataframe during the loop. The strategy is to collect records into a list of dictionary and the end, create your dataframe.
New code (search # <- HERE
)
# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(PATH)
# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)
#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)
## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)
records = [] # <- HERE
for i in range(2): #y
# Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
for item in soup.select('[data-lid]'):
try:
print('----------------------------------------')
# print(item)
print(item.select('h3')[0].get_text())
title = item.select('h3')[0].get_text()
print(item.select('a')[0]['href'])
link = item.select('a')[0]['href']
print(item.select('.gs_a')[0].get_text())
author = item.select('.gs_a')[0].get_text()
txt = item.select('.gs_a')[0].get_text()
print(re.findall(r'\d+', txt)[0])
year = re.findall(r'\d+', txt)[0]
print(item.select('.gs_rs')[0].get_text())
abstract = item.select('.gs_rs')[0].get_text()
records.append({ # <- HERE
"Titel" : title,
"Link" : link,
"Authoren" : author,
"Veröffentlichungsjahr" : year,
"Abstract" : abstract
})
print('----------------------------------------')
except Exception as e:
#raise e
print('---')
# Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
w = random.randint(1,14)
time.sleep(w)
try:
driver.find_element_by_link_text('Weiter').click()
except:
driver.quit()
df = pd.DataFrame(records) # <- HERE
Output:
>>> df
Titel ... Abstract
0 Shifting infrastructure landscapes in a circul... ... … [Google Scholar] [CrossRef]; Kirchherr, J.; ...
1 Demand-supply matching through auctioning for ... ... … 12, 76131 Karlsruhe, Germany cPolitecnico di...
2 Using internet of things and distributed ledge... ... … The authors were able to show how a combinat...
3 The impact of Blockchain Technology on the Tra... ... … In the broader sense, Blockchain is a Distri...
4 Assessing the role of triple helix system inte... ... … depends upon the successful diffusion of sev...
5 Circular Digital Built Environment: An Emergin... ... … For example, when searching for articles rel...
6 [PDF][PDF] Phillip Bendix (Wuppertal Institute... ... … Stadtreinigung Hamburg (Germany): AI image …...
7 Waste Management–A Case Study of Producer Resp... ... … A similar study in Germany reported an inter...
8 Blockchain in the built environment and constr... ... … changes in regulation can facilitate industr...
[9 rows x 5 columns]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Titel 9 non-null object
1 Link 9 non-null object
2 Authoren 9 non-null object
3 Veröffentlichungsjahr 9 non-null object
4 Abstract 9 non-null object
dtypes: object(5)
memory usage: 488.0+ bytes
Now you can use df.to_csv(...)
to export your data.
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.