Sunday, March 6, 2022

[FIXED] Saving Scrape-results into pandas Dataframe

March 06, 2022 beautifulsoup, pandas, python, selenium No comments

Issue

I am using the following code to scrape some information of different pages of Google Scholar using Selenium and Beautiful Soup.

I can print all the scraped information but I can't save the results into one Dataframe for export.

How do I save the results (Title, Author, Link, Abstract) for each result of the search?

# Dataframe initialisieren
data = {
        "Titel": [],
        "Link" :[],
        "Authoren" : [] ,
        "Veröffentlichungsjahr" : [],
        "Abstract" :[] 
        }
df = pd.DataFrame(data)


# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'

driver = webdriver.Chrome(PATH)

# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)

#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)

## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)
i=0

for i in range(2): #y
    # Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind

    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
        
    for item in soup.select('[data-lid]'): 
            try: 
                print('----------------------------------------') 
       # print(item) 
                print(item.select('h3')[0].get_text()) 
                title = item.select('h3')[0].get_text()

                print(item.select('a')[0]['href'])
                link = item.select('a')[0]['href']

                print(item.select('.gs_a')[0].get_text()) 
                author = item.select('.gs_a')[0].get_text()

                txt = item.select('.gs_a')[0].get_text()
                print(re.findall(r'\d+', txt)[0])

                year = re.findall(r'\d+', txt)[0]
                print(item.select('.gs_rs')[0].get_text()) 

                abstract = item.select('.gs_rs')[0].get_text()

                data_2 = {
                "Titel" : title,
                "Link" : link,
                "Authoren" : author,
                "Veröffentlichungsjahr" : year,
                "Abstract" : abstract
                }
                
                df_new = pd.DataFrame(data_2)

                df = df.append(df_new, ignore_index=True)

                print('----------------------------------------') 
            except Exception as e: 
                #raise e
                print('---')
    # Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
    
    w = random.randint(1,14)
    time.sleep(w)
    try:
        driver.find_element_by_link_text('Weiter').click()
    except:    
        driver.quit()

    i+=1

Solution

Don't create set the dataframe during the loop. The strategy is to collect records into a list of dictionary and the end, create your dataframe.

New code (search # <- HERE)

# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'

driver = webdriver.Chrome(PATH)

# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)

#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)

## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)

records = []  # <- HERE
for i in range(2): #y
    # Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind

    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
        
    for item in soup.select('[data-lid]'): 
            try: 
                print('----------------------------------------') 
       # print(item) 
                print(item.select('h3')[0].get_text()) 
                title = item.select('h3')[0].get_text()

                print(item.select('a')[0]['href'])
                link = item.select('a')[0]['href']

                print(item.select('.gs_a')[0].get_text()) 
                author = item.select('.gs_a')[0].get_text()

                txt = item.select('.gs_a')[0].get_text()
                print(re.findall(r'\d+', txt)[0])

                year = re.findall(r'\d+', txt)[0]
                print(item.select('.gs_rs')[0].get_text()) 

                abstract = item.select('.gs_rs')[0].get_text()

                records.append({  # <- HERE
                "Titel" : title,
                "Link" : link,
                "Authoren" : author,
                "Veröffentlichungsjahr" : year,
                "Abstract" : abstract
                })

                print('----------------------------------------') 
            except Exception as e: 
                #raise e
                print('---')
    # Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
    
    w = random.randint(1,14)
    time.sleep(w)
    try:
        driver.find_element_by_link_text('Weiter').click()
    except:    
        driver.quit()

df = pd.DataFrame(records)  # <- HERE

Output:

>>> df
                                               Titel  ...                                           Abstract
0  Shifting infrastructure landscapes in a circul...  ...  … [Google Scholar] [CrossRef]; Kirchherr, J.; ...
1  Demand-supply matching through auctioning for ...  ...  … 12, 76131 Karlsruhe, Germany cPolitecnico di...
2  Using internet of things and distributed ledge...  ...  … The authors were able to show how a combinat...
3  The impact of Blockchain Technology on the Tra...  ...  … In the broader sense, Blockchain is a Distri...
4  Assessing the role of triple helix system inte...  ...  … depends upon the successful diffusion of sev...
5  Circular Digital Built Environment: An Emergin...  ...  … For example, when searching for articles rel...
6  [PDF][PDF] Phillip Bendix (Wuppertal Institute...  ...  … Stadtreinigung Hamburg (Germany): AI image …...
7  Waste Management–A Case Study of Producer Resp...  ...  … A similar study in Germany reported an inter...
8  Blockchain in the built environment and constr...  ...  … changes in regulation can facilitate industr...

[9 rows x 5 columns]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   Titel                  9 non-null      object
 1   Link                   9 non-null      object
 2   Authoren               9 non-null      object
 3   Veröffentlichungsjahr  9 non-null      object
 4   Abstract               9 non-null      object
dtypes: object(5)
memory usage: 488.0+ bytes

Now you can use df.to_csv(...) to export your data.

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 6, 2022

[FIXED] Saving Scrape-results into pandas Dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels