Monday, December 11, 2023

[FIXED] Formatting of pandas dataframe and ranking per section rather than whole df

December 11, 2023 beautifulsoup, database, dataframe, pandas, python No comments

Issue

I have a scraper for a sports site. All runs well but I need help in outputting data. A snippet of my current code is:

for item in course_racecard:

  for link in a_tag:

        race_link = (times.find_previous('a')['href'])
        full_url = urljoin(url, race_link)


      for item in entries:

        for horse_info in item.find_all(class_='horse_extended'):

                li = item.find_all("li")

                trainer = li[0].text.strip()

                jockey = li[1].text.strip()

                data.append({
                    "Date": date,
                    "Course": race_course,
                    "Times": times.text,
                    #"URL": full_url,
                    "Title": race_title_simple,
                    "Distance": distance,
                    "Runners": number_of_runners,
                    "Age Group": age_group,
                    "Class": race_class_simple,
                    "Going": going,
                    "Surface": surface,
                    "Horse": horse_name,
                    "Trainer": trainer,
                    "Jockey": jockey,
                    "Age": age_simple,
                    "Weight": weight,
                    "Official Rating": rating_simple,
                    "Form": simple_horse_form,
                    "Exchange Back Odds": odds_int,
                    "weight in stone": weight_in_stone,
                    "Timeform Favourites": "",
                })

                df = pd.DataFrame(data)

                df['Going'] = df['Going'].str.replace('Going: ', '')

                df['Surface'] = df['Surface'].str.replace('Surface: ', '')

                df['Distance'] = df['Distance'].str.replace('Distance:  ', '')

                df['Age Group'] = df['Age Group'].str.replace('Age:  ', '')

                df['Runners'] = df['Runners'].str.replace('Runners:  ', '')

                df['Odds Ranking'] = df['Exchange Back Odds'].rank(ascending=False)

                df['Official Rating Ranking'] = df['Official Rating'].rank(ascending=True)

                df['Weight Ranking'] = df['weight in stone'].rank(ascending=True)

This whole scraper takes all of the days racing from multiple URLS and outputs to a CSV that looks like this:

CSV current

However I would like an output like this:

CSV desired

The difference being is that after each race I would like a spare line and then to repeat the column / df headers.

Also in my coding you will see that I rank some of the rows and I want to do this per race. The coding works fine when it was just running on one race at a time but now that it is importing a large number of races it is ranking all of the data from all of the rows rather than just per race (for example the horse with the lowest odds having the highest rank is supposed to be per race but it is now doing it across all races).

This is what it was like with an individual race:

Ranking on individual race

And now with multiple races:

Ranking on multiple races

With regards to the space and then repetition of the column titles, I have tried to create a new row like:

new_row = {
    "Title": "",
    "Distance": "",
    "Runners": "",
    "Age Group": "",
    "Class": "",
}

df2 = df.append(new_row, ignore_index=True)

print(df2)

in a hope it would at least break the races up but it just adds a gap at the entire end of the df rather than at the end of each race. I have tried to play around with the placement of this and still no luck - it just adds it to the end.

With regards to the ranking I have tried to move around the df statements and include them in the for loop, after it etc to see if that changes it but no luck there - it still ranks the whole data frame rather than individual races.

Any help would be much appreciated. Thank you.

Solution

Sorting the values by odds, then grouping will ensure that they will be in the correct order once grouped. Then converting to a numpy array and using np.vstack() to combine everything together:

Mock data (note that the column names are a bit different to your pictures, but that should be straightforward for you to change):

import pandas as pd
import numpy as np

# mock data
horse_data = {
    'race_id': np.random.randint(1, 10, 100),
    'horse_id': [i for i in range(101, 201)],
    'horse_name': ['Horse '+chr(ord('A')+i) for i in np.random.randint(0, 26, 100)],
    'weight': np.random.randint(50, 65, 100),
    'age': np.random.randint(3, 7, 100),
    'odds': np.random.uniform(2, 5, 100).round(2)
}
df = pd.DataFrame(horse_data)

Combining output:

# if adding rankings across all groups, add these here:
# df = df.assign(ranking_col=df["col"].rank(ascending=False))

# initial row
arr = np.zeros((1, len(df.columns)+1))  # +1 because 1 additional ranking column added in the loop.

# sort values by odds so in order and loop through each group (race id)
for i, j in df.sort_values("odds", ascending=False).groupby("race_id"):
    # add a ranking column
    j = j.assign(odds_ranking=j["odds"].rank(ascending=False))
    # additional rankings within each group/race:
    # j = j.assign(ranking_col_name=j["ranked_col"].rank(ascending=False))
    # append an row of 0s, then column names, then data to arr
    arr = np.vstack([arr, np.zeros((1, len(j.columns))),
                     np.array(j.columns), j.to_numpy()])

arr = arr[2:, :]  # drop first 2 rows
df1 = pd.DataFrame(arr)  # to DataFrame
# set the 0 columns to None so 'blank'
df1.loc[df1.eq(0).any(axis=1)] = None

If you then use df1.to_clipboard(index=False, header=False), you can paste the output in Excel and see that there are blank rows and headers above each group.

Exporting to a .csv:

df1.to_csv("racing_odds.csv", index=False, header=False)

Answered By - Rawson

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 11, 2023

[FIXED] Formatting of pandas dataframe and ranking per section rather than whole df

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels