I am a huge fan of combat sports, with boxing in particular being my favourite. As much as it may appear as a purely physical sport where your sole objective is to either outbox or knock your opponent out, it is far more strategic that one would expect and incorporates an element psychology. Like a chess game, each punch thrown has to be calculated, recklessly overextending yourself might leave you more vulnerable to a counter punch, while being overly passive and defensive might swing the momentum in your opponent’s favour and not get you enough points to win the fight. If you let self-doubt sink in or are intimidated by your opponent you have already lost the battle. On top of all this, you need to remain respectful of the sport and the life threatening dangers it presents. In the words of of Sugar Ray Leonard, 'you don't play boxing'.
While it may not be a ‘gentleman's sport’ on the face of it. I am always impressed by how the majority of professional boxers are able to conduct themselves with incredible sportsmanship. I find this analogous to life, with the opponent being the challenges of life.
In light of this interest in boxing, for one of my Capstone projects I decided to build a web app that would show the end user probabilities of different fight outcomes depending on the fighters selected. However, prior to this, I wanted to observe the data and answer some lingering questions I had. I was particularly interested in win rates between different age groups and fighting stances. I also wanted to get a clear view of the amount of wins boxers in my dataset had attained, juxtaposed with the amount of bouts fought. Lastly, I wanted to build a list of the top 10 boxers in each division, based on the outcomes of the bouts fought by the boxers and the caliber of the opponents they beat. To answer these questions I decided to build an interactive dashboard with visualizations related to my questions. The interactivity would allow me to use one’s weight class/division and gender as filters tailoring the visualizations to boxers in the selected weight class and gender. This would serve as the exploratory data analysis part of my project.
Getting into the data
For this project, I acquired a list of 3843 boxers, each with a URL to their profile. This link included a unique identifier associated with each boxer.
data['id'] = data['players_links'].str.extract('(\d+)')
In order to enrich the data I had on each fighter I needed to extract the numerical unique identifier. I found an API on GitHub built on node.js that had a method requiring a boxer's unique id in order to pull all data related to the bouts fought by each fighter, the referee's scorecards, the length of each bout, the outcome and other data that could potentially prove to be valuable for my project.
Using this list, I then created a function in node.js that would read each unique identifier extracted from each fighter's profile URL, append the id to a method in the API that would allow me to get my desired output and write the output to a new csv. You can view all the code on my GitHub.
async function writeData() {
const csv = require('csv-parser')
const results = [];
fs.createReadStream('C:\\Users\\User\\Documents\\GitHub\\Springboard Capstone BoxingPredictionWebApp\\boxingdata\\readdata.csv')
.pipe(csv())
.on('data',(data)=> results.push(data))
.on('end', async () => {
const cookieJar = await getCookieJar();
const promises = [];
results.forEach((data) => {
promises.push(boxrec.getPersonById(cookieJar,data.id));
})
const fighters = await Promise.all(promises);
fighters.forEach((fighter) => {
let data = '';
for (const key in fighter.output) {
if (Array.isArray(fighter.output[key])) {
data += JSON.stringify(fighter.output[key]) + ',';
} else if (typeof fighter.output[key] === 'object') {
data += JSON.stringify(fighter.output[key]) + ',';
} else {
data += fighter.output[key] + ',';
}
}
data = data.replace(/(^,)|(,$)/g, "");
data += '\n';
fs.appendFile('C:\\Users\\User\\Documents\\datatest.csv',data, (err) => {
if (err) throw err;
});
Since the API returns the output in JSON format. I was left with data that was not clean and a bit confusing. The column separation was not consistent, for certain boxers the 7th column appeared to be related to a given boxer's birth date while in other cases the column had data related to the date of a given boxer's first fight.
Observing and understanding the data and the nature of the output revealed a pattern. Each fight started with a date key, with the value of the key being the date a given fight was fought. I used this logic to split the columns up based on unique bouts a given fighter had fought.
df = pd.DataFrame(file_full[file_full.columns[0:]].apply(lambda x: ' '.join(x.astype(str)),axis=1))
df = pd.DataFrame(df[0].str.replace('date','dateday'))
#split each fight into a separate column
df_split = pd.DataFrame(df[0].str.split('date',expand=True))
def func(df,col):
return df[col].str.extract('day(?P<day>.*?)firstBoxerRating(?P<firstBoxerRating>.*?)firstBoxerWeight(?P<firstBoxerWeight>.*?)judges(?P<JudgeID>.*?)links(?P<Links>.*?)location(?P<location>.*?)metadata(?P<metadata>.*?)numberOfRounds(?P<numberofrounds>.*?)outcome(?P<outcome>.*?)rating(?P<rating>.*?)referee(?P<referee>.*?)secondBoxer(?P<secondBoxer>.*?)secondBoxerLast6(?P<secondBoxerLast6>.*?)secondBoxerRating(?P<secondBoxerRating>.*?)secondBoxerRecord(?P<secondBoxerRecord>.*?)secondBoxerWeight(?P<secondBoxerWeight>.*?)titles(?P<titles>.*?){')
for i in range(1,len(df_split.columns)-1):
df_split[['date'+str(i), 'firstBoxerRating'+str(i), 'firstBoxerWeight'+str(i), 'JudgeID'+str(i), 'Links'+str(i),
'location'+str(i), 'metadata'+str(i), 'numberofrounds'+str(i), 'outcome'+str(i), 'rating'+str(i),
'referee'+str(i), 'secondBoxer'+str(i), 'secondBoxerLast6'+str(i), 'secondBoxerRating'+str(i),
'secondBoxerRecord'+str(i), 'secondBoxerWeight'+str(i), 'titles'+str(i)]] = func(df_split,i)
I proceeded to concatenate the columns in my dataset to one column. I reasoned that having all the data in one column would make it easier to extract groups of data matching the pattern of characters I specified and then creating columns based on this. I iterated through the columns, extracting different patterns of character into multiple unique columns. For each fight I had the date of the bout, the boxer's rating and weight, information related to the judge (judge name and scorecards), the location of the bout, metadata including the length of the fight, boxer aliases etc, the number of rounds fought, the outcome of the bout, information pertaining to the referee, the opponent's name, the results of their last 6 bouts, their rating, entire boxing record at that point in time (when the bout occurred), their weight and any titles held by either boxers. For each of these attributes I had 85 columns, which is presumably because that number is the most bouts fought by a boxer in my dataset.
Further cleanups involved two processes that I had to repeat multiple times. This either involved removing a character and changing the format of the data to either a string or float,
#cleanup first boxer rating
def remove_colon(col_name):
return merged[col_name].str.replace(':','')
#weights need to be converted to float
def firstweight(col_name):
a = remove_colon(col_name)
return pd.to_numeric(a,errors='coerce')
#update first boxer rating columns
boxer_var = list_var('firstBoxerRating')
for i in boxer_var:
merged[i] = remove_colon(i)
#update weight
weight_var = list_var('firstBoxerWeight')
for i in weight_var:
merged[i] = firstweight(i)
Or extracting a pattern of characters from each column. For a full breakdown of the cleanup process you can view my the full code.
def split_time(col):
return merged[col].str.extract('(\d*\:\d+)',expand=True)
times = list_var('metadata')
for i in times:
merged[i] = split_time(i)
In order to find the win percentages between fighting stance, I decided to build a heatmap. For this, I reshaped my data from wide to long, limiting the data specifically to columns I needed. To put it simply, for each fight, I looked at the two boxer's fighting stance, counted the number of wins by fight stances , counted the total fights fought between each combination of stances and divided the two to find what I defined as the 'win rate' by stance.
I used the same logic to build the dataset I would use for to build another heatmap to compare win rates between different age groups. For the age groups I split fighter ages in 5 year intervals creating age groups such as 20-25, 25-30 etc. I strongly expected to see a substantially higher win rate in the 30-35 and possibly 35-40 age range for the male heavy weight division. Anecdotally, these appear to be the age ranges most elite boxers seem to peak. The code I wrote is available on my GitHub repository.
Since I wanted to show the top ten boxers by division. I needed to figure out how my custom rating would work. While I could have focused purely on wins and draws, applying a reward for each win and draw, I believed that I needed to penalize a fighter's losses. I gave each fight a X10 reward for a win, X5 reward for a tie and -X10 penalty for a loss. I also wanted to factor in the caliber of opponents a given fighter had beat. While beating a lot of opponents is an amazing feat, there is a difference between beating an elite boxer and beating a journey man.
opp_names = ['secondBoxer'+str(i) for i in range(1,85)]
outcome_cols = ['outcome'+str(i) for i in range(1,85)]
#remove quotation marks from secondboxer name
data[opp_names] = data[opp_names].astype(str).apply(lambda x: x.str.replace('"',''))
#get points for each opponent, if negative convert to zero
data[opp_names]=data[opp_names].apply(lambda x : x.map(dict(zip(topten.name,topten.total_points))))
data[opp_names] = data[opp_names].fillna(0)
data[opp_names] = data[opp_names].mask(data[opp_names] < 0, 0)
data[outcome_cols] = data[outcome_cols].astype(str).apply(lambda x: x.str.replace('"',''))
#add opp points to total points if outcome was a win
topten['total_points'] = (data[opp_names].where(data[outcome_cols].eq('win ').values, 0).sum(axis=1)/5) + (topten['total_points'])
For each opponent I returned their total score, using the reward system I explained in the previous paragraph. For boxers were there was no score (possibly because of missing data or a boxer who has not yet had a fight) or a negative score (because the boxer has lost more fights that s/he was won), I replaced the number with a 0. I then summed the opponents score where the outcome of fight was a win (ie where the string in the columns listed in outcome_cols equals win), divided the score by 5 and added it to the fighter's total score.
I divided the score by 5 because I didn't want the opponent's score to outweigh scores calculated from wins and losses. I didn't want to create a scenario where beating a player with a 500 score would literally double one's overall score. However, if a given fighter has a history of beating high caliber opponents the caliber of their opponents would push up their rating quite significantly.
The problem with my custom ratings however is that there are quite a few cases where I had missing data regarding some boxers, including some elite boxers. However, I will be continuously improving this dashboard. Another limitation of this rating is that the score isn't dependent on when the fight occurred and the opponent's score at that point in time. I simply look at the opponent's current score based on his/her overall wins, losses and draws.
Building the interactive dashboard
I opted to use the dash-plotly package to aide in building my dashboard. The package incorporates flask and react.js elements making it a very powerful and aesthetically pleasing package to use to build interactive dashboards.
For each visualization I declared the inputs and outputs through a callback decorator. The inputs either include filters such as division and gender or pictures for an image carousel I created to appear at the top of my dashboard.
@app.callback(
dash.dependencies.Output('total-bouts-v-bouts-won', 'figure'),
[dash.dependencies.Input('weight_class', 'value'),
dash.dependencies.Input('gender','value')])
Below the decorator, I created functions to update the visualizations based on input selections made by the user.
def update_scatterplot(weight_class,gender):
if weight_class is None or weight_class == []:
weight_class = WEIGHT_CLASS
if gender is None or gender == []:
gender = GENDER
weight_df = data[(data['division'].isin(weight_class))]
weight_df = weight_df[(weight_df['sex'].isin(gender))]
return {
'data': [
go.Scatter(
x=weight_df['bouts_fought'],
y=weight_df['w'],
text=weight_df['name'],
mode='markers',
opacity=0.5,
marker={
'size': 14,
'line': {'width': 0.5, 'color': 'blue'}
},
)
],
For example with the scatterplot to show the entire set focusing on the wins and total bouts fought by a given fighter, selecting a given division and or gender changes the output of the graph based on the user's selection.
The process of creating these filters involved using some core components on dash, specifically the multi drop down component to create drop-down menus that would allow people to select a single or multiple options.
dcc.Dropdown(
id='weight_class',
options=[{'label': i, 'value': i} for i in data['division'].unique()],
multi=True
),
dcc.Dropdown(
id='gender',
options=[{'label': i, 'value':i} for i in fight_outcomes['sex'].unique()],
multi=True
),
Dash also lets you use HTML components, I used a few of these to position my image carousel at the center of the top part of my dashboard and to control the pace at which the images would change on the carousel.
html.Section(id='slideshow',children=[
html.Div(style={'backgroundColor':colors['background'],
'textAlign':'center'},id='slideshow-container',children=[
html.Div(id='image'),
dcc.Interval(id='interval',interval=3000),
])
])
A slight issue I encountered was ordering of the axis on my heatmap. To ensure that the order of both the x and y axis were consistent and in the right order, I custom sorted the xaxis by defining the order in which I wanted my axis to appear using categoryarray. This allowed me to set the order of my x-axis according to a custom defined order I defined in a list. You can get a full view of the code I used to build these visualizations here.
In order to make my very minimally designed dashboard visible to all, I opted to deploy my dashboard on Heroku. Feel free to send me any feedback, suggestions or comments on Twitter @emmoemm