THE GOAL

For this project, I decided to challenge myself a little bit by not using a dataset that I found on Kaggle. In order to find a new data set, I did the reasonable thing and began my frantic google searches. What I found was that there is a plethora of well formatted data out there, most of it is just online as opposed to a nice and easy excel format. With this project I hope to familarize myself with the In’s and Out’s of webscraping and clean up a dataset for exploration and visualization.

As this post was getting extremely long, I elected to split it into two different parts. This post is going to everything EXCEPT the visualizations of the data. I will be making a seperate post for the visualization and interpretation of our data to keep things clean.

The Dataset

For this assignment, the first order of business is to decide what to look at. I’m a big soccer fan so I figured I would take a look at something in relation to that. I’m sure that if you give me enough soccer data, I can find something that I would find interesting inside of it so let’s work on getting that data.

After some research, I have decided to use the wonderful resource, <sofifa.com>. This site is one big table that contains the Fifa stats for any football player who is currently in the Fifa system, which is alot. The first thing we will do, as always, is we are going to prepare our standard visualization and cleaning packages for usage.

import re
import json
import requests
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

WEB-SCRAPING

Now, we are going to import a few more with the goal of Web-Scraping

import requests
import json
from bs4 import BeautifulSoup

Now we want to familiarize ourselves with our website a little bit. Taking a look at the base homepage, we can see that every player has some basic information on display. We can see their Names, Age, Overall, potential, as well as their team and their wage. Now when we click on an individual player, it gives us a much more thorough breakdown of their individual statistics. We can even see a radar plot, we will keep this in mind for later.

It is important to note that this site only displays 60 players per page. If we wanted to scrape more players, we would have to scrape the next page. For this project, I simply want to create a sizable dataframe and 60 players is plenty. I elected to simply sort the page by the 60 players in the Premier League with the highest overall ratings so we will populate our dataframe with those 60 players. But we are getting ahead of ourselves.

In accessing and web-scraping some websites, they have anti-botting software that can deny our request. Running a normal request.get() operation will result in our request being denied. To get around this we will be introducing some extra code as compared to a standard web scrape request.

base_url = "https://sofifa.com/players?type=all&lg%5B0%5D=13&offset=0"

req = requests.Session()
page = req.get(base_url, headers = {
  "User-Agent": "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)"
})
soup = BeautifulSoup(page.content, "html.parser")

The main things to note are the usage of the request.session() function, which persists cookies across requests, and the header in req.get(), which essentially feeds the site some information to trick it into thinking it is a human making the request as opposed to a program.

DATA STRUCTURE NAVIGATION

Now that we can actually scrape and get our data, we want to look at our data structure to determine where in it our player data lies. In this case, it stored in a structure called tbody, so we want python to grab that class and give us all the information that it holds.

player_data = soup.tbody

I’ll save you the trouble of looking at the raw soup text, but looking through it we can see it contains all the stats for each player that we could see on the main page, it’s just all jumbled up. The next step now is to begin sorting the data into a dictionary that is easily navigable and interpret-able.

By looking through this, we can see that each player’s stats start and end with a new class type, tr. We can use this information to tell python exactly when to start/stop looking at a player and move onto the next. We can automate this for every player on this list with some loop action

player_list=[] #Create empty list to store all data for each player as a seperate entry in list
player_dict={ #Create empty dictionary with all the stats to capture, will later be converted to a Pandas dataframe
  "last_name": [],
  "position": [],
  "age": [],
  "overall": [],
  "potential": [],
  "team": [],
  "contract": [],
  "value": [],
  "wage": [],
  "stats": [],
  "id": []
} 

player_list = player_data.find_all("tr") #Seperate each player/stats into a different entry in a list
for player in player_list:
  positions = [] #Create blank list to store positions in case there are multiple positions
  i = 0
  
  player_dict["last_name"].append(player.find("div", class_="ellipsis").get_text())
  
  #Some players have multiple positions, join all positions into a list, then get_text and merge list, append to dict
  positions = player.find_all(class_=re.compile("pos")) #Search through classes to find ones that include "pos" 
  while i != len(positions):
    positions[i] = positions[i].get_text(strip=True)
    i = i + 1
  
  player_dict["position"].append(" ".join(positions))
  player_dict["age"].append(player.find("td", class_="col col-ae").get_text())
  player_dict["overall"].append(int(player.find("td", class_="col col-oa col-sort").get_text()))
  player_dict["potential"].append(player.find("td", class_="col col-pt").get_text())
  player_dict["team"].append(player.find(href=re.compile("/team/")).get_text())
  player_dict["contract"].append(player.find("div", class_="sub").get_text())
  player_dict["value"].append(player.find("td", class_="col col-vl").get_text())
  player_dict["wage"].append(player.find("td", class_="col col-wg").get_text())
  player_dict["stats"].append(player.find("span", class_="bp3-tag p").get_text()) #Not sure what this is for
  player_dict["id"].append(player.img["id"])

Essentially all we are doing here is creating a list where each entry is a player we have scraped data for, then iterating through the list and saving the data for each player in a dictionary. This dictionary will then later be turned in to a Pandas dataframe.

players_dataframe=pd.DataFrame(player_dict)
print(players_dataframe.head(5))

      last_name position age  overall potential               team  \
0  K. De Bruyne   CM CAM  31       91        91    Manchester City   
1    E. Haaland       ST  21       90        94    Manchester City   
2      M. Salah       RW  30       89        89          Liverpool   
3       Alisson       GK  29       89        90          Liverpool   
4      Casemiro      CDM  30       89        89  Manchester United   

        contract    value   wage stats      id  
0  \n2015 ~ 2025  €107.5M  €350K  2299  192985  
1  \n2022 ~ 2027  €176.5M  €240K  2144  239085  
2  \n2017 ~ 2025   €99.5M  €260K  2208  209331  
3  \n2018 ~ 2027     €79M  €190K  1437  212831  
4  \n2022 ~ 2026     €86M  €240K  2251  200145

This looks pretty good! There are obviously some things that are slightly wrong with our data frame though, so lets move onto some cleaning!

DATA CLEANING

Now that we have (some) of our data into a nice looking data frame, let’s work on cleaning it up a little bit, not just for the aesthetics, but also the practicality of having a nice, clean data frame to work with.

#Code to clean up the names
player_name = players_dataframe.last_name #Create series to iterate through
for name in player_name:
  name_split = ""
  if name.find(".") != -1: #Only returns -1 if unable to find "." character
    name_split = name.split(". ", 1)
    players_dataframe["last_name"] = players_dataframe["last_name"].replace(name, name_split[1]) #Replace original dataframe with updated last name
  else:
    pass
  
#Code to clean up positions
players_dataframe[["position_1", "position_2", "position_3"]] = players_dataframe.position.str.split(" ", expand = True)
players_dataframe = players_dataframe.drop(columns="position")

#Code to clean up contract column; We don't (yet) have a way to determine contract specifics if player is on loan, for now just assign them a value showing they are loan 
contract_length = players_dataframe.contract
for contract in contract_length:
  fixed_contract = contract.removeprefix("\n")
  players_dataframe["contract"] = players_dataframe["contract"].replace(contract, fixed_contract)
  players_dataframe[["contract_start", "contract_end"]] = players_dataframe.contract.str.split(" ~ ", expand = True)
  
players_dataframe = players_dataframe.drop(columns="contract")

#Strip currency symbols & suffix from wage and value columns
players_dataframe["value"] = players_dataframe.value.str.removeprefix("€")
players_dataframe["wage"] = players_dataframe.wage.str.removeprefix("€")
#Remove suffix from value, convert to total amounts 
for value in players_dataframe.value:
  replace_value = 0
  if value.find("M") != -1:
    replace_value = value.removesuffix("M")
    replace_value = int(float(replace_value)) * 1000000
    players_dataframe["value"] = players_dataframe["value"].replace(value, replace_value)
  elif value.find("K") != -1:
    replace_value = value.removesuffix("K")
    replace_value = int(float(replace_value)) * 1000
    players_dataframe["value"] = players_dataframe["value"].replace(value, replace_value)
  
#Same thing but with wages
for wage in players_dataframe.wage:
  replace_wage = 0
  if wage.find("M") != -1:
    replace_wage = wage.removesuffix("M")
    replace_wage = int(float(replace_wage)) * 1000000
    players_dataframe["wage"] = players_dataframe["wage"].replace(wage, replace_wage)
  elif wage.find("K") != -1:
    replace_wage = wage.removesuffix("K")
    replace_wage = int(float(replace_wage)) * 1000
    players_dataframe["wage"] = players_dataframe["wage"].replace(wage, replace_wage)

print(players_dataframe.head(5))

   last_name age  overall potential               team      value    wage  \
0  De Bruyne  31       91        91    Manchester City  107000000  350000   
1    Haaland  21       90        94    Manchester City  176000000  240000   
2      Salah  30       89        89          Liverpool   99000000  260000   
3    Alisson  29       89        90          Liverpool   79000000  190000   
4   Casemiro  30       89        89  Manchester United   86000000  240000   

  stats      id position_1 position_2 position_3 contract_start contract_end  
0  2299  192985         CM        CAM       None           2015         2025  
1  2144  239085         ST       None       None           2022         2027  
2  2208  209331         RW       None       None           2017         2025  
3  1437  212831         GK       None       None           2018         2027  
4  2251  200145        CDM       None       None           2022         2026

Alright. So now we have gotten some preliminary information on each player. But the real analysis of their playstyle is really only seen in the semantics. We want to get into the nitty gritty for each of these players and see what picture their stats and abilities paint of their playstyle. Let’s see if we can can use that unique identifier we saved earlier to access all data for each player.

#Rescrape each player's separate stat page
fifa_stats = { #Create new dictionary to store the fifa values of players different stats, later merge with original df
  "crossing": [],
  "finishing": [],
  "heading_accuracy": [],
  "short_passing": [],
  "volleys": [],
  "dribbling": [],
  "curve": [],
  "fk_accuracy": [],
  "long_passing": [],
  "ball_control": [],
  "acceleration": [],
  "sprint_speed": [],
  "agility": [],
  "reactions": [],
  "balance": [],
  "shot_power": [],
  "jumping": [],
  "stamina": [],
  "strength": [],
  "long_shots": [],
  "aggression": [],
  "interceptions": [],
  "positioning": [],
  "vision": [],
  "penalties": [],
  "composure": [],
  "defensive_awareness": [],
  "standing_tackle": [],
  "sliding_tackle": [],
  "gk_diving": [],
  "gk_handling": [],
  "gk_kicking": [],
  "gk_positioning": [],
  "gk_reflexes": []
}
for player_id in players_dataframe.id:
  a = 0
  #Web-Scrape to get fifa stats from webpage
  base_url = "https://sofifa.com/player/" + player_id
  req = requests.Session()
  page = req.get(base_url, headers = {
    "User-Agent": "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)"
  })
  soup = BeautifulSoup(page.content, "html.parser")
  
  #After successful web scrape, grab all stats from page, scrape away ones we don't need, append values to fifa dict
  player_stats = soup.find_all(class_=re.compile("bp3-tag p p-"))
  #print(len(player_stats))
  if len(player_stats) == 66:
    del player_stats[0:32] #Remove initial entries in list as we only care about the player stats
  else:
    del player_stats[0:31] #Some players have only 65 calls bp3-tag objects, take care of those instances as well
  for stat in fifa_stats:
    fifa_stats[stat].append(int(player_stats[a].get_text()))
    a = a + 1
  fifa_stats_df=pd.DataFrame(fifa_stats)

#Now we can join together these two dataframes
player_stats_df = pd.concat([players_dataframe, fifa_stats_df], axis=1, join="inner")
print(player_stats_df.head(5))

   last_name age  overall potential               team      value    wage  \
0  De Bruyne  31       91        91    Manchester City  107000000  350000   
1    Haaland  21       90        94    Manchester City  176000000  240000   
2      Salah  30       89        89          Liverpool   99000000  260000   
3    Alisson  29       89        90          Liverpool   79000000  190000   
4   Casemiro  30       89        89  Manchester United   86000000  240000   

  stats      id position_1  ... penalties composure defensive_awareness  \
0  2299  192985         CM  ...        83        88                  66   
1  2144  239085         ST  ...        84        87                  41   
2  2208  209331         RW  ...        81        90                  38   
3  1437  212831         GK  ...        23        66                  15   
4  2251  200145        CDM  ...        66        85                  90   

  standing_tackle  sliding_tackle  gk_diving  gk_handling  gk_kicking  \
0              66              53         15           13           5   
1              53              29          7           14          13   
2              43              41         14           14           9   
3              19              16         86           85          85   
4              89              88         13           14          16   

   gk_positioning  gk_reflexes  
0              10           13  
1              11            7  
2              11           14  
3              90           89  
4              12           12  

[5 rows x 48 columns]

Looking good! Now we have a nicely sized dataframe that stores all data and stats of our webscraped players. Since this is cleaned and ready for usage, we can now move onto our visualizations! Check my other post for that part 2 to this process.