Dynamic EDA for Qatar World Cup Teams | by Destin Gong | Dec, 2022

By Jessie Hobb On Dec 14, 2022

How to Use Plotly for More Insightful and Interactive Data Explorations

Dynamic EDA for World Cup Teams (image by author)

This article will introduce the tool, Plotly [1], that brings data visualization and exploratory data analysis (EDA) to the next level. You can use this open source graphing library to make your notebook more aesthetic and interactive, regardless if you are a Python or R user. To install Plotly, use the command !pip install — upgrade plotly.

We will use the “Historical World Cup Win Loose Ratio Data [2]” to analyze the national teams participated in Qatar World Cup 2022. The dataset contains the win, loose and draw ratio between each “country1-country2” pair, as shown below. For example, the first row gives us the information that among 7 games played between Argentina and Australia, the ratio of wins, looses and draws by Argentina was 0.714286, 0.142857 and 0.142857 respectively.

df = pd.read_csv('/kaggle/input/qatar2022worldcupschudule/historical_win-loose-draw_ratios_qatar2022_teams.csv')

In this exercise, we will utilize box plot, bar chart, choropleth map and heatmap for data visualization and exploration. Furthermore, we will also introduce advanced Pandas functions that are tied closely with these visualization techniques, including:

aggregation: df.groupby()
sorting: df.sort_values()
merging: df.merge()
pivoting: df.pivot()

Box Plot — Wins Ratio by Country (image by author)

The first exercise is to visualize the wins ratio of each country when playing against other countries. To achieve this, we can use box plot to depict the distribution of wins ratio for each country and further colored by the continents of the country. Hover over the data points to see the detail information and zoom in box plots to see the max, q3, median, q1 and min values.

Let’s breakdown how we built the box plot step-by-step.

1. Get Continent Data

From the original dataset, we can use the fields “wins” and grouped by “country1” to investigate how the value varies within a country as compared to across countries. To further explore whether the wins ratio is impacted by continents, we need to introduce the “continent” field from the plotly built-in dataset px.data.gapminder().

geo_df = px.data.gapminder()

(Here I am using “continent” as an example, feel free to play around with “lifeExp” and “gdpPercap” as well)

Since only continent information is needed, we drop other columns to select distinct rows using drop_duplicates().

continent_df = geo_df[['country', 'continent']].drop_duplicates()

We then merge the geo_df with original dataset df to get the continent information. If you have used SQL before, then you will be familiar with table joining/merging. df.merge()works the same way by combining the common fields in df (i.e. “country1”) and continent_df (i.e. “country”).

continent_df = geo_df[['country', 'continent']].drop_duplicates()
merged_df = df.merge(continent_df, left_on='country1', right_on='country')

2. Create Box Plot

We apply px.box function and specify the following parameters that describe the data fed into the box plot.

fig = px.box(merged_df, 
x='country1', 
y='wins', 
color='continent',
...

3. Format the Plot

Following parameters are optional but help to format the plot and display more useful information in the visuals.

fig = px.box(merged_df, 
x='country1', 
y='wins', 
color='continent',
# formatting box plot
points='all',
hover_data=['country2'],
color_discrete_sequence=px.colors.diverging.Earth,
width=1300,
height=600
)
fig.update_traces(width=0.3)
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

points = ‘all’ means that all data points are shown besides the box plots. Hover each data point to see the details.
hover_data=[‘country2’]added “country2” to the hover box content.
color_discrete_sequence=px.colors.diverging.Earths specifies the color theme. Please note that color_discrete_sequence is applied when the field used for coloring is discrete, categorical values. Alternatively, color_continuous_scale is applied when the field is continuous, numeric values.
width=1300 and height=600 specifies the width and height dimension of the figure.
fig.update_traces(width=0.3) updates the width of each box plot.
fig.update_layout(plot_bgcolor=’rgba(0,0,0,0)’)updates the figure background color to transparent.

Bar Chart — Average Wins Ratio by Country (image by author)

The second exercise is to visualize the average wins ratio per country and sort them in descending order, so that to see the top performed countries.

Firstly, we use the code below for data manipulation.

average_score = df.groupby(['country1'])['wins'].mean().sort_values(ascending=False

df.groupby([‘country1’]): grouped the df by field “country1”.
[’wins’].mean(): take the mean of “wins” values.
sort_values(ascending=False): sort the values by descending order.

We then use pd.DataFrame() to convert the average_score (which is Series datatype) to the table-like format.

average_score_df = pd.DataFrame({'country1':average_score.index, 'average wins':average_score.values})

Feed the average_score_df to px.bar function and it follows the same syntax as px.box.

# calculate average wins per team and descending sort
fig = px.bar(average_score_df,
x='country1',
y='average wins',
color='average wins',
text_auto=True,
labels={'country1':'country', 'value':'average wins'},
color_continuous_scale=px.colors.sequential.Teal,
width=1000,
height=600
)
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

To take a step further, we can also group the bar based on continent to illustrate the top performing countries as per continent, using the code below.

Bar Chart — Average Wins Ratio by Country Grouped by Continent (image by author)

# merge average_score with geo_df to bring "continent" and "iso_alpha"
geo_df = px.data.gapminder()
geo_df = geo_df[['country', 'continent', 'iso_alpha']].drop_duplicates()
merged_df = average_score_df.merge(geo_df, left_on='country1', right_on='country')
# create box plot using merged_df and colored by "continent"
fig = px.bar(merged_df,
x='country1',
y='average wins',
color='average wins',
text_auto=True,
labels={'country1':'country', 'value':'average wins'},
color_continuous_scale=px.colors.sequential.Teal,
width=1000,
height=600
)
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

Choropleth Map — Average Wins Ratio by Geo Location (image by author)

The next visualization we are going to explore is to display the average wins ratio of the country through the map. The diagram above gives us a clearer view of which regions around the world had relatively better performance, such as South Americas and Europe.

ISO code is used to identify the location of the country. In the previous code snippet for average wins ratio colored by continent, we have merged geo_df with the original dataset to create merged_df with the fields “continent” and “iso_alpha”. We will keep using merge_df for this exercise (shown in the screenshot below).

We then use px.choropleth function and define the parameter locations to be “iso_alpha”.

fig = px.choropleth(merged_df, 
locations='iso_alpha',
color='average wins',
hover_name='country',
color_continuous_scale=px.colors.sequential.Teal,                
width=1000,
height=500,
)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0})

Heatmap — Wins Ratio Between Country Pairs (image by author)

Lastly, we will introduce heatmap to visualize the wins ratio between each country pair, where the dense area shows that countries on the y axis had a higher ratio of wining. Additionally, hover over the cells to see the wins ratio in a dynamic way.

We need to use df.pivot() function to reconstruct the dataframe structure. The code below specifies the row of the pivot table to be “country1”, “country2” as the columns, and keep the “wins” as the pivoted value. As the result, the table on the left has been transformed into the right one.

df_pivot = df.pivot(index = 'country1', columns ='country2', values = 'wins')

We then use the pivoted_df and px.imshow to create the heatmap through the code below.

# heatmap
fig = px.imshow(pivoted_df, 
text_auto=True,
labels={'color':'wins'},
color_continuous_scale=px.colors.sequential.Brwnyl,
width=1000,
height=1000
)
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')