.
The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.
In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv
, which was downloaded from Kaggle here.
This dataset contains information on a variety of characteristics of each episode. In detail, these are:
# Importing pandas library and matplotlib.plt library
#
import pandas as pd
import matplotlib.pyplot as plt
# Set some parameters for display a larger plot
plt.rcParams['figure.figsize'] = [11, 7]
# Load the data from csv file into a pandas' DataFrame.
office_episodes = pd.read_csv('datasets/office_episodes.csv')
# Explore some data showing the first few lines of the dataframe.
office_episodes.head(3)
office_episodes.info()
# Showing viewers trend over time
plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'])
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Viewers Trends')
plt.show()
The objective of the analysis is showing how the popularity and quality of the series varied over time. In the first scatter, we see a decrease in viewers. For each episode we have rating also. We can incorporate that data in the scatter, with a little make-up on the previus plot.
# We will assign a color for ratings and put them in a list.
colors_rating = []
for ind, row in office_episodes.iterrows():
if row['scaled_ratings'] < 0.25:
colors_rating.append('red')
elif row['scaled_ratings'] < 0.50:
colors_rating.append('orange')
elif row['scaled_ratings'] < 0.75:
colors_rating.append('lightgreen')
else:
colors_rating.append('darkgreen')
#show the new scatterplot
plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'], c=colors_rating)
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Viewers Trends')
plt.show()
sizes = []
for ind, row in office_episodes.iterrows():
if row['has_guests'] == True:
sizes.append(250)
else:
sizes.append(25)
#show the new scatterplot
plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'], c=colors_rating, s=sizes)
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Popularity, Quality and Guest Appearences on the Office')
plt.show()
It seems that the most appreciated episodes were those with Guest Stars, but one episode was far more popular than the others... "Who could these so popular stars be?"
top_star = office_episodes[office_episodes['viewership_mil'] == office_episodes['viewership_mil'].max()]['guest_stars']
print(top_star)