This is the Capstone project for the Datacamp 'Python intermediate' Course.

Dec. 17, 2021

Investigating Popularity and Guest Stars in The Office

1. Welcome!

Markdown.

The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.

In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.

This dataset contains information on a variety of characteristics of each episode. In detail, these are:

datasets/office_episodes.csv
  • episode_number: Canonical episode number.
  • season: Season in which the episode appeared.
  • episode_title: Title of the episode.
  • description: Description of the episode.
  • ratings: Average IMDB rating.
  • votes: Number of votes.
  • viewership_mil: Number of US viewers in millions.
  • duration: Duration in number of minutes.
  • release_date: Airdate.
  • guest_stars: Guest stars in the episode (if any).
  • director: Director of the episode.
  • writers: Writers of the episode.
  • has_guests: True/False column for whether the episode contained guest stars.
  • scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).
In [9]:
# Importing pandas library and matplotlib.plt library
# 

import pandas as pd
import matplotlib.pyplot as plt

# Set some parameters for display a larger plot
plt.rcParams['figure.figsize'] = [11, 7]

# Load the data from csv file into a pandas' DataFrame.
office_episodes = pd.read_csv('datasets/office_episodes.csv')

# Explore some data showing  the first few lines of the dataframe.
office_episodes.head(3)
office_episodes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   episode_number  188 non-null    int64  
 1   season          188 non-null    int64  
 2   episode_title   188 non-null    object 
 3   description     188 non-null    object 
 4   ratings         188 non-null    float64
 5   votes           188 non-null    int64  
 6   viewership_mil  188 non-null    float64
 7   duration        188 non-null    int64  
 8   release_date    188 non-null    object 
 9   guest_stars     29 non-null     object 
 10  director        188 non-null    object 
 11  writers         188 non-null    object 
 12  has_guests      188 non-null    bool   
 13  scaled_ratings  188 non-null    float64
dtypes: bool(1), float64(3), int64(4), object(6)
memory usage: 19.4+ KB
In [10]:
# Showing viewers trend over time

plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'])
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Viewers Trends')
plt.show()

The objective of the analysis is showing how the popularity and quality of the series varied over time. In the first scatter, we see a decrease in viewers. For each episode we have rating also. We can incorporate that data in the scatter, with a little make-up on the previus plot.

In [11]:
# We will assign a color for ratings and put them in a list.

colors_rating = []

for ind, row in office_episodes.iterrows():
    if row['scaled_ratings'] < 0.25:
        colors_rating.append('red')
    elif row['scaled_ratings'] < 0.50:
        colors_rating.append('orange')
    elif row['scaled_ratings'] < 0.75:
        colors_rating.append('lightgreen')
    else:
        colors_rating.append('darkgreen')
In [12]:
#show the new scatterplot

plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'], c=colors_rating)
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Viewers Trends')
plt.show()
In [13]:
sizes = []

for ind, row in office_episodes.iterrows():
    if row['has_guests'] == True:
        sizes.append(250)
    else:
        sizes.append(25)
In [14]:
#show the new scatterplot

plt.scatter(office_episodes['episode_number'], office_episodes['viewership_mil'], c=colors_rating, s=sizes)
plt.xlabel('Episode number')
plt.ylabel('Viewers (in Millions)')
plt.title('Popularity, Quality and Guest Appearences on the Office')
plt.show()

It seems that the most appreciated episodes were those with Guest Stars, but one episode was far more popular than the others... "Who could these so popular stars be?"

In [15]:
top_star = office_episodes[office_episodes['viewership_mil'] == office_episodes['viewership_mil'].max()]['guest_stars']
print(top_star)
77    Cloris Leachman, Jack Black, Jessica Alba
Name: guest_stars, dtype: object

Cloris Leachman, Jack Black, Jessica Alba!!