This is a unguided project from the Datacamp Course 'Data Scientist Professional with Python

1. Introduction

Google Play logo

Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market[1].

The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.

The dataset you will use here was scraped from Google Play Store in September 2018 and was published on Kaggle. Here are the details:

datasets/apps.csv
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
  • App: Name of the app
  • Category: Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.
  • Rating: The current average rating (out of 5) of the app on Google Play
  • Reviews: Number of user reviews given on the app
  • Size: Size of the app in MB (megabytes)
  • Installs: Number of times the app was downloaded from Google Play
  • Type: Whether the app is paid or free
  • Price: Price of the app in US$
  • Last Updated: Date on which the app was last updated on Google Play
datasets/user_reviews.csv
This file contains a random sample of 100 [most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/) user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
  • App: Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file
  • Review: The pre-processed user review text
  • Sentiment Category: Sentiment category of the user review - Positive, Negative or Neutral
  • Sentiment Score: Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.

From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.

The three questions are:

  1. Read the apps.csv file and clean the Installs column to convert it into integer data type. Save your answer as a DataFrame apps.

  2. Find the number of apps in each category, the average price, and the average rating. Save your answer as a DataFrame _app_categoryinfo. You should rename the four columns as: Category, Number of apps, Average price, Average rating.

  3. Find the top 10 free FINANCE apps having the highest average sentiment score. Save your answer as a DataFrame _top_10_userfeedback. Your answer should have exactly 10 rows and two columns named: App and Sentiment Score, where the average Sentiment Score is sorted from highest to lowest.

In [293]:
#importing pandas and explore apps datasets/app.csv
import pandas as pd

apps = pd.read_csv('datasets/apps.csv')
apps.head()
Out[293]:
App Category Rating Reviews Size Installs Type Price Last Updated
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10,000+ Free 0.0 January 7, 2018
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500,000+ Free 0.0 January 15, 2018
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5,000,000+ Free 0.0 August 1, 2018
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50,000,000+ Free 0.0 June 8, 2018
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100,000+ Free 0.0 June 20, 2018
In [294]:
#read and explore apps datasets/app.csv
user_reviews = pd.read_csv('datasets/user_reviews.csv')

user_reviews.head()
Out[294]:
App Review Sentiment Category Sentiment Score
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.00
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25
2 10 Best Foods for You NaN NaN NaN
3 10 Best Foods for You Works great especially going grocery store Positive 0.40
4 10 Best Foods for You Best idea us Positive 1.00
In [295]:
#Remove non-numerical charecter from the column 'Install' e convert it in integer data type.
apps['Installs'] = apps['Installs'].str.replace(',','').str.replace('+','')
apps['Installs'] = apps['Installs'].astype(int)
apps.head()
Out[295]:
App Category Rating Reviews Size Installs Type Price Last Updated
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000 Free 0.0 January 7, 2018
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000 Free 0.0 January 15, 2018
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 August 1, 2018
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50000000 Free 0.0 June 8, 2018
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100000 Free 0.0 June 20, 2018
In [296]:
#ensuring 'Installs' column is now an integer data type
apps['Installs'].dtype
Out[296]:
dtype('int64')
In [297]:
#Create a Dataframe with Average price, Average rating and numers of app per Category.

app_category_info = apps.groupby('Category').agg(
        {'Category' : 'count',
          'Price' : 'mean',
          'Rating': 'mean'})

#changing columns name
app_category_info = app_category_info.rename(columns={
    'Category': 'Number of apps',
    'Price': 'Average price',
    'Rating': 'Average rating'
}).reset_index()

#explore few rows of the new dataframe
app_category_info.head()
Out[297]:
Category Number of apps Average price Average rating
0 ART_AND_DESIGN 64 0.093281 4.357377
1 AUTO_AND_VEHICLES 85 0.158471 4.190411
2 BEAUTY 53 0.000000 4.278571
3 BOOKS_AND_REFERENCE 222 0.539505 4.344970
4 BUSINESS 420 0.417357 4.098479
In [298]:
#creating a new df with  a list of free finance apps and explore it

free_finance_apps = apps.query('Category =="FINANCE" and Type=="Free"')
free_finance_apps.head()
Out[298]:
App Category Rating Reviews Size Installs Type Price Last Updated
837 K PLUS FINANCE 4.4 124424 NaN 10000000 Free 0.0 June 26, 2018
838 ING Banking FINANCE 4.4 39041 NaN 1000000 Free 0.0 August 3, 2018
839 Citibanamex Movil FINANCE 3.6 52306 42.0 5000000 Free 0.0 July 27, 2018
840 The postal bank FINANCE 3.7 36718 NaN 5000000 Free 0.0 July 16, 2018
841 KTB Netbank FINANCE 3.8 42644 19.0 5000000 Free 0.0 June 28, 2018
In [299]:
#merging free_finance_apps with user_review  

free_finance_app_w_reviews = free_finance_apps.merge(user_reviews, on='App', how='left')
free_finance_app_w_reviews.head()
Out[299]:
App Category Rating Reviews Size Installs Type Price Last Updated Review Sentiment Category Sentiment Score
0 K PLUS FINANCE 4.4 124424 NaN 10000000 Free 0.0 June 26, 2018 NaN NaN NaN
1 ING Banking FINANCE 4.4 39041 NaN 1000000 Free 0.0 August 3, 2018 NaN NaN NaN
2 Citibanamex Movil FINANCE 3.6 52306 42.0 5000000 Free 0.0 July 27, 2018 Forget paying app, designed make fail payments... Negative -0.50
3 Citibanamex Movil FINANCE 3.6 52306 42.0 5000000 Free 0.0 July 27, 2018 It's working expected, talking best bank Mexic... Positive 0.40
4 Citibanamex Movil FINANCE 3.6 52306 42.0 5000000 Free 0.0 July 27, 2018 It has many problems with Android 8.1. You can... Positive 0.25
In [300]:
#finding the top 10 free Finance App with highest average sentiment

top_10_user_feedback = (pd.DataFrame(free_finance_app_w_reviews.groupby('App')['Sentiment Score'].mean()))\
                            .sort_values('Sentiment Score', ascending = False).head(10)


#I grouped the free_finance_app_w_reviews by 'App' to obtain the 'Sentiment Score' mean. After  that I sorted in descending order the 'Sentiment Score' column, end extracted the first 10 row with head()     

top_10_user_feedback
Out[300]:
Sentiment Score
App
BBVA Spain 0.515086
Associated Credit Union Mobile 0.388093
BankMobile Vibe App 0.353455
A+ Mobile 0.329592
Current debit card and app made for teens 0.327258
BZWBK24 mobile 0.326883
Even - organize your money, get paid early 0.283929
Credit Karma 0.270052
Fortune City - A Finance App 0.266966
Branch 0.264230

So the TOP 10 FINANCE APPS by Sentiment score, were the apps above.

In [301]:
%%nose
# %%nose needs to be included at the beginning of every @tests cell

# https://instructor-support.datacamp.com/en/articles/4544008-writing-project-tests-guided-and-unguided-r-and-python
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors

import numpy as np

correct_apps = pd.read_csv('datasets/apps.csv')
correct_reviews = pd.read_csv('datasets/user_reviews.csv')

# List of characters to remove
chars_to_remove = ['+', ',']
# Replace each character with an empty string
for char in chars_to_remove:
    correct_apps['Installs'] = correct_apps['Installs'].apply(lambda x: x.replace(char, ''))
# Convert col to int
correct_apps['Installs'] = correct_apps['Installs'].astype(int)
   

def test_pandas_loaded():
    assert ('pandas' in globals() or 'pd' in globals()), "pandas is not imported."

def test_installs_plus():
    assert '+' not in apps['Installs'], \
    'The special character "+" has not been removed from Installs column.' 
    
def test_installs_comma():
    assert ',' not in apps['Installs'], \
    'The special character "," has not been removed from the Installs column.'
    
def test_installs_numeric():
    assert isinstance(apps['Installs'][0], np.int64), \
    'The Installs column is not of numeric data type (int).'
    
def test_q1_app_category_info_columns():
    
    # when DataFrame in MultiIndex
    if 'BEAUTY' in app_category_info.index:
        assert all(x in app_category_info.columns for x in ['Number of apps', 'Average price', 'Average rating']), \
        "Some columns are missing or incorrectly named in your app_category_info DataFrame. Make sure there are 4 columns named: 'Category', 'Number of apps', 'Average price', 'Average rating'."
    else:
        "Some columns are missing or incorrectly named in your app_category_info DataFrame. Make sure there are 4 columns named: 'Category', 'Number of apps', 'Average price', 'Average rating'."

def test_q1_app_category_info_app_count():
    
    if 'Number of apps' in app_category_info.reset_index().columns:
        correct_app_category_info = correct_apps.groupby(['Category']).agg({'App':'count', 'Price': 'mean', 'Rating': 'mean'}).reset_index()
        correct_app_category_info = correct_app_category_info.rename(columns={"App": "Number of apps", "Price": "Average price", "Rating": "Average rating"})
        correct_app_count = correct_app_category_info['Number of apps']

        # convert to single index and compare
        app_count = app_category_info.reset_index().sort_values(by='Category')['Number of apps']
        assert correct_app_count.equals(app_count),\
        "The aggregate function used to calculate \"Number of apps\" is incorrect."
    
    else:
        assert False, "\"Number of apps\" column is missing in your app_category_info DataFrame."

    
def test_q1_app_category_info_avg_price():

    if 'Average price' in app_category_info.reset_index().columns:
        correct_app_category_info = correct_apps.groupby(['Category']).agg({'App':'count', 'Price': 'mean', 'Rating': 'mean'}).reset_index()
        correct_app_category_info = correct_app_category_info.rename(columns={"App": "Number of apps", "Price": "Average price", "Rating": "Average rating"})
        correct_app_count = correct_app_category_info['Average price']

        # convert to single index and compare
        app_count = app_category_info.reset_index().sort_values(by='Category')['Average price']
        assert correct_app_count.equals(app_count),\
        "The aggregate function used to calculate \"Average price\" is incorrect."
    
    else:
        assert False, "\"Average price\" column is missing in your app_category_info DataFrame."

def test_q1_app_category_info_avg_rating():
    
    if 'Average rating' in app_category_info.reset_index().columns:
        correct_app_category_info = correct_apps.groupby('Category').agg({'App':'count', 'Price': 'mean', 'Rating': 'mean'}).reset_index()
        correct_app_category_info = correct_app_category_info.rename(columns={"App": "Number of apps", "Price": "Average price", "Rating": "Average rating"})
        correct_app_count = correct_app_category_info['Average rating']

        # convert to single index and compare
        app_count = app_category_info.reset_index().sort_values(by='Category')['Average rating']
        assert correct_app_count.equals(app_count),\
        "The aggregate function used to calculate \"Average rating\" is incorrect."
    
    else:
        assert False, "\"Average rating\" column is missing in your app_category_info DataFrame."

# def test_reviews_loaded():
#     assert (correct_reviews.equals(reviews)), "The dataset was not read correctly into reviews."

def test_q2_finance_apps():
    correct_finance_apps = correct_apps[(correct_apps['Type'] == 'Free') & (correct_apps['Category'] == 'FINANCE')]['App']
    
    # if App column is the index
    if top_10_user_feedback.index.name == 'App': 
        finance_apps = top_10_user_feedback.index
        assert(set(finance_apps).issubset(set(correct_finance_apps))),\
        "You have not selected the free finance apps correctly. Check your answer again."
    else:
        finance_apps = top_10_user_feedback['App']
        assert(set(finance_apps).issubset(set(correct_finance_apps))),\
        "You have not selected the free finance apps correctly. Check your answer again."


def test_q2_top_10():
    assert(len(top_10_user_feedback) == 10), "You have selected more than 10 apps. Please select only top 10 apps with highest average sentiment score."
    

def test_q2_sorted():
    correct_finance_apps = correct_apps[(correct_apps['Type'] == 'Free') & (correct_apps['Category'] == 'FINANCE')]  
    correct_merged_df = pd.merge(correct_finance_apps, correct_reviews, on = "App", how = "inner")
    
    correct_app_sentiment_score = correct_merged_df.groupby('App').agg({'Sentiment Score': 'mean'}).reset_index()
    correct_sorted_apps = correct_app_sentiment_score.sort_values(by = 'Sentiment Score', ascending = False)[:10]

    # if App column is the index
    if top_10_user_feedback.index.name == 'App': 
        sorted_apps = top_10_user_feedback.index
        assert(list(sorted_apps) == list(correct_sorted_apps['App'])),\
        "You have not sorted top_10_user_feedback correctly. Make sure to sort your DataFrame on Sentiment Score from highest to lowest (ie - in decreasing order)."
    else: 
        sorted_apps = top_10_user_feedback['App']
        assert(list(sorted_apps) == list(correct_sorted_apps['App'])),\
        "You have not sorted top_10_user_feedback correctly. Make sure to sort your DataFrame on Sentiment Score from highest to lowest (ie - in decreasing order)."


def test_q2():
    
    correct_finance_apps = correct_apps[(correct_apps['Type'] == 'Free') & (correct_apps['Category'] == 'FINANCE')]  
    correct_merged_df = pd.merge(correct_finance_apps, correct_reviews, on = "App", how = "inner")
    
    correct_app_sentiment_score = correct_merged_df.groupby('App').agg({'Sentiment Score': 'mean'}).reset_index()
    correct_top_10_user_feedback = correct_app_sentiment_score.sort_values(by = 'Sentiment Score', ascending = False).reset_index()[:10]

    correct_app_sentiment_score_multiindex = correct_merged_df.groupby('App').agg({'Sentiment Score': 'mean'})
    correct_top_10_user_feedback_multiindex = correct_app_sentiment_score_multiindex.sort_values(by = 'Sentiment Score', ascending = False)[:10]
    
    # if App column is the index
    if top_10_user_feedback.index.name == 'App':
        assert (correct_top_10_user_feedback_multiindex.equals(top_10_user_feedback)), "You have not computed top_10_user_feedback correctly. Some values are wrong."
    else:
        top_10_user_feedback_apps = top_10_user_feedback['App']
        top_10_user_feedback_sentiment_score = top_10_user_feedback['Sentiment Score']
        assert (list(top_10_user_feedback_apps) == list(correct_top_10_user_feedback['App']) and
               list(top_10_user_feedback_sentiment_score) == list(correct_top_10_user_feedback['Sentiment Score'])), "You have not computed top_10_user_feedback correctly. Some values are wrong."
Out[301]:
12/12 tests passed