Monday, July 27, 2020

Should I Stay or Should I Go?



Sparkify: Should I stay or should I go?


Project Overview

Sparkify is Udacity's (on paper...so far) music streaming service that produces data similar to Spotify or Pandora. Users play and/or rate their favorite songs through such services on a daily basis. The business model employed is a free and pay tiered system where free/unpaid users have generally limited access and often see advertisements to offset the royalty fees paid to music artists. Users have the ability to upgrade their service (become a paying customer), or downgrade their service (from paying to non-paying); both tiers have the ability to cancel their service, and that lost of a customer would be considered "customer churn".

All user interaction with Sparkify revolves around these activities:

* Play a song
* Creating and updating playlists
* Rating song with the thumbs up or thumbs down button
* Add a friend
* Logging in or out
* Changing settings

Each of these activities generates an entry into the user logs. Naturally, log analytics are a key factor for the Sparkify marketing team and all the executive staff. The team at Udacity has provided a test dataset for our analysis, although little documentation has been provided on the dataset.

Business Understanding

Realistically, there is one main question we want to answer: Will either a paying or free user leave? That is called churn and apps like Sparkify live and die by the churn rate.


Along the way, I'll load and examine to data looking to clean the dataset for further examination. After this, EDA (Exploratory Data Analysis) will take place to better understand the data and prepare it for creating several machine learning models that will help to understand which customers are likely to churn.

I'll be looking for the best "F1" score, which is a measurement of the accuracy of a data model to evaluate the best model. After that, I'll present my conclusions along with ideas for further improvement.

SPARK!

I'll be using a Spark, which is an open source framework for distributed data processing. This allows potentially large and geographically dispersed data to be accessed and analyzed seamlessly by the developer. This allows for data to scale to very large sizes, yet users are able to use this data without needing to be aware of this dynamic.

Developers that use the Spark ecosystem can focus on the domain-specific data processing business case, while Spark will handles the messy details of parallel computing, such as data partitioning, job scheduling, and fault-tolerance. This supports the flexibility and scalability needed to handle massive volumes of data efficiently.

Pyspark is the Python API to Spark, which allows all of the attendant benefits of Python and it's libraries to be used with Spark. It's a happy marriage!

The folks at Data Flair have a nice infographic that shows the highlights of Spark.




My Python/Pyspark code is here at Github: https://github.com/nameisunique/Capstone_Sparkify

Data Understanding and Data Cleaning

A few thoughts:
1) For my analysis, I took a "low level" approach with python in addition to using a number of useful functions provided by Pyspark.. Those can be very helpful tools for initial understanding, and usually the first place I investigate data from.
2) For the initial data loading and cleaning, I use a variety of Python functions to look for nulls, missing data, outliers, and data that needs to be removed. Potentially I may need to use some statistical techniques to extrapolate (or impute) missing values; for this data I did not impute any values.


No comments: