Introduction

Bellabeat, is a high-tech company that manufactures health-focused smart products focusing on women. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women around the world with knowledge about their own health and habits.

Bellabeat currently offers five products to their customers: Bellabeat membership, the Spring water bottle, the Time wellness watch, the Leaf classic wellness tracker, and the Bellabeat app. The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products. To advertise their products, Bellabeat focuses on digital marketing.

The aim of this report is to generate data-driven recommendations regarding marketing strategy for Bellabeat, focusing on one of Bellabeat’s products: the Bellabeat app. The recommendations in this report will be built upon observations from analyzing smart device data to gain insight into how consumers are using their smart devices.

This report will follow the six steps of the data analysis process: ask, prepare, process, analyze, share, and act.

1. Ask

Under this section, business task and key stakeholders will be identified.

1.1. Identify business task

The business task is to analyze non-Bellabeat smart device usage data in order to gain insight into how people are already using their smart devices and based on the results of the analysis deliver high-level recommendations for how these trends can inform Bellabeat’s marketing strategy.

1.2. Consider key stakeholders

The identified key stakeholders are:

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team

2. Prepare

Under this section, metadata and data will be examined to find out more about the data and its reliability.

2.1. Data source

The data used for this analysis is the FitBit Fitness Tracker Data. The data set is stored on Kaggle and was made available through Mobius.

2.2. Data accessability & privacy

The dataset is licensed CC0: Public Domain, which means the owner has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

This means the dataset is open-source and can be copied, modified, and distributed, even for commercial purposes, without asking permission.

2.3. Data content

The dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

2.4. Data organization

The data was available as 18 different CSV-files. Each file contained different quantitative data generated from Fitbit health trackers on either long or wide format. The table below lists the available files.

File name Type Format Description
dailyActivity_merged CSV Long Daily activity over 31 days for 33 users. Includes: Steps, Distance, Intensities, Calories.
dailyCalories_merged CSV Long Daily calories over 31 days for 33 users.
dailyIntensities_merged CSV Long Daily intensities over 31 days for 33 users. Includes distance and time for 4 different intensity levels.
dailySteps_merged CSV Long Daily steps over 31 days for 33 users.
heartrate_seconds_merged CSV Long Heart rate in intervals of 5 seconds over 31 days for 14 users.
hourlyCalories_merged CSV Long Hourly calories over 31 days for 33 users.
hourlyIntensities_merged CSV Long Hourly intensities over 31 days for 33 users.
hourlySteps_merged CSV Long Hourly steps over 31 days for 33 users.
minuteCaloriesNarrow_merged CSV Long Minutely calories over 31 days for 33 users.
minuteCaloriesWide_merged CSV Wide Minutely calories over 31 days for 33 users.
minuteIntensitiesNarrow_merged CSV Long Minutely intensities over 31 days for 33 users. Binary scale.
minuteIntensitiesWide_merged CSV Wide Minutely intensities over 31 days for 33 users. Binary scale.
minuteMETsNarrow_merged CSV Long Minutely METs* over 31 days for 33 users.
minuteSleep_merged CSV Long Minutely sleep score** over 31 days for 24 users.
minuteStepsNarrow_merged CSV Long Minutely steps over 31 days for 33 users.
minuteStepsWide_merged CSV Wide Minutely steps over 31 days for 33 users.
sleepDay_merged CSV Long Daily sleep over 31 days for 24 users.
weigthLogInfo_merged CSV Long Weight information over 31 days for 8 users. Includes weight (kg and pounds), fat, BMI, and whether the report is manual or not.

*MET stands for metabolic equivalent of task and is the objective measure of the ratio of the rate at which a person expends energy, relative to the mass of that person, while performing some specific physical activity compared to a reference, set by convention at 3.5 mL of oxygen per kilogram per minute, which is roughly equivalent to the energy expended when sitting quietly.

**Sleep score is based on heart rate, the time spend awake or restless, and sleep stages.

The number of days and number of distinct users were derived using pivot tables in Excel for the datasets containing daily data, for minutely and secondly data R was used. Below is an example of the used R-code.

# Import dataset
heartrate_seconds_merged <- read.csv("data/heartrate_seconds_merged.csv")

# Transform Time from character to date-time 
heartrate_seconds_merged$Time <- mdy_hms(heartrate_seconds_merged$Time)

# Create new column storing the date
heartrate_seconds_merged$Date <- as.Date(heartrate_seconds_merged$Time)

# Calculate distinct number of days and user id:s 
n_distinct(heartrate_seconds_merged$Date)
## [1] 31
n_distinct(heartrate_seconds_merged$Id)
## [1] 14

2.5. Data credability & integrity

The datasets has a small sample size (33 users or less). Also, the gender and other demographic information of the users are unknown. The dataset might include users with another gender identity than woman, which is Bellabeats target customer. There might also be an unknown sampling bias in the data. Partly based on demographics, partly based on that the survey was only distributed via Amazon Mechanical Turk. Additionally, the survey was only open for two months and the data is from 2016 and might not be relevant anymore. Because of above mentioned reasons, the results of this analysis might not be representative and this case study should be seen as an operational approach.

3. Process

Under this section, the data will be processed. Due to the amount of data and easiness to create visualizations to share with stakeholders, R will be used.

3.1. Packages

The following packages were installed and opened:

  • janitor
  • lubridate
  • Rcmdr
  • scales
  • tidyverse
# Load packages
library("janitor")
library("lubridate")
library("Rcmdr")
library("scales")
library("tidyverse")

3.2. Import datasets

According to the central limit theorem, given a sufficiently large sample size from a population with a finite level of variance, the mean of all sampled variables from the same population will be approximately equal to the mean of the whole population. What is a sufficient sample size varies depending on industry and business, but sample sizes equal to or greater than 30 are often considered sufficient for the central limit theorem to hold. This analysis only used datasets fulfilling that sample size, which means heart rate data (14 users), sleep data (24 users), and weight data (8 users) were not used.

For this analysis, minutial data would not provide more insights than hourly and/or daily data could provide. Hence, datasets containing minutial data was not used.

The dataset called dailyActivity_merged contains daily calories, intensities, and steps, which made the datasets dedicated specifically to those data redundant for this analysis.

To summarise, this analysis used the following datasets:

  • dailyActivity_merged
  • hourlySteps_merged
# Import datasets
daily_activity <- read.csv("data/dailyActivity_merged.csv")
hourly_steps <- read.csv("data/hourlySteps_merged.csv")

3.3. Preview datasets

A preview allows for familiarization with the data.

# Preview datasets
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(hourly_steps)
## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal    <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…

3.4. Clean and format datasets

Now when the data structures are known, it is time to look for errors and inconsistencies.

3.4.1. Verify number of users

Double checked the number of distinct users.

# Check number of distinct users
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33

3.4.2. Identify and remove potential duplicates¶

Looked for duplicates.

# Find potential duplicates
sum(duplicated(daily_activity))      # Returns the number of duplicate rows
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0

3.4.3. Clean and rename columns¶

To ensure the column names followed a good naming convention they were formatted to snake case and renamed.

# Format to snake_case
daily_activity <- clean_names(daily_activity)
hourly_steps <- clean_names(hourly_steps)
# Rename activity_date to date
daily_activity <- 
  daily_activity %>%
  rename(date = activity_date)
# Rename activity hour to date_time
hourly_steps <- 
  hourly_steps %>%
  rename(date_time = activity_hour)

3.4.4. Date and time

For the data set daily_activity, the dates were stored as characters in american standard, MM/DD/YYYY. To transform it to ISO standard, YYYY-MM-DD, the following code was used:

# Transform into date
daily_activity$date <- mdy(daily_activity$date)

Check so dataset looks as desired.

glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

For the dataset hourly_steps, both the date and the time were stored as characters in american standard. To transform it into date-time format the following code was used.

# Transform into date-time
hourly_steps$date_time <- mdy_hms(hourly_steps$date_time)
glimpse(hourly_steps)
## Rows: 22,099
## Columns: 3
## $ id         <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366,…
## $ date_time  <dttm> 2016-04-12 00:00:00, 2016-04-12 01:00:00, 2016-04-12 02:00…
## $ step_total <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 221…

4. Analyze & Share

Under this section, the Fitbit datasets will be analyzed to discover trends that can be used to inform Bellabeats marketing strategy.

4.1. Summary statistics

To start the analysis, summary statistics were calculated.

# Numerical Summaries: daily_activity
numSummary(daily_activity[,c("total_steps", "total_distance", "tracker_distance", "logged_activities_distance", "very_active_distance", "moderately_active_distance", "light_active_distance", "sedentary_active_distance", "very_active_minutes", "fairly_active_minutes", "lightly_active_minutes", "sedentary_minutes", "calories"), drop=FALSE], 
           statistics=c("mean", "sd", "quantiles"), quantiles=c(0,.5,1))
##                                    mean           sd 0%      50%         100%
## total_steps                7.637911e+03 5.087151e+03  0 7405.500 36019.000000
## total_distance             5.489702e+00 3.924606e+00  0    5.245    28.030001
## tracker_distance           5.475351e+00 3.907276e+00  0    5.245    28.030001
## logged_activities_distance 1.081709e-01 6.198965e-01  0    0.000     4.942142
## very_active_distance       1.502681e+00 2.658941e+00  0    0.210    21.920000
## moderately_active_distance 5.675426e-01 8.835803e-01  0    0.240     6.480000
## light_active_distance      3.340819e+00 2.040655e+00  0    3.365    10.710000
## sedentary_active_distance  1.606383e-03 7.346176e-03  0    0.000     0.110000
## very_active_minutes        2.116489e+01 3.284480e+01  0    4.000   210.000000
## fairly_active_minutes      1.356489e+01 1.998740e+01  0    6.000   143.000000
## lightly_active_minutes     1.928128e+02 1.091747e+02  0  199.000   518.000000
## sedentary_minutes          9.912106e+02 3.012674e+02  0 1057.500  1440.000000
## calories                   2.303610e+03 7.181669e+02  0 2134.000  4900.000000
##                              n
## total_steps                940
## total_distance             940
## tracker_distance           940
## logged_activities_distance 940
## very_active_distance       940
## moderately_active_distance 940
## light_active_distance      940
## sedentary_active_distance  940
## very_active_minutes        940
## fairly_active_minutes      940
## lightly_active_minutes     940
## sedentary_minutes          940
## calories                   940

Initial thoughts regarding the dataset:

  • Mean total steps per day was approx 7638 steps (SD=5087), median total steps per day was slightly under mean (MDN=7406). Min recorded steps is 0 and max is as much as 36019 steps! Seems like there were some very active users that increased the average, but also some sedentary users that balanced the very active. Need to investigate further.
  • Mean very active minutes is 21 min (SD=33), mean fairly active minutes is 13 min (SD=20). This means on average users reach the recommended active minutes to gain significant health benefits. However, large standard deviation. Investigate further.
  • The activity minutes did not add up to 1440 min (60 min per hour, 24 hours in a day) on each row. Indicates that the users did not have their tracking devises on during the whole day. Investigate further.

4.2. Lifestyle type based on daily steps

As there was no available demographic data, users were categorized into lifestyle types based on their daily steps. The levels are defined in the table below.

Lifestyle type Steps per day
Sedentary < 5000
Low active 5000 - 7499
Somewhat active 7500 - 9999
Active 10000 - 12499
Highly active ≥ 12500

The categorization was based on the article “How many steps/day are enough? Preliminary pedometer indices for public health” by Catrine Tudor-Locke and David R Bassett Jr. The article can be read here.

Thereafter, the users were categorized.

# Calculate average steps per day for each user
mean_daily_activity <- 
  daily_activity %>%
    select(id, date, total_steps, very_active_minutes, fairly_active_minutes) %>%
    group_by(id) %>%
    summarise(mean_steps = mean(total_steps), 
              mean_very_active_minutes = mean(very_active_minutes),
              mean_fairly_active_minutes = mean(fairly_active_minutes))
# Categorize users lifestyle based on number of daily steps
mean_daily_activity <- 
  mean_daily_activity %>%
    mutate(lifestyle_type = case_when (mean_steps < 5000 ~ "Sedentary: 0 - 4999 steps",
                                       mean_steps < 7500 ~ "Low active: 5000 - 7499 steps",
                                       mean_steps < 10000 ~ "Somewhat active: 7500 - 9999 steps",
                                       mean_steps < 12500 ~ "Active: 10000 - 12499 steps",
                                       mean_steps >= 12500 ~ "Highly active: > 12500 steps"))
# Calculate number of users in each group
lifestyle_pct <- 
  mean_daily_activity %>%
    select(id, lifestyle_type) %>%
    group_by(lifestyle_type) %>%
    summarise(n = n()) 

# Calculate percentage
lifestyle_pct <-
  lifestyle_pct %>%
    mutate(dbl = (n/sum(n)),
           round = round(dbl, digits = 2),
           error = 1 - dbl/round)
lifestyle_pct
## # A tibble: 5 × 5
##   lifestyle_type                         n    dbl round   error
##   <chr>                              <int>  <dbl> <dbl>   <dbl>
## 1 Active: 10000 - 12499 steps            5 0.152   0.15 -0.0101
## 2 Highly active: > 12500 steps           2 0.0606  0.06 -0.0101
## 3 Low active: 5000 - 7499 steps          9 0.273   0.27 -0.0101
## 4 Sedentary: 0 - 4999 steps              8 0.242   0.24 -0.0101
## 5 Somewhat active: 7500 - 9999 steps     9 0.273   0.27 -0.0101

The sum of rounded decimals did not add up to 1 (100%). As rounding induced equal error for all lifestyle types, it was decided to randomly round up one of them. The lifestyle type that was rounded up was sedentary.

# Round up sedentary lifestyle
lifestyle_pct$round[4] <- lifestyle_pct$round[4] + 0.01
# Transform into percentage
lifestyle_pct <- 
  lifestyle_pct %>%
    mutate(pct = scales::percent(round)) %>%
    select(-c(dbl, error))
# Reorder lifestyle types after activity level by adding factor levels
lifestyle_pct$lifestyle_type <- factor(lifestyle_pct$lifestyle_type, levels=c("Sedentary: 0 - 4999 steps", "Low active: 5000 - 7499 steps", "Somewhat active: 7500 - 9999 steps", "Active: 10000 - 12499 steps", "Highly active: > 12500 steps"))
# Pie chart
ggplot(lifestyle_pct, aes(x=" ", y=round, fill=lifestyle_type)) +
  geom_bar(width=1, stat="identity") +
  coord_polar("y", start=0) +
  scale_fill_manual(values = c("#61E3FA", "#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "User Distribution based on Daily Steps",
       fill = "Lifestyle") +
  theme_bw() +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.ticks = element_blank(),
  axis.text.x=element_blank(),
  panel.border = element_blank(),
  panel.grid=element_blank(),
  plot.title=element_text(size=14, face="bold", hjust = 0.5)
  ) +
  geom_text(aes(label = pct),
            position = position_stack(vjust = 0.5))

As the pie chart shows, all lifestyle types were represented in the tracker data. However, most users had a somewhat active (27%), low active (27%), or sedentary lifestyle (25%). Only 21% of the users were considered to have an active (15%) or very active (6%) lifestyle based on their total daily steps.

4.3. Active minutes¶

The number of steps is not the only measure by which activity level is measured. According to Verywell Fit (article here), the number of active minutes mean even more than steps. They say the recommended activity level is minimum 150 min moderate-intensity exercise or 75 min vigorous-intensity exercise per week to reduce health risks such as heart disease, type 2 diabetes, etc. For potentially greater health benefits they recommend 300 min moderate-intensity exercise or 150 min vigorous-intensity exercise per week. Hence, it was investigated how many of the users that reached these levels of activity.

It was assumed that moderate-intensity exercise corresponds to fairly active minutes in the dataset, and that vigorous-intensity exercise corresponds to very active minutes. Based on above recommendations, three levels of activity were created based on the health benefits they entail.

Enough active minutes Fairly active minutes Very active minutes Health benefits
Not enough 0 - 22 AND 0 - 11 Low health benefits
Enough 22 - 42 OR 11 - 21 Significant health benefits
More than enough ≥ 43 OR ≥ 22 High health benefits

Thereafter, the users were categorized into an activity level based on the if they had enough active minutes.

mean_daily_activity <- 
  mean_daily_activity %>%
    mutate(activity_level = case_when (mean_fairly_active_minutes < 22 & 
                                         mean_very_active_minutes < 11 ~ 
                                         "Not enough",
                                       mean_fairly_active_minutes >= 43 | 
                                         mean_very_active_minutes >= 22 ~ 
                                         "More than enough",
                                       mean_fairly_active_minutes < 43 | 
                                         mean_very_active_minutes < 22 ~ 
                                         "Enough"))
# Reorder lifestyle types after activity level by adding factor levels
mean_daily_activity$lifestyle_type <- factor(mean_daily_activity$lifestyle_type, levels=c("Sedentary: 0 - 4999 steps", "Low active: 5000 - 7499 steps", "Somewhat active: 7500 - 9999 steps", "Active: 10000 - 12499 steps", "Highly active: > 12500 steps"))

# Reorder activity levels by adding factor levels
mean_daily_activity$activity_level <- factor(mean_daily_activity$activity_level, levels=c("Not enough", "Enough", "More than enough"))
# Bar chart
ggplot(mean_daily_activity, aes(x=activity_level, fill=lifestyle_type)) + 
  geom_bar() +
  scale_fill_manual(values = c("#61E3FA", "#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "Users Reaching Enough Activity Minutes",
       x = " ", y = "Count",
       fill = "Lifestyle") +
  theme_bw() +
  theme(plot.title=element_text(size=14, face="bold", hjust = 0.5))

Reaching the recommended active minutes per day to gain health benefits was achievable regardless of what lifestyle in terms of steps the user had. However, it seems like it is more likely to reach the recommended steps to gain high health benefits when having a somewhat active, active, or highly active lifestyle in terms of daily steps. Hence, encouraging sedentary and low active users to be more active during the day seems like a good idea. Emphasis spending more time being fairly active and very active rather that just increasing the steps. For example, rather a brisk walk than a slow one.

4.4. Timing of steps¶

Next the timing of the users’ steps was analysed. First the timing based on weekday was analysed and then the timing based on time during the day.

# Add column with weekday name
daily_activity <-
  daily_activity %>%
  mutate(weekday = wday(date, label=TRUE, abbr=FALSE, locale="en_US"))
# Calculate average steps per day
weekday_mean_steps <- 
  daily_activity %>%
  group_by(weekday) %>%
  summarise(mean_steps = mean(total_steps))
# Reorder weekdays by adding factor levels
weekday_mean_steps$weekday <- factor(weekday_mean_steps$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Barplot
ggplot(weekday_mean_steps, aes(x=weekday, y=mean_steps)) +
  geom_bar(stat="identity", fill="#583475") +
  labs(title = "User Mean Steps per Weekday",
       x = " ", y = "Number of steps") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1),
        plot.title=element_text(size=14, face="bold", hjust = 0.5))

On average, users took slightly more steps on Tuesdays and Saturdays, and slightly less steps on Sundays.

To find out the daily step distribution per lifestyle type, the following code was used.

# Select relevant columns from daily_activity 
weekday_steps <- 
  daily_activity %>%
  select(id, date, total_steps, weekday) 

# Select relevant columns from mean_daily_activity 
df_join <- 
  mean_daily_activity %>%
  select(id, lifestyle_type)

# Join dataframes
weekday_mean_steps_grouped <- left_join(weekday_steps, df_join, by="id")

# Calculate mean steps grouped by lifestyle and weekday
weekday_mean_steps_grouped <- 
  weekday_mean_steps_grouped %>%
  group_by(lifestyle_type, weekday) %>%
  summarise(mean_steps = mean(total_steps))
# Reorder weekdays by adding factor levels
weekday_mean_steps_grouped$weekday <- factor(weekday_mean_steps_grouped$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Barplot
ggplot(weekday_mean_steps_grouped, aes(x=weekday, y=mean_steps, fill=lifestyle_type)) +
  geom_bar(stat="identity", position = position_dodge()) +
  scale_fill_manual(values = c("#61E3FA", "#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "User Mean Steps per Weekday",
       x = " ", y = "Number of steps",
       fill = "Lifestyle") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1),
        plot.title=element_text(size=14, face="bold", hjust = 0.5))

However, looking at the average steps per weekday grouped by lifestyle shows how the steps of the active and highly active users masked the low number of steps taken by the sedentary and low active users. Both low active and sedentary users seems to take most steps on Saturdays. Somewhat active people seems to take most steps in the beginning of the week (Mon and Tue) and less steps on Sundays. Maybe the Bellabeat app could custom reminders on being active based on user lifestyle.

To find out at what time during the day users are most active, the users hourly steps were analysed.

# Separate date and time into two different columns
hourly_steps <- 
  hourly_steps %>%
    separate(date_time, into = c("date", "time"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].
# Calculate mean steps per hour
hourly_mean_steps <-
  hourly_steps %>%
  group_by(time) %>%
  summarise(mean_steps = mean(step_total))
# Barplot
ggplot(hourly_mean_steps, aes(x=time, y=mean_steps)) +
  geom_bar(stat="identity", fill="#583475") +
  labs(title = "User Mean Steps per Hour",
       x = " ", y = "Number of steps") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1),
        plot.title=element_text(size=14, face="bold", hjust = 0.5))

On average, users take most steps around lunch time (noon - 2 pm) and between 5 pm - 7 pm. Might indicate that most of the users have office jobs.

# Join hourly_steps and df_join
hourly_mean_steps_grouped <- left_join(hourly_steps, df_join, by="id")

# Calculate average hourly steps per user
hourly_mean_steps_grouped <-
  hourly_mean_steps_grouped %>%
  group_by(lifestyle_type, time) %>%
  summarise(mean_steps = mean(step_total))
# Barplots showing mean steps throughout the day for each lifestyle type
ggplot(hourly_mean_steps_grouped, aes(x=time, y=mean_steps, fill=lifestyle_type)) +
  geom_bar(stat = "identity") +
  facet_wrap(vars(lifestyle_type)) +
  scale_fill_manual(values = c("#61E3FA", "#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "User Mean Step Distribution Throughout the Day",
       x = " ", y = "Number of steps") +
  theme_bw() +
  theme(axis.text.x = element_blank(),
        legend.position = "none",
        plot.title=element_text(size=14, face="bold", hjust = 0.5))

The more active of a user, the more the distribution deviated from a uniform distribution. Seems like most somewhat active, active, and highly active users have periods where they take a large amount of steps followed by periods where they take less steps. Both active and very active users have a peak at 2 pm. Perhaps they are going for a mid-day walk or workout. Also, both active and very active users have a peak at 6 - 7 pm. Perhaps they are walking home from work or going for a walk/workout. Active users also have a peak at 9 am, maybe they walk to work.

In general, it seems like a good idea to promote more activity throughout the whole day. Maybe a little reminder in the app to have an activity break and move the body a bit. Encourage users to commute by foot or bike if they are able, to have a lunch time walk/work out, and to continue to move after work. I.e. to integrate movement into their everyday life.

4.5. Tracker usage

While observing the summary statistics, it was noticed that users do not keep their tracking devices on all the time. First, it was investigated how many days, out of the 31 days, the users utilized their fitness tracker.

The users were categorized into four different groups based on their daily usage. The groups are shown in the table below.

User type Days of usage
Sporadic 0 - 9
Moderate 10 - 19
Frequent 20 - 30
Everyday 31

To calculate the daily usage, the following code was used.

# Categorize based on usage
tracker_usage_days <-
  daily_activity %>%
  group_by(id) %>%
  summarise(n_days = n()) %>%
  mutate (usage_days = case_when(n_days < 10 ~ "Sporadic user: 0 - 9 days",
                            n_days < 20 ~ "Moderate user: 10 - 19 days",
                            n_days < 31 ~ "Frequent user: 20 - 30 days", 
                            n_days == 31 ~ "Everyday user: 31 days"))
# Calculate percentages for graph
tracker_usage_days_pct <-
  tracker_usage_days %>%
  group_by(usage_days) %>%
  summarise(n_users = n()) %>%
  mutate(dbl = n_users / sum(n_users),
         dbl = round(dbl, digits = 2),
         pct = percent(dbl)) 

tracker_usage_days_pct
## # A tibble: 4 × 4
##   usage_days                  n_users   dbl pct  
##   <chr>                         <int> <dbl> <chr>
## 1 Everyday user: 31 days           21  0.64 64%  
## 2 Frequent user: 20 - 30 days       9  0.27 27%  
## 3 Moderate user: 10 - 19 days       2  0.06 6%   
## 4 Sporadic user: 0 - 9 days         1  0.03 3%
# Reorder groups by adding factor levels
tracker_usage_days_pct$usage_days <- factor(tracker_usage_days_pct$usage_days, levels=c("Sporadic user: 0 - 9 days", "Moderate user: 10 - 19 days", "Frequent user: 20 - 30 days", "Everyday user: 31 days"))
ggplot(tracker_usage_days_pct, aes(x=" ", y=dbl, fill=usage_days)) + geom_bar(width=1, stat="identity") +
  coord_polar("y", start=0) +
  scale_fill_manual(values = c("#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "Tracker Usage based on Days",
       fill = "User type") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x=element_blank(),
    panel.border = element_blank(),
    panel.grid=element_blank(),
    legend.position = "left",
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5)) +
  geom_text(aes(label = pct),
            position = position_stack(vjust = 0.5))

The users utilized their trackers often! A majority of users (64%) used their trackers daily, and more than a forth (27%) of the users used their trackers 20 or more days of the month.

Thereafter, it was investigated how many minutes a day the users used their fitness tracker. The usage was divided into four groups. The grouping is based on that one day and night is 24 hours (1440 min). People are recommended to sleep 8 hours (480 min) per night. In this analysis, day is defined as the 16 hours (960 min) that are left. The usage groups are shown in the table below.

Usage group Minutes of usage
Less than half of the day 0-479
More than half of the day 480-959
Most of the day and night 960- 1439
All day and night 1440

To calculate the minutely usage, the following code was used.

# Categorize based on usage
tracker_usage_minutes <-
  daily_activity %>%
  mutate(tracked_minutes = very_active_minutes + fairly_active_minutes + lightly_active_minutes + sedentary_minutes) %>%
  select(id, tracked_minutes) %>%
  mutate(usage_minutes = case_when(tracked_minutes < 480 ~ "Less than half of the day: 0 - 479 min",
                                   tracked_minutes < 960 ~ "More than half of the day: 480 - 959 min",
                                   tracked_minutes < 1440 ~ "Most of the day and night: 960 - 1439 min",
                                   tracked_minutes == 1440 ~ "All day and night: 1440 min"))
# Calculate percentages for graph
tracker_usage_minutes_pct <-
  tracker_usage_minutes %>%
  group_by(usage_minutes) %>%
  summarise(n_users = n()) %>%
  mutate(dbl = n_users / sum(n_users),
         dbl = round(dbl, digits = 2),
         pct = percent(dbl))

tracker_usage_minutes_pct
## # A tibble: 4 × 4
##   usage_minutes                             n_users   dbl pct  
##   <chr>                                       <int> <dbl> <chr>
## 1 All day and night: 1440 min                   478  0.51 51%  
## 2 Less than half of the day: 0 - 479 min         13  0.01 1%   
## 3 More than half of the day: 480 - 959 min      167  0.18 18%  
## 4 Most of the day and night: 960 - 1439 min     282  0.3  30%
# Reorder by adding factor levels
tracker_usage_minutes_pct$usage_minutes <- factor(tracker_usage_minutes_pct$usage_minutes, levels=c("All day and night: 1440 min", "Most of the day and night: 960 - 1439 min", "More than half of the day: 480 - 959 min", "Less than half of the day: 0 - 479 min"))
# Visualize results in pie chart
ggplot(tracker_usage_minutes_pct, aes(x=" ", y=dbl, fill=usage_minutes)) +
  geom_bar(width=1, stat="identity") +
  coord_polar("y", start=0) +
  scale_fill_manual(values = c("#FB9778", "#DB65A2", "#B96CF6", "#6376DB")) +
  labs(title = "Tracker Usage based on Minutes",
       fill = "Minutes of usage") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x=element_blank(),
    panel.border = element_blank(),
    panel.grid=element_blank(),
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5)) +
   geom_text(aes(label = pct),
            position = position_stack(vjust = 0.5))

Also when it comes to minutes per day the users utilized their fitness trackers a lot. A majority of the users (51%) used their device every single minute of the day and night. 30% of users used their tracker most of the day and night. Almost a fifth (18%) of the users used their fitness tracker between 8-16 hours of the day and night.

Lastly, the minutely use per user type was analysed.

# Join tracker_usage_days and tracker_usage_minutes
tracker_usage <- left_join(tracker_usage_days, tracker_usage_minutes, by = "id")

# Group by daily use and minutely use
tracker_usage <-
  tracker_usage %>%
  group_by(usage_days, usage_minutes) %>%
  summarise(n = n())
# Reorder groups by adding factor levels
tracker_usage$usage_days <- factor(tracker_usage$usage_days, levels=c("Sporadic user: 0 - 9 days", "Moderate user: 10 - 19 days", "Frequent user: 20 - 30 days", "Everyday user: 31 days"))

# Reorder by adding factor levels
tracker_usage$usage_minutes <- factor(tracker_usage$usage_minutes, levels=c("Less than half of the day: 0 - 479 min", "More than half of the day: 480 - 959 min", "Most of the day and night: 960 - 1439 min","All day and night: 1440 min"))
# Bar plots based on user types
ggplot(tracker_usage, aes(x=" ", y=n, fill=usage_minutes)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_grid(. ~ usage_days, labeller = label_wrap_gen(width=16)) +
  scale_fill_manual(values = c("#6376DB", "#B96CF6", "#DB65A2", "#FB9778")) +
  labs(title = "Fitness Tracker Utilization",
       x=" ", y="Number of entries",
       fill = "Minutes of usage") +
  theme_bw() +
  theme (plot.title = element_text(face = "bold", size = 14, hjust = 0.5))

As seen in the figure, the users use their fitness trackers a lot in terms of both days and minutes. Users who use their trackers daily are also likely to use them all day and night or most of the day and night. This indicates user engagement is high. However, to increase the usage of the fitness tracker even further, it might be a good idea to add a reminder in the app to wear the fitness tracker after longer times of inactivity.

4.6. Reccomendations

This section will present a summary of the high-level recommendations to be used in Bellabeat’s marketing strategy.

  • Encourage sedentary and low active users to be more active during the day. Emphasis spending more time being fairly active and very active rather that just increasing the steps. For example, rather a brisk walk than a slow one.
  • Custom reminders on being active based on user lifestyle.
  • Promote more activity throughout the whole day. Maybe a little reminder in the app to have an activity break and move the body a bit. Encourage users to commute by foot or bike if they are able, to have a lunch time walk/work out, and to continue to move after work. I.e. to integrate movement into their everyday life.
  • Add a little reminder in app to use fitness tracker if there are long periods of inactivity.
  • Remake analysis with own data; Considering the demographics of the Fitbit users were unknown, it is unsure if the results of this analysis are applicable to Bellabeat. Also, the sample of the analysis is very small. Hence, it is recommended to use Bellabeat’s own data to perform the same or a similar analysis.

5. Act

Now it is up to the Bellabeat marketing group to ACT!