Analysis of fitness tracker data
Data cleaning in Google Sheets
Data Analysis in SQL (in BigQuery)
Visualization in Tableau
Here what I am asked to do in this project :
What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?
I am ask to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices
The dataset I am suggested to analyze is the Fitbit Fitness Tracker Data which consists of the Fitbit data of about 30 users for a month during the year 2016. So we total around 1000 user.days in theory, if every user was consistent with wearing their tracker.
Let’s pause a moment to reflect on what we’re being asked : analyze usage, or gain insight into how people use their trackers. There’s as far as I can tell two ways to interpret this :
Taken literally : extract information regarding how the sample users are using their trackers :
when they decide to wear it vs not
do they use it for exercise vs walking vs sleep
do they use any manual logging function, like weight tracking, which has more to do with the app than the tracker itself.
What does the fitness tracker data reveal about how the sample users live their lives :
how much sleep do they get ?
how much walking do they do each day ?
how strong is the relationship between any two recorded variables ?
In a real world context, I would ask to clarify. Analysis n°1 would be the choice if the demand was taken literally, but it is so limited in terms of insight value, that we’ll be doing both : analysis 1 and 2. Let’s not lose track of the overarching goal : helping Bellabeat grow by providing actionable insights to the marketing team. So we might be left wondering how Analysis n°2 can possibly help with that : how can knowing how a sample of fitness tracker users live their lives help a marketing team ? The main way I can think of is by improving the value of the Bellabeat tracker/app combination for the pursuit of health goals. In other words, promoting positive change. How : 1) get statistics about the sample users for a particular variable 2) research the consensus recommended level or range for that variable 3) compare the two 4) In so far as they diverge, engineer features in the app meant to reduce that disparity :
provide general information to the users* provide custom information to the user, which adapts to the user’s recent data This information could be provided via the following means : informational notifications, reminders, goal setting, gamification (rewards and badges, maybe a social aspect), access to clear data visualization showing one’s evolution towards a goal…
Circling back to the growth of the company : more users believing their Bellabeat tracker helps them means better reviews and word of mouth, which in turns means a growing number of users.
After failing to import the dataset in BigQuery, I decided to look at it in Google Sheets. Once in Sheets, I decided to do some data Cleaning there.
One thing that was apparent is that there was some issue with the dates and date-times in all the datasets (spoiler alert : there wasn’t, my Google Sheets wasn’t set to the proper country)
As you can see, the dates weren’t all formatted in the same way. I could try and set the whole column to the same format, but it wouldn’t work.
I noticed that whatever I did, the dates that are centered left just wouldn’t be formatted into Date and Time by Sheets, they stuck to being strings.
I finally realized what the problem was. Dates with days 1 to 12 were behaving as dates as excepted, but days 13 and up wouldn’t.
This issue was therefore very likely an issue of DD/MM/YYY vs MM/DD/YYYY, and one that couldn’t be solved by a normal format.
My Google Sheets was looking for DD/MM while the dataset was in MM/DD, that explained why :
Days 1 to 12 were registered inaccurate dates. For example 04/12 was considered December 4th instead of April 12th,
Days 13 to 31 were registered as text.
No attempt at telling Google sheets the dates were in the MM/DD format worked.
I then went through an overly complicated way to “clean” the dates in all the tables.
I knew this might be useless because that problem of formatting was potentially specific to Sheets, and wouldn’t be an issue in SQL/BigQuery, but I did it just in case.
3 Full Clean date and time.png
The cleaning steps were the following :
Data > Split text to columns. Separator = space. This splits date from time.
All the following only refers to the date.
Data > Split text to columns. Separator = /
I now have three columns : month, day, year
I create a fourth column to recombine them in the only way my Google Sheets seems to understand : D/M/Y
=CONCATENATE(day cell,”/”,month cell,”/”,year cell)
Apply it to the whole column
I copy and “Paste as values” this column into a new one
And finally I can format it as a date in any format I wish, the dates are accurate.
I can now remove all the temporary columns
The Cleaning of the time followed roughly the same steps, except for also having to deal with “AM” and “PM”.
4 Final clean date and time.png
Learning from my mistakes.
Only once I had done that for most of the tables did I figure out what the proper solution was. I had to format the whole sheet document :
File > Settings > General > Locale > Switch from France (or UK) to United States.
If I did that before importing the csv file, or if I formatted the date-time column once I did that, then the dates were all accurate and didn’t need my convoluted cleaning steps.
But it was a good practice so I don’t mind the loss of time too much.
Misc routine cleaning
Here is some additionnal cleaning steps I went through :
checking for inconsistent data, for example out of range.
checking for duplicates
checking for empty cells
Data Analysis & Visualization
Note : Why are my SQL queries embedded from Pastebin ? That’s the most straightforward way I’ve found so far to display SQL code with proper syntax highlighting on Kaggle. Kaggle itself doesn’t offer this feature.
collapse the table of contents on the right to expand the width of the visualizations.
The visualizations are interactive, you can get additional information by hovering and clicking.
Please expand the visualizations by collapsing the table of contents on the right.
Observation 1 : Fitbit wearing
Only 40% of the total possible nights were tracked.
Informing users about the value and benefits of a more consistent use. I think notifications on the app could be annoying and counter-productive. Instead, informing them through articles on the benefits of a consistent use might be better. The more they wear it, the better the insights they’ll receive on the app.
Observation 2 : amount of sleep
Less than half of the recorded nights fall within the recommended of 7 to 9 hour range.
We’re using the total time actually asleep here. This is a metric you can only know with a tracker. You can’t possibly know it without one because it is inferior to the difference between bedtime (or even falling asleep time) and wakeup time. This is because we usually wake up multiple times during the nights, without recalling it.
44% of nights are under the minimum recommended 7 hours,
while 9% were over the maximum recommended of 9 hours.
Find ways to help the user get into the recommended range consistently. Tackling too little is probably more important than too much. It’s both more common and more problematic.
There is a lot of evidence when it comes to the health issues – short and long term – caused by not getting enough sleep. The upper limit of 9 hours isn’t as clear. It’s unlikely that it would cause problems, but it’s more that, unless you’re child or an athlete, getting that much sleep could be the symptom (not cause) of an underlying health issue.
Notification : instead of a notification “you should be fall asleep now”, I think it’d be more useful to have one at least 60 to 90 minutes prior that reminds us to start a wind-down routine :
dim the lights, especially blue (cold white) ones : both in the room and on your screens
stop screens, especially video games, news and social media
if you’re going to consume some type of content/media, favor the calm/boring and ban the stimulating/stressful.
journaling, next day to-do (flushing what’s on our mind to avoid overthinking and worrying once in bed)
Too much is only problematic when it’s consistent, as it can be a sign of an underlying issue.
I used a comparison between time asleep and time in bed as an indicator of trouble sleeping, the more advance stage being insomnia. For the most part, the sample users during that month of tracking didn’t seem to suffer from too much insomnia or more broadly trouble with falling asleep or staying asleep.
Note : This visualization doesn’t distinguish between spending time awake before falling asleep, and walking up in the middle of the night.
That doesn’t mean we should ignore people that do. If the Bellabeat app detects that a user consistently spends more that 20% of his time in bed awake, they could receive tips to reduce that number :
stress management strategy
wind down routine, avoiding overstimulation in the evening
tips regarding food, alcohol, caffeine…
maybe even CBT
limit alcohol and cannabis, especially in the evening
stop caffeine intake earlier in the day
The half life of caffeine is around 6 hours. To avoid having caffeine interfering with sleep, there are individual differences in when people should stop. Some people should stop as early as noon, while some others can handle caffeine intake up to 4PM. Very few people can consume caffeine in the evening and have it not affect their sleep in any way. All of this also depends of course on how strong a dose of caffeine your last intake is.
Does bedtime influence sleep duration ?
In an age of sleep procrastination, I wanted to see if the hour at which people fall asleep correlates with how long they end up sleeping. No causation would be proven, but a valuable investigation nonetheless.
Sleep procrastination refers to the behavior of delaying or putting off going to bed, or stopping any activity once in bed, even when you know that you should be sleeping. This can manifest in different ways, such as staying up late to watch TV, play video games, or scroll through social media, even when you know that you need to wake up early the next day.
Now before I even did the analysis, I knew there was a big confounder that was absent from the dataset : the use an alarm clock.
If your wake up time is fixed, then it is self evident that earlier sleep time means a longer sleep. This dataset lets us observe if that’s an issue with our sample users.
But even more interesting would have been to have the presence of an alarm clock as a boolean in our dataset, to filter the data and look at the relationship (bedtime vs. sleep duration) for people that do not use an alarm. This could have given us a small insight into, circadian rhythms and biology.
Here is the resulting Tableau scatter plot. I filtered the data to only keep sleep duration over 180 minutes, with falling asleep time between 8pm and 6am, to remove the noise of naps or outlier data. You can tweak the filters yourself on the right of the viz.
As you can see, and as you might have guessed, the trend is clear. On average, falling asleep earlier means sleeping longer.
One thing I could have done to supplement the scatter plot is a bar chart where I group the samples per hour they fell asleep, and plot the median sleep duration for each group.
This is perhaps an opportunity for Bellabeat to offer some options and suggestions that can help users setting up a better wind-down routine, to sleep earlier rather than later, avoiding sleep procrastination.
When would you like to wake up ?
How much sleep do you think you would benefit the most from ? Pick between 7 and 9 hours.
Fitbit data : how long do you usually take to fall asleep ?
Fitbit data : how much of your night is on average spent awake ?
Data points 2, 3 and 4 are combined into a go to bed time.
You can set a reminder before that time, you decide when (30, 60 or 90 minutes before bed time)
Does sleep influence calorie burning the next day ?
That’s an other potential relationship I was wondering about.
If you sleep long enough, do you burn more calories the next day ?
Perhaps you’re more likely to exercise after a long rather than a short night.
This time, no apparent relationship between the two metrics was revealed. That doesn’t mean it’s not worth publishing, an absence of correlation is (almost) as interesting as the presence of one.
One thing to note here is there could potentially be two opposite forces at play here :
You sleep more and have more energy and motivation to be active the next day
You sleep less and therefore spend more hours awake the next day. Anything you do awake burns more calories than lying in bed.
Sleeping: 30 to 60 calories per hour
Sitting: 60 to 90
Standing : 72 to 90
Walking: 180 to 300
Steps per hour : a week heatmap
I thought the weekly heatmap was a good idea initially, but it turns out to be limited in the insights we can extract from it, more can be done.
I decided to shift to only comparing three types of days : weekday, Saturday, and Sunday, and doing so via bar charts instead of a heatmap. That would make whatever differences might exist between them more apparent.
Steps per hour : weekday, Saturday and Sunday compared
For the sake of clarity, this was done in two steps (pun non intended).
1) Writing a first query from the cleaned up table
2) Exporting the result of the query as a new table
3) Writing a second query from the new table
Note : this was done in order to declutter the code a bit, but maybe it isn’t the best way to go about it.
Making days start at 4AM
Query n°1 has the task of shifting the definition of a day. Instead of going from midnight to 11PM, I thought that, in the context of step counting, it made more sense to have days start at 4AM and finish at 3:59.
Whoever walks between midnight and 3AM, is likely doing so at the end of their previous day. They’re very unlikely to be an early riser
For exemple : I wanted steps registered between midnight and 3AM on Sunday count towards Saturday.
To make the switch from SQL to a tableau visualization easier, I created a 24th, 25th, 26th and 27th hour.
April 15 hour 2 (2AM) would now be April 14 hour 26 (14PM)
The second query groups all days across the sample month (and 30+ users) in three bins : weekday, Saturday, Sunday. We then compute the average number of steps for each hour of the day.
Here is the result after some work in Tableau (see full version by clicking on the Tableau icon in the bottom left of the viz)
Users walked the most on Saturdays (8378 steps on average), then on weekdays (7759) and the least on Sundays (6796).
The number of steps peaks at 6PM on weekdays, and at 1PM on Saturdays. There isn’t any clear peak on Sundays.
On none of the days did the average user reach the recommended 10.000 steps, for basic cardiovascular fitness and calorie burning.
The average user would need to increase his walking by +30% to reach a goal 10.000 steps a day.
This is an average of all the users, so it’s likely some users do reach that goal. But the notifications would be tailored for each user depending on their personal data so we don’t need to worry about over-generalizing.
Maybe the Bellabeat tracker or app can remind the user of some steps goal at strategic moments of the day (ie not at 11PM). They could also receive badges and have access streak counter : how many days in a row have they reached their daily steps goal.
The Fitbit app allows users to log their weight.
Overall, weight logging was mostly ignored by the users.
Out of 33 users, only 8 logged their weight at least once, and only 2 did so consistently. Contrary to sleep, steps, or calories tracking, this has to be done manually by the users, and the tracker isn’t being used, hence the lower amount of data.
Another reason might be that it’s not obvious if or how logging ones weight can prove beneficial.
Bellabeat can think of the value of that feature. Do we believe it can be useful to some users ? If so, how can we encourage consistent logging ? Maybe we can only notify some users, and not others. For example maybe users can choose between a couple of goals :
gaining lean mass (calorie surplus)
losing fat (calorie deficit)
The issue with weight on the scale is that it is the sum of two very interesting numbers : lean mass and fat mass. But the scale just gives us the sum. The scales that do offer information regarding bodyfat aren’t particularly accurate, and BMI is close to useless at the individual level.
For two individuals of the same sex and height, sharing the exact same weight on the scale (and therefore BMI) can hide wide differences in body composition and health. One could be high in lean mass and low in bodyfat, while the opposite would be true for the other person.
For the users that do want to track their weight, some added features can be offered. For example, weight on the scale evolution day to day is actually a pretty noisy piece of data : the scales have a margin of error, and our water weight fluctuates a lot too. Therefore, displaying and tracking weekly average weight can be more valuable. This means the user should weigh themselves 4 to 7 times a week, at the same time of the day, maybe while fasted before breakfast, and not worry too much about the day to day fluctuations.
We took a dataset from Fitbit fitness trackers that included data from 30 users across 30 days. We explores data regarding both how they used their Fitbit (wearing consistency, manual logging), and what the tracker data revealed about their routine. We then suggested things Bellabeat could work on to improve the overall user experience of the Bellabeat trackers, with the focus on the following : how do we help a user set sensible health goals, and reach them successfully ?
It’s reasonable to assume that someone who buys a fitness tracker do so with a couple of goals in mind :
assessing where they currently stand, relying on hard data rather than intuition and rough guesses
set a goal
tweak their daily habits in the pursuit of that goal
The advantage of a fitness app linked to a tracker is that the suggestions and notifications can be tailored to the user’s current habits, and therefore be more useful and actionable than general guidelines one might read in a book.