Being a Data Scientist, I'm literally crazy about data! I gather it whenever I can and I’m always checking whether it’s possible to get the data through an api (application programmer interface), Excel exports or at least web scraping. Not only do I love it, I always keep in mind that every bit of data may help to draw useful insights and every missed piece of information may never be retrievable in the future (what a loss!).
This positive craziness drove me and my colleagues to one of the most joyful experiments of my work life - our very own geeks' version of the Big Brother game! We gathered our tools (a Raspberry Pi and six sensors to measure temperature, humidity, light, sound and two that measure distance). And if a lot of data is good, then even more data must be better! So we also started to capture pictures automatically taken from two positions in the office. We had tons of ideas about this data usage, but only one was really close to my heart, only one accompanied me for months - solving the unoccupied yet booked conference rooms problem.
History shows some meetings end early and some don’t occur for various reasons. Wouldn’t it be great to see not only which conference rooms are booked, but also which are occupied right now? And, going beyond, knowing how many people attend the meetings, to prepare proper conference rooms sizes?
I felt it in my guts that it’s possible to predict if and how many people are in a room based on the sensor data. You may say that installing an infrared camera in a room would be a simpler solution. For sure! But let’s keep in mind that having a camera in a conference room may feel like a privacy violation. What’s more, the richness of our sensors data enables not only occupancy detection but also other inferences. Then goodbye to unreliable calendars that we all forget to update and hello to a cheap, privacy respecting (no personally identifiable data collected) and for extra fun, has-to-be-done-from-scratch solution. What an amazing challenge!
As with all brilliant (:D) ideas, it got postponed. The raspberry fell from the desk, our light sensor broke, the memory on disk got clogged and the camera had to be used elsewhere… However, miracles do happen after all and now the first prototype is ready!
I modelled room occupancy (whether it was occupied or not) and a number of people in a room separately since the latter is a significantly more difficult problem.
The first prototype for room occupancy uses Random Forest (500 trees, with maximum depth equal to 5). I split the data into train (80%) and test (20%) datasets. The training set was used to choose the best Random Forest parameters and features, profiting from cross-validation. Even though the problem can be treated as a time series one, cross-validation was done randomly, not using moving windows of time - I did it consciously, as the data at my disposal came from a rather short time period. The probability cutoff was changed from 0.5 to 0.6 as it performed significantly better. The highest cross-validation accuracy is equal to 0.94 while final test accuracy is 0.93. It means that out of 100 meetings in our conference rooms we would have suggested the right room as being empty/busy in 93 of the cases.
There are two error types:
- Saying the room is full while it’s actually empty
- Saying the room is empty while there is actually a meeting there
By coincidence the error rate for both of them is the same.
The first one worries us less as somebody would probably hijack the room anyway ;)
The second one is more serious, as we may be forcing our colleagues to walk quite a long distance just to find out the room is taken anyway.
The models themselves, even tuned to the maximum with the most clever automated hyperparameter searching, are powerless without defining proper variables. Actually, that’s one of the most beautiful parts of the modeling, as it requires human thinking and can’t be replaced with automatic parameters grid search. The features I used for modelling are a mixture of sensor data together with weather (temperature, humidity, pressure, etc.), some seasonalities and natural correlators like, for example, the probability of taking a certain day off. However, I was quite surprised that the model based solely on sensors could achieve accuracies of 0.85 (cross-validated mean accuracy) and 0.84 (final test accuracy).
The interesting question is, what is the minimal number of sensors and what types of sensors improve the prediction the most (excluding the obvious, like occupancy PIR sensor)? Among our explored types, the sound is the winner.
The model using only this one variable has a test accuracy of 0.79. Some people may suggest our team is quite talkative (and they say data scientist aren’t sociable!) ;). Our Team members may suggest the sensor was placed quite close to my desk… For sure audio recording analysis could help with both hypotheses.
In terms of raw sensors data, they were quite a challenge - and not only because we changed our room three times this year. The sensors themselves are quite temperamental and they require some preliminary outliers detection and removal. Below you can see the temperature chart:
It’s worth mentioning that the predictor calculation wasn’t trivial as well. The historical number of people in a room was estimated using CNN (that’s actually not the only checked approach, though that will be the subject of a separate post).
In the following pictures, each dot symbolizes one detected person. Its coordinates are the center of a rectangle with the detected person and its size is the area of a rectangle. There are some natural clusters formed around our desks. Also, you can see that one of our colleagues likes to walk near a window while thinking. What’s more, we had standups just near the comfortable chair. How cool is that? :D
I can’t help but think about using the same approach in stores or in restaurants. It would be so amazing to see which areas attract more people, which less, it may help with reorganization… Possibilities are endless!
Apart from room occupancy I also forecasted the number of people in a room, at a certain hour, on a certain day. It’s more difficult and yet so fascinating a problem. This time the results are worse than for classification, though our model shows some fascinating potential. Let’s compare the median of actual and predicted values by hour:
It’s not perfect, but the model learned our typical hours of working as well as the fact that we eat lunch around 12 and have stand-ups around 10.
Actually, the cutest are the days with the highest prediction error.
One day, the model expected us to be at the office, but… we visited our boss. And our teammate stayed here, but wasn’t the whole day in the room cause he had to participate in his new employees' induction:
And here is the team with our boss:
The next day, the model yet again wanted us to be here. But our flight was in the morning and we didn’t arrive here before noon, hence we disappointed him (he=Random Forest Model):
Next, we finally behaved like we were supposed to :D It’s truly fascinating how well data knows us!
Some may say we are crazy to risk our privacy and be willing to gather such detailed data about ourselves. And we are even thinking about audio! But let’s think - our privacy is compromised already in so many ways ...
- Googling ‘sweet cats’? BAM, search log has a new row.
- Shopping? BAM, the transaction goes on your credit card.
- Walking? BAM, your detailed location just has been transformed into json on a server.
- Breathing? BAM, heart rate is saved on your tracker’s history.
We are in Big Brother anyway, whether we like it or not. We might as well embrace it and model ourselves before others do!
Of course, there are many ways to determine how many people are in a room and this experiment explored just one. And yet, I’m still amazed by models not failing me, still inspired by their power, exactly as I was 8 years ago when I heard about some of them for the very first time. When suddenly and with absolute certainty I knew what I truly wanted to do in my professional life :)
However, there was something I enjoyed more than modelling - working with the team. The modelling wouldn’t be possible without amazing collaboration and mutual motivation, without various complementary skills, without exceptional creativity. We went through a bumpy road, together, and we solved a problem - cause that’s what we do!
(And we have fun doing it :D)