Predicting School Holidays from BBC iPlayer Viewing | by Matt Crooks

By Jessie Hobb On Dec 23, 2022

Can we tell whether kids are in school just from the viewing of certain TV shows on BBC iPlayer? Spoiler— yes! We can also identify when schools are closed due to snow!

This was the first project I worked on at the BBC, and remains one of the most fun to have delivered!

While the information on school holidays is available on the internet, it’s presented on local government websites which vary in format from council to council making web scraping impractical.

The BBC produces a lot of content for children and being able to effectively and appropriately deliver this content is crucial. Our current use cases include:

Reporting
Timely output of Children’s content
Recommending child-appropriate content
Evaluating the effectiveness of marketing campaigns
Audience Segmentation

Wouldn’t it be great to automatically personalise the BBC iPlayer homepage with a child-friendly version when it’s a school holiday?

While some of our use cases, such as audience segmentation, can be performed with historical data about school holidays, the most useful applications come from knowing school holidays in sufficient time to be able to make proactive decisions about BBC output.

We very cleverly constructed the model in such a way that we are able to identify school holidays by 9am the same day. Keep reading to find out how…

We used 2 data sets to build this model:

BBC events data for the streaming of children’s content on BBC iPlayer
Code-Point Open dataset from Ordnance Survey for postcode location data

Our aim is to identify whether or not a particular day is school holiday based purely on the viewing pattern of Children’s iPlayer content throughout the day. Fig. 1 below shows the average viewing patterns for a term-time day and a school holiday/weekend.

Figure 1. Distribution of viewing throughout the day. Viewing for each day is normalised so that each day effectively has 1 viewer, before being averaged across each class. (image author’s own)

There are no surprises in the viewing behaviours between the two classes:

Term-time is characterised by two peaks in traffic; one in the morning and one in the afternoon. Little viewing occurs during the school day.
School Holiday viewing begins slightly later and plateaus for most of the day.

While it may seem obvious in the above plot that viewing patterns are different and distinct on the two types of day, what I haven’t discussed is how do we label what is a term-time day and what is a holiday, and that is the challenge!

This turned into a study into semi-supervised learning — a rather lesser-known area of data science.

We’ve all heard of supervised and unsupervised learning, but what is semi-supervised learning?

Supervised Learning — each data point in the training set has features and a label, we can identify the label based on the features
Unsupervised Learning — we don’t have labels, but want to identify and group “similar” data points into clusters based on their features. While we may try to explain the defining characteristics of each cluster they are largely abstract.
Semi-supervised Learning — we have largely unlabelled data but want our clusters to mean something specific.

We are left with a bit of a paradox — we want a model to assign labels, but we need the labels to build the model. Bit of a chicken and egg situation, or more interestingly this:

I recently found out that the nougat in Snickers* is flavoured using melted Snickers bars. And the chocolate layer between the wafers in a KitKat contains melted KitKats!

*other chocolate based snacks are available

Semi-supervised learning is really a worst of both worlds but there are approaches that can help in tackling these problems using standard supervised and unsupervised techniques.

Unsupervised Learning Approach

Let’s first briefly discuss what we didn’t do.

Probably the biggest contributing factor that affects people’s streaming behaviour is the weather — if it’s raining then people stay inside and watch TV.

We see three times the volume of traffic to CBBC shows on iPlayer when it’s raining — suggesting that when it’s sunny and warm the kids are outside. Applying unsupervised learning could well yield two clusters that represent raining/not raining rather than holiday/term-time.

Supervised Leaning Approach

Our starting point for this is that we can label some of our data:

Holidays: Weekends are similar to holidays, and common school holidays such as early August and the week containing Christmas are uniform across the country.
Term-time: Periods clear of known holidays such as early December and June.

However, there is no guarantee that BBC iPlayer viewing on a Wednesday during a holiday is similar to a Saturday during term-time; we risk the model not generalising well if we stick to just the partially labelled dataset as training data.

The solution is to generalise our partially labelled data set using outlier detection algorithms…

The aim of outlier detection is to identify additional days among the unlabelled data that we can reasonably label as either holiday or term-time.

1-Class Support Vector Machines can perform outlier detection, although they may not perform optimally for irregular shaped clusters/classes. In this case, DBScan would be a great alternative.

We can use our partially labelled data to define each of our classes, and then use a 1-class SVM to identify other days that aren’t outliers to that class. This helps to generalise from our initial partially labelled dataset. Much like Nestle with their Snickers and KitKat, we can combine both our original partially labelled data and the new model output to retrain the model and predict a more generalised set of class examples. Repeated application should converge on a consistent set of non-outliers.

In Fig. 2 below we show the results of applying the term-time 1-class SVM to weekdays that are not in August. The 1-class SVM identifies additional term-time days from among all non-august weekdays by picking days with a pronounced drop in day-time viewing.

Figure 2. Viewing figures for all non-August weekdays (left) split down by newly labelled “Term-time” days (right) and outliers from this class (middle). Blue lines shown median and grey bars show 5th and 95th percentiles. (image author’s own)

Of course, a data point being an outlier to one class doesn’t mean that it belongs to the other class! We still need a more robust classifier but for now we can average across all data points in each class to define a characteristic behaviour for each of the classes — this is what is shown in Fig. 1.

Now that we have characteristic behaviours for each of our classes, we can begin to analyse whether a particular day looks more like a holiday or term-time.

If we express a day’s viewing as a vector in 24 dimensionsional space, we can assess the similarity between a particular day’s viewing and term-time/holiday viewing by calculating the angle between the two vectors. We use the definition of the Euclidean dot product to do this, which is very closely related to the more common metric cosine similarity.

Figure 3. Simple 2-d version of representing viewing data as vectors. Here we have simplied our 24 dimensions into just 2: “daytime” and “evening”, for demonstrative purposes. (image author’s own)

We can then combine our 24 hours of viewing data into a single metric, which we will call the classification metric:

Classification can then be defined by the simple relationship:

c ≥ 1/2 : Term-time
c < 1/2 : Holiday

We have provided some example code at the end of the article on how to calculate our classification metric.

Advantages

You may be wondering why we don’t just train an ML model using the 24 hour feature set?

The advantage of using the classification metric is that it can be evaluated at any time of day — we can take, say, the first 9 hours of viewing on a particular day, work out the cosine similarity relative to the first 9 hours of our characteristic behaviours, and then calculate the classification metric with the same cut-off value of 1/2. This property allows us to be able to use real-time viewing data to identify school holidays by 9am (more on that later!).

In addition, the classification metric reduces 24 dimensions (one for each hour of viewing) to just 1. This offers massive dimensionality reduction that allows us to be able to aggregate multiple classification metrics across different geographical regions. Each geographical region would otherwise have 24 dimensions and you can easily see that the number of dimensions could grow quite quickly. It also simplifies and improves the outlier detection algorithm previous discussed.

One of the main justifications for this project was that different schools have different holidays. This is especially true of Northern Ireland and Scotland’s summer holidays. Even within England, different regions have different half-terms and Easter and Christmas breaks.

We began to break the viewing down into smaller and smaller geographical regions. At an outcode level we get very specific predictions but in rural areas we lack enough data to reliably classify. On a regional level we get plenty of data but lose a lot of the geographical detail.

For each day, the classification metric is evaluated on viewing data aggregated over the following geograhic regions:

Outcode
10 closest outcodes (based on Euclidean distance between the center of each outcode according to latitude and longitude defined by Ordnance Survey)
Town
Region

A Random Forest Classifier was trained on these 4 classification metrics as features, together with the same metrics from the previous day, and the day of the week. Here the previous day’s data is important because holidays often occur in whole weeks so if Tuesday is a holiday then it’s more likely that Wednesday will be too. However, bank holidays are frequently Mondays and don’t tell us quite as much about whether Tuesday is a holiday.

The first qualitative evaluation that we performed on the model was the summer holidays. Give or take teacher training days that may be applied in different regions, the core summer holidays are consistent across the different nations of the UK. No information was provided to the model about time of year (only day of week) or when different nations take their summer holidays. The figure below shows the results from the model.

Figure 4. Variation in summer holidays across the different nations in the UK. Data shown is Mon-Fri average in each region. (image author’s own)

The model has correctly identified that Northern Ireland has an 8 week long summer holiday, compared to 6 weeks of the other Nations. Scotland’s summer holiday is also staggered a fortnight earlier than England and Wales.

While checking the results I noticed a rogue day in December 2017 that showed Gloucestershire was on holiday (not shown). After an initial panic, I decided to google news events on that day and discovered that a large snow storm had closed a lot of the schools in Gloucestershire on that particular day — an event that the model had correctly identified!

What ended up being an even more interesting event was the winter weather event nicknamed “The Beast from the East” that arrived in the UK at the end of Febraury 2018. The figure below shows the model’s output over this period.

Figure 5: In the winter of 2018 the UK was hit by a winter storm nicknamed “The Beast from the East”. Kent was first to be hit on Wednesday 28th February and by Friday 2nd March most of the UK was buried unders inches of snow, closing most of the UK’s schools. (image author’s own)

The Beast from the East first hit headlines on Wednesday 28th February when it hit Kent and closed the schools there. As the week progressed the storm moved up and west across the country causing increasing mass closures of schools — all of which our model was able to identify. On Friday 2nd March, only children in the Northwest of England were attending school in large numbers.

Hindcast data is useful for a lot of our applications, such as audience segmentation, reporting, marketing effectiveness. However, we ideally want to be able to make proactive decisions based on the model output.

I mentioned briefly above that we can apply cosine similarity at any point in the day just by using the dimensions that we have data for. We also trained our Random Forest using the classification metric, which we can calculate from the cosine similarity. As a result, we don’t need to wait for a full day’s data before we can begin to classify the day as a school holiday or term-time.

But just because we can make predictions doesn’t mean we should — how accurate is the model?

We can analyse how quickly the model converges on a stable prediction as we add more hours of data. Assuming the model at the end of the day is correct, if the model makes the same prediction at 10am then we’re doing well! The figure below shows the proportion of predictions that agree with the figure at the end of the day. This is aggregated across all days and locations in the UK.

Figure 6. Convergence of the model through the day. Blue line shows median, shaded blue region shows 5th to 95th percentiles. (image author’s own)

In the early hours of the morning we get a large amount of variability in the model (large shaded blue region). Around 9am there is a pronounced increase in model performance — in most cases the model doesn’t change it’s classification beyond 9am. Going back to Fig. 1, we can see the morning peak in viewing drop off around this time for term-time days — this appears to be all the model needs to be able to work.

By the start of the work day we can identify with >95% accuracy whether it is a school holiday or term-time.

Being able to make predictions to early in the day means we’re able to make proactive decisions about BBC output based on whether it is a school day or, and we can do this on a regional level!

Using user behavioural data from BBC iPlayer we are able to accurately predict whether a day is a school holiday or term-time. We can perform this both on historical data to analyse marketing effectiveness but also, by training a model using the classification metric, we’re able to use the same model to predict in real-time. We’ve also used Ordnance Survey data to allow us to aggregate regional data to make localised predictions.

Below is the code to calculate our classification metric. The main function call is calculate_classification _metric().

def magnitude_of_vector(vector):
"""
Function to calculate the magnitude of a vectorParameters
----------
vector: 1-d numpy array
vector we want to find the magnitude of
Returns
-------
Magnitude of vector as float
"""
return np.sqrt(np.sum(vector ** 2))
def dot_product(vector1, vector2):
"""
Function to calculate the dot product between 2 vectors.
The dot product is independent of the order of the input vectors.
Parameters
----------
vector1: 1-d numpy array
one of the vectors
vector2: 1-d numpy array
the other vector
Returns
-------
Dot product of the two input vectors as a float
"""
return np.sum(vector1 * vector2)
def angles_in_radians(vector1, vector2):
"""
Function to calculate the angle between 2 vectors.
The angle is independent of the order of the input vectors and 
is returned in radians.
Parameters
----------
vector1: 1-d numpy array
one of the vectors
vector2: 1-d numpy array
the other vectors
Returns
-------
Angle between the two input vectors measured in radians
"""
magnitude_x = magnitude_of_vector(vector1)
magnitude_y = magnitude_of_vector(vector2)
dot_product_x_and_y = dot_product(vector1, vector2)
angle_in_radians = np.arccos(dot_product_x_and_y / (magnitude_x * magnitude_y))
return angle_in_radians
def convert_radians_to_degrees(angle_in_radians):
"""
Function to convert an angle measured in radians into degrees.
360 degrees is equivalent to 2pi radians
Parameters
----------
angle_in_radians: float
angle measured in radians
Returns
-------
Angle measure in degrees between 0 and 360.
"""
angle_in_degrees = angle_in_radians / np.math.pi * 180
return angle_in_degrees
def calculate_acute_angle(angle):
"""
Function to convert an angle measured in degrees to an acute angle 
(0 < angle < 90 degrees)
For example, 2 vectors that form an angle 135 degrees will be converted
to the acute angle 45 degrees.
Parameters
----------
angle: float
angle measured in degrees
Returns
-------
Equivalent acute angle measured in degrees 
"""
acute_angle = (angle > 90) * (180 - angle) + (angle <= 90) * angle
return acute_angle
def calculate_acute_angle(vector1, vector2):
"""
Function to calculate the acute angle between 2 vectors 
(0 < angle < 90 degrees). 
Taking the acute angle means that we ignore the direction of the vectors
eg. Walking due north is equivalent to heading due south - all we care 
about is that your longitude remains constant
Parameters
----------
vector1: 1-d numpy array
one of the vectors
vector2: 1-d numpy array
the other vectors
Returns
-------
Acute angle between the two input vectors measured in radians
"""
angle_in_radians = angles_in_radians(vector1, vector2)
angle_in_degrees = convert_radians_to_degrees(angle_in_radians)
acute_angle = calculate_acute_angle(angle_in_degrees)
return acute_angle
def calculate_classification_metric(
holiday_vector,
termtime_vector,
viewing_vector
):
"""
Function to calculate our classification metric. This is based on the
acute angle that a day's viewing makes with each of our two class-defining 
vectors. 
If classification_metric < 1/2 -> Holiday
If classification_metric > 1/2 -> Term time
Parameters
----------
holiday_vector: 1-d numpy array
Average viewing during a school holiday
termtime_vector: 1-d numpy array
Average viewing during a school day
viewing_vector: 1-d numpy array
Viewing for the day that we want to classify
Returns
-------
Float between 0 and 1
"""
holiday_angle = calculate_acute_angle(viewing_vector, holiday_vector)
termtime_angle = calculate_acute_angle(viewing_vector, termtime_vector)
classification_metric = holiday_angle / (holiday_angle + termtime_angle)
return classification_metric