The following use case: Most televisions today are „smart“. This primarily means that they are connected to the Internet. This means that you can use streaming services such as Netflix, Amazon Prime, etc. In addition to these providers, the „normal“ television channels such as Pro7, RTL, etc. also have streaming services that can be used as a kind of media library. Interactions such as games, etc. are also possible. The streaming services of these channels naturally display advertisements just like linear television. Recently I watched a series via streaming and the same advertisement kept appearing. Even 3 times in a row during a commercial break. It was a toilet paper advertisement where a little girl felt like a princess when she used this special toilet paper. After 1 hour I could repeat the text. When I then realized what was actually happening, I switched channels in frustration.
What does this tell us? Personalized advertising may not be a bad thing after all. After all, many of the big data and analytics efforts of major providers (yes, including Facebook and Google) are ultimately about personalized advertising. The advertiser saves money (and can perhaps use it for a better product or higher salaries) and the end customer is presented with advertising that interests them more. After all, we are constantly being served advertising anyway, so why not for something that suits my personal preferences? What’s so bad about that? Data protection advocates in the EU are painting the devil on the wall, portraying the companies that make money from users‘ data as villains and octopuses, including the Chancellor. So what? If that bothers you, then you shouldn’t use Facebook, or use a program other than Gmail. The EU’s data protection regulation is a completely over-regulated construct and a real blocker of innovation.
After this personal digression, let’s get back to the technology. We thought about how we could use data/log files to divide households and programs watched into clusters. The broadcaster could use this data to determine what interests I/my household have, make me personalized offers and prevent the same advertising from being shown. In the following, I describe the challenges and a technical pipeline for real-time cluster formation.
Description of the scenario:
(1) A family has a SmartTV and several devices that the members use to surf the Internet. We also assume that a television station also has a network of websites that belong to its own company or to partner companies. Every view (SmartTV, cell phone, tablet, etc.) in the household is logged on the backend servers. The log files contain the IP address of the router, the MAC address of the device, the content that was viewed and a timestamp.
(2) The log files are automatically retrieved from the backend servers and written to a Kafka cluster. Another service retrieves the data from Kafka, processes it, aggregates it and forms clusters. The results are written to a MongoDB for visualization.
(3) A Python Flask application, socket.io and d3.js are used for visualization.
Here is the overall architecture as a diagram:
And so that we have an idea of the end result:
This network graph is created and updated in real time so that I can see which household clusters have been formed. A cluster consists of one or more MAC device addresses (mobile phone, tablet, television, etc.) and one or more IP addresses. Why multiple IP addresses? Normally, all devices go online with the same IP address (that of the household’s Internet router, usually a Fritz Box). These IP addresses are not permanently assigned; the Internet provider assigns new IP addresses at certain intervals. This means that the IP alone is not a criterion for assigning different devices to a household. At least, it is only for a certain time window. We have to take this into account in our clustering algorithm. For the first household, I added the content that the devices accessed. What could you do with these clusters now?
Case 1: In household 1, Device1 watched Content1 and Content2 and Device2 watched Content2. If the content is categorized, I could classify the devices into age groups and even gender and thus display appropriate advertising. For example, Device1 looked at recipes on the Internet (Content1) and a cooking show (Content2).
Case 2: Based on Case 1 and the time information, I could display advertising that was shown in the cooking show (Content 2) 5 minutes later on the recipes page (Content 1).
Case 3: Imagine: Household1 and Household2 are neighbors or friends. How do you find out? If, for example, a device (Device4 in Household2) always appears in the Household2 cohort (time information) but also occasionally in Household1. Then you can assume that there is a relationship between these households. In other words, build a social graph that can also be defined by content. This in turn enables the television station to display personalized advertising (eg a cookbook, or a Weber grill in the summer, or a nice red wine that you like to drink with friends).
Case 4: If a household has many devices in the cohort, it can be assumed that the household has a corresponding income and can also place corresponding advertising here.
Case 5: The television company could give its advertising customers the opportunity to choose what percentage of the advertising budget should be divided between mobile devices, tablets and televisions.
I could go on a bit here with the corresponding options, I think it’s clear which direction this is going in. As I said, this is about the efficiency and relevance of the advertising displayed and nothing more. Not about spying and conspiracies against users. Oh, there’s just been a knock on the door, I assume the EU data protection authority with clubs. If you don’t hear from me again, please call someone for help.
OK, it was just the pizza delivery guy. On with the text. Since this is a POC, we don’t have any real data, but we generated it ourselves with a small Python script. The script runs indefinitely and generates „view“ data, which is written directly to Kafka. The whole setup runs on my laptop and can of course be migrated to the cloud so that it can scale. The actual logic is in a Python script, which would have to be integrated into Apache Spark in a production setup. After the cluster is formed in Python, the results are written to MongoDB. From here, a web application has access and can present the results in the form of the network graph via websockets. Here is a video of the ongoing process:
I hope I was able to demonstrate how something like this can work and which pipelines you can use for such a setup. If you have any further questions or want to see a live demo, write to me at [email protected]. Please send hate mail about data protection and the like to [email protected]. I will definitely answer.