Skip to main content

Put simply, machine learning is a special field of information technology in which mathematical algorithms are used to gain insights from data. There are a wide variety of areas of application and disciplines. One area that we want to examine in this article is so-called „topic modeling“. Here, the algorithm tries to autonomously find out what topics are involved in one or many text documents.

A practical example:

A mobile phone/laptop repair service provider wants to generate new product and service ideas. To do this, it would be helpful to know which topics users are particularly interested in, or which topics users have the most problems with. Users exchange ideas about these topics in discussion forums. It therefore makes sense to use these discussions as a basis. An employee of the service provider could read through all of these discussions and identify the most common problems. This approach is not very efficient, as there are tens of thousands of discussions and many forums.

What if a machine could search through all these forums and, based on statistical models and machine learning, find out what most users are talking about and what problems they have? The result of this machine research would have two immense advantages for the company:

– What problem do users have?

– This leads to: What product/service could the company offer to help the user?

– A byproduct of the result would be relevant keywords that the company could use to create content around the product/service.

Using this concrete example, I would like to show what a possible implementation could look like.

Our machine learning pipeline looks like this:

At http://all4phones.de/iphone-probleme/ users discuss their problems with the iPhone. We want to find out what the biggest problems are so that we can generate product or service or content ideas.

The first necessary step is to extract and save the discussion texts from the website. In the second step, the data must be cleaned. Words like „I, he, have, and“ are not useful for our analysis purposes and should be removed from the text corpus. This reduces the data complexity and increases the quality of the results. The third step is the interesting part. For topic modeling, we use the LDA algorithm (Latent Dirichlet Allocation). Very roughly speaking, LDA is a generative model that can classify text documents using stochastics, word distributions and frequencies.

Below is the result:

If you click on the first cluster in our example, you can see on the right the words that occur most frequently within this topic area according to their relative frequency. We look at the first 6 words:

backup, device, ios, icloud, update, restore

This topic is clearly about backing up and restoring iOS devices with iCloud. Since we know that this is an „iPhone problems“ forum, we conclude that this is the biggest problem that users are discussing.

This is the second cluster:

Here are the six most relevant words:

cell phone, jailbreak, apps, ios, firmware, version

This cluster is obviously about jailbreaking and iOS updates.

The company could now create topic pages (i.e. content) around these topics to generate organic traffic. Or create products and services around this topic:

  • A service for creating iPhone backups and restoring
  • A tutorial sold as paid content?
  • An eBook for free download on the topic of iPhone backup if the user registers beforehand

A cross-check in Google Trends clearly shows that the topic “Backup iCloud” is a growing trend, which underlines our results of the text analysis.

Do you want to know how to use machine learning in your industry?

Contact us: Contact