by Dr. Elena Michel and Marina Ehrenreich
the power of text mining – what it takes to understand your customers worldwide
Responding quickly and efficiently to complaints has become the true hallmark of global players. The problem is that only around one fifth of all data is available in structured form, for example as tables or Excel forms. 80 percent consist of any e-mails, Word files, PDF documents, PowerPoint slides and other text formats. There are also tons of audio files, videos, voice memos or image files. An incredible treasure trove of data – largely unknown – floats beyond our perceptual horizon. Whoever digs this treasure has a strategic advantage.
found in translation – the end of Babylon
Who said that globalization must always be easy? Despite the world language of English, there are still around 6,500 languages. 1.3 billion people alone speak Chinese as their mother tongue and 525 million Hindi. We like to use our hands and feet to help on the spot. On the computer, however, this is not possible, since only what is available in black and white counts: Complaints, suggestions, appointment requests – a challenge for global players. Nothing is as important for brands as being able to respond precisely to customer wishes. Around the clock, in all languages and formats: From a simple call to the call center to a nasty comment on a comparison platform.
Hourly, gigabytes of information are generated. But how can these data floods be sensibly managed – and how can the right conclusions be drawn? Multilingual text mining is the answer. Over the next few years, the market for such automated text analysis will grow to several billion dollars, and the trend is rising.
Whoever raises this treasure trove of data has a strategic advantage. The next few years will be about consumer data, says Dr. Horst Florian Jaeck, partner of the Data Analytics line at rpc: "Whoever uses this strategically will win.
lifting data treasures
First we have to look under the surface of the data iceberg. A key to this is to analyze and classify existing unstructured text files. But going through terabytes of data manually overwhelms even the most experienced service team. As a result, only a fraction of all complaints are evaluated. Call center agents also evaluate them subjectively to arbitrarily. The problem only shifts. There are more and more specialists at headquarters who do nothing but categorize complaints, for example. And who speaks Hindi when it is needed? Or Italian? That's where this email comes in, for example: "La mia auto è in officina da 3 settimane ormai, ma non so nemmeno cosa abbia causato il problema al cambio" (My car has been in the workshop for three weeks now, but I still don't know what caused the transmission problem.) Is it a complaint? Or just an observation? Is it about the gearbox or the duration of the repair? And is there perhaps a connection with this other email? "Rattle noise at low revs"?
Classifying texts ties up many resources. This is the time for automated text mining, which categorizes texts meaningfully, i.e. assigns comparable content to the same facts.
- The first step is to clean up the raw data (removing numbers, punctuation and spaces, and converting uppercase letters to lowercase letters)
- The so-called tokenization splits sentences (or character strings) into keywords. Tokens can be words, expressions, or entire sentences. This opens the way for further text mining
- After the spell check, stop words are removed, i.e. words that have no added value for the information content of the statement.
- Stemming attributes different word variants to a common root, such as "gone" and "went" to "go". Now the frequency of this stem per document is calculated – and thus its relevance. TF-IDF (Term Frequency – Inverse Document Frequency) results in the DTM (Document Term Matrix). It is something like the key to text comprehension.
The actual classification takes place after the preparation of the data shown above. Either a rule-based classification is performed or a typical classification model is used, for example Random Forest, C5.0, SVM (Support Vector Machine) or Neural Networks.
all for one?
But what if texts are available in different languages? Multilingual text mining is used here. In the case of clear technical terms and easily understandable facts, it makes sense to bring different languages together under a "leading language" – usually English – and only then edit them. This requires terminology management software as well as excellent translation tools that speak all languages and translate in good quality.
In the case of complex, ambiguous situations or texts with many technical terms, it is worth taking a language-specific approach, each with its own rules and analysis resources. The language of each document must be defined at the latest after tokenization, because the removal of stop words and stemming are language-specific. It is recommended that all further steps as well as the modeling of the classification model are language-specific (and market-specific). These results can in turn be used to enrich structured data and are thus available for further analysis.
training is worth it
Back to our two complaint emails. The first one, "La mia auto è ...", is not about the gearbox, but about the duration of the repair. The analysis tool therefore assigns the label "Repair duration" – and not "Technical Problem" as in the second email ("Rattle noises at low revs"). Both complaints can now be answered specifically. For example, a nice email goes to owner number one – and a second email goes to the garage who should please call the owner. Speech recognition and the division according to relevant questions ensure that the most diverse wishes and complaints are dealt with quickly and precisely and answered "personally".
The basic prerequisites for successful text mining are, of course, good document data quality and sufficient data volume. If only a few documents are available, it is advisable to translate the texts into a main language, as otherwise there is not enough training and test data available for the classification model.
the end of Babylon
Text mining can be used in many different ways. Up to 95 percent of the existing text files in companies could be evaluated automatically – but so far companies have only analyzed a fraction of them; some long-term observations do not take place at all. By automating the categorizations, time savings of up to 80 percent can be achieved compared to manual work. This not only leads to considerable cost savings in post-processing, but it is much more important that companies recognize possible breakdowns and complaints much earlier and avoid possible shitstorms.
In the future, texts in all languages can be categorized fully automatically. Service staff will then concentrate entirely on the wishes of their customers and take care of real problems. Text mining even enables worldwide social media monitoring: How and what is spoken about the company worldwide? What are the main topics here? This helps to better connect customers and brands worldwide with little effort. Because nothing is as destructive as customers who don't feel taken seriously. And nothing is as valuable as satisfied users.
This text is the English translation of our article originally published on marktforschung.de (German only).