The data treasure hunter

21 Feb 2022

How we can learn more from what we know about ourselves: Statistician Frauke Kreuter wants to improve the quality of Big Data and is using artificial intelligence to do so.

"Water, water everywhere / Nor any drop to drink."

For Frauke Kreuter, these lines of poetry by Samuel Taylor Coleridge beautifully expresse a feeling she experiences time and again in her job as a statistics professor: “Data, data everywhere — but what to do with it?”

Prof. Dr. Frauke Kreuter

Frauke Kreuter has no doubt that the enormous amounts of data collected in a wide variety of situations can be used to improve people's lives.


Frauke Kreuter does actually have a few ideas about what she might find in amid thise sea of data with the help of algorithms. Beyond the poetry, the Professor of Statistics and Data Science describes her own work simply: “I am interested in quality of data.” But talk to her for any length of time and you realize that her research is far more expansive, and has more implications for every aspect of society than something as seemingly prosaic as data quality might suggest. She’s investigating how to keep people in the active workforce. How to fight poverty. How to defeat disease. And things like how automated decision-making affects sentences handed down in the American criminal justice system.

To get her view across of what this “quality” is all about, she tells a story. It’s often used as a reminder to scientists in her field that the data you don’t have can be more relevant than the data you do have. “Being aware of that can be the difference between life and death,” says Kreuter. The story centers on Abraham Wald, a mathematician and statistician born in 1903. When engineers were trying to make Allied bomber aircraft as combat-resilient as possible during World War II, Wald told them they needed to do a complete rethink. The engineers had spent some time looking at all the places planes returning from combat operations had suffered damage, and they had added extra armor plating in those places. Wald pointed out the error in their thinking: If damage didn’t prevent a plane from returning, the damaged area wasn't critical, so there was no point worrying about reinforcing those spots. More important, he said, was to protect the other places better, correctly reasoning that the planes that were hit in those places were brought down by the resulting damage. They of course could not examine those planes because those planes had not returned. “The really important information was missing,” says Kreuter.

The statistician also uses this example to comment on a legal rule passed last year to allow “voluntary data donation” through our electronic patient files from 2023 onwards: “People who are really sick or in need of serious care are not sitting there thinking about donating data. So, while we get a lot of data, they are not the data we really need. The interesting cases are missing.”

Training for the algorithms

Kreuter has no doubt that the enormous amounts of data that have been collected over many years in various places can be used to improve people’s lives. She wants to use the latest artificial intelligence (AI) methods to do this and train the algorithms with the best possible data. At the same time, she is aware of the concerns the field of data science can evoke. When she talks about the smartphone, she calls it the “monitoring device we all carry around with us all the time.” To her it's an excellent example of where the big opportunities—and risks—of data science lie.

One opportunity lies in researching more closely how people’s lives change when they lose their job, believes Kreuter. In conjunction with colleagues from the German Federal Employment Agency’s Institute for Employment Research (IAB), she has developed an app built on the basis of a classic sociology study from the early 1930s: the Marienthal study. A group of scientists in the Austrian town of Marienthal used extensive surveys and observations to explore how the closure of a factory, the town’s most important employer, affected the social fabric and the individual people of the town.

Kreuter and her team are pursuing the same goal with a smartphone app they have developed. Study participants can input which career advice services they have taken advantage of, for example. Given the appropriate permissions, the app also records how people move, whether they reduce or expand their radius of movement after losing their job, whether they limit or intensify social contacts. And which apps they use on their smartphones.

“Meanwhile it is noon,” wrote one of the participants of the Marienthal study in the early 1930s in response to being asked how he spends the morning. In sociology, this sentence has become the embodiment of the empty daily structure of the unemployed, explains Kreuter. A lack of structure that in turn reduces the person’s chances of finding a job. A smartphone app lends itself to exploring this phenomenon, known in the jargon as employability, she says, “like a little researcher sitting in the phone and recording behavioral data.”

Porgrammcode ist siluettenhaft vor technischem Hintergrund zu sehen.

Artificial intelligence helps to analyze large amounts of data in a meaningful way. | © monsitj/fotolia

Hearing signals through the data noise

But the project is also a good example of how analyzing large amounts of data is never a trivial matter, says Kreuter. You get what she calls a “tsunami of data” when you extract so much information from a smartphone for a six-month period. Even the excellent artificial intelligence methods we now have for searching for usable data in this tsunami can’t do anything to change that, she says: “You don’t get much of a signal with all that noise.” That’s why collecting even more data isn’t always the right answer: “Often you have to refine the questions so that you’re only getting data that are really helpful.” And that’s where algorithms can be a useful tool. In that sense, AI is really just math and statistics, Kreuter says. “We just have to try it out.”

What she hopes to contribute to is evidence-based policy. In health care, it is now accepted and expected that the effectiveness of treatments need to be proven. The concept of evidence-based medicine means that different groups are tested to see whether a drug achieves its goal: A group that receives the drug is compared against a group that receives a placebo without the drug. If necessary, they are also compared against a group that receives neither the drug nor a placebo. Kreuter says it is also possible to collect a similar kind of data on sociopolitical issues, and cites universal basic income and other forms of basic social security as examples.

The researcher therefore sees the 2019 Nobel Prize in Economics as an important signal. It was awarded to the US-based research group consisting of Esther Duflo, Abhijit Banerjee, and Michael Kremer. They designed their research based on the principles of evidence-based medicine: randomized controlled trials of groups with a statistically comparable composition. “In their experiments they looked at what works in development aid,” says Kreuter, explaining their approach. Specifically, the Nobel laureates compared different groups of parents in India who were being encouraged to have their children vaccinated. Some were told they would also receive a kilo of lentils when the children were vaccinated, while the others were given no such incentive. The data showed that giving the lentil rations significantly increased the vaccination rate. That’s a clear statement.

Sometimes, researchers don’t even have to intervene themselves to investigate certain questions, adds Frauke Kreuter. This year’s Nobel Prize winners David Card, Joshua Angrist, and Guido Imbens, for example, are researching so-called natural experiments. They used data from schoolchildren in the United States, for example, to study the relationship between education and income. Each year, children in a given jurisdiction all start school on the same day, but they can often drop out as soon as they turn 16. So, from the fact that they attend school for different lengths of time depending on their date of birth, they were able to determine a calculation method for what an extra year of schooling would mean for a person’s subsequent earnings.

Outdoor bars on the street, Glockenbachviertel, Munich. The Corona warning app, which was invented to reduce the risk of infection, is also intended to provide more security and a more normal life. But some people are skeptical about what happens to their sensitive health data in the process. Frauke Kreuter assures: "It's fantastic in terms of data protection."

© Stephan Rumpf/SZ-Photo/Picture Alliance

New insights thanks to data linking

Kreuter says there has been a lot of progress in recent years in the collection and analysis of good-quality data. However, she shakes her head at the ever-increasing number of surveys that purport to have something to do with social or political issues. Headlines like “Majority of Germans support sanctions for breaking Hartz IV rules” trigger her skepticism: “Many surveys are nothing more than infotainment.” She says you would have to look closely to determine whether the survey was actually conducted in such a way as to make the statement reliable. At the same time, Frauke Kreuter is sure that many relevant findings could even be obtained without the need to collect any new data. Better linking and correlating data that are already available could yield a great deal of knowledge in many areas with very little effort.

Kreuter knows that many people flatly reject such ideas. But there is no single, perfect solution to the problem of how to protect people’s privacy. It is not a solution, she says, “to pass it over to each individual and say: You decide!” She is referring to the rules of the General Data Protection Regulation, which ensure that we are constantly confronted with messages asking us to accept or reject certain forms of data use on the internet.

A better way to deal properly with the need for privacy is with a principle that US information scientist Helen Nissenbaum calls contextual integrity, says Kreuter. In other words, it’s about what context the use and disclosure of your data is appropriate in. An example would be, “If you have to show the bouncer your ID at the door to a club so they can see if you’re of legal age, then we agree that that’s okay. If the bouncer looks at your address, memorizes it, and shows up at your door one night, that’s not.” When asked if contextual integrity doesn’t contradict her advocacy of data linking, Kreuter points to the many ways data can be anonymized. But she has seen in her own research how subjective, case-by-case decisions always play a big role in how people handle their personal data.

For example, a study on the sharing of medical data that she conducted with fellow scientists before and after the outbreak of the coronavirus pandemic showed that Germans are very hesitant to give their health data to public institutions. They are far more willing to transmit health data to private companies, say through a smartwatch, especially if they are offered support on health issues in return. She also says that the skepticism shown by many people around the security of data collected through the Robert Koch Institute’s coronavirus warning app does not really have a rational explanation. “The app offers fantastic data privacy.” But with her background as a sociologist, she knows that human behavior is not always rational.

Insights. The research magazine is doing a reader survey.

Read more

What’s not helpful in reaching a fair verdict

She admits that she herself has had an emotional reaction to certain developments connected with her scientific field. When she sees how courts in the United States, for example, are using automated decision making (ADM) to help them reach a verdict, it sends a bit of a shiver down her spine, she says. The basic idea behind it is that, say, a judge has before them a 38-year-old defendant accused of aggravated robbery, for example, who already has two dozen citations on their criminal record for the same or similar offenses. Then the ADM tool will use artificial intelligence — relying on algorithms — to inform the judge how other judges have ruled in similar cases. “There’s clearly a risk of confirmation bias there,” says Kreuter — a risk that the judge will be subconsciously steered towards always imposing a sentence similar to ones imposed in previous verdicts. So, this thing that is intended to standardize the administration of justice and thus make decisions fairer could have unintended consequences.

Such biases can occur simply because algorithms learn from historical data, i.e., the computer is fed older examples to train on. This can mean that the AI is not up to date with social developments — which can be a source of unfairness. If online media automatically adjust the ads they show to the profile of the reader, for example, this can lead to tangible discrimination in the case of, say, job postings. “Women may well systematically see fewer advertisements for certain kinds of jobs — simply because women have clicked less on such jobs overall in the past,” explains Kreuter.

In order to stimulate debate on questions like these and to enable as large a section of society as possible to benefit from the findings of data science, Kreuter has teamed up with other researchers in the United States to launch the Coleridge Initiative. Fifteen US states and twelve American universities are now involved in the activities of the initiative, which is named after the same English poet, Samuel Taylor Coleridge, we opened with. But non-American institutions such as GESIS Leibniz Institute for the Social Sciences in Mannheim and the consulting firm Capgemini are also involved. According to the not-for-profit organization, the goal is to work with governments to ensure that data are more effectively used for public decision-making.

Just as the sailor Coleridge saw, "Water, water everywhere,” there is often no shortage of data, says Frauke Kreuter. But where the sailor was unable to make use of the salt water that surrounded him, Frauke Kreuter is not willing to give up on finding productive use of the masses of data in our modern world to make decisions that serve the common good.

Text: Nikolaus Nützel

© Fotostudio klassisch-modern

Prof. Dr. Frauke Kreuter holds the Chair of Statistics and Data Science in Social Sciences and the Humanities at LMU Munich. Kreuter studied sociology at the University of Mannheim. She received her doctorate from the University of Konstanz before going as a postdoc to the Department of Statistics at the University of California at Los Angeles (UCLA), USA. After working at the University of Maryland, College Park, and the University of Michigan in Ann Arbor, USA, she came to LMU Munich as a statistics professor between 2010 and 2014, where she was head of the statistical methods group at the Institute for Employment Research (IAB) in Nuremberg before moving to the University of Mannheim. In 2020, she returned to LMU following periods of research at Facebook, Stanford and the University of California, Berkeley, USA. She is also Co-Director of the Social Data Science Center and a faculty member in the Joint Program in Survey Methodology at the University of Maryland and Co-Director of the Mannheim Data Science Center.

Read more articles from the current issue of Insights. The research magazin in the online section.

What are you looking for?