DURHAM, N.C. -- Since it began in 2011, the Syrian Civil War has left hundreds of thousands of people dead. But as the conflict continues, many monitoring groups say they are starting to lose count of the bodies.
Realities on the ground make it increasingly hard to pinpoint precise numbers -- so much so that in 2014 the United Nations announced it would stop updating its death toll due to accuracy concerns.
But some researchers are determined to do the best they can to keep Syria’s death count -- which may have reached half a million -- from getting lost in the fog of war. Duke statistician Rebecca Steorts is one of them.
“This is probably the most important thing I’ve worked on,” Steorts said.
Steorts and colleagues published an analysis in June which concluded that in the first three years of the conflict, between 190,102 and 193,646 named victims were reported.
For the past five years, Steorts has been developing state-of-the-art statistics and machine learning techniques to help human rights groups take on the grim task of tallying Syria’s war dead.
To show how their methods work, Steorts and colleagues Beidi Chen and Anshumali Shrivastava at Rice University analyzed roughly 354,000 Syrian death records for the period between March 2011 and April 2014.
Provided by the non-profit HRDAG, the death records data consisted of overlapping casualty lists collected by different groups, each with access to a different snapshot of the violence. Each victim is identified by name, gender, plus the date and location of their death.
But combining these records is complicated by the fact that some victims were recorded more than once. The news of someone's death may come from family members or witnesses, but also from hospital or morgue records, or the information may be obtained from social media.
It sounds like an easy problem. To make sure deaths aren't double-counted, just go through the reports and weed out duplicates. But some of the names are misspelled; prefixes, suffixes and nicknames are inconsistent; dates and locations are inexact.
There’s also a scaling issue. Checking every possible pairing of the 354,000 records in the Syrian data set to determine whether they refer to the same person or not would mean comparing 63 billion pairs.
So Steorts and her team came up with a different approach. To reduce the number of comparisons, they relied on a technique called “locality sensitive hashing.” Records with similar names, locations, and dates of death were grouped together, and only records within the same group -- those with the reasonable chance of being duplicates -- were compared.
Out of 63 billion possible pairs, they only dealt with 450,000 -- more than 99 percent fewer pairs.
The researchers presented their work at the 2018 Joint Statistical Meetings in Vancouver in July.
With their method, they estimate with 95 percent certainty that the number of identified victims killed between March 2011 and April 2014 was 191,874, plus or minus 1,772 -- which closely matches HRDAG’s estimate of 191,369.
The estimate produced by HRDAG relied on human experts to review the pairs and decide which were matches -- a task that took months. Their machine learning model delivered results in as little as two minutes.
Unfortunately, both estimates are likely to be undercounts, the researchers say. Many violent incidents go unreported; bodies are unidentified.
What’s more, their work takes the Syrian death toll only to 2014, but the conflict continues. What about the four-plus years since?
In March 2018, the Syrian Observatory for Human Rights said the war had killed more than half a million people.
Numbers matter. Putting a figure on the human price of war can help fuel support for humanitarian assistance, drive political action and hold perpetrators accountable, Steorts says.
Steorts’ research won’t make it any easier for monitoring groups to collect Syrian casualty data on the ground, where the security situation makes it difficult to access certain areas. But as Syria’s death toll continues to rise, she hopes their work will help make sense of the messy data they have, and update Syria’s casualty count more efficiently and with greater certainty than previous methods.
“It’s amazing what they’re doing on the ground,” Steorts said. “The groups that are there are doing a phenomenal job. They’re risking their lives.”
“As time goes on, and the conflict becomes more challenging, I think that the standard error will very likely go up,” Steorts said. “But that’s not the same as saying it’s impossible to estimate how many people have died. It’s a difficult problem, but I don’t think it’s impossible. And we want to get it right.”
For her de-duplication research, Steorts was awarded a five-year CAREER grant from the National Science Foundation in 2017. She also won a three-year NSF grant in 2015, and was named one of the world's top 35 innovators under the age of 35 by MIT Technology Review magazine in 2015, among other honors.
Steorts is also using her expertise at developing methods for merging large, noisy datasets to make sure that nobody is counted twice for the upcoming U.S. Census. She and her team led two workshops on the topic last year.
This work was made possible through a collaboration with the Human Rights Data Analysis Group.
CITATION: "Unique Entity Estimation With Application to the Syrian Conflict," Beidi Chen, Anshumali Shrivastava and Rebecca Steorts. Annals of Applied Statistics, June 2018. https://doi.org/10.1214/18-AOAS1163