DataSafes

Statistical analysis and machine learning across multiple datasets without compromising privacy

JASON DENT / UNSPLASH.com

Status: Full program plan in development

A great paradox of the information age is that personal information is the most valuable data for addressing societal problems and simultaneously the most dangerous data for individual privacy. Within banks of data are insights about how to better educate our children, avoid the diseases that loom ahead for millions, reduce crime and incarceration, improve public services, and much more. But the tradeoff between gaining the value of data and protecting privacy seems immutable today, and that limits access to rich educational, health, administrative, and business datasets.

An array of new technologies is just starting to make it feasible to ease the tension between data and privacy. Computation on encrypted data is now possible. The field of differential privacy has developed methods to track privacy leakage. New techniques can allow researchers to clean, link and model multiple datasets without having to see the raw data. However, these technologies are nascent and limited, so today are only used in a few instances.

DataSafes will be an end-to-end framework allowing for rigorous protection of privacy during the entire cycle of statistical analysis from data cleaning/linkage to model discovery to the generation of results from discovered models. If successful, this new framework would make personal data analyzable and private at the same time. It would allow individuals, companies, and agencies to provide access to more and more valuable data with confidence that it will remain private while helping to solve major challenges.

The Alfred P. Sloan Foundation has generously supported the design of the DataSafes program.

Actuate lead: Wade Shen, Chief Program Officer

For more information

DataSafes paper

Download

Other programs