A $2.9 million grant from the National Science Foundation for Information Technology research has been provided to a Cornell-led consortium studying new techniques to collect social science data while maintaining better confidentiality and anonymity.
This grant enlargement is the most recent increase in funding from the $4.1 million NSF Social Data Infrastructure grant, which was won in 1999 and is currently in its last year. Since 1999, the study has received more than $25 million in research support from the Census Bureau, National Institute of Health, NSF, Sloan Foundation and other sources.
“This latest $2.9 million grant is a confirmation of the very substantial contribution that this research program has made to the social science research community,” stated John M. Abowd, the Edmund Ezra Day Professor of Industrial and Labor Relations. Abowd is also the lead principal investigator for the study, and the only Cornell faculty member on the research team. The research activities will take place at the Cornell Institute for Social and Economic Research (CISER), which is also directed by Abowd.
Abowd is responsible for coordinating the activities of all the involved research teams from around the country. However, the individual teams are responsible for the design and execution of their own studies. There are four other co-principal investigators, as well as a team of senior scientists from universities around the country and the U.S. Census Bureau.
At CISER, there will be senior scientists, graduate students, professional staff and programmers working on the research under the direction of Abowd. While the study is Cornell-led, the NSF grant is multi-institutional. Other institutions working on the grant are Carnegie-Mellon University, Duke University, University of Michigan, UCLA, UC-Berkeley, University of Maryland, Argonne National Laboratory and the U.S. Census Bureau.
The goal of this study is to improve ways of protecting one’s privacy when using social science data in research. The methods that are currently used are often unable to do this effectively because it is often possible to identify individuals in publicly accessible databases based on combinations of income level, occupation, geographic area and age, even when names and addresses are removed from the data.
With the advent of modern technology, information that was more easily kept private is now available in great detail via the Internet. Some types of databases, such as geospatial, can associate a street address with its exact latitude and longitude, and even a household’s electric bills payments.
“We expect to develop a new, and very exciting, collection of public use data products known as ‘synthetic data.’ These products permit analysts to study a wide variety of business and household models using microdata but without compromising the confidentiality of the data that were originally provided to the Census Bureau,” Abowd stated.
Social scientists are still uncertain about whether the use of “virtual households” and synthetic data will produce the same results as actual data.
Much of research to be done will focus on certifying the validity of these methods, as well as support the research being done at a network of nine Census Research Data Centers. The RDCs use actual, confidential, carefully encrypted microdata in projects that have been approved by the Census Bureau.
Another technique that is being investigated using the grant funding is “coarsening,” in which small groups of households or businesses are combined into a single record. The Quarterly Workforce Indicators Online is an example of this concept.
Abowd and the research team expect that the new NSF support will enhance the workings of the RDCs and uphold the confidentiality standards expected of researchers.
Archived article by Jennifer Murabito
Sun Staff Writer