Monday, November 28, 2011

Privacy and Cloud Databases

Greetings from the Australian National University, where Professor Chris Clifton, from Purdue University, is speaking on "Freeing Cloud Databases from Privacy Constraints". This is topical as Microsoft recently criticized the Australian Government as the Australian Government Cloud Computing Guidelines effectively require storing Australian citizen's records records in Australia. Also I raised concerns about storing student data in the cloud.

Professor Clifton is working with Qatar University on techniques of atomization (or fragmentation) to separate the person identifiers from the data about them (see papers). He claimed that this way the bulk of the data can be stored in the cloud and processed, without revealing private information. To further protect the information, the list of personal identifiers can be encrypted. Some shortcuts can be used, such as using hashing to check if records match, without having to use actual encryption.

Professor Brad Malin, Vanderbilt University, talked on Risk-based Privacy: What are we Afraid Of?, at ANU, 14 November 2011. He pointed out that matching with other information navigable on-line it is possible to re-identify the records.

Professor Clifton admitted there was a risk of re-identification. But one issue I see is the risk of what will happen in the future. While the data may be anonymous now, new data on-line or more capable computers may make it possible to identify individuals later. Professor Clifton mentioned there are techniques to address this but they also reduce the value of the data, as they deliberately introduce some errors into the data.

The techniques discussed by Professor Clifton involve a compromise between privacy and efficient processing. The data can be placed on a public server and partly encrypted. The difficulty is to decide when the data is sufficiently private. The technical terms are k-anonymity, l-diversity and t-closeness. Some data has to be suppressed completely to protect privacy, but Professor Clifton has found that this will typically be a small amount. These techniques will be added to the UT Dallas Data Security and Privacy Lab Anonymization ToolBox. Recent work has been on the Hive open source data warehouse system. Another option is to use a secure coprocessor.

Apart from risks in the data, Professor Clifton pointed out there may also be risky queries, where the aim is to confirm a suspicion about the data, rather than just a random query. This may require some queries to be rejected. Which queries would have to be rejected is a topic for future research. Also it occurs to me that the fact a query was rejected may in itself reveal information.

Apart from the risk of a change in the future, there is the more practical consideration of basic security rules not being followed for protecting data. There is no point in creating a sophisticated system, if the operating system of the server has not and the latest security patches applied, or strong passwords used, of if the staff are bribed to supply copies of the data. In may be that in practice a well run cloud server is more secure than the average small server at a company or government agency, simply because more care can be given to run a large server properly.

Professor Clifton is speaking on Privacy-Preserving Data Mining at 10: What's Next? at Ninth Australasian Data Mining Conference: AusDM 2011, 1 to 2 December 2011 at University of Ballarat, Mt. Helen Campus, Ballarat.

No comments: