The value of differential privacy in establishing an intermediate legal standard for anonymisation in Singapore's data protection landscape

Reading time: 11 minutes

Written by Nanda Min Htin | Edited by Josh Lee Kok Thong

We’re all law and tech scholars now, says every law and tech sceptic. That is only half-right. Law and technology is about law, but it is also about technology. This is not obvious in many so-called law and technology pieces which tend to focus exclusively on the law. No doubt this draws on what Judge Easterbrook famously said about three decades ago, to paraphrase: “lawyers will never fully understand tech so we might as well not try”.

In open defiance of this narrative, LawTech.Asia is proud to announce a collaboration with the Singapore Management University Yong Pung How School of Law’s LAW4032 Law and Technology class. This collaborative special series is a collection featuring selected essays from students of the class. Ranging across a broad range of technology law and policy topics, the collaboration is aimed at encouraging law students to think about where the law is and what it should be vis-a-vis technology.

This piece, written by Nanda Min Htin, seeks to examine the value of differential privacy an establishing an intermediate legal standard for anonymisation in Singapore’s data protection landscape. Singapore’s data protection framework recognizes privacy-protected data that can be re-identified as anonymised data, insofar as there is a serious possibility that this re-identification would not occur. As a result, such data are not considered personal data in order to be protected under Singapore law. In contrast, major foreign legislation such as the GDPR in Europe sets a clearer and stricter standard for anonymised data by requiring re-identification to be impossible; anything less would be considered pseudonymized data and would subject the data controller to legal obligations. The lack of a similar intermediate standard in Singapore risks depriving reversibly de-identified data of legal protection. One key example is differential privacy, a popular privacy standard for a class of data de-identification techniques. It prevents the re-identification of individuals at a high confidence level by adding random noise to computational results queried from the data. However, like many other data anonymization techniques, it does not completely prevent re-identification. This article first highlights the value of differential privacy in exposing the need for an intermediate legal standard for anonymization under Singapore data protection law. Then, it explains how differential privacy’s technical characteristics would help establish regulatory standards for privacy by design and help organizations fulfil data breach notification obligations.

Introduction

The Personal Data Protection Act (the “PDPA”),^{^[1]} accompanied by a host of advisory guidelines, serves as the principal safeguard for an individual’s right to personal data in Singapore. Pertinently, anonymised data is not considered personal data and falls outside the ambit of the PDPA.^[2] However, while foreign counterparts such as the General Data Protection Regulation (the “GDPR”) require anonymised data to be completely non-re-identifiable,^[3] the Singapore definition of anonymisation falls far short of this standard by embracing data that can be re-identified, albeit without a “serious possibility” of re-identification.^[4] Hence, many types of data treated with reliable but non-absolute de-identification techniques are not protected by the PDPA. Data treated with differential privacy (“DP”) algorithms is one such example that will be the focus of this article.

DP, touted as one of the 10 Breakthrough Technologies of 2020,^[5] has been hurriedly adopted by states and big tech, ranging from the US Census Bureau to Facebook in the wake of the Cambridge Analytica scandal.^[6] In essence, DP gives data subjects plausible deniability that their information is part of the original dataset whose statistical computation result has been disclosed.^[7] The third-party data processor who receives a group computation result (e.g. population statistics) would not be able to identify any individual person at a significant level of confidence. However, DP does not prevent attempts at re-identification from, for instance, combining that computation result with other insider information privy to an attacker.^[8] Broadly speaking, DP is a technical standard according to which many non-absolute de-identification mechanisms are designed. Thus, it is a prime reference for designing an intermediate legal standard for anonymization.

This article urges a more facilitative data protection environment in Singapore for technological advancements in privacy, namely DP. To this end, two issues are discussed: first, a more granular standard for anonymisation is required to protect DP-protected data which is reversibly de-identified; second, this standard will aid organisations in their efforts towards meeting regulatory standards for privacy by design and fulfilling data breach notification obligations under the PDPA.

The problem with anonymised data in Singapore’s data protection framework

The overly broad ambit of anonymisation in the PDPA

Data that is anonymised as defined under the PDPA is exempt from being caught under its main provisions (i.e. Parts 3 to 6A). However, there are two forms of technical anonymisation: reversible, where the data processor remains able to identify data subjects in the original data; and irreversible, where this ability is permanently forfeited. While the latter clearly fulfils the legal definition of anonymised data, the former would not fulfil this if there is a “serious possibility that an individual could be re-identified”.^[9] The ‘motivated intruder’ test is prescribed as a “starting point” to determine this possibility—if a motivated, reasonably competent attacker with access to standard resources (e.g. the Internet and public information) cannot re-identify a particular data subject, the data would be considered anonymised under the PDPA.^[10] As will be explained in the following section, DP-protected data could easily be an example of this.

The position under the GDPR is different. For any data that is no longer attributable to an individual without the help of additional information,^[11] it is classified as pseudonymised data under the GDPR and falls within its ambit. Once there is a mere possibility of re-identification, it would fail to meet the GDPR standard for anonymised data.^[12] The GDPR provision for pseudonymisation consequently sets a formal legal standard for a plethora of widespread privacy techniques (i.e. DP) which do not, and need not, promise absolute de-identification.

Unlike the GDPR, the Singapore framework makes no express distinction between different legal standards for de-identification.^[13] Although the PDPC recognises technical examples of irreversible and reversible anonymisation,^[14]it does not prescribe any legal standards to accommodate the varying extents of reversibility provided by essential techniques such as DP.

The DP guarantee and its legal shortcomings

DP guarantees that if any single user’s data is modified within a dataset (D), the probability distribution of outcomes for any processing done on the modified dataset (D’) would not be significantly different from that done on the original. This requires introducing random noise to the statistics computed from D. Thus, DP can provide population-level information without allowing inferences about individual-level information to be made with confidence. While DP computations primarily de-identify the outputs queried from a dataset, they are also capable of generating differentially private datasets themselves. For simplicity, this article will consider DP as a general anonymisation function that guarantees a low possibility of re-identification without inquiring into its myriad mechanisms.

For illustration, a DP-treated output could be analysed to reveal that students who study computer science score better in law school, but it guarantees that this finding is not attributable to “Jerrold”, a particular student who scored an A-grade. Assume Query X returns that 100 students scored an A-grade out of a 1000-student cohort. After Jerrold drops out, Query Y returns that 99 students scored an A-grade out of a 999-student cohort. However, because the figures 100 and 99 are noisy figures returned by a DP computation, not exact ones, the risk of confidently inferring that Jerrold scored an A-grade is greatly reduced.

However, it must be recognized that DP alone is not sufficient basis to build an absolute de-identification standard. DP does not eliminate more fundamental privacy breaches such as the actual access to collected datasets nor does it prevent DP-treated datasets from being combined with information publicised by individuals themselves.^[15] Also, DP provides a strong privacy guarantee at the expense of severe utility loss. The greater the noise introduced by a DP algorithm, the lesser the degree of privacy lost, especially in small datasets and record-level data where a small gain in commercial utility is likely to produce a larger loss in privacy.^[16] Since there are no accepted guidelines for how much noise is to be introduced,^[17]organisations are afforded plenty of leeway in setting the parameter and passing off insufficiently noisy outputs as DP-protected datasets which fall far short of reasonable legal standards.

Singapore’s data protection framework does not address whether the use of DP-protected tools meets its legal standards for anonymisation. While DP-protected data would unlikely meet the strict definition of anonymised data under the GDPR, the PDPC’s “serious possibility” of re-identification standard suggests that the PDPA fails to protect such data. Thus, it is crucial for the PDPA to recognize a category of reversibly de-identified data that can be legally protected instead of being shoehorned under anonymised data.

The value of protecting reversibly re-identified (DP-protected) data under the PDPA

Fulfilling privacy by design (“PbD”)

It has been suggested that under the GDPR, pseudonymisation would help data controllers meet the “data protection by design and by default” obligations.^[18] These obligations require data controllers to implement data protection measures at the organisational level. While PbD is not statutorily required by the PDPA, it is nevertheless strongly encouraged by the PDPC. DP embodies one of the seven foundational principles underlying PbD under the GDPR and the PDPA alike — to be proactive and preventive.^[19] This entails identifying and preventing data protection risks before-the-fact, not after. It is not concerned with remedying privacy breaches, only with instituting mechanisms to anticipate them in advance.

Crucially, an effective implementation of DP can serve as a benchmark for PbD. To determine whether a data processing method complies with a regulatory standard for negligible variation between the processed and non-processed datasets, DP can serve as a technical benchmark under Singapore’s data protection framework. This is because DP builds on similar principles to counterfactual testing (“CF“), which is prescribed as a test for repeatability under the Singapore’s Model AI Governance Framework (the “Model Framework“).^[20]

The Model Framework highlights the aim of repeatability in AI design – counterfactually fair AI systems ought to “consistently perform an action or make a decision, given the same scenario”.^[21] For such AI systems, in a counterfactual scenario where a protected attribute such as education or income level is modified, the same predictions will be made by the AI across all instances of that attribute. The causal effect of other attributes on that protected attribute, such as whether ethnicity affects education or income level, would not change the counterfactual prediction.

CF is founded upon two essential elements: fairness and causality.^[22] While fairness entails negating the effect of a certain variable, causality examines the effect of that variable on other variables and the output.^[23] Likewise, DP ensures that any modifications to an individual’s data does not output a significantly different statistical computation than the original, notwithstanding that individual’s causal relationships with other individuals’ data (e.g. it does not matter whether “Jerrold’s” exam scores are the same as “Jarrold’s” because they copied from each other).

While DP may not necessarily be mathematically equivalent to CF, it does build upon both elements of CF, and can similarly be appropriated as a technical benchmark to help determine a legal standard. On the one hand, CF testing is used to determine the repeatability for “AI solutions at scale” in Singapore.^[24] On the other hand, the effectiveness of a DP algorithm (i.e. as measured by the masking effect of noise added) could be used to determine whether a privacy design meets the legal standard for pseudonymisation.

Fulfilling data breach notification obligations

A data breach entails the unauthorised access and use of personal data. Pseudonymisation has been said to help data controllers meet data breach notification obligations under the GDPR. The requirements for notification depend on the level of “risk [caused] to the rights and freedoms of natural persons”,^[25] which in turn depends on the level of pseudonymisation. The recent 2021 PDPA amendments have shadowed the GDPR in establishing obligations for notifying data breaches. An organisation is obliged to notify if the data breach fulfils either one of two requirements: (1) the breach causes significant harm to affected individuals or (2) the breach is of significant scale (i.e. 500 or more individuals are affected).^[26] However, it remains unclear whether pseudonymised (or DP-protected) data involved in such breaches would be protected in Singapore.

The threshold for significant harm is satisfied if the data breach relates to, for instance, the individual’s full name or alias or account identifier with an organisation.^[27] As previously illustrated, techniques such as DP are used precisely to protect such information. As for the significant scale requirement, DP-protected data would likely satisfy it because the very strength of DP lies in the size of the dataset— the larger it is, the lesser the DP-induced loss in accuracy relative to other sources of error such as statistical sampling error.^[28] Unsurprisingly, some of the notable applications of DP include the US 2020 Decennial Census and The Opportunity Atlas which studied over 20 million children and their parents.^[29]

But it must be noted that DP-protected data are not immune to data breaches, especially in terms of practical design. Many published DP algorithms violate their theoretical guarantees simply because the nature of DP design is very “subtle and error-prone”.^[30] Furthermore, reliable means of testing for bugs in DP are far from established and still subject to “ongoing research”.^[31] Take the example of the bug where the incorrect amount of noise is added to a DP algorithm. Both the correct and buggy algorithms could return the same output, even though the latter took the same output from an incorrect output distribution. Unlike traditional testing where simply computing the input/output pairs would uncover a bug, DP requires customised debugging mechanisms such as sophisticated testing or programme analysis.^[32] This relies on the fact that DP algorithms produce randomised output distributions—observing single values would not be helpful.

With practical vulnerabilities in reversible de-identification techniques which may be discovered only in the event of a breach, not during their design, there is a great danger in classifying such “anonymised” data outside the scope of the PDPA. This would only serve to compound compliance issues for Singapore businesses which are faced with entirely new notification obligations under the PDPA.

Conclusion

It is timely that the first aim of the 2021 PDPA amendments is to strengthen consumer trust through organisational accountability.^[33] An accountability-based approach expects data protection by design, which coheres with this paper’s ethos that the regulation of developer behaviour should precede the conversation on regulating user behaviour. However, particularly in the realm of data de-identification, the law cannot expect so much of businesses without first providing a clear standard for how far they should go to anonymize data, and the consequent obligations (or lack thereof) that follow. Accordingly, two broad observations have been made: that DP exposes the glaring lack of an intermediate standard for reversibly de-identified data, and that the Singapore data protection framework ought to bring it within the PDPA’s ambit especially because organisations that employ DP would have an easier time adhering to this new intermediate standard.

This piece was published as part of LawTech.Asia’s collaboration with the LAW4032 Law and Technology module of the Singapore Management University’s Yong Pung How School of Law. The views articulated herein belong solely to the original author, and should not be attributed to LawTech.Asia or any other entity.

^[1] Personal Data Protection Act (Act 26 of 2012)

^[2] Personal Data Protection Commission, 2018, ‘Guide To Basic Data Anonymisation Techniques’ at para 2.4, available at https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Other-Guides/Guide-to-Anonymisation_v1(250118).pdf. (the “PDPC Anonymisation Guide”)

^[3] EU General Data Protection Regulation (GDPR): Regulation (EU) 2016/679 of the European Parliament

^[4] Personal Data Protection Commission, 2021, ‘Advisory Guidelines on the PDPA for Selected Topics’ at para 3.3, available at https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Advisory-Guidelines/AG-on-Selected-Topics/Advisory-Guidelines-on-the-PDPA-for-Selected-Topics-4-Oct-2021.pdf?la=en. (the “PDPC Selected Topics”)

^[5] Massachusetts Institute of Technology, ‘10 Breakthrough Technologies 2020’ in MIT Technology Review, 26 Feb 2020, available at https://www.technologyreview.com/10-breakthrough-technologies/2020/.

^[6] Chaya Nayak, ‘New privacy-protected Facebook data for independent research on social media’s impact on democracy’ in Facebook Research, 13 Feb 2020, available at https://research.fb.com/blog/2020/02/new-privacyprotected-facebook-data-for-independent-research-on-social-medias-impact-on-democracy/.

^[7] Ibid.

^[8] Wood, Alexandra, Micah Altman, Kobbi Nissim, and Salil Vadhan. 2020. “Designing Access with Differential Privacy.” In: Cole, Dhaliwal, Sautmann, and Vilhuber (eds), Handbook on Using Administrative Data for Research and Evidence-based Policy (the “Administrative Data Handbook”) at Section 6.3. Accessed at https://admindatahandbook.mit.edu/book/latest/diffpriv.html on 2022-01-31.

^[9] The PDPC Selected Topics at para 3.3.

^[10] The PDPC Selected Topics at para 3.26.

^[11] Art 4(5) of the GDPR.

^[12] Tess Blair, Patrick Campbell Jr. and Vincent Cantanzaro, ‘The eData Guide to GDPR: Anonymization and Pseudonymization Under the GDPR’ in JDSupra, 9 Dec 2019, available at https://www.jdsupra.com/legalnews/the-edata-guide-to-gdpr-anonymization-95239/#_ftn4.

^[13] Section 48F of the PDPA.

^[14] PDPC Anonymisation Guide at para 2.2

^[15] The “Administrative Data Handbook” at Section 6.3.

^[16] Ibid.

^[17] Raina, G, and Amritha, J, 2020. ‘Differential Privacy’ in Belfer Center for Science and International Affairs Tech Factsheets for Policymakers p. 1, available at https://www.belfercenter.org/sites/default/files/files/publication/diffprivacy-3.pdf

^[18] Mike, H, and Khaled, E, 2017. ‘B Comparing the Benefits of Pseudonymization and Anonymization Under the GDPR’ in Privacy Analytics White Paper, p. 7, available at https://iapp.org/media/pdf/resource_center/PA_WP2Anonymous-pseudonymous-comparison.pdf

^[19] Personal Data Protection Commission, 2019, ‘Guide To Data Protection by Design For ICT Systems’ at p. 6, available at https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Other-Guides/Guide-to-Data-Protection-byDesign-for-ICT-Systems-(310519).pdf.

^[20] Personal Data Protection Commission, 2020, ‘Model Artificial Intelligence Governance Framework Second

Edition’ at para 3.30, available at https://www.pdpc.gov.sg/-/media/files/pdpc/pdf-files/resource-fororganisation/ai/sgmodelaigovframework2.pdf. (“MAF 2020”)

^[21] Ibid.

^[22] Matt, K, et al, 2018, ‘Counterfactual Fairness’ in 31st Conference on Neural Information Processing Systems, available at https://arxiv.org/abs/1703.06856.

^[23] Ibid.

^[24] MAF 2020 at para 2.2.

^[25] Herbert Smith Freehills, ‘Mandatory data breach notification has been introduced in Singapore, with more changes to follow’ in HSF Data Notes, 19 Mar 2021, available at https://sites-herbertsmithfreehills.vuturevx.com/208/24757/march-2021/mandatory-data-breach-notification-has-been-introduced-in-singapore–with-more-changes-to-follow.asp?sid=1a8dbc62-248f-4889-a466-71ae79a9a409.

^[26] Sections 4 and 5 of the Personal Data Protection (Notification of Data Breaches) Regulations 2021

^[27] Ibid.

^[28] The Administrative Data Handbook at Section 6.4.

^[29] Garfinkel, Simson L. 2017. “Modernizing Disclosure Avoidance: Report on the 2020 Disclosure Avoidance Subsystem as Implemented for the 2018 End-to-End Test (Continued),” September. https://www2.census.gov/cac/sac/meetings/2017-09/garfinkel-modernizing-disclosure-avoidance.pdf.; Chetty, Raj, John N. Friedman, Nathaniel Hendren, Maggie R. Jones, and Sonya R. Porter. 2018. “The Opportunity Atlas: Mapping the Childhood Roots of Social Mobility.” Working Paper 25147. National Bureau of Economic Research. https://doi.org/10.3386/w25147.

^[30] Zeyu, D, et al, 2019, ‘Detecting Violations of Differential Privacy’ in 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18), available at https://arxiv.org/pdf/1805.10277.pdf.

^[31]Joseph Near and David Darais, ‘Differential Privacy Bugs and Why They’re Hard to Find’ in NIST Cybersecurity Insights, 25 May 2021, available at https://www.nist.gov/blogs/cybersecurity-insights/differential-privacy-bugs-and-why-theyre-hard-find

^[32] Ibid.

^[33] Opening Speech by Mr S Iswaran, Minister for Communications and Information, at the Second Reading of the Personal Data Protection (Amendment) Bill 2020 on 2 November 2020.

Asia's Leading Law & Technology Review

Share This

The value of differential privacy in establishing an intermediate legal standard for anonymisation in Singapore’s data protection landscape