Type: Research Essays
Sample donated: Deborah Jackson
Last updated: December 22, 2019
There is always the tension between the privacy issues and utility of the data. If something is generallyuseful, one use of the data is identication of records. Researchers need to decide what is more importantthreads or opportunity.Inferred data however are not considered as personal data, the resulting information is not coveredby protection of gdpr.3.1 Data minimization and purpose limitationThe researchers need to take their own responsibility during data collection. Data subjects need to beaware that their personal data will be used and for what purposes.
Researchers should only collect thepersonal data that they actually need to achieve their research purposes. These purposes should beclearly stated in the project description and data can not be used for other, incompatible, purposes.Datashould be processed proportionally to the need. Data should not be retained for longer than necessaryin relation to the purposes for which they were collected, or for which they are further processed, is keyto ensuring fair processing.
According to the GDPR personal data may be stored for longer periods forarchiving purposes in the public interest, or scientic, historical, or statistical purposes in accordancewith Art.89(1) and subject to the implementation of appropriate safeguards.3.2 k-anonymityIf collected data need to be released, the data holder has to ensure the k-anonymity of the dataset. Thismeans, for any combination of quasi-identiers there are at least k respondents can be found in the set.The k value need to be kept 30 or higher 6. If the condition is not satised, dierent mechanisms needto be used. Among most popular techniques generalization and suppression need to mentioned.
Theadvantages of such them in comparison to other techniques, like scrambling or swapping, anonymizeddata are not destroyed and still contain useful true information.In particular, for some attributes data holders may want to consider instead of release values them-selves, join them together and release data, where for some attributes values are generalized (generaliza-tion) 4. If limited number of outliers force too high degree of generalization, suppression technique canbe useful. In this method, certain values of the attributes are replaced by an asterisk or NA value 5.
Special attention need to be paid if the same databases are released in dierent time slot. Previouslyreleased attributes can be used as quasi-identiers. New releases must take into account that anypreviously released information has joined external information.3.3 l-diversity and t-closenessProviding the k-anonymity, however, is not enough to protect sensitive information. For databases withsmall diversity in sensitive attributes, it could happen, k rows with the same quasi-identiers have thesame or similar attribute of interest. In this case, the it is not necessary to exactly identify the row ofinterest, because the column of interest becomes known (homogeniety attack).
Another problem is, itcan be dicult to predict the information available to adversary, or even dierent adversary may havedierent information. Therefore it may happen, k-anonymity need to be provided for more features whatis not always feasible.Therefore, if on top of k-anonymity the data holder ensure that sensitive attributes are equallyrepresented in each equivalence class (set of k data points with the same quasi-identiers), it is possibleto rely on stronger privacy guarantees (l-diversity) 12. The notion of t-closeness oers even more reliableanonymisation and includes requirement that the distribution of a sensitive attribute in any equivalence5class is close to the distribution of the attribute in the overall data set (i.
e., the distance between twodistributions should be no more than a threshold t) 13. This can be achieved removing outliers andsmoothing a distributions in equivalence classes.The drawback of the k-anonymity approach is, the identication of quasi-identiers could be nontrivialtask. Moreover, the adversary acquire more and more information during the time and new features maybecome quasi-identiers. Enforcing l-diversity and t-closeness may lead to serious utility loss of the data.This means, researchers need to nd the balance between the amount of distortion introduced to the database to increase the privacy protection and amount the information can be retrieved from the database.
3.4 Dierential privacyPublishing data is still vulnerable to de-anonymization attacks even if privacy preserving algorithmsapplied. Therefore researchers may want to consider privacy preserving data mining approach to sharedatabases with community. Dierential privacy criteria then can be used to provide the proper privacyprotection. This criteria ensures that the removal or addition of a single database item does not (sub-stantially) aect the outcome of any analysis. Therefore the probability of producing a given model froma training dataset that includes a particular record is close to the probability of producing the samemodel when this record is not included.
There are several approaches have been proposed to achievedierential privacy: geometric, Laplacian, randomized response 14, dierential PCA and dierentialLDA 15 or combination of dierent approaches.However, trade-o between functionality and data privacy need to be kept. The data still need to beuseful, you can not introduce too much noise to the database such that it can not be used anymore forstatistical or machine learning purposes.3.5 Proper machine learning modelsThe research institutes need to take their responsibilities about the proper design of their models toavoid the possibility to infer the information about the people whose data were used to build models.One of the mitigation strategy goes along with the correct model design itself and corresponds togeneralization property of the model.
The notion of overttig, very important model characteristics,refers to the dierence in prediction accuracy between training and test data sets. If the prediction forthe training set is much better then for the test set, the model has generalization problems and overts.Models with high degree of overtting can easily leak membership information. Usual techniques toimprove generalization and avoid overttig need to be used to avoid membership inference, for example,drop-out, regularization, early stopping, pruning.
The overtting however is not the only reason why machine learning models leak information abouttheir training datasets. Model predicting a lot of classes is under the high risk to reveal informationabout its training set members. This is caused by the fact, the more classes are in the data, the lessnumber of members each class has. Therefore, if model predicts correctly the data point as the memberof minor class, the probability this point was in the training set has increased. To avoid such problemresearcher may think of designing the model predicting only small set of highly populated classes.3.6 Regular check for possible motivated intrusionThe useful question the data holder might want to answer on regular base is whether the “motivatedintruder” would able to successfully perform re-identication of data records.
The “motivated intruder”has legitimate access to public data such as internet, libraries and documents but he can not hack thesystem and gain the access to data that are kept securely 6. His goals could be curiosity, politicalpurposes, revealing information about the public gure and so on. Such test may include the followingsteps: web-search, electoral registers, church records and local libraries to discover the available quasi-identiers, news mining to see whether it is possible to link names with database records, social networkmining to check the possibility to connect users’ proles with anonymised data.It is reasonable however to rely on ethical conduct of professionals, for example, doctors, and excludethem from “motivated intruders” group.63.
7 Privacy engineeringSince many dierent way of using REDCap application for data collection and research are possible, hereI just list several points the researchers need to keep in mind when they plan the project.3.8 Check list for planning a new data processing activity Dene the research goals and objectives and data need to fulll these objectives. Document the consent.
Informing prospective research participants about the research. Limit the collection of data. Safeguard personal data. Set reasonable time limits on keeping personal data. Ensure the transparency and accountability of personal data. Ensure control and disclosure of personal data.
gdpr check list public availableEthical guidelines. Based on 1979 Belmont Report for biomedical and behavioural sciences.Fundamental principle when handling data for individuals is informed consent. L8 44:00