Privacy protection in the era of big data


This paper introduces the efforts of academia and industry to protect users’ privacy. It mainly talks about k-anonymity (k-anonymization), l-diversity (l-diversity), t-closeness and ε-differential privacy. Privacy) and analysis of their strengths and weaknesses.

Data vs privacy

In the era of big data, data has become the cornerstone of scientific research. While enjoying the convenience of intelligent algorithms such as recommendation algorithms, speech recognition, image recognition, and unmanned driving, the data is behind the role of the drive algorithm to optimize the iteration. In the process of scientific research, product development, and data disclosure, algorithms need to collect and use user data, and data is inevitably exposed in the process. There have been many publicly available data in history that have exposed the case of user privacy.

AOL is a US Internet service company and one of the largest Internet providers in the United States. In August 2006, for academic research, AOL published anonymous search records, including data for 650,000 users, for a total of 20M query records. In this data, the user’s name was replaced with an anonymous ID, but the New York Times used these search records to find the person whose ID was anonymous to 4417749 in the real world. ID 4417749’s search record contains questions about “60-year-olds”, “Lilburn’s place”, and “Arnold” search. Through the above data, the New York Times found that Lilburn had only 14 people named Arnold, and finally contacted the 14 people directly to confirm ID 4417749 is a 62-year-old grandmother named Thelma Arnold. Finally, AOL urgently removed the data and issued a statement apologizing, but it was too late. AOL was sued for privacy breaches and ultimately compensated the affected users for a total of $5 million.

Also in 2006, Netflix, one of the largest film and television companies in the United States, hosted a Netflix Prize, which asked for speculative user ratings on publicly available data. Netflix erases the information that uniquely identifies the user in the data, and believes that this will ensure the privacy of the user. But in 2007, two researchers from The University of Texas at Austin said they could identify the identity of an anonymous user by linking the data published by Netflix with the records published on the IMDb (Internet Movie Database) website. Three years later, in 2010, Netflix finally announced the suspension of the game for privacy reasons, and was therefore subject to high fines, with a total compensation of $9 million.

In recent years, major companies have continued to pay attention to the privacy of users. For example, Apple proposed a differential privacy technology called Differential Privacy at the WWDC conference in June 2016. Apple claims that he can use the data to calculate the behavior patterns of the user community, but can not get the data of each user individual. So how is differential privacy technology done?

In the era of big data, how can we guarantee our privacy? To answer this question, we must first know what privacy is.

Today is the 2400th birthday of the prince of the demon’s realm (you).
You make a lot of effort in a demon summoning ritual in order to have
your own familiar spirit, but a human girl is summoned instead of a demon…!?

Even though you are the expected next demon lord, people may underestimate
your magic ability… You have to prove your ability by making the human girl into a demon! Familiar Spirit of the Demon Lord