Oiwan Lam | Global Voice
It could be the biggest data leak in history
An anonymous hacker “ChinaDan,” listed a 23.88-terabyte (TB) dataset allegedly containing one billion Chinese residents’ personal data for sale for 10 Bitcoins, approximately USD 200,000, on June 30, 2022. The hacker posted the advertisement on the website of Breach Forums, an online hacking community, claiming they procured the data from the Shanghai National Police server.
While the Chinese authorities have neither confirmed nor denied the data leak, some reporters, including Rachel Cheung from VICE World News and Karen Hao from Wall Street Journal, reached out to individuals listed in a sample of the dataset to verify the data. Of those who responded to the phone calls, all confirmed that the details in the dataset were correct.
Karen Hao 郝珂灵
A hacker is selling an alleged 1 billion Chinese citizens’ information stolen from Shanghai police. @rachelliang5602 & I downloaded the sample the hacker provided and called dozens of people listed. Nine picked up & confirmed exactly what the data said.
According to the Wall Street Journal, the data leak might have come from the cloud server of Aliyun, a cloud computing company, a subsidiary of Alibaba Group. The company is currently investigating the incident.
ChinaDan claimed that the dataset contained one billion Chinese residents’ personal data and several billion crime and police case records, including names, addresses, birthplace and birthdates, national ID numbers, mobile numbers, and more.
The data seller also released a sample with 750,000 data items on it for potential buyers to validate the data. If the 23.88 TB dataset is for real, it would be the biggest data leak in history.
According to the data tables posted on Breach Forum, the dataset comes from 7 data indexes and the 750,000 sample data items are from three main indexes, which consist of 250,000 items of individual data, 250,0000 police investigation records, and 250,000 records from commercial and public administration platforms.
As the news spread on social media, some Chinese internet users started testing the released sample. However, their discussions and findings were quickly censored on social media.
Content index of the dataset
One deleted analysis from Zhihu, a mainland Chinese question-and-answer website, has a list of content from the sample dataset and found that the personal data files include an individual’s ID card number, name, address, place of birth, educational background, marital status, military service background, height and weight, occupation, religious belief, political orientation, criminal record (such as types of crimes), and photo files such as ID card, resident card, driver’s license, and passport. Some data even had information about user movements, such as immigration checkpoints, hotel check-ins, internet cafe log-ins, detention centers, detention facilities, and more.
The criminal investigation data file includes information about the crime location, time, type of crime, investigation status, case details, personal data of the case reporter such as ID card number, name, and mobile number, and case handler (police station and work ID number).
Data sourced from commercial platforms included users’ shopping orders, delivery orders, payments, food orders, tickets, and travel records. Data sourced from public administration platforms included technology forensics, economic investigation records, immigration records, entertainment venue information, traffic control directives, and details of the individuals classified as the seven targeted groups (7類重點人員) (which commonly refers to potential terrorists, activists, criminals, drug dealers, fugitives, mentally disordered individuals and petitioners). It also included information about cyber investigations, pawn shops, detention centers and facilities, addiction treatment centers, property owner records, resident registration records, household registration records, regular population, actual population, medical records, fuel usage records, and more.
A further look into the data: population and stability control
Some users have already begun conducting initial research through the sample to draw conclusions about population demographics. Yi Fuxian, a demographer at the University of Wisconsin-Madison was appalled by the 250,000 items of demographic data as they came from almost every county in China, including those with a population under 10,000. Upon digging into the age distribution of the population data sample, as well as statistics around the usage of the Bacillus Calmette-Guerin vaccine (for tuberculosis), which is compulsory for newborn babies, Yi concluded that the population crisis in China has been underestimated as there are even fewer babies being born in China than suggested in the 2020 official census.
The sample dataset also reflects the scale of stability control in China. Tech-blogger William Long, for example, found that among the sample data of 250,000, 166 individuals are listed on the seven target groups list, which implies that more than 660,000 individuals might have been listed as stability control targets in China. The blogger also noted that, among the 250,000 cases of crime, two are linked to speech-related crimes on Twitter. If the data size is 1 billion, there could be up to 8,000 Twitter-related cases of crime.
Implications: individual security and political crisis
The breach of private data has been a serious issue in China before as personal data has previously been sold in some secret chat rooms. For example, in 2020, authorities from Zhuhai city arrested a man who was in possession of 120 gigabytes of data files that consisted of 1 billion pieces of citizens’ personal information. And in 2021, police authorities from Jiangsu province cracked down on a 200,000-member chat room that served as a marketplace for digging up specific individuals’ backgrounds. The administrator of the chatroom was also in possession of more than 1 billion pieces of personal information.
However, if verified, the current data breach, in terms of scale, would be like a nuclear bomb compared with previous, relatively micro-level leaks.
Citizens online are already reckoning with its impact. Some on Weibo are worried about the surge of scams and blackmailing as the data includes personal information and criminal records. As for overseas Chinese dissidents on Twitter, many believed that the data would undermine the legitimacy of the Chinese Communist Party as further data mining could expose the party’s policy failure. One of the most insightful remarks comes from prominent Chinese software architect Issac Mao:
…the data breach is a fresh new case of a dictator’s dilemma: the more you concentrate, the more you lose control.