Published: September 25th, 2024,
Last updated: May 28th, 2025
China has a large amount of data, but it lags behind the US in terms of data quality, diversity, and availability for AI training. Opening up more datasets in both the public and private sectors could help ease the crunch.
Dieser Inhalt ist Lizenznehmern unserer Vollversion vorbehalten.
In addition to high-performance AI chips, high-quality data in the Chinese language is another critical resource that Chinese AI companies are struggling to obtain. The lack of access to high-quality training data has started to strain the development of LLMs in the Chinese language.
Available data in the Chinese language is generally less than data in English. Chinese accounts for only 5.2 percent of the data in Common Crawl, a widely used open-source database for AI training, while English takes up 43.2 percent.
More importantly in terms of quality, there is a clear lack of diversity and depth in the value systems reflected in the data, according to a recent Alibaba white paper on LLM’s training data. Party ideology prevails in the press and the public sphere, which limits the spectrum of training data for LLMs.
Despite recent efforts to release datasets, administrative and public data remains largely closed for AI training under China’s strict data security regime. Earlier this year, the authority limited access to court documents to personnel within the judicial system. Similarly, access to public health data remains highly restricted.
Companies are also reluctant to share their data due to concerns about business interests and IPR violations. While big tech companies like Tencent and ByteDance can rely on data generated from their own social media networks for AI training, small AI startups face greater difficulties in accessing data.
To cope with the data shortage, businesses call for the government to open up more high-quality data resources, such as science and research data, for training AI models. Big tech companies have started to address the issue in various ways. ByteDance is reportedly creating training data by paying people to have conversations guided by pre-set prompts. Baidu has set up data-processing bases in small cities with lower labor costs. Alibaba is experimenting with synthetic data as training material, meaning feeding models with self-generated content.
Sinolytics is a research-based business consultancy entirely focused on China. It advises European companies on their strategic orientation and specific business activities in the People’s Republic.