Scraping public data in India: Innovation enabler or privacy threat?

When a civic-tech research collective set out to build a multilingual artificial intelligence model capable of answering questions drawn from debates in India's lower house of Parliament, the Lok Sabha, and the historic Constituent Assembly debates of 1946–50, which framed the Indian Constitution, it quickly ran into a challenge.

To make the system inclusive across Indian languages, the team would have to draw on diverse, publicly available text sources like parliamentary transcripts, official statements, archival broadcasts, and recorded lectures in regional languages.

None of what they intended to do invaded private inboxes or evaded paywalls. All of the information can be found by anyone on the open web. Yet, under India's privacy laws, even the reuse of such information can raise compliance questions. The effort to make knowledge more accessible exposes a puzzle at the heart of India's privacy law — a puzzle with a missing piece.

PLI, Earn privacy CPE and CLE credits: Watch anytime online or on our mobile app, topics include AI, privacy, cybersecurity, and data law

The puzzle

India's Digital Personal Data Protection Act sets out clear obligations for processing personal data. But when it comes to information that is already public, the statute and government's responses send mixed signals.

Section 3(c)(ii) states DPDPA does not apply to "personal data that is made or caused to be made publicly available" by the data principal, to whom the data relates, or by "any other person who is under an obligation under any law for the time being in force in India to make such personal data publicly available." This would mean transcripts, debates and discussion from the Lok Sabha, court records, and even self-published social media posts could fall outside of the act's scope.

However, in parliamentary responses, the government has insisted that organizations scraping or processing publicly available data must still comply with consent and other obligations under the DPDPA. In August 2024, the Minister of State for Electronics and Information Technology Shri Jitin Prasada told the upper house of Parliament, the Rajya Sabha, that the scraping of public user data is covered by the Information Technology Act, Information Technology Rules and DPDPA, which requires intermediaries to obtain consent, ensure transparency and respect individual rights.

The contradiction stands when Section 3(c)(ii) suggests such data may be exempt, but official interpretation insists on compliance. The statute and the government's stance simply do not align.

Where the minister's answer falls short

Prasada's reply cited three pillars: penalties for unauthorized access and possible criminal liability under Sections 43 and 66 of the IT Act; obligations for intermediaries to prevent illegal content and implement safeguards under the IT Rules; and DPDPA obligations to obtain consent, lawfully process data and comply with oversight by the Data Protection Board of India, which can impose penalties of up to 250 rupees.

While these references sound comprehensive, they leave a few gaps. To prevent misuse of data scraping, reliance on Section 43 of the IT Act is too broad because the act penalizes unauthorized access but does not actually address the scraping of openly visible sites.

When talking about transparency and informed consent, the DPDPA and Rule 3 of the Draft Digital Personal Data Protection Rules require clear, itemized notices before processing personal data. But how can a data scraper realistically provide such notice when collecting data at scale? Prasada's answer was silent on this operational dilemma.

Further, ethical AI risks — like bias, misuse and context collapse — extend beyond legality and cannot be addressed by the DPBI set up under the DPDPA.

Why public data matters for AI

Public data is often the fuel of innovation; AI models, in particular, require large quantities of data. However, without low-cost access to such data sets, innovation risks only being available to large firms with the resources to license proprietary datasets.

A literal consent requirement for every piece of public data is unworkable. But equally, treating all public data as a free-for-all undermines privacy principles — purpose limitation, minimization and fairness — that the DPDPA, like other privacy legislations, is enacted to protect.

Guardrails, not blank checks

So where does this leave policymakers? The choice is either to ban the use of publicly available personal data for AI development to eliminate privacy risk or exempt it altogether in the name of innovation. Neither extreme is sustainable. A balanced approach would build guardrails within the DPDPA and its rules to enable responsible, transparent use of public data.

The law should clearly define what amounts as public under Section 3(c)(ii) and prescribe restrictions and use requirements. Through Section 17(5), startups and researchers should receive time-bound relief from certain obligations, provided they adhere to baseline conditions, such as reasonable security safeguards and purpose limitations. This is not a sandbox in the regulatory sense, but a way to ease compliance burdens while the law matures.

Those using public data must meet requirements to disclose the categories of data used in AI training, especially when individual notice isn't practical. They should also be encouraged to create voluntary standards for data scraping, history tracking, and promoting fairness in AI.

The policy opportunity

The contradictions between Section 3(c)(ii)'s exemption, the government's insistence on consent, and the overlay of the DPDPA with the IT Act and Rules creates more confusion than clarity. In this current state, and if left unresolved, these gaps risk pushing innovation it into legal gray zones.

India does not need to mimic the deregulation utilized in the U.S. or tight controls employed in China. Instead, the country can shape a rights-based, innovation-friendly middle path: permit public data scraping in defined contexts; back it with accountability measures; and use exemptions judiciously to avoid stifling startups.

The multilingual-model project captures the dilemma perfectly. Its data comes from democratic archives meant to inform citizens, yet its creators face uncertainty about whether their work constitutes lawful innovation or potential non-compliance.

Ultimately, the question is not whether public data should be used, but under what conditions its use is permissible. In the end, public does not mean permissionless.

Ayush Sahay is an associate at Arnava Legal.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page