One of the defining characteristics of open data is that is it free to use and re-use. Legal claims of copyright or the sui generis right over datasets make re-use difficult. Open data licenses allow dataset creators to provide pre-facto authorisations for re-use of their datasets. This helps contribute more open data to the ecosystem.
There are many types of open data licenses today. Some are created by government bodies, and applied to public sector information. An example is France’s License Ouverte – created by Etalab (the French department that manages France’s national open data portal). Some open data licenses are created by non-profit or advocacy organisations. The Community Data License Agreement managed by the Linux Foundation is one example. The Open Data Commons licenses managed by the Open Knowledge Foundation is another example. And then there are licenses originally devised for creative content, but apply to datasets as well. The Creative Commons licenses (with the exception of CC-NC) are an example.
The history of open data licenses is intimately connected to the open source and open science movements. Advocates of Free and Open Source Software (FOSS) like Richard Stallman firmly believed in four freedoms that were sought to be protected through open source licenses – the freedom to run a software program from any purpose, the free to study the program (by being provided access to the source code), the freedom to redistribute copies, and the freedom to distribute copies of modifications. These ‘freedoms’ translated to open data licenses as well.
From this perspective, one point of tension in open data licenses, is the concept of use-restrictions. For example, a CC BY-NC license does not allow re-use of the licensed material for commercial purposes.
With increasing re-use of copyrighted material as training data for Large Language Models, a new licensing framework known as ‘Responsible AI Licensing’ (RAIL) has emerged. In RAIL, licenses impose certain ethical use restrictions on datasets, software and models in the AI context. These include harmful use of AI models for generating personal data, harming minors, engaging in fully automated decision-making that has adverse effects on an individual’s legal rights, or exploiting the vulnerability of a particular group of people.
Most RAIL licenses were developed for software and AI models. AI2Impact licenses developed by the Allen Institute for AI extend their ethical licensing framework to training datasets as well. In fact, the Allen Institute released a training dataset known as Dolma, containing 3 million token of web data, under an AI2Impact license in 2023 (but changed the license to CC-BY in 2024).
Strict adherents of the open data movement would argue that use restrictions detract from the very essence of openness, as they limit a particular type of re-use. But on the other hand, certain uses of open data can have harmful effects on individuals and communities.
As artists around the world argue in legal claims against GenAI, the use of their creative content licensed under an open license to create a GenAI model which produces a very similar output to the creative works of such artists raises economic challenges for the artists as well as ethical challenges. This has led to some proponents of open data and open source to rally around Responsible AI Licenses, which contain some ethical use restrictions. So where do we draw the line? Should open data licenses should be revamped to include some kinds of use restrictions? Or is this against the fundamental idea of openness?
Some suggestions for future research and reflection on these questions can include:
- Critically interrogating the history and objectives of the open data movement
There is growing scholarship on how transparency was understood by the open data movement in its early days as part of open government initiatives, and how this has changed over time. This scholarship also engages with liberal and neoliberal conceptions of transparency. Engaging with this historical literature can help us situate open data licenses within the specific context in which they were created, and then evaluate whether this context has changed and whether therefore the licenses need to be revamped. - Critically interrogating the history and objectives of the open data movement
Open data can serve as a new source of input for AI applications. At the same time, technologies like GenAI can also aid in making open dataset more discoverable and improving their quality. But these technologies can also be used for unethical purposes. Understanding these use-cases, and familiarizing ourselves with the ethical concerns involved can also aid in resolving the question of use restrictions in open data licenses.