Copyright and innovation in the life sciences (Publishers, licensing & innovation)
Presentation at September 2020 panel organized by US Patent & Trademark Office and US Department of Justice (antitrust)
We live at the confluence of content/data and technology—this can be seen in the amazing power of supercomputing to analyze and categorize billions of data points (as in mapping the human genome) or the ability of new AI applications to identify new relevant and unexpected analytical insights from disparate content. But there are still some long-term constants— informational content, particularly scientific research content—is most valuable when it is organized, standardized, updated, and indexed.
Scholarly communication is largely supported through scholarly journals, and the journal article has become a well-organized vehicle for conveying research information. Articles have an almost universal structure—an abstract followed by a description of the research methods employed in the research activity; the discussion or paper itself (results + conclusion) including charts, graphs, and other data, and of course the references list. Publishers have evolved this structure, and although some authors chafe over the confines of the structure, researchers themselves highly value this organization of information, as it improves their efficiency in reviewing the large number of articles that might be relevant to their projects.
Publishers have in recent decade moved this content online, “retro-digitizing” earlier journal issues, and incorporating such online innovations as reference linking (CrossRef) and standards in terminology, representations of chemical structures, and the display of formulas. Researchers themselves helped launch many of these innovations, but publishers made them consistent and universal, to the benefit of society (see Kent Anderson’ famous Scholarly Kitchen post on the things that journal publishers do here). Although authors contribute articles to journals on a royalty-free basis (unlike in book publishing), as part of their general work at universities, research institutions or in research-intensive industries such as life sciences, the cost for these innovations and for managing the large number (some 3m articles published per year in the international literature, more if humanities publications are more fully represented), and the submission processes dealing with many more millions of papers (data from the STM Report), along with maintaining the archives and platforms where such content is accessed, is considerable.
Picture courtesy of “artlibrarycrawl.com”.
Copyright is fundamental to the business of journal publishing, as the vast majority of articles are still published under a subscription model. Although author-pays (or funder/institution pay) Gold Open Access is increasing significantly (a recent Plan S position paper suggests as much as 20%), the economy supporting journal publishing will likely be a mixed one for some time. In terms of supportive US government action, in my view, the most useful thing for the scholarly communication construct would be to ensure that research funding also includes publication costs, as is true in many European countries. This would enable a sustainable Gold OA future for government-funded research.
We hear that data is the new currency, and life sciences innovation and the urgency of Covid-19 research certainly demand that further work be done to enable computational research of published articles (“text and data mining”) and on research data itself, the data that represents the raw research results before it is analyzed, reviewed and shortened to fit into a journal article. Patents are also sources for data mining. Publishers have established tools for TDM processes through the STM association with the 2013 Declaration supporting non-commercial TDM supported by more than 20 publishers (representing all the major publishers) and by offering collective licensing options through CrossRef the Copyright Clearance Center for TDM applications. These programs offer “normalization” methodologies that provide a more consistent database against which to apply computational queries. EU law now permits non-commercial TDM in any event as a copyright exception, although there are more limits with respect to commercial activities. Publishers supported the initiative organized by the STM publishers association to open Covid-19 content for use by researchers, and as of this summer we have seen as much as 150m downloads of articles.
Publishers who are particularly active in the life sciences space such as Wolters Kluwer and my former employer Elsevier are also using these kinds of analytic or TDM technologies to support drug development and discovery. These companies are providing data about existing drugs but also about potential reactions, relying on chemical structure information and the literature. These products combine published content, including patents, with technical mining and analytics. Technology companies such as IBM (Watson) are also actively innovating in this space (the recently announced Robo RXN). These new tools are supporting the drug pipeline by focusing on such data as adverse events, reactive data and the like. Such tools are intended to replace actual trials of potential drugs that might ultimately be ineffective or even harmful. What is probably obvious in this discussion is the complexity of research publishing in the life sciences space—especially given the mix of public data and public emergencies with private data and commercial motivations in developing new solutions and therapeutics. One aspect of this complexity is that commercial players are not always motivated to publish all their data (indeed even scholarly researchers are sometimes reluctant), and society as a whole needs to push to have more data (such as negative trial results) made more public. The urgency of Covid-19 has encouraged greater collaboration and openness in sharing research data.
The STM publishing association’s major initiative of 2020 involved launching the “Research Data Year” and establishing collaborative initiatives with organizations such as the Research Data Alliance. The collaboration with RDA involves new standards on data availability, linking from publications to data repositories, and working on principles for managing data repositories. We are beginning to see here the expansion of the traditional publisher role to the standardizing capabilities in data curation, building on earlier experiments (commercial and scholarly) such as figshare, Mendeley, protocols.io. (scholarly project supporting the deposit of methods). Government support for research data management projects would be extremely helpful (going beyond merely mandating data posting requirements for funded projects to provide funding for such projects).
A major impediment that I see at this point of confluence is a commercial one— companies with strong content assets are unlikely to accept that the value of new analytic services for life science is mostly about the technology— while companies with strong technology assets are unlikely to accept that content assets bring the highest value to the equation. One consequence is that content companies are developing their own technological assets and platforms and that technology companies are trying to develop or identify cheaper content assets. The reality is that both types of initiatives are important and probably equally so. There’s a similar issue for AI and machine learning—on one hand, large data sets are useful and on the other hand, structured and well-organized data is equally useful.