Daniel Hook - Digital Science

Barcelona: A beautiful horizon

David Ellis — Thu, 02 May 2024 12:43:34 +0000

How the Barcelona Declaration promotes openness, transparency and innovation

Digital Science welcomes the Barcelona Declaration as a force to continue pushing forward not only openness and transparency but also innovation in and around the scholarly record. Following the launch of this important initiative, we reflect on Digital Science’s path and historical contributions, the economics of maintaining the scholarly record, and its future.

Dimensions is built around open data

In many senses, Dimensions is a demonstration of what can be done when data are made freely and openly available. It would not have been possible to build and maintain Dimensions without the work of initiatives such as I4OC, and the data made available by CrossRef, DataCite, PubMed, ORCID, arXiv and many others. Many pieces of the Dimensions data system leverage use of public sources, and we believe that it is only right and proper to have a version of our product that is available to the community for research purposes at no cost – hence, the free version of Dimensions that we have maintained since 2018, and which we will continue to maintain into the future.

However, service to the community was not the only reason to create a free version of Dimensions back in 2018; it was also about ensuring that researchers had access to search the scholarly record for free and about ensuring that, in an era of increasing research evaluation and increasing research on research, there would be a platform where anyone could go to check results from an analysis or evaluation exercise. At that time, we wrote a paper stating our rationale and principles behind the development of Dimensions and wrote a follow-up piece announcing and committing to continued access for academic research.

In summary, and relevant to current developments, we believe that:

Researchers have a fundamental right to access research metadata to further their research;
Research into bibliometrics and scientometrics, and related fields, needs to have a basis for reproducibility and we seek to participate in that ecosystem to ensure that any analysis carried out using Dimensions data is reproducible;
Data that are used to evaluate academics or institutions should be made available in a way that allows those being evaluated to have an insight into the data on which they are being evaluated.

There is, however, an important additional component that goes beyond these principles – innovation.

A more complex picture

Before we talk about innovation, it is important to acknowledge that Dimensions is not solely built on open data. Indeed, it is a mixed environment with data of different types describing different research objects using different sources. This leads to significant complexity in the data pipeline and in the work that needs to be done to provide “analytics-ready” data. However, for the purposes of the current discussion, it is helpful to understand a bit about the different nature of the sources of data used in our data products. These include open data from open sources. When data are published under a CC0 licence (as Digital Science did with its GRID dataset in 2017) then it is unambiguous that these data may be used in any context, commercial or noncommercial, and that they may be merged with other datasets for the purposes of creating new and better things. It is an interesting question as to whether a Digital Science “mirror” of these helps to make the research infrastructure more robust and easier to access.

Our products also make use of licensed data. These are data for which we have an agreement that restricts its use. Examples can range from research articles; grant data from funders; and, patent documents. They can also include data licensed into products such as Altmetric, which includes data from news providers and social media platforms such as Twitter (X). These data can be expensive to acquire and can only be used and made available in our products within certain limits, even where they are already in the public domain.

All these data and data that are derived from them, even if already freely and openly available, can require substantial resources to compile and process. Examples of such derived data could include funder details, details of ethics statements, conflict of interests, data availability statements, and so on, that Digital Science has transformed, enriched and contextualised. All are activities that take significant investment and add significant value to those who use it. We expect that these types of data will increasingly become part of the Open dataset as the research ecosystem matures. Yet, as we innovate, these are also the data that cost Digital Science the largest investment to produce and maintain, including where this may be done in an automated manner. The infrastructure behind Dimensions is not simply a platform that takes data from open sources and then reserves it for users to consume; rather, it is a complex and expensive mechanism for compiling, refining and improving data so that it can be discoverable, useful and analytics ready.

Taking author contribution statements as an example, the Dimensions team has invested in the creation and curation of AIs that identify author contribution statements across the research literature. These AIs operate at a level of accuracy that still needs improvement, and hence further investment. Neither the scholarly community, nor publishers, nor standards organisations have defined or accepted a standardised data format that makes author contribution statements widely available. As such there is a significant cost to data processing. On top of this, innovations such as the CReDIT taxonomy are neither universally or evenly applied. The use of CReDIT would be of significant value to sociologists who study the research community, as well as to the evaluation community and anyone involved in tenure and promotion processes. And yet, there is no accepted structured data format that makes these data easily available. As such, the Dimensions team is working on the development of a CReDIT data structure and the creation of these data at a level of quality where they can be trusted and used in these important use cases.

As the research ecosystem matures, what should the path from algorithmically generated information back to openly available data with a defined provenance be? One option is to provide enhanced metadata back to publishers to enhance the scholarly record where gaps exist. Arguably, it is not enough for data only to be open – it should be owned by the community that created it, which includes ensuring the context and provenance of the data are maintained. This process has happened many times before, most notably during the application of DOIs to the historical scholarly record.

A model for thinking about innovation

To make sense of this complex landscape we have a mental model that we use to think about the developing world of open research metadata.

The area outside the outer circle (or horizon) can be thought of as all unpublished articles and all articles as yet unprocessed. With time the outer circle expands encompassing both more detail about the existing published literature (new fields, greater accuracy) and the detail about newly published work. At the horizon of the circle the data are mined and fall inside the circle. The fact that the circle expands is important in this model as the effort to derive the data does not expand proportionally to the volume of data refined, but it does increase. The horizon is representative of the ongoing investment in innovation that is required to derive and improve data from raw, unstructured formats. In practical terms, some cases require humans to identify data from texts; in other cases humans write and train AIs to create annotations and make them available.

The inner circle (or the “beautiful” horizon) can be thought of as open data or data that has become so inexpensive to make available as part of increases in efficiency of the production process that it is completely commoditised. These are data that either cost little to provide or are already refined to the point where little or no innovation is required to make them available. Examples include article title, journal name, page number, DOI and, most recently as a result of I4OC and I4OA, citations and abstracts.

The area between these two circles is where the friction at the heart of the Barcelona Declaration exists. A few years ago, it might be argued that there was no inner circle and yet, over the last 20 years, projects including PubMed, Crossref, I4OC, I4OA and pre-print servers such as RePEc and arXiv have slowly created a space for open data, either through community action or technological progress. Among the contributors to this effort there are some notable players including the Microsoft Academic Search project, 1science from the team at ScienceMetrix, and others.

Such a model is not unusual in other contexts, nor is it surprising that it is the natural point of friction. Determining the time for which an innovation should be profitable and the level of profit is not a trivial problem – it is sometimes left to market forces or sometimes is the result of legislation. In the context of copyright law, which was originally developed to protect creativity, the distance between the circles is determined by law to be 70 years after the death of the author in many geographies, although there are variances. Perhaps closer to home, and less legal (but nonetheless social-contract-style) agreements include humanities PhD theses, which often have an agreed two-year embargo period during which the student has the opportunity to develop and publish a book or otherwise build on top of their work.

There are other non-legislative mechanisms that also determine the distance between analogous horizons in other contexts. One might argue that the creation of a new patented invention is like the innovation horizon of the outer circle, whereas the beautiful horizon of the inner circle is the creation of parallel developments that seek to achieve the same ends as the original invention via different mechanisms. Typically, the time taken for competitors to duplicate an approach, might take several years. At some point the patent will expire, but it may already be rendered useless by the innovations of others.

Perhaps uniquely in the research information sector, Digital Science has pushed both horizons – pushing the innovation horizon:

as well as pushing the open data horizon:

Making grant data freely available to funders who award less than $1m in funding per year;
Making Dimensions available for free to Research4Life;
Making GRID institution data available CC0;
Making dataset searching and citations freely available in Dimensions and Figshare;
Making ANZSRC and SDG classification freely available in Dimensions;
Making grants data available to ORCID profiles through the UberWizard.

Taking a pragmatic position suggests that the annulus needs to be determined dynamically rather than systematically. If an individual or a company invests in pushing the innovation horizon then they are taking a chance on improving the data that researchers and other stakeholders have to make better decisions, gain deeper insights or be more efficient, and there should be an incentive to continue to invest in innovation. If the innovation is incremental or easy to replicate then the returns will be small as others should easily duplicate it. If the innovation is significant then it will be harder for others to reproduce and hence it will take a longer period before competitive forces come to bear.

A step change in technology can upset the equilibrium and change both the current competitive dynamics as well as the future focus of innovation. Machine learning was one of the key technologies that has allowed the Dimensions team to push resources into innovation over the last few years, and enhancements in the AI landscape with large language models (LLMs) will continue to fuel these developments.

At Digital Science, our belief is that by taking risks, being innovative and pushing boundaries, so that clients gain real value and significant benefit from our offerings, there should be an opportunity for an appropriate return on investment. We believe that the chance to profit is naturally kept in check by competition, which typically pushes the outer circle, by initiatives such as the Barcelona Declaration, which often advance the inner circle, and by our own mission as Digital Science to support and serve research and the community around it, where we have clearly demonstrated the ability and the will to move both circles.

The future

Using the model above, it made sense in the past that scholarly information would be closed. In the 1950s, when Eugene Garfield started the Institute for Scientific Information, the investment required to construct the science citation index was significant. Indeed, it was Garfield’s realisation that 80% of the citations related to 20% of the literature which turned the problem of citation tracking into one that was tractable with technology contemporary to the era.

The investment that needed to be made to “mine” the publication and citation information, given the level and nature of scholarly information infrastructure at this time, was vast. Hence, it is unsurprising that the Science Citation Index was, in essence, the only such index for almost 50 years. With the digitisation of the scholarly record towards the end of the 20th Century, the bar to entry was lowered and PubMed, Crossref, Google Scholar and Scopus were all innovators, introducing competition and, ultimately, creating the Open Data Horizon.

In 2018, Dimensions made use of successive innovations from the community, such as I4OC, together with machine learning to lessen the distance between the two circles.

In the next 10 years, with technological advances in how we write and publish scholarly output, we see a world in which much of the metadata is simply available at the point of production as open data – a true realisation of the Barcelona Declaration. At this point, the distance between the two circles will be zero, with the innovation horizon and the open data horizon coinciding. The effective cost of production of the data will be zero.

So, what will be beyond Barcelona? There are still many challenges regarding research information – there will probably be a further period beyond the Barcelona Declaration’s aims in which, as we already are, we start to invest more heavily in information provenance, the integrity of research information, and in understanding sentiment and bias in the research literature. Our field of focus will shift to ensuring that we can trust the information that will be increasingly important not only in decision making but in forming the basis of AI curricula in the future.

I have confidence that in an innovative field such as research, innovation will continue to be expected of those who seek to serve the space. While Barcelona defines a beautiful horizon, that is still compatible with an endless frontier.

The post Barcelona: A beautiful horizon appeared first on Digital Science.

From initials to full names: How transparency and diversity emerged in author bylines

Simon Porter — Wed, 10 Apr 2024 07:04:13 +0000

How naming conventions evolved from initials to full first names in scholarly publishing

In this ongoing investigation into Research Transformation, we seek to celebrate the art of change. How does change happen in research? What influences our behaviour? How do all of the different systems in research influence each other?

We begin our reflection on transformation with perhaps one of the most unremarked on, yet most pervasive changes in research – the switch between initials and full first names in the author records. As we will see, the shift from the formal to the familiar has been in flux from the start of scholarly publishing, however – particularly in the last 80 years – we can trace the influence of countries, fields of research, publishers and journal submission technology, funders and scholarly knowledge graphs on author name behaviours. In more recent history, we can observe that the shift towards full names has also been gendered, particularly in medicine, with men shifting towards full names earlier than women.

Why does it matter? The increase in transparency afforded by first author names is not simply a curiosity. First names, in the ethnicities and genders that they suggest, provide an (albeit imperfect) high level reflection of the diversity of experiences that are brought to research. It is just as important to see ourselves reflected in the outputs of the research careers that we choose to pursue, as the voices that represent us on panels at conferences. Framed this way, the progress towards the use of first names is part of the story of inclusion in research. The ‘Initial Transformation’ is also an initial problem.

Fortunately, the use of initials as part of author names has been in steady, if gradual, decline. The full details of the “The Rise and Fall of the Initial Era” can be found in our recent paper on arXiv: https://arxiv.org/abs/2404.06500.

Below are six observations from the paper:

The transformation from initials to full first names is part of the the broader transformation of the journal article as technology

The form of a research article itself a the technology used to encode the global norms of science. As a key building block of shared knowledge, the evolution of the form of a research article must be at once slow enough to allow the discoveries of the past to be understood today, and flexible enough to codify new patterns of behavior (such as researcher identifiers ORCiD, funding statements, conflicts of interest, authors contribution statements and other trust markers).

Over time, not only has the structure of the content of a research article evolved, the way that authors are represented has also changed. From 1945 through to 1980, we identify a period of name formalism (referring to authors by first initial and surname). This is the only period in the history of publishing where initials are used in preference to full first names. We call this period the ‘Initial Era’.

In the ‘Initial Era’, we suggest that accommodating a growing number of authors per paper on a constrained physical page size encouraged the formalism towards initials. From 1980, full names begin to be used more commonly than initials marking the beginning of the ‘Modern Era’. Within the ‘Modern Era’, name formalism continues a gradual decline through to the 1990s. In the period between 1990 through to 2003 – a period of significant digital transformation in which the research article was recast as a digital object, name formalism drops steeply. After 2003, the decline in name formalism is less steep, but steadily trends toward zero.

The story of the initial transformation is one of different research cultures becoming homogenised

The US is the first country to shift towards the familiar, followed reasonably quickly by other western countries, with France perhaps holding out the longest. Slavic Countries are more formal for longer but also increasingly shift towards familiar names. At the bottom of the graph (see below) in green, are three countries in the Asia-Pacific region – Japan, South Korea and China. For these countries there is no concept of a first initial, and where names have been anglicised, full names were preferred.

The story of initial transformation highlights a discipline separation in research culture

How we name ourselves on papers has nothing to do with the type of research that we conduct, yet there are very clear differences in the rate of shift from name formalism between disciplines. Research does not change at a single pace, local cultures can impact change regardless of their relationship to the change itself.

Technology influenced our name formalism

The choice to use first names or initials has not always been a choice that resides with researchers themselves. Below we present an analysis of three journals that all went live with online journal systems in 1995-96. From the mid 70s through to 1995, journals still mostly employed typesetting houses that set the style of the journal. Even before the onset of online submission systems, journal styles influenced the way that first initials were represented. From the mid 70s these three journals take different approaches. Tetrahedron shifts from a majority initials approach, whereas The BMJ and the Journal of Biological Chemistry switch to typesetting that preferences initials. With the emergence of the internet in 1995, research articles began to be recast as discoverable landing pages, and here the Journal of Biological Chemistry switches all at once to a system that enforces full names, and The BMJ – a system that allows choice. In all cases where author choice is allowed, the trend away from formal names continues.

Changes in infrastructure can affect how we understand the past as well as the present

Between 2003 and 2010, DOI infrastructure run by CrossRef was adopted by the majority of publishers. As part of the CrossRef metadata schema, a separate field for given names was assumed. Critically, during this transition most journals chose to implement their back catalogue, including full names where possible. We owe our ability to view full name data in the past to infrastructure changes in the first decade of the 2000s.

How were publishers able to communicate first names to the crossref DOI standard? At a layer below DOIs was another language to describe the digital structure of papers. The Journal Article Tag Suite (JATS XML), now a common standard used to describe the digital form of a journal article – aiding both the presentation, and preservation of digital content - was first released in 2003, and reflected over a decade of prior work in the industry to reexpress the journal article as a digital object. Within this standard full names were also codified, and the requirement of a publisher to preserve all digital content meant that there was an imperative to apply this standard (or at least compatible earlier versions) to their complete catalogues.

Although the communication of first names seems to have occurred reasonably seamlessly to DOI metadata, the transition of first names to the scholarly knowledge graphs of the time was slower.

MedLine (and by relation pubmed) only began adding full names to its metadata records in 2002. Journals that relied on MedLine records for discovery (and chose not to implement DOIs) did not benefit from retrospective updates.

The difference in the adoption of first names between crossref and MedLine/PubMed also highlights a risk in adopting scholarly knowledge graphs as infrastructure. Scholarly Knowledge graphs have their own constraints on infrastructure, and make decisions on what is sustainable to present. Although enormously valuable, they are a disconnection point with the sources of truth they present. We can see this split starkly if we look at publications from those journals that chose not to create DOIs for their articles, relying instead just on the services provided by MedLine.

The shift to full names happened at different rates for men and women, and at least for publications associated with pubmed, technology influenced the practice

With the benefit of gender guessing technology, we note that progress towards first names has occurred at different rates for men and women. This is particularly stark for publications in PubMed.

Why is there a jump in 2002? As mentioned above, 2002 was the year that you could start to interact with author first names, with pubmed and medline incorporating it into their search. Although we cannot draw a direct causal connection, it is tempting to make the argument that this subtle shift in critical technology used by almost all medical researchers had a small but important impact on making research more inclusive. When we look at articles that have both a PubMed ID and a DOI, we can see that in 2002 the average number of first names on papers associated with women rose by 17%, and 13% for men. This jump is not present in publications that have not been indexed by PubMed.

For medical disciplines associated with papers in pubmed, after 2002 there also is a distinct difference in the rate of first name transformation for men and women. The rate of change for men is less than half that of women, rising only 5% in 20 years, compared to 12%. For some disciplines then, this raises a methodological challenge in gender studies as (at least based on author records,) the changes in participation rates of women in science must be disentangled from changes in the visibility of women in science.

Embracing initial transformation

Finally, the transition from initials to first names has happened slowly and without advocacy. Whilst this has been to our advantage in identifying some of the axis along which research transformation occurs, an argument could be made that, if first names help provide us (imperfectly) access to the diversity of experiences that are brought to research, then the pace of change has not been fast enough. For instance, could more have been made of the use of ORCiD to facilitate the shift to using first names so that older works by the same researcher identified by an initial based moniker could be linked to newer works that use the researchers full first name?

The transformation away from name formalism of course does not stop at author bylines. Name formalism is also embraced in reference formats. It could be argued that even within a paper, this formalism suppresses the diversity signal in the research that we encounter. Reference styles were defined in a different era with physical space constraints. Is it time to reconsider these conventions?
Within contribution statements that use the CRediT taxonomy, initials are also commonly employed to refer to authors. Here, this convention also creates disambiguation issues when two authors share the same surname and first initials. Here too, as the digital structure of a paper continues to evolve, we should be careful not to unquestioningly embed the naming conventions of a different era into our evolving metadata standards.

The post From initials to full names: How transparency and diversity emerged in author bylines appeared first on Digital Science.

Launching a new way to interact with scientific content on OpenAI’s ChatGPT platform

Wed, 28 Feb 2024 14:00:46 +0000

Dimensions Research GPT: Trustworthy research exploration via ChatGPT

Today, Digital Science releases its first custom GPT on OpenAI’s ChatGPT platform – Dimensions Research GPT – as a free version based on Open Access content and an enterprise version calling on all the diverse content types in Dimensions from publications, grants, patents and clinical trials. In alignment with our goals to be responsible about the AIs that we introduce, we explore below some of the steps that we’ve taken in its development, explain our key principles in developing these tools, and make the context of these tools clear for the community that we intend them to serve.

For any software development company, there is an implicit responsibility to the user communities that they serve. Typically, this commitment might extend to being conscientious about how the software is developed; ensuring, to the greatest extent possible, that their software should be secure, not contain bugs, and that it will function as described to the client, would seem to be some of the basic requirements.

The rise of AI should raise the value that systems can bring to users, but it also raises the bar in the relationship between developer and user, especially with large language models (LLMs). Users need to understand how the data that they submit to the system are being used, and they also need to understand the limitations of the responses that they receive. Developers need to understand and minimise biases in the tools they create, as well as understand complex concepts such as hallucination and work out how to educate users about how they should think about trusting different types of output from their software.

All these problems are magnified tenfold when it comes to supporting researchers or the broader research enterprise. The research system is so fundamental to how society functions and progresses that we cannot afford for new technologies to undermine the trust that we have in it.

At Digital Science we believe that research is the single most powerful tool that humanity possesses for the positive transformation of society and, as such, we have a responsibility to provide software that does not damage research. Although that sounds simple, it is tremendously difficult. In an era of papermills and p-hacking, providing information tools that support research requires deeper thinking before releasing a product to users.

Beyond all the requirements that we have listed above, to support researchers and the research community, we believe that we need to:

ensure that researchers understand what uses of the system are valid and which aren’t;
sensitise users to the fact that this technology is in its early stage of development and that it cannot be completely trusted;
provide users with the ability to contextualise the output that they get so that they don’t have to trust without verification;
ensure that no groups of researchers are artificially or through commercial approaches disenfranchised or excluded from accessing this type of technology.

Many of these features have been built into the offering that we launch today: this blog attempts to address some of the points above; we are working to ensure equitable access by creating a free version; and we have made specific functionality choices to try to address our concerns with where this technology can lead. Overall, it is with some pride and much excitement that we launch Dimensions Research GPT today!

The Road to Dimensions Research GPT

Our free offering Dimensions Research GPT and its more powerful counterpart Dimensions Research GPT Enterprise are the result of a long period of testing and feedback from the community. We started developing this type of functionality in late 2022, but by summer 2023 it had reached a phase where we needed more understanding from the sector. Thus, in August 2023 we launched the Dimensions AI Assistant as a beta concept. We quickly learned that “question answering” can be challenging not just from a technical perspective (for example, providing a low-to-no-hallucination experience) but also in terms of providing users with an interface that continues to be engaging and which fuels curiosity.

In addition, we found that there is a certain “fuzziness” in querying through an LLM that doesn’t sit comfortably in an environment that involves highly structured data, such as Dimensions. That realisation led us to make certain design decisions that you’ll see informing the way that we develop both the products launched today and Dimensions in the future.

For better or worse, since the beginning of modern search in the mid-1990s we have become used to searching the web and seeing pages of search results – some of which are more relevant to our search, and some of which appear less relevant. With most LLMs, the information experience is different to a standard internet search: We ask a question and we get an answer. What’s more, we get an answer that typically does not equivocate or sound anything less than completely confident. It does not encourage us to read around a field or notice interesting articles that might not be relevant – it focuses us on the answer rather than being curious about all the things around the answer. Launching a tool that has those characteristics in a research context is not only potentially irresponsible but also dangerous. We have used that concern as a guiding principle for how we have built Dimensions Research GPT.

What is Dimensions Research GPT?

Dimensions Research GPT and Dimensions Research GPT Enterprise both bring together the language capabilities of OpenAI’s ChatGPT and the content across the different facets of Dimensions. In the case of Dimensions Research GPT, data related to research articles from the open access corpus contained in Dimensions is used to provide context to the user’s question and discover more. This free tool gives users the ability to interact with the world’s openly accessible scholarly content via an interface that ensures that answers refer back to the research that underlies the answer. This provides two important features: Firstly, the ability to verify any assertions made by Dimensions Research GPT, and secondly, the ability to see references to a set of articles that may be relevant to their question so that users continue to be inquisitive and read around a field. Basing this free tool on content that is free-to-read provides the greatest chance for equity and impact.

Dimensions Research GPT Enterprise runs the same engine and approach as Dimensions Research GPT but it extends the scope of the content that it can access to include data from the full Dimensions database covering 350 million records, including research articles, grant information, the clinical trials, and the patents. A truly fascinating dataset to explore in this new way.

Before we explore further what Dimensions Research GPT is, and the kinds of things that you can do, it is worth taking a moment to be clear about what it is not. Put simply, it is not intended for analytics. While many users are familiar with Dimensions as an analytics tool, the Dimensions Research GPT is not a tool for asking evaluative or quantitative questions. Thus, asking Dimensions Research GPT to calculate your H-index or rank the people in your field by their attention will be a fruitless task. Similarly, the system is designed to help you explore knowledge, not people; hence, if you ask Dimensions Research GPT to summarise your own work, provide rankings, or tell you who the most prolific people are in your field, you will be disappointed. Many of these use cases, with the exception of those involving H-index (Digital Science is a signatory to DORA) are already covered by Dimensions Analytics.

An example of how to use Dimensions Research GPT

We’ve covered at a high level the principles behind building a tool like Dimensions Research GPT, and we’ve also explained what it is and is not, so now we really should show you how to think about using the tool.

Below, we show a brief conversation with Dimensions Research GPT about a research area known to one of the co-authors of this blog. We encourage readers to carry out the same queries in ChatGPT or Dimensions Research GPT Enterprise and compare the answers that they receive.

Our first prompt introduces the area of interest…

Summarise three of the most important, recent applications of PT-symmetric quantum theory to real-world technologies

The references link over to Dimensions to give full contextualised details of the articles and connect over to source versions so that you can read further. Maybe we’re not from the field and we want to understand that response in simpler terms. That might look like:

Rewrite your last response at the level which a high-school student can understand and highlight the potential for application in the real world

With this query, we’ve just begun to explore the base functionality that ChatGPT provides under Dimensions Research GPT. This is just scratching the surface of the open-ended possibilities implied here.

Finally, we ask Dimensions Research GPT to speculate:

Please speculate on the potential applications of PT symmetry to medical device development, providing references to appropriate supporting literature

Again, the tool shows references that back up these speculations about these exciting potential advances.

We fully realise that this is not a panacea, but at the same time, we think that this approach is worthy of exploration and pursuit in a way that can help the research community benefit from new AI technologies in a responsible way. We’re sure that we won’t get everything right on the first attempt – but we aim to learn. On that note, we hope that you will be part of our experiment – please do tell us how you use this platform to inform and accelerate your own research. Like us, we’re sure you’ll find that with this technology there are always possibilities.

If you want to try Dimensions Research GPT, you can do so as a ChatGPT Plus or Enterprise user, by going to your OpenAI/ChatGPT environment and looking for Dimensions Research GPT under Explore GPTs.

The post Launching a new way to interact with scientific content on OpenAI’s ChatGPT platform appeared first on Digital Science.

The lone banana problem. Or, the new programming: “speaking” AI

David Ellis — Tue, 27 Jun 2023 09:20:21 +0000

The subtle biases of LLM training are difficult to detect but can manifest themselves in unexpected places. I call this the ‘Lone Banana Problem’ of AI.

Daniel Hook

CEO, Digital Science

Think of all the tools that the information age has brought us…

Think of all the people whose diseases have become manageable or which have been cured. Think of all the economic benefits that have been generated through increased productivity. Think of all the routes that we have to express ourselves and share our creativity. Think of all the skills that were needed to power that revolution – the hardware engineering, and the software engineering – the development of whole new fields of knowledge and understanding.

Then think of all the wealth disparity that has been introduced into our world. Think of the social anxiety of always being online. Think of the undermining of our democratic institutions.

The effects of new technology are seldom either solely positive or solely negative and, as such, there is a responsibility that sits with those that develop technology to consider how it will be used. Underlying that responsibility sits a need to deeply understand technology. The recent rise of Large Language Models (LLMs) and its rapid adoption begs many questions about whether we understand the technology and whether we understand how it will impact us at a cultural, societal, or economic level.

It is clear that the experience that we have of programming and existing technologies will give way to different skills with this new technology. The command of a programmer mindset and skill with languages such as C++ and Python will give way to the need to understand a dynamic meta language that is drawn from the patterns of online human interaction. The new skill for the present technology is “speaking” language in the way that AI determines that language to be spoken from the inputs that it has consumed.

In that context, knowing that LLMs are not producing something new or creative, but rather that they are producing (or reproducing) the statistical average of the inputs that they have consumed in the context of the question they have been asked is important. Understanding that AIs do not understand us the way that we think they do is an important step in taking responsibility for developing on top of these technologies when creating new tools.

What follows attempts to illustrate this point.

Going bananas

When I have written before on this topic I have noted that it is not these large biases that are concerning to me, rather it is the subtle biases that are difficult to detect that I think should be of more concern. I found recently what I regard to be an excellent illustration of the phrase that I have chosen to describe the output of AI. I call this the ‘Lone Banana Problem’.

Bananas are attractive fruits, they have a jolly colour, and they taste great. I have had a running joke for several years with a friend of mine that due to his love of bananas he should use them more significantly in the branding of his business. When I signed up to Midjourney I saw the perfect opportunity to generate an idealised banana image for him to use. I started with a simple prompt: “A single banana casting a shadow on a grey background”. The result is shown in Figure 1.

1: Four initial outputs generated by Midjourney in response to the prompt “A single banana casting a shadow on a grey background”.

Now, the more astute amongst you may notice an issue with the outputs that Midjourney has produced in Figure 1. While the bananas are beautiful and look extremely tasty in their artistic pose casting their shadow on the grey background, you may notice that I asked for a single solitary banana, on its own, but none of the variants that I received contained just one banana. Of course, I thought, the error must be mine, I clearly must not have been sufficiently precise in my prompt. So, I tried variants – from “a perfect ripe banana on a pure grey background casting a light shadow, hyperrealistic”, to the more specific “a single perfect ripe banana alone on a pure grey background casting a light shadow, hyperrealistic photographic”, and even to the emphatic (even pleading) “ONE perfect banana alone on a uniform light grey surface, shot from above, hyperrealistic photographic”.

The invisible monkey problem?

I mentioned this challenge to a friend of mine who has much more of a programmer’s brain than I. He asked me if I tried to get Midjourney to render a monkey with a banana and then asking Midjourney to imagine that the monkey was invisible. (You see what I mean about a programmer’s brain?) He (my friend, not the monkey) was surmising that the data around monkeys holding bananas, or bananas in a different context might yield different results. The depressing result is included in Figure 2.

Figure 2: One of the outputs of the experimental prompt: “An invisible monkey with a single banana”.

You are quite right, that monkey (in Figure 2) should look sheepish! Firstly, he should be invisible and is conspicuous by well…his conspicuousness. Secondly, he is holding not one but two bananas! The results were the same with aliens holding bananas and other animals. Slightly bizarrely, several of the monkeys ended up wearing bananas or being banana-coloured.

Every image that I asked Midjourney to produce contained 2 (or more) bananas seemingly no matter how I asked.

I began to suspect that bananas, like quarks in the Standard Model of physics, might not naturally occur on their own as through some obscure binding principle they might only occur in pairs. I checked the kitchen. Experimental evidence suggested that bananas can definitely appear individually. Phew! But, the fact remained that I couldn’t get an individual banana as an output from the AI. So, what was going on?

Bias training

One of the problems of generative AI is that understanding what is going on inside the machine’s brain is almost impossible. There are interesting approaches such as TCAV that attempt to give us more of an insight, but as with a human brain, we don’t fully understand the process that goes on inside a deep learning algorithm.

Testing and understanding the outputs for a given input is critical when deciding how this type of technology can be applied in real-world applications. Anyone who has studied chaos theory or who has heard of the butterfly effect will know that naturally complex systems are often highly sensitive to initial conditions. The Large Language Models (LLMs) that we are building have highly complex inputs in the form of vast amounts of data that are used to train these intelligences. However, like the Mandelbrot set and other complex-looking fractals, the rules that are then applied to the data to do the training are deceptively simple.

The bias to two bananas in a picture is, I believe, an example of a subtle bias (OK, it’s not that subtle, but it is more subtle than many of the more concerning news-grabbing biases that we regularly read about). A naïve explanation may be that in the training dataset there have been many pictures of bananas added to Midjourney’s database that have been labelled “banana” but not labelled “two bananas”. It may also be that Midjourney has never seen an individual banana, so it doesn’t know that a single banana is possible.

The danger here is that due to the convincing nature of our interactions with AIs, we begin to believe that they understand the world in the way that we do. They don’t. AIs, at their current level of development, don’t perceive objects in the way that we do – they understand commonly occurring patterns. Their reality is fundamentally different to ours – it is not born in the physical world but in a logical world. Certainly, as successive generations of AI develop, it is easy for us to have interactions with them that suggest that they do understand. Some of the results of textual analysis that I’ve done with ChatGPT definitely give the impression of understanding. And yet, without a sense of the physical world, an AI has a problem with the concept of a single banana.

Of course, the point of this article is not to be vexed about the lack of individual bananas in the AI’s virtual world. It is to point out that even though this technology is developing rapidly and that the output is impressive, there are still gaps and, while they are not always immediately notable, they are not small. Ethical and responsible use of AI is easy to forget when faced with the speed of innovation and the constant press hype around AI. The lone banana problem is, in some sense, a less scary version of HAL the AI in Arthur C. Clarke’s 2001 where instead of killing the crew of the Odyssey, I have merely discovered a virtual universe in which bananas only appear in pairs.

While humans are amazing pattern matchers, that skill is augmented by common sense (in many but not all cases), context and an evolved and subtle understanding of the physical world around us. AIs don’t yet have those augmentations – they are pure pattern matching power. And hence, they are only as good as the data that we input into the training set and hence can be no more than the statistical average of those inputs. In the lone banana problem, the statistics suggested that bananas only appear in twos (or more) and so the AI could not imagine a single banana, because the data and parametric tuning that had gone on didn’t allow it to consider that approach, on average.

Existential questions

But this line of thinking does beg certain uncomfortable questions such as, is human intelligence just the result of pattern matching in the context of an enriched relationship with a physical world? Is human morality simply the result of a pattern matching type feature with a sense of its own mortality? Or, is there something deeper going on? Is inspiration or intuition something that an AI will be able to master either through its learnt experiences or a richer relationship with the physical world? In a philosophical sense, what is creativity? And, does human creativity differ fundamentally from that of machine creativity?

Certainly human creativity has limits (and, perhaps paradoxically, appears to become more limited with age – precisely as we have been exposed to more experiences). Some of those limits are, for example, the amount of data that we can perceive and process. Others appear in the extent to which we can imagine beyond our everyday experiences. Machine creativity appears to have different limits – not ones of data processing, but rather limits on what can be perceived to be relevant or important, and similar to human experience, on imagining beyond experience. Despite the alluring implication of the command “imagine” used to initiate a new prompt when getting Midjourney to create a set of images, to what extent is the AI able to imagine beyond the patterns that have been fed to it?

It also points to a deeper issue of the underlying nature of this technology. When programming became a more mainstream job in the 1980s and people started studying for computer science degrees, it was noted that programming required a certain way of thinking. It is the same for prompt engineering, but the type of thinking behind prompt engineering is not the same as for programming. Prompt engineering requires a deep understanding of language and not only that, a deep understanding of how a large language model understands language. In the same way that deep learning algorithms have a deep understanding of chess and Go and consequently make surprising moves due to their perception of the game, the same will be true of large language models and their perception of the world. Their use of language and interpretation will be considerably more nuanced than ours, loaded with a myriad of cultural references that none of us can possibly have assimilated.

While I appreciate that AIs are a good deal more complex than my simple example shows, and indeed my example may even just be a facet of clumsy prompt engineering, it still demonstrates that you have to be incredibly careful about your assumptions when using AI. What it tells you may not be what you think it is telling you and it may not be for the reasons that you think either.

At Digital Science, we believe that we have a responsibility to ensure that the technologies that we release are well tested and well understood. The use cases where we deploy AI have to be appropriate for the level at which we know the AI can perform and any functionality needs to come with a “health warning” so that people know what they need to look for – when they can trust an AI and when they shouldn’t.

Postlog

After two weeks in despair of ever finding a lone banana, I tried a different style of prompt in Midjourney (either it has learnt some new tricks, or I might be getting better at “prompt thinking”). The prompt “A single banana on its own casting a shadow on a grey background” yielded Figure 3.

This shows two things: Firstly, that initial conditions, by which I mean the prompt given can be very sensitive – the difference between the phrases “A single banana casting a shadow on a grey background” and “A single banana on its own casting a shadow on a grey background” is not large, either in the semantics or the words used. However, the outcome is significantly different in its level of accuracy. Secondly, even with this improved prompt formulation coupled with whatever upgrades that may have been implemented by the Midjourney team in the prior two weeks, there is still one output that contains two bananas, and if you look closely, one attempted banana is trying to split itself in two!!

Figure 3: Four initial outputs generated by Midjourney in response to the prompt “A single banana on its own casting a shadow on a grey background”.

The post The lone banana problem. Or, the new programming: “speaking” AI appeared first on Digital Science.

Tinker, researcher, prompter, wizard

Daniel Hook — Tue, 09 May 2023 08:51:34 +0000

Will mastery of the witchcraft of AI be good or bad for research?

via Bing Image Creator. Prompts: “A (female) sci-fi wizard invoking all of (her)his mysterious energy to create the most powerful spell imaginable, drawn in comic-book art style”.

Until six months ago most of us probably hadn’t placed the words “prompt” and “engineer” in close proximity (except possibly for anyone involved in a construction project where a colleague had arrived to work consistently on time). Today, a “prompt engineer” is one of a new class of emerging jobs in a Large Language Model (LLM)-fueled world. Paid in the “telephone-number”-salary region, a prompt engineer is a modern day programmer-cum-wizard who understands how to make an AI do their bidding.

The dark art of LLMs

Getting an AI to produce what you want is something of a dark art: as a user you must learn to write “a prompt”. A prompt is a form of words that translates that which you desire into something the AI can understand. Because a prompt takes the form of a human-understandable command, such as “write me a poem” or “tell me how nuclear energy works”, it appears accessible. However, as anyone who has played with ChatGPT or another AI technology will tell you, the results, while amazing, are often not quite what you asked for. And, this is where the dark art comes in. The prompts that we use to access ChatGPT-like interfaces lack the precision of a programming language, but give a deft user access to a whole range of tools that have the potential to significantly expand our cognitive reach and, because of the natural language element of the interface, have the potential to fit more neatly into our daily lives and workflows than a programming interface.

the incantation or spell that you need to cast is not obvious unless you can get inside the mind of the AI…

Mastering these new tools, just as was the case when computer programming became the tool of the 1970s, requires a whole new way of thinking – the incantation or spell that you need to cast is not obvious unless you can get inside the mind of the AI. Indeed, this strange new world is one in which words have a new power that they didn’t have just a few months ago. A well-crafted prompt can conjure sonnets and stories, artworks and answers to questions across many fields. With new plug-in functionalities, companies are able to build on the LLM frameworks and extend this toolset even further.

The classic UNIX Magic poster by Gary Overacre hints at the possibilities open to its users if they can find the right combination of “ingredients”, which are all UNIX related. Prompt engineers are likewise attempting to master the necessary ingredients to become wizard-like users of LLMs. For more discussion on the poster and its contents, see e.g. this post on HackerNews.

Problematic patterns

What can seem like magic is actually an application of statistics – at their hearts Large Language Models (LLMs) have two central qualities: i) the ability to take a question and work out what patterns need to be matched to answer the question from a vast sea of data; ii) the ability to take a vast sea of data and “reverse” the pattern-matching process to become a pattern creation process. Both of these qualities are statistical in nature, which means that there is a certain chance the engine will not understand your question in the right way, and there is another separate probability that the response it returns is fictitious (an effect commonly referred to as “hallucination”).

Hallucination is one of a number of unintended consequences of AI – whereas a good search on Google or a Wikipedia article brings you into contact with material that may be relevant to what you want to know, it is most likely that you will have to read a whole piece of writing to understand the context of your query and, as a result, you will read, at least somewhat, around the subject. However, in the case of a specific ChatGPT prompt, it is possible that you only bring back confirmatory information or at the very least information that has been synthesized in specific response to your question, often without context.

Another issue is that of racism or other biases. AIs and LLMs are statistical windows on the data they contain. Thus, if the training data are biased then the outcome will be biased. Unfortunately, while many are working on the high-level biases in AI and also trying to understand how one might equip AIs with an analog of common sense, it is not yet mainstream research. More generally, I am quite hopeful that, since the issue of bias is becoming better articulated with time, we will conquer obvious biases that take place in interactions. However, what is more concerning to me are the cumulative effects of subtle biases that AIs adopt. The whole point of AIs as tools is that one thing that they do better than humans is that they are able to perceive factual information more completely and with greater depth than a human. This is why we can no longer beat AIs at Chess or Go (except by discovering critical flaws!), and why we may not be able to perceive biases that are not simple. In “2001”, the famous Stanley Kubrick Arthur C Clarke collaboration, the AI HAL interpreted conflicting instructions in an unpredictable way. In fact, the rationale for HAL’s psychosis is obvious by modern standards, but it serves as a prescient tale – the outcome of a confluence of a large number of complex instructions could be a lot more challenging to identify.

Fear and loathing in AI

Twenty years ago, we would not have recognised “social media influencer” as a job type and in twenty years from now, we may not recognise taxi driver as a job type. Prompt engineers may form just one new class of jobs, but how many of us will be prompt engineers in the future? Will prompt engineers be the social media superstars of tomorrow? Will they be the programmers that fuel future tools? Or, like using a word processor or search engine, will we all be required to learn some level of witchcraft?

Figure 1: Quantifying the time lag of the impact of industrial revolutions on GDP per person. In all industrial revolutions to date, GDP per person has been impacted and takes several decades to recover and exceed previous peak levels. Reproduced by kind permission of The Economist: The Third Great Wave (2014).

Previous industrial revolutions have been responsible for massive wealth creation as sections of society have significantly increased their productivity. But, at the same time, revolutions have also led to destitution in the parts of society that have either lacked access to new technologies or who have lacked the capacity to adapt. As can be seen in Figure 1, following the first industrial revolution in the UK the average GDP per person fell between the 1760s and the 1820s – it was not until the 1830s (50 years after the introduction of the steam engine and a full 80 years after the beginning of the industrialization) that the GDP per person regained and eventually eclipsed its level in 1750. The same trend took place in the late 1860s where GDP per person fell until the early 1900s when, by 1925, it was well on the way to its previous high, again to eclipse its prior peak by 1950. We are currently in a third cycle in which GDP per person is now lower than its 1975 high, initiated by the advent of the PC, followed by the World Wide Web and further fuelled by the rise of AI.

Viewed as a continuum since the dawn of the information age with the widespread introduction of the PC, we stand in the midst of an exponential revolution where the technologies we have built are immediately instrumental in designing and building their successors, superseding themselves over shorter and shorter timescales.

Moore’s Law may have held, predicting the growth in potential of computer chips, through the end of the last millennium. While this power has determined the power and capability of AI up until this point, data are now the dominant factor in how the power of AI will develop. And, as data limitations come into place, the power of quantum computing may take us further into the future.

If we view AI as the tool that it is, and think about how it can complement our work, we begin to position ourselves for the new world that is emerging.

Despite Elon Musk’s prophecies of doom and even the training of evil AIs, or those which create deadly toxins, most people just want to know if their livelihoods are under threat in the near future. While world domination by an AI is not likely in our imminent future, it is almost a certainty that jobs will change. However, if we view AI as the tool that it is, and think about how it can complement our work, we begin to position ourselves for the new world that is emerging. Progressive researchers and businesses alike will be investing not just in learning how the current technology will revolutionize our world but in continual professional development for their team members as, in an exponential revolution, we can expect to need to retrain not just one or twice in a career, but potentially all the time.

While Web 2.0 in the early 2000s gave us massive open online courses (MOOCs) with the potential to retrain and develop ourselves, it has developed as a tool for companies to deliver information such as safety and compliance training, as well as a tool for us to learn new hobbies. Importantly, it has broadened access to education by, for example, making college and university-level courses available for free or at low cost. However, for the majority of people, MOOCs and learning platforms have not been integrated into our lives on a regular basis. In an AI-fueled revolution, our capacity to learn may become critical in being employable – continuous professional development may be the new norm, with your education even being tailored to your needs by AI.

Who prompts the prompters?

ChatGPT and similar technologies show great promise in a research context. They have the capacity to extend our perception, detect subtle patterns, and enhance our ability to consume material. It is clear that their existing ability to summarize and contextualize existing work is impressive. Several authors have already attempted to add ChatGPT as a co-author, causing some publishers to bring out guidelines that ChatGPT may not be a co-author, since it cannot take responsibility for its contribution. At the same time there are concerns over ChatGPT’s tendency to “hallucinate” answers to questions. Since LLMs are essentially statistical in nature – their answers correspond to the statistically most likely responses to questions – which means that their answers are only statistically based on fact and there is always a chance of untruth. This means that a literature review written by an AI may be written quickly but it could, at some level, be both uninnovative and incorrect since it is the most statistically likely statements that others might make about the subject.

Such approaches can already be observed as publishers seek to address an onslaught of papers from papermills

This leads to two obvious challenges (and many more besides): One is the bias of research to particular points of view that are reinforced by repetition in the literature; the second, is that AIs are clearly already capable of producing papers that look sufficiently “real” that they may pass peer review. Depending on who is prompting the AI, such an AI-written paper may be designed to contain fake news, factually questionable information, or otherwise damage the scholarly record. Such approaches can already be observed as publishers seek to address an onslaught of papers from papermills.

On the one hand, LLM technology may be able to perceive patterns beyond the capability of humans and hence point to things that humans have not thought of, yet on the other hand, LLMs also don’t generate true understanding-based intellectual insight, although it can seem to be otherwise. This leaves AIs at an uncomfortable intersection of capabilities until they are truly understood by their users.

But, as Harari argues, the effect of AI tools on how we interact with each other and how we think has the potential to be profound. In Harari’s example, it is narrative storytelling that is the critical capability that can undo society. In research, we need to preserve young researchers’ ability to formulate narrative arcs to describe and relate their research to other researchers – while machine-readable research has been a long-held aim of those of us in the infrastructure community, human-readability must not go away. Asking an AI to write a summary or narrative for you using the current technology is a way to potentially miss key patterns, key results and realisations. For a mathematician this is analogous to using a symbolic mathematics program to perform algebraic manipulations rather than having a sense of the calculation oneself. If one lacks calculational facility then it is impossible to know if the result really is true or whether the machine has made an assumption that is inappropriate.In spite of these challenges, AI tools, used as a set of intellectual augmentation capabilities, hold intriguing possibilities for researchers. AIs already play significant roles in the solution of certain types of differential equations, in calculating optimal geometries for plasma fusion and detecting cancer – essentially anything where human pattern matching is not subtle enough or where one has to carry too many variables in a single mind.

Is the real gain to come from AIs being prompt engineers for us?

We started this blog by considering humans writing as prompt engineers for AIs – but is the real gain to come from AIs being prompt engineers for us? When we stare at a blank page trying to write an article, an AI does an excellent job of writing something for us to react to. Most of us are better editors than we are authors so is one possible future one in which the AI helps with our first draft? Or, does it challenge our thinking if we appear to be the ones who are biased? Is AI the tool that ensures that our logical argument hangs together? Or, is it the tool that takes our badly written work and turns it into something that can be read by someone with a college-level education?

When writing a recent talk, I used ChatGPT as a sounding board – “Would you consider the current industrial revolution to be the third or fourth?” I asked. And then, “It is generally agreed that the phase the exponential revolution is synonymous with the fourth industrial revolution?” Later I prompted, “Please make suggestions for an image that might be used to represent the concept of community.” The responses were helpful, as though I had an office colleague to bounce ideas off. Not all of them were accurate but they gave me a starting point to jump off into other searches on other platforms or other books that I had available to me.

Figure 2: Proportion of internationally collaborative journal-article-based research for the world’s six highest output research economies from 1971 to 2022. Note that volume is small for China before 1990 hence volatility is high. Source: Dimensions.

One, perhaps scary, notion is that an AI may herald a reversal of the recent trend towards team research and back toward the days of the lone researcher. Over the last half century the world has been moving forward with an increasingly collaborative research agenda (see Fig. 2) – collaboration has become critical in many areas. This need for collaboration often has its genesis in needing resources such as equipment that is too costly for a single institution or single country to hold. But, collaborations often also come from wanting to work with colleagues with particular skills or perspectives. In this latter case, LLMs can be a helpful tool. I have written before that it is a conceit to believe that you can be the best writer, idea generator, experimentalist, data analyst and interpreter of data for your research. But, with LLMs, you may only need to be capable of just a few of these in order to work once again alone.

Regardless of the actual future of AI and its role in research, it is clear that we are living in a world where AIs will take a role in our lives. Arthur C Clarke once commented that “Any sufficiently advanced technology is indistinguishable from magic” and that is certainly how the outputs of LLMs appear to many of us, even those of us in the technology industry are impressed. But, there is another side to witchcraft. In John Le Carré’s Tinker, Tailor, Soldier, Spy – Operation Witchcraft takes over the Circus (the centre of the British intelligence service), leaving them completely vulnerable to false information from their enemies. In what is coming, it seems that we have no choice but to master witchcraft.

The post Tinker, researcher, prompter, wizard appeared first on Digital Science.

The importance of adding context

David Ellis — Wed, 03 May 2023 12:59:07 +0000

Adding context is one of the most important things that we do at Digital Science, whether that be through the tools that we make available to the research community while they’re making decisions, carrying out research or communicating their findings, or through our direct outreach and engagement with the community such as on our blogs and in our reports.

TL;DR is not just a place to write our thoughts – it is a conversation. It is at once a place for us to test ideas and to showcase methods and techniques, and at the same it is a place to connect with a community, to receive feedback and so to better understand our context.

We hope that it will be a friendly, safe environment where we can encourage diverse but respectful opinions and where we can challenge ourselves to ensure that we become the best supporter of the community that we seek to serve.

I’m looking forward to contributing my own articles to TL;DR in the near future, and in the meantime you might enjoy one of my more light-hearted looks at how a certain figure in popular literature and film makes a regular appearance in research.

For Scholars’ Eyes Only?

Or for a more in-depth analysis of a more serious topic, here are three examples where the Dimensions database has provided useful context to one of the global challenges facing the world today:

A tragedy of inequalities: Two studies have mined data from Dimensions to uncover striking imbalances in climate change research funding
Zooming in on zoonotic diseases: Analysis reveals disparities in funding to combat global impacts of climate change on health.
When economy meets environment: Sustainable development and the case of wastewater pollution in textile manufacturing.

The post The importance of adding context appeared first on Digital Science.

For scholars’ eyes only?

Daniel Hook — Thu, 13 Apr 2023 09:00:10 +0000

On the 70th anniversary of the publication of Ian Fleming’s first James Bond novel, Casino Royale, we ask the question: Why does James Bond have such a large footprint in scholarly literature? Our analysis reveals that Bond, James Bond, is about more than just espionage, vodka martinis and cinema studies.

Every so often a fictional character is so well drawn that even though they often embody the ideals or sensibilities of a non-contemporary era, with all the challenges that can present, they transcend their original zeitgeist to be constantly reinvented, renewed and, to use a modern term, rebooted for new generations.

Unravelling the academic impact of 007

The first edition of Ian Fleming’s novel Casino Royale (inset) was published 70 years ago on 13 April 1953. Daniel Craig (pictured) portrayed James Bond in the 2006 film adaptation of the book. James Bond remains the property of Eon Productions and Ian Fleming Publications.

In science fiction and fantasy, this is a familiar trope, with Doctor Who, Superman and Spider-Man all being prime examples of characters who receive frequent updates for contemporary audiences. Outside science fiction, you will be hard put to call to mind a character with the same enduring appeal and knack for self-reinvention.

The almost sole example of such a character is one Commander James Bond of the British Secret Service – a character who so thoroughly embodies Britishness (even Englishness) of a certain style and period that it is almost at odds with his seeming longevity. And yet, this month he celebrates 70 years since first jumping off the page of Casino Royale, Ian Fleming’s 1953 novel that introduced the world to the suave sophistication of Cold War international espionage.

This first novel introduced readers to Bond’s car, a 1930 Blower Bentley (it was not until Fleming’s 1959 novel Goldfinger that Bond gets an Aston Martin DB Mark III), the .25 Beretta (the Walther PPK was introduced in the novel Dr No in 1958), and the Vesper Martini (a vodka martini of the shaken rather than stirred variety that Fleming invented and named for Bond’s love interest Vesper Lynd).

Despite the challenges of Bond’s originally written misogyny, and references to race that the publisher says are being revised, he has become much loved around the world, and lays claim to one of the most successful film franchises in the history of cinema. A major cultural export for the UK, Bond films have featured and established icons of the British music scene including singer Dame Shirley Bassey and composer John Barry. In addition, the films highlighted both British and non-British brands while pioneering brand positioning in movies while, at the same time, making Q (no, not the one from Star Trek) a household name.

Bond has come to embody a certain brand of Britishness, a fact clearly acknowledged as Daniel Craig was chosen to appear as Bond to escort Her Majesty the Queen to the London Olympics in a short film prepared for the 2012 opening ceremony. And, as life sometimes imitates art, (and perhaps also gives an insight into the wry sense of humour of a particular member of the Royal family), a decade later Daniel Craig was awarded a CMG (the Most Distinguished Order of St Michael and St George) in recognition of his services to theatre and cinema in the Queen’s 2022 Birthday Honours – the same honour given to the fictional Bond by Fleming in 1957’s book, From Russia with Love.

Thousands of scholarly articles have been written about James Bond since his inception – but how do we know this, and what are they about?

A simple Dimensions search limited to titles and abstracts yields 674 references to James Bond, including the descriptively titled 2022 article “No Mr Bond, we expected you to die”: a medical and psychotherapeutic analysis of trauma depiction in the James Bond films, and A Psychological Study of the Modern Hero: The Case of James Bond. Arguably these articles represent those where Bond is a central focus of the work, but even at this level, a quick look at the ANZSRC article classifications (recently updated to the new ANZSRC Field of Research (FoR) 2020 codes as described in our recent paper) is revealing, with work being classified as code 36 – Creative Arts and Writing (with 3605 – Screen and Digital Media accounting for much of the 2-digit-level assignment) only accounting for around 30% of research output. Even though Bond has made his mark beyond the creative arts, Bond-themed titles do appear to be more predictable (compare, for example, 2009’s “Compute? No, Mr. Bond, I Expect You to Die!” with our earlier-mentioned paper).

Figure 1: Advanced search in Dimensions to locate an exact phrase in the Dimensions full-text catalog (including more than 80 million articles at the time of writing). Note the 32 authors that need to be removed as a result of their name containing the string “James Bond”.

Using Dimensions’ advanced searching capabilities, we quickly find that James Bond’s impact on research discourse is much larger than this apparently meagre 600 articles from the basic search above. If we broaden the search to use Dimensions’ Exact Search (one of the advanced search tools that allows more powerful, fine-grained searches of the full-text corpus behind Dimensions), then we can identify more than 28,000 articles that mention James Bond. Of course, this more advanced search includes full text, and hence we need to be more careful with our methodology. In this case, the query needs to be modified to remove all 32 authors who are fortunate (or indeed unfortunate) enough to have the name James Bond contained within their names.

In this expanded dataset, references to Bond can be more tangential – for example, as a cultural reference: Bond as a relatable example, a gateway or a framing for a set of ideas, or to quickly orient the reader to a specific era, or a set of values. Indeed, in this expanded dataset, ANZSRC FoR code 36 – Creative Arts and Writing – is no longer the dominant category, with code 47 – Language, Communication and Culture – taking the top spot.

However, even this new dominant category only occupies 12% of the “Bondverse”, with a much greater diversity of topics playing a role, including FoR 44 – Human Society with 7.7%, 43 – History, Heritage and Archaeology 4.0%, 35 – Commerce, Management, Tourism and Services 3.7%, and 46 – Information and Computing Sciences at 3.5%. Indeed, articles in the Bondverse have been written on Gender Studies, Built Environment and Design, Political Science, Philosophy and Religion, Psychology, Marketing, Biomedical Sciences and Law, all of which are able to use James Bond as a gateway to help people relate to their topic.

The brand of Bond is so powerful that it is often mentioned through other affiliations, such as those with particular artists as in “Man vs the machine: The Struggle for Effective Text Anonymisation in the Age of Large Language Models”, where singer/songwriter Adele is the principal focus of the commentary, but where Bond receives a collateral mention; or where Bond’s connection to those wonderful gadgets and cars from the long-suffering Q means that he is a natural point of reference as in Automated Driving in Its Social, Historical and Cultural Contexts. Each year, a consistent 1000 or so articles refer to James Bond (approximately the output of a medium-sized research institution). Outlets that regularly publish articles referring to James Bond include SSRN, the Journal of Cold War Studies (MIT Press Direct), Lecture Notes in Computer Science (Springer Nature), The Historian (Taylor & Francis Online) and Nature. It is perhaps of little (Quantum of?) solace to the Journal of British Cinema and Television and Film Quarterly that they are some way down the list.

Figure 2: References in the research literature to well-known fictional characters. Source: Dimensions.

Of course, Bond is not alone as a fictional figure who has made his mark in the research literature, there are other prominent fictional characters that we use as a shorthand for cultural references. James Bond fares well in these stakes, beating most recent characters such as Ethan Hunt from the Mission: Impossible franchise, and Jason Bourne. But he has not yet attained the same level of cultural embeddedness as more established figures such as Sherlock Holmes (who even has his own adjectival form, “Holmesian”), Batman (for which our analysis, perhaps unfairly, also includes mentions of Bruce Wayne, but does remove authors with the name Bruce Wayne as well as publications from Batman University in Turkey), or indeed Mary Poppins. The one modern fictional character who seems to defy all the rules is Harry Potter, but that is for another article and a different anniversary.

Figure 2 begs another question that goes beyond Bond: Despite possessing either cult status or serious literary impact, it seems that women are not getting their due as cultural gateways to support narratives in research literature. Searches for Elizabeth Bennet (from Jane Austin’s Pride and Prejudice) produced a mere 1,687 research outputs and Jane Eyre does a little better, being mentioned in almost 12,000 outputs, but Hermione Granger, a significant source of inspiration for many up-and-coming researchers, is mentioned in a mere 562 publications, not yet receiving the same level of success as her literary school friend Harry, despite being the one who does all the research in the books! Anna Karenina has given her name to an “effect”, “bias” or “principle” depending on the field, all of which have made the translation of her brand to the research environment successful.

This lack of reference to female characters from fiction in the research literature is not a surprise. Female characters are just as well drawn as male ones, often more relatable, and hence better suited to performing these key roles in research narrative to help render research itself more relatable. This is a complex sociological issue that deserves more research. At a high level, a simple explanation may be that the male-dominated media of the past is responsible for establishing male characters in the zeitgeist and that a male-dominated research ecosystem (also of the past) is more apt to use male characters to make their points. However, the fact that these practices endure today is something that requires more analysis and attention, at least in the opinion of this author.

Whether or not this is “No time to die” for Bond is not in question from the perspective of research literature. It is, however, clear that references to Bond serve not only narrative or contextual use cases, but invite us instead to ask more challenging questions. In the final analysis, whether he will ultimately die another day or whether he will only live twice are questions only James Bond can answer.

About Dimensions

Part of Digital Science, Dimensions is the largest linked research database and data infrastructure provider, re-imagining research discovery with access to grants, publications, clinical trials, patents and policy documents all in one place. www.dimensions.ai

The post For scholars’ eyes only? appeared first on Digital Science.

Symplectic at 20: Thoughts from Digital Science’s CEO

Daniel Hook — Thu, 09 Mar 2023 09:50:28 +0000

Daniel Hook, one of the co-founders of Symplectic and now CEO of Digital Science, reflects on the past 20 years of growth and change at Symplectic – and what makes it such a special partner within the research community.

Twenty years is a long time in tech but a short time in the world of research. There are other, perhaps more appropriate measures by which to measure the age of Symplectic: in UK terms, Symplectic is ‘three REFs’ old, from a New Zealand perspective it is just two PBRFs, and in an Australian context it is four (and a bit) ERAs old. From a software development perspective, Symplectic is six major versions old. From a client perspective, it is more than 120 installations old. From a personal perspective, it is two CEOs old – indeed, around Christmas this year, I will become the second-longest serving CEO of Symplectic, having moved into the Digital Science leadership team in 2015 and handed the reins of Symplectic over into the capable hands of Jonathan Breeze.

As with almost any 20-year-old, this one, which was started by four friends who happened to share an office while doing their PhDs, has grown so as to be almost completely unrecognisable. And yet, there are things that were important to us when we founded the company 20 years ago that remain at the heart of what we do now. I like to think that there are two guiding principles in what Symplectic does: firstly, whatever we do, we do it collaboratively; secondly, we want to save people time. There are other things that flow from this: bringing an academic perspective; helping people to make better decisions; ensuring that data are re-used; making sure that we preserve key aspects of choice in how users of Elements are able to work with the data that it contains; interoperability between systems and so on. At the core each of these things is an expression of those two guiding principles.

Setting collaboration at the centre of Symplectic’s world has created a very special ethos in the company, as both those inside the company and those who work with Symplectic’s team will attest. Symplectic’s story is not just about those of us who founded the company or those of us who have been part of the team – it is a story that is shared with Symplectic’s wider community. There are simply too many people to name who have played pivotal roles in making Symplectic the company that it is today. I know this because, in preparation for this blog, I tried to write such a list and found myself with more than 50 names of people simply from my time as CEO in the first ten years of Symplectic. And, that list specifically did not include the many colleagues and friends who were actually part the Symplectic team itself over that period. I can only imagine that Jonathan Breeze, my successor, has a list at least as long as the company has expanded significantly under his tenure. All these contributors have made Symplectic what it is today.

Symplectic enjoys a special level of collaboration with its clients, partners, friends, and colleagues. So many over the years have taken a long view – not solely focusing on their own project or installation but giving their time and knowledge generously. This has not only created a company and a piece of software, but also a shared store of deep domain knowledge. Every relationship has gone toward ‘paying it forward’ so that the broader Symplectic community benefits from the innovations and ideas of each participant. When once, in the early phase of Symplectic’s development around 2008, a perceptive UK-based client observed, “You’re really just centralising development funding from many universities so that you can give us a great product and keep it moving forward in a way that we can afford”, they were not wrong.

Our second focus of saving people time sits as a key part of this collaborative relationship. In that regard, Symplectic has moved from serving a single institution in 2003 to being fortunate enough to collaborate with institutions around the world to help them save time for their researchers.

Symplectic’s work is trusted around the world, saving time every day for more than 500,000 academics and administrators in 18 countries. The clients of Symplectic hold more than 8.8m distinct publications sourced from different data sources, saving academic and administrative time every time an article is added to their Symplectic Elements system, full text is deposited, or data is reused in other systems to inform decisions, help annual reviews or advertise the expertise of colleagues to potential partners around the world. With the help of Dimensions, I estimate that:

Just over 7% of global annual output is recorded by organisations in a Symplectic Elements system in an automated way that minimises the time to rekey research metadata records.
23% of global green open access articles are associated with at least one Symplectic Elements instance, saving time for academics to deposit their work into institutional repositories.
17.5% of global citations land on articles stored in Symplectic Elements instances, while 15.5% of Nature papers are captured in Elements instances.
Approximately 64% of articles associated with Symplectic’s clients have an Altmetric mention (compared to a global average of 27%).
72.5% of New Zealand’s research article output is captured in a Symplectic Elements system, as well as 74% of funder-acknowledging publications, and almost 81% of New Zealand’s University-produced research.

It has been an honour to work with the Symplectic team over the last 20 years. To see their progress, their dedication, and their spirit. As you see, they have carved out a unique path and make a real impact in the world with the people that they support. Here’s to the next 20!

And, of course, to borrow a phrase… Vive la Symplectic!

This post was originally published on the Symplectic website here.

The post Symplectic at 20: Thoughts from Digital Science’s CEO appeared first on Digital Science.

Will we only ever dream of endless energy?

Daniel Hook — Thu, 12 Jan 2023 11:50:58 +0000

The National Ignition Facility (NIF) has achieved fusion ignition using powerful laser systems and x-rays.
Image credit: NIF, Lawrence Livermore National Laboratory, US.

The recent nuclear fusion ignition event at the National Ignition Facility at the Lawrence Livermore National Laboratory in California is a triumph of modern science and of the persistence of scientists who continue to strive to solve some of the most difficult technical and engineering challenges of a generation. However, it is important to see this development in a broader context of global events as well as the research environment that has been created to support the nuclear energy developments upon which society is increasingly likely to depend in the coming years.

Did we vote for this?

It may be argued that geopolitics has been driven by an energy agenda since the late 19th century, when the industrial revolution had moved solidly beyond the borders of the UK and countries began competing for global resources to fuel their burgeoning industrial economies. As our economies have become larger so has our need for energy. Most recent wars (including the one in Ukraine) have been about control of energy resources – oil or gas. As supplies become more scarce or more expensive to extract, tensions will rise. While voters do not vote (in most cases) directly to support a specific energy-based geopolitical stance, in recent years energy has become a more overt topic in elections.

Even in countries where energy independence is a critical geopolitical issue, green parties do not command a large percentage of the vote, nor do mainstream political parties necessarily have well-articulated policies related to energy independence. In Germany, a country with significant foreign energy dependencies (63.7%) that have appeared in the news this year, the Greens garnered 20.5% of the vote in the 2021 federal elections. Meanwhile, in The Netherlands and Belgium next door, countries with even higher percentage dependencies on foreign energy (68.1% and 78% respective) than Germany, green parties have begun to slowly gain ground.

This is perhaps due to the fact that our homes have, until this winter, remained warm at a reasonably affordable cost. However, the phase change that we have all experienced in 2022 (for some very painfully) is a sign of things to come. Indeed, if electorates were to cast their votes more directly based on the growing issues of energy dependence, we might see a significant change in the political landscape in the next few years. Trading blocs like the EU may become more robust in their energy policy – we have already seen the establishment of the EU Energy Platform to start to mitigate the effects of dependency on Russian gas. Being outside such a bloc in current times appears foolish at best.

Enter the apparent saviour of the day, courtesy of a nuclear fusion experiment from the National Ignition Facility (NIF) at Lawrence Livermore National Laboratory in California. Hailed by a number of media outlets as a solution to our energy problems, we need to be careful about being overly optimistic. Anyone who has had an interest in nuclear fusion knows that we have been 30 years away from commercial nuclear fusion for the last 40 years. Indeed, it will come as a surprise to precisely no one who knows me that the seminar I gave in English class 31 years ago as a 14-year-old was on tokamak fusion. I clearly recall stating that nuclear fusion was 30 years away. Which just goes to show – I was wrong!

But, this all sounds a bit dangerous…

Perhaps unsurprisingly, some voters have been worried about the risks of developing nuclear solutions. Harnessing the energy source that, uncontrolled, underlies the most destructive weapons that our species has ever produced, and which powers the Sun, and consequently our entire lives, is an illusive and sometimes perilous pursuit. Classic science fiction novels such as Asimov’s Robot series, and TV shows like the 1980s adaptation of Buck Rogers have shown the post-apocalyptic atomic horrors that paint vivid pictures in our minds of both promises of success and failure with fusion. For many, fusion is not just a technology but a cultural phenomenon. As a technology it looms large in our collective consciousness partly because it is one that has been in development and which holds so much power both for positive and negative outcomes. As a young researcher, it is a beguiling field of study – some of the best minds on the planet, for several generations, have wrestled with taming nuclear fusion.

Figure 1: Timeline of the key developments in nuclear fusion research.

Our knowledge of both forms of nuclear energy – fission and fusion – originate in Einstein’s famous observation that energy and mass are equivalent: E = mc². In the case of nuclear fission (the process used in current nuclear power plants and in the earliest atomic weapons), heavy elements such as Uranium and Plutonium are used. A heavy element is one in which there are many protons and neutrons in the nucleus of each atom. A configuration of many protons and neutrons (beyond 92 protons) is unstable, which means that the energy required to keep the nucleus together is more than if the atom were to split into two (or more) lighter elements. Just a little interaction with, say, a free neutron is enough to break down the nucleus of some heavy elements into the nuclei of two or more lighter elements. As this process takes place a little energy is given off, which can be converted to heat to turn a turbine. The downside of nuclear fission is that you end up with residual elements that, while more stable than the original atoms in the reaction, are still radioactive and remain so for many years. Such waste products require careful storage in locations where they cannot damage living organisms.

Figure 2: Nuclear Fission versus Nuclear Fusion processes. In the left pane, a heavy element is broken apart via interaction with a neutron into two smaller (but still radioactive) elements and an amount of energy. In the right pane, a deuterium nucleus (a proton and a neutron) and a tritium nucleus (a proton and two neutrons) are brought together to form helium (two protons and two neutrons), a “spare” neutron and energy. In both cases, the right side of each pane is “energetically favourable”, which is to say that the configuration of protons and neutrons on the right of the interaction requires less energy than the configuration on the left, which means that energy is released.

Nuclear fusion, however, is a process that takes place at the other end of the periodic table with very light elements. The energy produced in the fusion reaction is around 5-10x larger than that in a fission reaction. In addition, the by-products are not radioactive – just helium, some neutrons, and energy. In essence, nuclear fusion is a completely clean energy source. Such is the promise of nuclear fusion that some of the best minds in physics have worked on nuclear fusion over the last century. Today, the best minds are also supplemented by AIs, which help to optimise calculations and design the next generation of test reactors.

There are many approaches being developed as a candidate for a commercial nuclear fusion reactor. The main ones include: Magnetic confinement fusion (the type involving ring-style devices – probably the most famous until the recent announcement from NIF), inertial confinement fusion (the type reported on recently); laser-driven fusion; magnetised-target fusion, acoustic inertial confinement fusion, Z-pinch fusion, Muon-catalysed fusion and Nuclear reaction control fusion. Each of these approaches has a different risk profile and different pros and cons, but a successful solution may well need learnings from several of these different technologies.

While the experiment reported recently from the NIF is a significant step in getting to nuclear fusion it is not actually a “break even” event – if you include all the energy used in creating the reaction, you’ll find that the reaction still didn’t get more energy out than was put in. There is still a long way to go but, there may be a value to making something out of this step. Returning science to the public consciousness in a positive way, especially in the face of recent developments in Ukraine and their fallout in the oil industry, may have its benefits. But, it will be important not to overplay the hand – presenting this as fusion being “just around the corner” can backfire badly.

OK, so when will we have it?

Given the increasing importance of this technology to the future of humanity, one would expect to see a significant amount of research funding going into the various different routes to fusion. And while the amount is substantial it is, perhaps, less than might be expected.

Global competitive grant funding for fusion research is at the level of around USD $800 million per year. Put another way, the US spends around USD $45 billion per year on the total budget of the National Institutes of Health (NIH) and the world spends around USD $32 billion annually on Sustainable Development Goal-related competitive research grants.

I contend neither that health research is not critical, nor that SDG-related research is not an excellent way to spend public money. However one may expect that an effectively limitless, clean energy source that would reduce global dependency on fossil fuels, make a considerable contribution not only to the reduction in greenhouse gases and the cost of living, but which would also reduce global geopolitical tensions, might warrant more than 1.5% of the annual funding spent on these other worthy and critical initiatives.

I don’t want to address issues of lobbying in this piece as the point is well known, rather I want to finish by exploring two points that are closer to research. Firstly, the observation that metrics are powerful drivers of behaviour and, secondly, that links to immediacy seem to be critical in decision making.

Over the last few years, the global nuclear fusion community has consistently produced around 4,000-5,000 research papers per year. However, over the same period the biomedical research community has produced between 800k and 1.25m papers per year; SDG communities have published between 400k and 1m articles per year. A naive argument would be that fusion papers look expensive relative to the more recent papers in either SDG-related research or biomedicine. But, while it is objectively clear that these areas of research are not comparable in their nature, the incentives in the research world are very much skewed toward paper production, which will tend to disadvantage nuclear fusion research. Of course, papers are only one measure of research output. The recent announcement with which I started this blog is a very tangible output of research and its media coverage is positive, but such events are few and far between and hence don’t easily play into a higher speed research narrative.

At a more fundamental level, immediacy plays a critical role in this discussion. It took the better part of 20 years to build momentum for research and funding of SDG-related research, but similar levels of research output and funding were achieved for COVID research in just 24 months. The threat of not understanding the SDGs is not immediately evident in the lives of those with established advanced economies or large continental territories that are not so directly at risk from rising water levels or energy challenges – it has not been a burning platform for them. While the threat of COVID is not as existential or as long-lived for humanity as either SDGs or the emerging energy crisis, the immediacy of the issue in the G20 made the topic instantly appealing both for funding and for publication.

At its heart, nuclear fusion suffers from a perception problem – it is always 30 years away. Because we don’t associate everyday challenges such as energy prices, war, and economic stagnation with not having nuclear fusion as one of our power options, we don’t make research decisions or political choices based on funding and solving this problem. We need a long-term alignment across the political spectrum that strives for nuclear fusion with consistent funding and clear strategic intent to gain this.

If the NIF announcement leads to a broad realisation that we are getting closer and that voters and hence politicians will take note of the seriousness of our situation, then perhaps another 30 years will not be needed.

Funding levels and publication counts in this article are sourced from Dimensions.

The post Will we only ever dream of endless energy? appeared first on Digital Science.

Five measures that chart the rise of Chinese influence in global research

Guest Author — Tue, 18 Oct 2022 13:32:10 +0000

If the story of the 20th Century is one of the decline of the power and influence of the West, then the 21st Century tells the story of the ascent of Asia and more specifically China. Indeed, the era in which we live currently, with the cultural and economic dominance of the West, is something of a historical aberration.

A 2012 report from McKinsey points out that for the better part of the last 2000 years, the centre of economic wealth in the world has been firmly positioned in the East, with a period from the 1500s until 2000 where the centre of mass moved and dwelt (for a while at least) in the West. The enlightenment and the industrial revolution took place first in Europe and our own work shows the movement of the centre of mass of research since the late 1600s to present day – a journey that starts in the UK, moves West to the US, reaching its nearest point in the mid-1940s before moving ever more quickly eastward towards China.

This week’s Communist Party Congress will see it hand leader Xi Jinping an historic third term. If we were to assess China’s growth in non-research terms during his leadership, we might look at its financial output such as GDP at purchasing power parity (PPP), in which case you would notice that China overtook the US to become the world’s largest economy in 2016. In Rachman’s 2016 book Easternisation a framework for thinking about the rise of China is presented. While Rachman looks at the world through the lens of the Thucydides Trap (the Greek-inspired notion that when the pre-eminent nation in the world changes, there must be war between the incoming power and the incumbent) we take a different view in the context of research. Research leads to knowledge that should be the property of all humankind and hence, perhaps naively, we will take the view that the more countries that choose to invest in research, the better it is for all, since, so long as that research is done openly and made openly available, the more innovation and advancement is possible.

In this short blog, we propose a set of research-based measures to track the rise of China.

We propose the following set of metrics by which to rank countries to see how influential they are in the world of research. We have five metrics, in increasing importance and level of difficulty to achieve:

1. Percentage of GDP spent on research: Theoretically easy to achieve for most countries since this is mostly within the hands of the government of the day. It may be a leading indicator of research success but does not take in account the size of the economy, which is obviously a determining factor into how much difference this investment can make.

2. Gold Open Access (OA) publication volume: Slightly harder to achieve than metric 1 since it requires a change in culture and understanding of incentives. This is a stronger leading indicator of the increasing power of a research base.

3. Total publication volume: Highly related to metrics 1 and 2 – to consistently publish a large body of research content in recognised journals requires significant investment in both research infrastructure and capacity. But, also as the current trend is toward OA publication, also a significant investment in Open Access.

4. Proportion of global citations: Still harder, producing high research volume is not the same as producing research that is highly noticed, read and cited. Garnering citations is critical to demonstrate that work is noted.

5. Relative global influence: Using Eigenvector centrality, a network measure that can be considered a proxy for influence in an ecosystem, we calculate this quantity on the co-author graph. Broadly speaking this metric expresses the likelihood that a given paper picked at random from all papers in a given year has an author from a given country on it. It is not a probability but can be thought of as a related quantity.

This is necessarily a reduced list and focuses on highly quantifiable metrics. It does not assess the policy environment in a detailed manner – ease of cross-border travel and ease of access to visas for the purposes of collaboration, study and academic work are obvious ways in which a country can have a disproportionate effect on the research economy. It is also clear that attracting overseas students to study increases diversity at many levels and helps to create networks that can later result in fruitful collaborations – again, this is not something that we’ve considered here.

However, these five metrics are chosen to show causal development. From funding to being able to develop infrastructure and a research population together with the willingness to publish in Gold Open Access journals. This then leads to building a more substantial capacity that is able to produce the consistently high research volumes required for excellent research to be part of the overall mix. While citations are not a proxy for quality, high-quality work is often more noted and hence more cited. Ultimately, if there is funding and serious research worthy of note then this makes a country a destination for collaboration.

1. Percentage of GDP spent on research

Looking at Figure 1, below, we can see that Chinese investment in Research and Development (R&D) has increased steadily since 2000 to reach 2.4% of GDP in 2020. However, since around 2012 the US has increased its own investment in R&D, a trend mirrored by the EU-27 countries. At the same time GDP at PPP has increased significantly for China. In 2021, China’s GDP at PPP was $27.3tn USD versus $23tn USD for the US, $21.6tn USD for the EU and $3.34tn USD for the UK. This means that the while the US still outspends China in absolute terms, the gap is narrowing between the two countries with China spending around 20% less than the US. If the Chinese and American economies continue to grow at their current rates (3.2% for China versus 1.6% for the US) for a sustained period, China would be spending more than the US on its research base by 2032 without the need to increase the percentage of GDP invested. Of course, China may see a slowdown given the current financial picture, but it has also been clear that the country is keen to invest in research and hence it may well be that China chooses to increase the percentage of GDP that it wishes to commit to its rapidly increasing research economy.

Figure 1: Percentage of GDP spent on research by country between 2000 and 2020. Source: World Bank.

2. Gold Open Access publication volume

Advanced research economies have typically invested heavily in open models of publishing and sharing research in the last decade. The reason that we have focused on Gold rather the Green OA here is that Gold OA conflates two political components – firstly, the willingness to adopt policies to make research broadly openly available and, secondly, the ability to fund OA. Green OA, which is frequently considered a more progressive form of OA is often more difficult to track since funding for it is usually done through infrastructure, which is harder to track than Gold OA author processing charges. The UK has been a leader in Gold OA alongside countries such as Australia, Brazil and India. However, if we look at the main “blocs” – China, the EU and the US we see that the EU (see Figure 2) has historically been the most committed to supporting Gold Open Access, as can be seen in its overall volume. In 2009, China only equalled the UK in Gold OA volumes, but there is a clear inflection around that point, when China accelerated, overtaking the US around 2017, and looking to be on course for overtaking the EU this year (based on an extrapolation of partial year data).

Figure 2: Volume of Gold Open Access publications (article and conference proceedings) by country between 2000 and 2022 (partial year). Source: Dimensions from Digital Science.

3. Total publication volume

Another obvious marker of research development is the total volume of publications. This is harder to achieve than the previous two markers as a sustained high-level of production requires long-term development of infrastructures to support research, as well as feeder mechanisms such as training for undergraduates, PhD students and postdocs. It generally also requires a vibrant research community and opportunities to collaborate internationally (which is discussed further on). In Figure 3, we see that not only will China surpass the US this year but it also looks likely to leapfrog the EU in production volume.

Figure 3: Total volume of publications (article and conference proceedings) by country between 2000 and 2022 (partial year). Source: Dimensions from Digital Science.

4. Proportion of global citations

To be considered the preeminent research country, it is not merely about research volume but whether research is noteworthy enough to be cited. Since 2000, the research world has diversified significantly on the global stage with many countries beginning to play an active role in developing their research economies. As a result of this development the overall share of global citations garnered by established actors such as the US and UK has naturally decreased. The EU (as the old eastern bloc countries began to more seriously invest and develop) has grown its share. But, the big winner has been China, moving from a tiny single-digit share of global citations in 2000 to 13.5%, a level that is almost a full 2 percentage points ahead of the EU-27. While this is still dwarfed by the almost 31% attracted by the US, it is clear that China is producing a high level of very noteworthy research.

Figure 4: Proportion of global citations (as a percentage of overall global citation) by country between 2000 and 2022 (partial year). Source: Dimensions from Digital Science.

5. Relative global influence

Finally, we use a network measure called Eigenvector Centrality on the co-authorship graph to work out who the preferred partners are to work with by country (Figure 5). The EU-27 countries have continued to be the favoured research partner over the last two decades.This metric is heavily influenced not only by the large volume of papers produced by the EU-27 countries but also their strong links with other collaborators such as the US, UK, China and beyond. Of course, each individual country in the EU-27 will look significantly weaker on its own, however, there is significant power to be gained from being part of the bloc, as can be seen from the network effect highlighted here. Thus, coordinated funding streams such as Horizon 2020 have built an excellent platform for the research influence of the EU-27.

Figure 5: Eigenvector centrality “influence” of countries based on the global co-authorship graph between 2000 and 2022 (partial year). Source: Dimensions from Digital Science.

The relatively smaller size of the US together with its stronger internal collaboration network places it second in the list. The UK outperforms on this measure due to a number of historical advantages – its strong global connections through the Commonwealth; its past relationship and general geographic proximity to the EU-27; its historically strong relationship with the US; and the establishment of English as the global language of research. These factors all mean that the UK is something of a destination for students, who then either stay and create connections to their home countries or return to their home countries and continue to collaborate with their UK-based colleagues.

China, by comparison, has not yet had time to build a large and complex network of global collaborations. At the same time, it is growing its research capacity so rapidly that few countries have the absorptive capacity to work with China at the scale that is possible. That tends to imply that China’s research collaborations are currently more internally focused than might otherwise be the case where they too have developed to their current size more slowly. However, it is still clear that China is quickly developing into a highly collaborative global partner with scale.

This trend will be highly relevant for the scholarly communications industry. As the great and the good of academic publishing descend on Germany this week for the Frankfurt Book Fair, they are acutely aware of the rapid increase of Chinese research, and strategically one of their main challenges is how to attract authors to their books and journals and increase their market share of content. Given recent policy changes in China, traditional citation measures as represented by metric 4 and modern vectors such as metric 5 could combine to inform publishing strategies around, for example, how to encourage and facilitate global collaboration with Chinese authors.

Closing thoughts

Each one of the five measures gets successively harder to achieve pre-eminence and, in some sense, one leads to the next. A country can decide to spend a large amount of its GDP on research if it values research and believes in the long-term effects of that for its people. Of course, there are two aspects to this – government spending and industry spending. A government can encourage industry spending on research with the local tax environment and other inducements but, at the end of the day, this is also a cultural phenomenon. Those who believe in the value of research will generally invest. As a country becomes richer and levels of education increase, it is a choice often made to invest in research for future prosperity and for the long-term benefit of its people. Increasingly the richest companies appear to see things similarly.

Once a research economy is established, there is a clear value in sharing results through open access to increase the volume of material available upon which to build, which leads to metric 2. Of course, if a country is wealthy then paying for open access is also something that is within reach. But, more generally, it is important to have sufficient research volume that you can ensure that a proportion of that research is of high quality – an effect that only tends to happen at a certain scale of endeavour and hence metric 3 becomes important. As metric 3 is achieved, then the international community should begin to recognise the value of the research being produced and it should become more cited, leading to metric 4 and finally, in achieving high quality research at scale, the country becomes a destination for collaboration and gains influence in the global social research network, which is metric 5.

Thus far, China has established itself sufficiently in metric 1 that it has been able to achieve pre-eminence in metrics 2 and 3. (This development may be surprising to some as it was not obvious that China would overtake both the US and the EU-27 in the same year in metric 3!) Metric 4 tells a broader picture – that citation patterns are diversifying. It is not merely that the proportion of citations to the US is dropping and switching to China, but rather that the proportion of citations to the US is dropping in favour of greater geodiversity, of which China is one recipient. South America, and Asia in general are developing significant research economies, which is a positive trend. Finally, China’s development in metric 5 is impressive.

Within just a few years China’s global influence has developed to a point where it is clear that, if it continues on its current path, within a decade it will be vying with the EU-27 for global pre-eminence in its ability to influence the global research conversation.

About Dimensions

Part of Digital Science, Dimensions is a modern, innovative, linked research data infrastructure and tool, re-imagining discovery and access to research: grants, publications, citations, clinical trials, patents and policy documents in one place. www.dimensions.ai

The post Five measures that chart the rise of Chinese influence in global research appeared first on Digital Science.

Research in the second Elizabethan era: A platinum age for the Commonwealth and the UK

Guest Author — Thu, 29 Sep 2022 08:08:43 +0000

An era of astonishing change

From physical empires of the past to information and virtual empires in the modern era, the last 70 years have borne witness to astonishing change in both research and science and how they are communicated.

But where have these shifts occurred, and what do they tell us about the future? And what will be the legacy of Elizabeth II’s long-lasting reign for science and technology?

In the UK, Commonwealth countries, and around the world, people have mourned the passing of Queen Elizabeth II. For many, she will be remembered as a dedicated public servant who gave her support to charities and good causes, raising their profile and giving them a voice that helped them to be noticed in the world. For a good deal more than half a century, successive Prime Ministers have regarded her as a giver of context and a provider of a safe space to air concerns and discuss the challenges of the day that they could share with no one else. She has been widely recognised as a constant in our lives and, by some, as the embodiment of all things British in a changing age. An instantly recognisable figure, the Queen has played the role of an observer of that change, unable to be actively involved in politics but having privileged access to politicians, celebrities and changemakers globally during a period of great change.

Billboard in London’s Piccadilly Circus displaying a tribute to Queen Elizabeth II.
Image credit: Ocean Outdoor.

In this brief commemoration in honour of Her Majesty, we wanted to reflect on the changes in research and, in particular, in scientific research that have characterised the modern Elizabethan era. The Queen was crowned in the same year that Watson and Crick (with the significant help of Franklin) published their paper on the double helix structure of DNA in 1953. Over the intervening 70 years, the UK has not only remained at the forefront of research, but has made contributions that have been key to shaping the emerging exponential industrial revolution. From Sir Tim Berner-Lee’s pioneering work around the World Wide Web in the late 1980s, to the London-based DeepMind team who solved the protein-folding problem with AlphaFold, the UK has been the home to some of the greatest scientific advances of the late 20th and early 21st Centuries.

Like her great-great-grandmother Queen Victoria, Queen Elizabeth married a man who had great interest in, and who consequently was a great supporter of, science and technology. The Victorian era is one that is remembered not only for empire but also for innovation and its technological contribution to the world, and we suggest that the second Elizabethan era will be viewed similarly in this respect. In 1952, the UK produced around 2% of the world’s research output and the broader Commonwealth of Nations, with which the Queen is so intimately related, around 3%. As of 2021, the UK produces around 4% global research and, together with other Commonwealth countries, around 14%.

The influence of the Commonwealth on the world stage often goes unseen but network effects are powerful in today’s world. Figure 1 shows relative global influence on research of each country or set of countries using a technique that we’ve developed at Digital Science over the last few years based on the eigenvector centrality network measure. The core of the idea is that research volume and research citations only show a superficial picture of research strength. Research is becoming an increasingly collaborative pursuit and success is born of finding the right people with which to work. Eigenvector centrality calculated in the way shown here mixes volume, citation and level of collaboration into one metric that, we argue, is an interesting proxy for influence. In the figure you can see that since 1952 the global influence of the US was strong in the 1950s and 1960s, but decreased in the 1970s and 1980s, stepping down once again in the 1990s and has gradually waned since the turn of the millennium. On the other hand, China, which had little global research influence at all until the 1980s has grown significantly in the last three decades, overtaking the UK in the last few years.

However, what is striking about Figure 1 is that the Commonwealth has not only established a solid foundation of global research influence (doubtless due to its large “surface area”, with 56 countries collaborating globally), but that it has also begun growing significantly in its influence in the last 20 years. Of course, when the Queen originally became the Head of the Commonwealth, there were just 8 member states and hence their influence would have been even less than shown in the diagrams here. Our analysis follows the influence of the full current 56 member states throughout the life of the Queen’s reign. While the influence of this group may be unsurprising as it counts highly developed research economies such as Australia, Canada and New Zealand amongst its number, it should also be recognised that it is a disparate collection of countries that includes small and developing economies as well as large ones.

The spread of internet technologies (as mentioned above, an innovation that owes an important part of its lineage to the UK) has enabled smaller nations to play on the international stage of research – a distinct advantage for some Commonwealth-member countries who share a common set of values, language, and a legal system, that all facilitate collaboration. In the age of the internet, distance is no longer a barrier to these countries working together on research projects Hence, while the UK (separate from the Commonwealth) has generally waned in its international research influence, it is notable that it has been less susceptible than others such as the US and the EU to decline in influence. We suggest that this is likely to be due to its strong ties with Commonwealth nations.

Over the modern Elizabethan era, the world has moved from an age of physical empires through the space race and the computer revolution to an age of information and virtual empires. From a political perspective, the UK relinquished its role as global power and instead needed to content itself with the role of influencer on the world stage. In an elegant parallel, the monarchy moved from being a “great Imperial family” to influencers both culturally and politically. The Queen’s personal style transcended the geopolitical zeitgeist as she lent her personal brand to long-term projects such as the Commonwealth.

By all accounts the Queen believed in the Commonwealth as a group of nations who could make positive change and she fought for that. In the case of research, as the US and EU gradually wane in their influence, it may well be that the Commonwealth has the collaborative spirit, as well as the geographical and cultural diversity to continue to influence the world positively. If true, this would be a worthy legacy for someone whose life was one of service.

About Dimensions

The post Research in the second Elizabethan era: A platinum age for the Commonwealth and the UK appeared first on Digital Science.

Inspiring dreams: The new James Webb Space Telescope

Daniel Hook — Wed, 20 Jul 2022 07:00:00 +0000

“Cosmic Cliffs” in the Carina Nebula, approximately 7,600 light-years away from Earth. Image taken by the James Webb Space Telescope (JWST).
Image credit: NASA, ESA, CSA, and STScI.

As children we look up to the beauty of the night sky and are inspired to dream. I recall as a small child being fascinated by my father’s books on astronomy and the beautiful pictures of now-familiar starscapes such as the Horsehead Nebula. That led me to join the astronomy club at school and spending nights in the cold, sleeping on the floor of the cricket pavilion, and waking up at the right time with other similarly nerdy teens, to peer up through a telescope lens to see if we could locate the moons of Jupiter. How many of today’s scientists (not just astronomers) are doing what they do in part due to some similar formative experience? A wonder about the universe and a desire to understand its mysteries.

A whole new generation of scientists may now have been inspired to dream and perhaps, one day, will pursue a career in research. The James Webb Space Telescope (JWST), launched at the end of last year, released its first images a week ago. While one does not need to be a scientist to find these shots breathtaking, it is humbling to think that we live in an age where “big science” events like this don’t just happen once in a lifetime, but every few years. Indeed, the pace of discovery is accelerating, powered by the engineering and technology that globally both the public and private research ecosystem is building.

It is said that familiarity breeds contempt, and perhaps a justifiable fear that the regularity of such advances may lead to a lack of anticipation or excitement, as happened with the US space program in the 1970s. But, as we will see below, at least in the global science community, JWST is already set to loom large in our collective psyches for some years to come and bring huge value to our lives in so many ways, as Professor Monica Grady from the Open University has so eloquently set out.

The idea of going beyond our atmosphere to look at the stars was first suggested by American theoretical physicist Lyman Spitzer in 1946. His ideas led to a number of orbital observatories – the American Orbiting Astronomical Observatory OAO-2 in 1968, the Soviet Orion 1 in 1971. This lineage eventually led to the most famous space telescope to date, Hubble. And it is Hubble that we want to use as a benchmark to compare the attention associated with JWST.

Riding off the back of the success of the moon landing, NASA put forward a paper in 1969 on the uses of a large space telescope, but it was not until 1977 that the ambitious project was funded. Six years later in 1983 the name Hubble was given to the project. While the terrible Challenger disaster of 1986 must have caused significant internal challenges at NASA, the project pressed on and Hubble was launched in 1990, with the first scientific paper being submitted on 1st October 1990.

In the subsequent 32 years, even with initial teething trouble, Hubble has not only gone on to profoundly advance our understanding of the universe in which we live – from helping to establish the existence of black holes, to detecting water vapour on Europa (one of those moons of Jupiter that I was searching for all those years ago) – but has also served as a platform for us to understand how to engineer devices that live in the vacuum of space.

When Hubble first returned results in 1990, I recall the media attention being massive. My perception is that this is similar today but when I looked at Dimensions, I was shocked to see that there were already almost 15,000 papers citing the JWST! I couldn’t help but wonder if JWST is already more famous than Hubble.

Figure 1: Scholarly mentions of Hubble versus JWST from Dimensions placed on a reference timeline zeroed to their first pictures. “T-0” is 1990 for Hubble and 2022 for JWST. Note that the name for JWST was announced around 20 years before the first pictures were released whereas Hubble was named just seven years prior to its first pictures being shared. Source: Dimensions.

It seems, at first glance, that the JWST is receiving significantly more attention than Hubble at the same point in its existence. One might speculate as to the reasons for this – perhaps being named relatively further in advance of launch than Hubble, or the controversy over the choice of name might have increased the attention to the telescope. However, the growth of research over the intervening years is not negligible (indeed, Hubble has played an important role in the growth in astronomical and space science).

The area of Astronomy and Space Science (ANZSRC FoR 0201) has grown significantly in the last 30 years and much of that growth comes from the advances made possible by Hubble itself. Using the ANZSRC FoR (Field of Research) definition of the field, Dimensions suggests that there were around 6,600 papers, conference proceedings, pre-prints, monographs and edited books produced in the 1983, whereas in 2021, there were around 33,600 such outputs – a five-fold increase. (Book chapters are specifically removed from all the analyses here to remove peaks from astronomical encyclopaedia publications that skew specific years.)

Figure 2: Scholarly attention to HST and JWST in the academic literature – as in Figure 1, but with JWST mentions normalised based on factoring the growth in the field of Astronomy. Source: Dimensions.

Figure 2 shows the result of a simple approach to rebase the JWST attention, to account for the difference in time periods during which the attention was received. In this case, we looked at the growth in Astrophysics using the ANZSRC FoR Code definition of 0201 Astronomical and Space Science to create an inflation rate for the field from 1983 (T-20) to present day. We then divided the JWST output number by the compound inflation rate year on year for each year from T-7 up to T-0.

Normalisation, however, must be taken with real care. There is one further edge effect that means that we cannot trust the T-0 JWST line. Since 2022 (T-0 for the red line) is the current year, it is incomplete and hence cannot be compared to a full year. You can see the same effect in the dip in T+32 (2022 for the blue Hubble line). Thus, while 2022 looks to be a disappointing year for JWST, it is because the comparison is between a partial year with a full year. This will, I’m sure, be an amazing year for JWST publications, which are set for lift off in the coming years, if the example set by Hubble is followed.

Subjectively, it is often easy to recall the golden days of the past and how wonderful things were. However, in this case, we can see that the level of excitement, as measured through research publications, of the JWST compared with the launch of the Hubble is entirely comparable. Not only this, thanks to Hubble this excitement has been sustained over the last 20 years.

We at Digital Science wish the JWST team at NASA and around the world the very best for their coming data releases. This is the stuff of which dreams, and future scientists, are made.

About Dimensions

The post Inspiring dreams: The new James Webb Space Telescope appeared first on Digital Science.