academic writing - Digital Science

Machine-First FAIR: Realigning Academic Data for the AI Research Revolution

Mark Hahnel — Mon, 17 Nov 2025 12:16:40 +0000

The best way for humankind to benefit from research is to prioritize machines over people when sharing data. Here’s why.

We push out the lines that academic research needs to be Findable, Accessible, Interoperable and Re-usable (FAIR) for humans and machines. This suggests humans and machines should get equal priority when it comes to FAIR. This is not the case, we should prioritize the machines. Machine-generated new knowledge will accelerate knowledge discovery.

While humans can infer insights from sparse information in academic literature and datasets – due to our ability to find more context online – the machines currently cannot. To go further, faster in knowledge discovery we need to move past human-powered knowledge discovery. To do this, the machines need structure and pattern. Every research-generating organization should be prioritizing this.

Academia is Ignoring Decades of Advancement

Academic research generates more than 6.5 million papers annually, and over 20 million datasets, each representing potential training signals for the artificial intelligence systems reshaping discovery. Yet most institutional data remains locked in formats optimized for human consumption rather than computational processing.

While most stakeholders know the theoretical merits of making data FAIR (Findable, Accessible, Interoperable, Reusable) for both humans and machines, the practical reality is starker: in an era where language models can process orders of magnitude more literature than any human researcher, we are still organizing our most valuable research assets for the wrong consumer.

The economic implications are substantial. Organizations like the Chan Zuckerberg Initiative (CZI) have committed over $3.4 billion toward AI-powered biology, funding projects ranging from their 1,024 GPU DGX SuperPOD cluster for computational biology research to the Virtual Cell Platform that aims to create predictive models of cellular behavior. The Navigation Fund, with its $1.3 billion endowment, has invested in AI infrastructure through their Voltage Park subsidiary, while simultaneously funding open science initiatives focused on machine-actionable intelligence and metadata enhancement. Astera Institute has deployed portions of its $2.5 billion endowment to support projects like their $200 million investment in Imbue’s AI agent research and their Science Entrepreneur-in-Residence program specifically targeting scientific publishing infrastructure. Meanwhile, the Allen Institute for AI demonstrates the practical returns on machine-first approaches through projects like their OLMo series of fully open language models, where complete training datasets, code, and methodologies are published in computational formats, and their Semantic Scholar platform, which processes millions of academic papers to extract structured, machine-readable knowledge graphs.

Chan Zuckerberg Initiative (CZI)

Yet the vast majority of academic institutions continue to publish their findings in PDFs or as poorly described datasets. While LLMs are getting better at ingesting multi-modal content, PDF is a format that remains surprisingly resistant to reliable automated extraction, despite decades of advancement in natural language processing. This is not merely a technical limitation. Modern large language models struggle with PDFs because these documents prioritize visual presentation over semantic structure. Critical information becomes trapped in figures, tables, and formatting that computational systems cannot reliably parse. A reaction scheme embedded as an image, a dataset described in paragraph form, or experimental parameters scattered across multiple tables represent precisely the kind of structured knowledge that could accelerate discovery if only machines could access it consistently.

The Architecture of Computational Research Infrastructure

The solution requires a fundamental reorientation toward machine-first data architecture. Rather than retrofitting human-readable outputs for computational consumption, we can take inspiration from pharma and industry writ large, who are designing their data flows to serve algorithms from the ground up, with human-friendly interfaces emerging as downstream products of this computational foundation.

Consider the transformation pathway implemented by teams working with Digital Science’s suite of computational research tools. We’re building workflows in our tools for automated knowledge extraction at scale. The extracted knowledge gains semantic coherence through integration into domain-specific knowledge graphs. Platforms like metaphacts (metaphactory) provide the infrastructure to align these signals with established ontologies while enforcing quality constraints through SHACL validation integrated into continuous deployment pipelines. The result is not merely a database of facts, but a queryable intelligence system that can answer novel questions through automated reasoning over validated relationships.

Simultaneously, the operational requirements of research continue through dedicated literature management systems. Tools like ReadCube maintain the audit trails and conflict resolution workflows that regulatory environments demand, while ensuring that every screening decision and data extraction connects to persistent identifiers. The curated evidence flows directly into the computational infrastructure rather than terminating in isolated spreadsheets.

The critical innovation lies in packaging. While human researchers expect PDFs and narrative summaries, machine learning pipelines require structured metadata that specifies exactly what each dataset contains, where to retrieve it, and how to interpret every field.

The Metadata Multiplier Effect on Repository Platforms

Academic data repositories like Figshare occupy a unique position in the machine-first FAIR ecosystem. We serve as the critical junction between human research practices and computational discovery. When researchers publish datasets with comprehensive, structured metadata, these platforms transform from simple storage services into computational assets that can feed directly into AI research pipelines. The difference lies entirely in how authors describe their work at the point of deposit.

The REAL (Real-world multi-center Endoscopy Annotated video Library) – colon dataset on Figshare: https://doi.org/10.25452/figshare.plus.22202866.v2

Consider two datasets published on the same platform: one uploaded with a generic title like “experiment_data_final.xlsx” and minimal description, the other with machine-readable field descriptions, standardized vocabulary terms, and explicit links to ontologies and methodologies. The first requires human interpretation before any computational system can make sense of its contents. The second can be discovered, validated, and integrated into training pipelines automatically. Figshare’s API can surface the rich metadata to computational systems, but only if researchers have provided it in the first place.

The platform infrastructure already supports the technical requirements for machine-first FAIR. Persistent DOIs ensure stable identifiers, while structured metadata fields can accommodate everything from ORCID researcher identifiers to detailed provenance information. When authors invest time in describing their data using controlled vocabularies, specifying units of measurement, documenting collection methodologies, and linking to relevant publications, they create computational assets rather than digital archives. The same dataset that might languish undiscovered with poor metadata becomes a valuable training resource when described with machine-readable precision.

This creates a powerful feedback loop. Datasets with excellent metadata get discovered and reused more frequently, driving citation counts and demonstrating impact. Meanwhile, poorly described data remains computationally invisible regardless of its scientific value. Platforms like Figshare could amplify this effect by providing better authoring tools that encourage structured metadata entry, perhaps even using AI to suggest appropriate ontology terms or validate metadata completeness before publication. The infrastructure for machine-first FAIR already exists, it simply requires researchers to embrace metadata as a first-class research output rather than an administrative afterthought. But this is an evolving field, new standards are emerging that repositories need to engage with.

The Croissant format, a lightweight JSON-LD descriptor based on schema.org, provides this computational bridge. A single Croissant file enables any training pipeline to hydrate datasets without custom loaders while simultaneously supporting discovery through standard web infrastructure.

Practical Implementation in Institutional Contexts

The transition to machine-first FAIR follows a predictable arc when properly resourced. Initial implementations focus on proving the fundamental workflow with narrowly scoped pilot projects. A team might select a single dataset and one sharply defined outcome, perhaps drug-target interaction prediction or materials property modeling and implement the complete pipeline from literature extraction through validated knowledge graph construction to machine-readable packaging.

The critical insight from successful implementations is the importance of automation as the second phase. Manual processes that work for pilot projects become bottlenecks at scale. The most effective teams invest heavily in converting their proven workflows into tested, continuous integration pipelines that enforce quality gates automatically. This includes SHACL validation for knowledge graphs, automated license checking, and provenance tracking.

Production deployment requires infrastructure investments that many academic institutions are not yet considering. Successful implementations provide stable, resolvable URLs for every dataset and descriptor, enable content negotiation so that both machines and humans receive appropriate formats, and implement comprehensive monitoring of data quality trends and usage patterns. This is the stack that Digital Science can provide.

Quantifying Institutional Success

Organizations can assess their progress toward machine-first FAIR through several concrete indicators. Successful implementations demonstrate that every significant dataset resolves to a persistent identifier that returns structured JSON-LD for computational consumers while maintaining readable landing pages for human users. Knowledge graphs pass automated validation, maintain stable URI schemes, and support catalogued query patterns rather than requiring ad hoc exploration.

Literature workflows leave complete audit trails with PRISMA-compliant reporting that can be generated automatically rather than assembled manually. Licensing and provenance information becomes verifiable through computational means rather than requiring human interpretation. Most importantly, the time taken from initial hypothesis to trained model decreases as institutional infrastructure matures and teams spend more of their time on discovery rather than data preparation.

The research organizations that define the next decade will not necessarily be those with the largest datasets, but rather those whose data infrastructure works most effectively at computational scale. Every day spent optimizing publishing workflows for human-readable reports while leaving data computationally inaccessible represents lost ground in an increasingly competitive landscape.

The funders backing this transformation, from CZI’s investments in computational biology to Astera’s focus on AI-native research infrastructure, are betting that machine-first approaches will determine which institutions can effectively leverage artificial intelligence for discovery. The technical architecture exists today. The standards are stable. The remaining barrier is institutional commitment to prioritizing computational accessibility over familiar but inefficient human-centered workflows.

Academic research stands at yet another technology-driven inflection point. The institutions that embrace machine-first FAIR will find themselves having more impact for their research and researchers.

The post Machine-First FAIR: Realigning Academic Data for the AI Research Revolution appeared first on Digital Science.

Digital Science launches new cutting-edge AI writing tools for 20+ million Overleaf users

David Ellis — Tue, 24 Jun 2025 08:45:00 +0000

Overleaf’s AI Assist provides advanced language feedback and LaTeX code help

London, UK — Tuesday 24 June 2025

More than 20 million research writers worldwide now have immediate access to powerful new AI features from Digital Science through an optional add-on for Overleaf.

The add-on, called AI Assist, helps researchers write in LaTeX faster and smarter by combining the power of advanced language feedback with cutting-edge LaTeX AI tools.

Overleaf users can explore the new AI features with a limited number of free uses and upgrade at any time for unlimited access to AI Assist.

Overleaf is the world’s leading scientific and technical writing platform. A LaTeX editor, Overleaf was developed by researchers to make scientific and technical writing simpler and more collaborative. With the launch of AI Assist, Digital Science is bringing powerful AI features from its Writefull service to the global Overleaf community.

With the AI Assist add-on, Overleaf users can take advantage of:

Language and writing tools

AI-powered language feedback: Context-aware suggestions to improve grammar, spelling, word choice, and sentence structure, all tailored to the nuances of academic and research writing.
Contextual editing tools: Paraphrase selected text, summarize lengthy paragraphs, check synonyms in context, or even generate abstracts and titles with just a few clicks.

LaTeX tools

LaTeX error assistance: Instantly identify and fix LaTeX coding errors, to get documents compiling smoothly.
LaTeX code generation: Generate LaTeX code, including tables and equations, from simple prompts or even images, saving hours of manual coding.
TeXGPT: Ask TeXGPT to help with formatting, figure generation, custom commands, and much more.

Overleaf co-founder Dr John Lees-Miller, Senior VP of B2C Products at Digital Science, said: “The combination of language and writing tools within our AI Assist add-on means millions of Overleaf users can now write their research papers, theses, and technical documents more efficiently and effectively than ever before.

“These AI features will ensure they’ll spend less time wrestling with LaTeX code and perfecting their prose, and more time focusing on groundbreaking research. Users will be able to write with greater confidence, ensuring their documents are error-free, polished, and ready for publication, thanks to the AI Assist add-on.”

Digital Science CEO Dr Daniel Hook said: “Overleaf AI Assist is another example of how Digital Science is bringing tools to our community that save them time and help them to do more research. Responsibly developed AI tools are going to be at the core of giving time back to researchers over the next few years. We are pleased that users can now focus on the important tasks of communicating their research results to the world.”

Find out more about AI Assist and simplify your research writing today.

Overleaf’s AI Assist: Generate equations from simple prompts or images.

About Overleaf

Overleaf is the market-leading scientific and technical writing platform from Digital Science. It’s a LaTeX editor that’s easy enough for beginners and powerful enough for experts. Loved by over 20 million users, it’s trusted by top research institutions and Fortune 500 companies around the world. Users can collaborate easily with colleagues, track changes in real-time, write in LaTeX code or a visual editor, and work in the cloud or on-premises. With Overleaf, anyone can write smarter—creating complex, beautifully formatted documents with ease. Visit overleaf.com and follow Overleaf on X, or on LinkedIn.

About Writefull

Writefull is a Digital Science solution that helps researchers write better, faster, and with confidence, with AI tools that deliver everything from advanced English language edits to research-tailored paraphrasing. It also enables publishers to improve efficiencies across their submission, copy editing, and quality control workflows, and is trusted by some of the world’s leading scholarly publishers. Visit writefull.com and follow @Writefullapp on X.

About Digital Science

Digital Science is an AI-focused technology company providing innovative solutions to complex challenges faced by researchers, universities, funders, industry, and publishers. We work in partnership to advance global research for the benefit of society. Through our brands – Altmetric, Dimensions, Figshare, IFI CLAIMS Patent Services, metaphacts, OntoChem, Overleaf, ReadCube, Symplectic, and Writefull – we believe when we solve problems together, we drive progress for all. Visit digital-science.com and follow Digital Science on Bluesky, on X or on LinkedIn.

Media contact

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post Digital Science launches new cutting-edge AI writing tools for 20+ million Overleaf users appeared first on Digital Science.

NLP series: Speeding up academic writing

Tue, 21 Apr 2020 12:07:24 +0000

In this week’s edition of our blog series on Natural Language Processing, we hear from two members of the team at Writefull, the academic writing support tool. Dr Hilde van Zeeland is Chief Applied Linguist at Writefull. After having completed an MSc and PhD in Applied Linguistics at the University of Nottingham, UK, she worked for several years as a language testing consultant and a scientific information specialist before joining Writefull. Dr Juan Castro is one of the founders of Writefull. He finished his PhD in Artificial Intelligence at the University of Nottingham, UK. He then did a few post docs at the same university before founding Writefull.

SEE MORE POSTS IN THIS NLP SERIES

Introducing Writefull

Writing is key to science. Whether it is journal articles, book chapters, reports or conference proceedings, most research is communicated through written texts. For most researchers however, writing takes up more time and effort than they would like. Fortunately, we now have Writefull: a tool that uses the latest Natural Language Processing (NLP) techniques to speed up the writing process.

Data, data, data, and models

NLP is a strand of Artificial Intelligence that refers to the automatic understanding and generation of human language. It can be applied to many purposes, such as predictive text, automatic translation, and text categorisation. Whatever the application of NLP, its techniques often rely on the training of models on vast amounts of data. While these models process batches of data, they acquire knowledge needed for the task at hand. For predictive text, for example, they require recurrent linguistic strings.

NLP models and Writefull

To help with academic writing, we need models to do three things:
1) to learn the recurrent patterns of academic texts;
2) to recognise when an author’s language does not follow these patterns, and;
3) to change such language so that it follows the expected patterns.

Writefull can suggest changes to academic writing based on the likelihood of a word or sentence to be correct.

At Writefull we have spent the last few years developing and training models that do just that. We offer an editor in which researchers can write their text. They then get automatic feedback on their writing, and can accept or reject Writefull’s suggestions. The models that Writefull uses to give feedback have been trained on millions of journal articles. Thanks to this, they can spot when the author’s writing deviates from the norm – that is, from the expected language patterns as acquired from our dataset. In many cases, such deviations will be grammatical errors, but they can also include things like awkward wording or unnecessary commas.

Why AI beats grammar rules

Traditional language checking software uses grammar rules to check for fixed elements in a sentence. For example, they might ensure that the right prepositions precede certain nouns by coding rules such as: correct ‘at progress’ into ‘in progress’.

Programming rules are definitely easier than training models. However, once models work well, they are much more powerful. Rules are limited; even thousands of rules wouldn’t cover all of the mistakes that authors can make, whereas models can cope with any input: their knowledge is generalisable to any sentence. To give you an example, Writefull recently corrected ‘time of the day and day of the week’ into ‘time of day and day of the week’. Writefull knew that, in this context, ‘the’ precedes ‘week’, but not ‘day’. There are many of these usage-based norms, and it is impossible to cover all of these in a rule set, but a model, if trained sufficiently, will eventually learn them.

Another downside of rules is their black-or-white nature. If an author’s sentence triggers a rule, it will then be corrected regardless of the context. This may lead to false corrections. Models, on the other hand, look at the context to judge what suggestions are needed and, based on this, can give nuanced feedback. When Writefull spots that something is off in a text, it often gives the author the probability of their phrase and compares this to alternatives. For example, when writing, “He is sitting on the sun” in the Writefull editor, Writefull shows that “He is sitting in the sun’ is a more probable alternative, with 82% likelihood of the latter versus 18% for the former. In cases like this, Writefull does not give a harsh correction, but an insight into the likelihood of the author’s wording versus alternatives. Language correctness is, after all, not always black-or-white. Messiness and ambiguity, both inherent to language, are two key challenges in the field of NLP.

The challenge of messy language

A challenge to Writefull – and to any NLP application – is noisy input. If an author writes sentences that are very different from the language that Writefull’s models know from training (i.e., from the journal articles), Writefull may fail to give accurate feedback. Think of an author messing up word order or making several serious grammar mistakes in one sentence. The challenge is therefore to identify those cases where it is best to not suggest anything, for a suggestion might turn out to be incorrect.

The possibilities are endless

At Writefull, we’re continuously exploring avenues to make our feedback even more accurate and complete. While Writefull currently gives feedback on many language features, including the use of punctuation, prepositions, subject-verb agreement, etc., there are still plenty of science-specific features to cover. Academic writing might use virtually the same grammar as other genres, but it is highly specific on other things, such as word use. We now have the technology in-house to expand – and in doing so, we’re keeping a close eye on developments in the NLP field.

Writefull website

More about NLP

The post NLP series: Speeding up academic writing appeared first on Digital Science.