Insight Generation

Slava Kurilyak
Slava Kurilyak

In this post, I discuss the release of the arXiv Dataset by Cornell University and three collaborators. I explain how the dataset provides a significant opportunity to derive technological insights from scientific papers or pre-print articles.

Cornell University and three collaborators (Joe Tricot, Devvret Rishi, Brian Maltzan) recently released (August 4, 2020) the arXiv Dataset on Kaggle.

Here's what you need to know.

For nearly 30 years, arXiv served the public and research communities by providing open access to scholarly preprint articles covering many fields of computer science, mathematics, physics, etc. In academic publishing, a preprint is a version of a scholarly or scientific paper that precedes formal peer review and publication in a peer-reviewed scholarly or scientific journal.

According to

To help make the arXiv more accessible, (Cornell University and collaborators) present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:
  1. id: ArXiv ID
  2. submitter: Who submitted the paper
  3. authors: Authors of the paper
  4. title: Title of the paper
  5. comments: Additional info, such as number of pages and figures
  6. journal-ref: Information about the journal the paper was published in
  7. doi:
  8. abstract: The abstract of the paper
  9. categories: Categories / tags in the ArXiv system
  10. versions: A version history

arXiv preprints provide signals (or event triggers) to emerging trends and technologies. The arXiv dataset on Kaggle presents a significant opportunity to derive technological insights from scientific papers or pre-print articles, at scale. This dataset allows anyone to better understand scientific and technological trends using manual, semi-automatic, and automatic insight generation.

Manual Insight Generation

Anyone can derive insights by reading free arXiv papers available on and indexed by Google. Generating or extracting insights requires individuals to read one research paper at-a-time. Let's call this manual insight generation.

Semi-automatic Insight Generation

Release of the arXiv dataset on Kaggle allows anyone to derive insights from existing research papers using knowledge graphs.

According to

A knowledge graph is a set of datapoints linked by relations that describe a domain, for instance a business, an organization, or a field of study. It is a powerful way of representing data because knowledge graphs can be built automatically and can then be explored to reveal new insights about the domain.

Knowledge graphs are secondary or derivate datasets: They are obtained by analyzing and filtering the original data. More specifically, the relations between data points are pre-calculated and become an important part of the dataset. This means that not only each data point can be analyzed fast and at scale, but also each relation.

If the arXiv dataset is the primary dataset, a knowledge graph acts like a secondary dataset. While the knowledge graph is built automatically, insights are still derived manually. Let's call this semi-automatic insight generation.

Automatic Insight Generation

Today, data exploration is not necessary. Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data. Since data exploration requires personal insight, it is unable to scale due to our limited human capabilities.

By following scientific principles and applying machine learning techniques, we can generate automated insights from data. Let's call this automatic insight generation.

How can we apply scientific principles to generate automatic insights from data?

"We define a set of question-action or hypothesis-action pairs. Then the question/hypothesis is applied to the dataset, providing an answer which is actionable insight."

Orion Talmi

How can we apply machine learning to achieve automatic insight generation?

With predictive modeling, we can automatically predict research trends and topic relationships. With generative modeling, we can generate titles, summaries and topics.

The arXiv dataset allows us to solve many natural language processing tasks, including: language modeling, question answering, sentiment analysis, natural language inference, named entity recognition, text classification, machine translation, data-to-text generation, text generation, common sense reasoning, text style transfer.

Next Steps

In times of unique global challenges, in the world of COVID-19, efficient extraction of insights from data is essential.

I believe the world must move beyond manual data exploration of scientific papers, replacing it automatic insight generation, or algorithms that provide actionable insights automatically generated by computers.

It's time to use the arXiv dataset to generate scientific insights!

It's time to leverage the arXiv dataset to develop software solutions for technological innovation!

Want to derive insights from arXiv papers? Contact me.

Join the conversation.

Great! Check your inbox and click the link
Great! Next, complete checkout for full access to Slava
Welcome back! You've successfully signed in
You've successfully subscribed to Slava
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated