Usage
Installation
To use novelpy, first install it using pip:
$ pip install novelpy
Format supported
The package currently supports JSON files which should be located in Data/docs or a MongoDB. Here is a typical starting folder structure to run novelpy if one use JSON:
project
├── demo.py
└── Data
└── docs
├── Ref_Journals
│ ├ 2001.json
│ └ 2002.json
│
└── Meshterms
├ 2001.json
└ 2002.json
Sample
We made available a small sample of data so one can get familiar with the package and the data structure needed. To get this sample, one needs to run the following code in the “project” folder:
>>> from novelpy.utils.get_sample import download_sample
>>> download_sample()
>>> download_sample(client_name="mongodb://localhost:27017")
Note that you will have the JSON files in both cases. Please delete the files if you use MongoDB and do not want duplicates (saving memory is always good).
More on the structure expected
Depending on the indicator you will run you’ll need different info/variables/format. Here’s a short summary of all the indicators and the variables you can run them on.
For Foster et al. [2015] [2], Lee et al. [2015] [1] and Wang et al. [2017] [3] you only need two pieces of information of a document. The year of creation of the document and the entities they use
# Example of a single paper information
dict_Ref_Journals = {"PMID": 16992327, "year": 1896, "c04_referencelist": [{"item": "0022-3751"}]}
# OR
dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830"}, {"descUI": "D001695"}]}
For Uzzi et al. [2013] [4] you will need one more information, the year of creation of the entity, in order to do the resampling.
# Example of a single paper information
dict_Ref_Journals = {"PMID": 16992327, "year": 1896", "c04_referencelist": [{"item": "0022-3751", "year": 1893}]}
# OR
dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830", "year": 1999}, {"descUI": "D001695", "year": 1999}]}
For text embedding indicators, one need different entities.
To run Shibayama et al. [2021] [5], one needs the Citation_network (i.e. the ID of papers the document cite) but also the abstract and/or title of papers.
# Example of a single paper information
dict_citation_net = {"PMID": 20793277, "year": 1850, "refs_pmid_wos": [20794613, 20794649, 20794685, 20794701, 20794789, 20794829]}
# AND
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}
# You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}
To run Pelletier et Wirtz [2022] you need the abstract or/and title of papers but also the list of authors for each paper.
# Example of a single paper information
dict_authors_list = {"PMID": 20793277, "year": 1850, "a02_authorlist": [{"id":201645},{"id":51331354}]}
# AND
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}
# You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}
Finally, for disruptiveness indicators, one only need the citation network.
# Example of a single paper information
dict_citation_net = {"PMID": 20793277, "year": 1850, "refs_pmid_wos": [20794613, 20794649, 20794685, 20794701, 20794789, 20794829]}
Tutorial
This tutorial is built upon the sample available above in JSON format. The extension to MongoDB is straightforward and requires adding the “client_name” and “db_name” arguments in each function. Make sure to run the code in the “project” folder (demo.py in Usage:format)
Here is a straightforward implementation to run Foster et al. [2015] [2] novelty indicator. Currently, all available indicators are based on the idea that new knowledge is created by combining already existing pieces of knowledge. Because of this, one will require co-occurrence matrices. The element ij of the co-occurrence matrix is the number of times the combination of item i and j appeared for a given year. We made it so the co-occurrence matrices are saved in the pickle format in order to save time when running different indicators :
# demo.py
import novelpy
ref_cooc = novelpy.utils.cooc_utils.create_cooc(
collection_name = "Ref_Journals_sample",
year_var="year",
var = "c04_referencelist",
sub_var = "item",
time_window = range(1995,2016),
weighted_network = True, self_loop = True)
ref_cooc.main()
project
├── demo.py
└── Data
├── docs
│ ├── Ref_Journals_sample
│ │ ├ 1995.json
│ │ ├ 1996.json
│ │ ├ ...
│ │ └ 2015.json
│ │
│ └── Meshterms_sample
│ ├ 1995.json
│ ├ 1996.json
│ ├ ...
│ └ 2015.json
│
└── cooc
└── c04_referencelist
└── weighted_network_self_loop
├ 1995.p
├ 1996.p
├ ...
├ 2015.p
├ index2name.p
└ name2index.p
# demo.py
import novelpy
import tqdm
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Foster = novelpy.indicators.Foster2015(collection_name = "Ref_Journals_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "c04_referencelist",
sub_variable = "item",
focal_year = focal_year,
starting_year = 1995,
community_algorithm = "Louvain",
density = True)
Foster.get_indicator()
project
├── demo.py
├── Data
│ ├── docs
│ │ ├── Ref_Journals_sample
│ │ │ ├ 1995.json
│ │ │ ├ 1996.json
│ │ │ ├ ...
│ │ │ └ 2015.json
│ │ │
│ │ └── Meshterms_sample
│ │ ├ 1995.json
│ │ ├ 1996.json
│ │ ├ ...
│ │ └ 2015.json
│ │
│ └── cooc
│ └── c04_referencelist
│ └── weighted_network_self_loop
│ ├ 1995.p
│ ├ 1996.p
│ ├ ...
│ ├ 2015.p
│ ├ index2name.p
│ └ name2index.p
└── Results
└── foster
└── c04_referencelist
├ 2000.json
├ ...
└ 2010.json
import novelpy
# Easy plot
dist = novelpy.utils.plot_dist(client_name="mongodb://localhost:27017",
db_name = "novelty_sample",
doc_id = 20100198,
doc_year = 2010,
id_variable = "PMID",
variables = ["c04_referencelist"],
indicators = ["foster"])
dist.get_plot_dist()
# The data used for the plot can be found in dist.df
import novelpy
# Trend
trend = novelpy.utils.novelty_trend(year_range = range(2000,2011,1),
variables = ["c04_referencelist"],
id_variable = "PMID",
indicators = ["foster"])
trend.get_plot_trend()
# demo.py
import novelpy
import tqdm
# all the cooc possible not including the one done above
ref_cooc = novelpy.utils.cooc_utils.create_cooc(
collection_name = "Ref_Journals_sample",
year_var="year",
var = "c04_referencelist",
sub_var = "item",
time_window = range(1995,2016),
weighted_network = False, self_loop = False)
ref_cooc.main()
ref_cooc = novelpy.utils.cooc_utils.create_cooc(
collection_name = "Meshterms_sample",
year_var="year",
var = "Mesh_year_category",
sub_var = "descUI",
time_window = range(1995,2016),
weighted_network = True, self_loop = True)
ref_cooc.main()
ref_cooc = novelpy.utils.cooc_utils.create_cooc(
collection_name = "Meshterms_sample",
year_var="year",
var = "Mesh_year_category",
sub_var = "descUI",
time_window = range(1995,2016),
weighted_network = False, self_loop = False)
ref_cooc.main()
# Uzzi et al.(2013) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Meshterms_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "Mesh_year_category",
sub_variable = "descUI",
focal_year = focal_year,
density = True)
Uzzi.get_indicator()
# Uzzi et al.(2013) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Ref_Journals_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "c04_referencelist",
sub_variable = "item",
focal_year = focal_year,
density = True)
Uzzi.get_indicator()
# Foster et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Foster = novelpy.indicators.Foster2015(collection_name = "Meshterms_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "Mesh_year_category",
sub_variable = "descUI",
focal_year = focal_year,
starting_year = 1995,
community_algorithm = "Louvain",
density = True)
Foster.get_indicator()
# Lee et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Lee = novelpy.indicators.Lee2015(collection_name = "Meshterms_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "Mesh_year_category",
sub_variable = "descUI",
focal_year = focal_year),
density = True
Lee.get_indicator()
# Lee et al.(2015) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
Lee = novelpy.indicators.Lee2015(collection_name = "Ref_Journals_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "c04_referencelist",
sub_variable = "item",
focal_year = focal_year,
density = True)
Lee.get_indicator()
# Wang et al.(2017) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011)):
Wang = novelpy.indicators.Wang2017(collection_name = "Meshterms_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "Mesh_year_category",
sub_variable = "descUI",
focal_year = focal_year,
time_window_cooc = 3,
n_reutilisation = 1,
starting_year = 1995,
density = True)
Wang.get_indicator()
# Wang et al.(2017) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011)):
Wang = novelpy.indicators.Wang2017(collection_name = "Ref_Journals_sample",
id_variable = 'PMID',
year_variable = 'year',
variable = "c04_referencelist",
sub_variable = "item",
focal_year = focal_year,
time_window_cooc = 3,
n_reutilisation = 1,
starting_year = 1995,
density = True)
Wang.get_indicator()
from novelpy.utils.embedding import Embedding
embedding = Embedding(
year_variable = 'year',
time_range = range(2000,2011),
id_variable = 'PMID',
references_variable = 'refs_pmid_wos',
pretrain_path = 'en_core_sci_lg-0.4.0/en_core_sci_lg/en_core_sci_lg-0.4.0',
title_variable = 'ArticleTitle',
abstract_variable = 'a04_abstract',
abstract_subvariable = 'AbstractText')
# articles
embedding.get_articles_centroid(
collection_articles = 'Title_abs_sample',
collection_embedding = 'embedding',
year_range = range(2000,2011,1))
import novelpy
import tqdm
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
shibayama = novelpy.indicators.Shibayama2021(client_name="mongodb://localhost:27017",
db_name = "novelty_sample",
collection_name = 'Citation_net_sample',
collection_embedding_name = 'embedding',
id_variable = 'PMID',
year_variable = 'year',
ref_variable = 'refs_pmid_wos',
entity = ['title_embedding','abstract_embedding'],
focal_year = focal_year,
density = True)
shibayama.get_indicator()
from novelpy.utils import Embedding
from novelpy.utils import create_authors_past
import novelpy
# First step is to create a collection where each doc contains the author ID and its list of document he coauthored
clean = create_authors_past(client_name = 'mongodb://localhost:27017',
db_name = 'novelty_sample',
collection_name = "authors_sample",
id_variable = "PMID",
variable = "a02_authorlist",
sub_variable = "AID")
clean.author2paper()
clean.update_db()
embedding = Embedding(
year_variable = 'year',
id_variable = 'PMID',
references_variable = 'refs_pmid_wos',
pretrain_path = r'en_core_sci_lg-0.4.0\en_core_sci_lg\en_core_sci_lg-0.4.0',
title_variable = 'ArticleTitle',
abstract_variable = 'a04_abstract',
abstract_subvariable = 'AbstractText',
aut_id_variable = 'AID',
aut_pubs_variable = 'doc_list')
"""
embedding.get_articles_centroid(
collection_articles = 'Title_abs_sample',
collection_embedding = 'embedding')
"""
embedding.feed_author_profile(
aut_id_variable = 'AID',
aut_pubs_variable = 'doc_list',
collection_authors = 'authors_sample_cleaned',
collection_embedding = 'embedding')
from novelpy.indicators.Author_proximity import Author_proximity
for year in range(2000,2011):
author = Author_proximity(
collection_name = 'authors_sample',
id_variable = 'PMID',
year_variable = 'year',
aut_list_variable = 'a02_authorlist',
aut_id_variable = 'AID',
entity = ['title','abstract'],
focal_year = year,
windows_size = 5,
density = True)
author.get_indicator()
dist = novelpy.utils.plot_dist(
doc_id = 20100198,
doc_year = 2010,
id_variable = "PMID",
variables = ["c04_referencelist","Mesh_year_category"],
indicators = ["foster","lee","uzzi","wang","shibayama"],
time_window_cooc = [3],
n_reutilisation = [1],
embedding_entities = ["title","abstract"])
dist.get_plot_dist()
trend = novelpy.utils.novelty_trend(year_range = range(2000,2011,1),
variable = ["c04_referencelist","a06_meshheadinglist"],
id_variable = "PMID",
indicator = ["foster","commonness"],
time_window_cooc = [3],
n_reutilisation = [1])
trend.get_plot_trend()
correlation = novelpy.utils.correlation_indicators(year_range = range(2000,2011,1),
variables = ["c04_referencelist","Mesh_year_category"],
indicators = ["foster","lee","wang","shibayama"],
time_window_cooc = [3],
n_reutilisation = [1],
embedding_entities = ["title","abstract"])
correlation.correlation_heatmap(per_year = False)