Utils

cooc_utils

Most indicators hypothesise that new ideas are created by combining already existing ones. They look at the combination of items (Journals cited, keywords used, …). cooc_utils creates an adjacency matrix that retraces the history of these combinations done in a given year.

create_cooc(var, sub_var, year_var, collection_name, time_window, dtype=np.uint32, weighted_network=False, self_loop=False, client_name=None, db_name=None)

Create a co-occurrence matrix of a field (e.g. authors, keywords, ref) by year. Matrices are sparse csr and pickled for later usage.

Parameters:
  • var (str) – The key of interest in the dict.

  • sub_var (str) – Name of the key which holds the ID of the variable of interest.

  • year_var (str) – Name of the key whose value is the year of creation of the document.

  • collection_name (str) – Name of the collection (either Mongo or Json) where the data is

  • time_window (range) – Compute the cooc for the years in range

  • dtype (np.dtype) – The dtype for the co-occurence matrix.

  • weighted_network (str) – False if you want a combination that appears multiple times in a single paper to be accounted as 1

  • self_loop (str) – True if you want the diagonal in the co-occurrence matrix

  • client_name (str) – Name of the MongoDB client

  • db_name (str) – Name of the MongoDB

Returns:

Raises:

embedding

In order to use the indicators of Shibayama et al (2021) and the one on authors, it is necessary to embed the title or abstract of the document.

Embedding(year_variable, id_variable, references_variable, pretrain_path, time_range, title_variable=None, abstract_variable=None, keywords_variable=None, keywords_subvariable=None, abstract_subvariable=None, aut_id_variable=None, aut_pubs_variable=None, client_name=None, db_name=None)

Compute the semantic centroid for each paper (abstract and title) Compute an author profile of embedded articles per year and store it.

Parameters:
  • year_variable (str) – Key where value is the year of publication of the document.

  • id_variable (str) – Key where value is the document’s id.

  • time_range (range) – Create the embedding for papers published in the time_range if None it iterates on all available years.

  • pretrain_path (str) – path to the pretrain word2vec: ‘your/path/to/en_core_sci_lg-0.4.0/en_core_sci_lg/en_core_sci_lg-0.4.0.

  • title_variable (str) – Key where value is the document’s title.

  • abstract_variable (str) – Key where value is the abstract’s information for the document.

  • abstract_sub_variable (str) – Key inside abstract variable where value is text of the abstract.

  • keywords_variable (str) – Key where value is the keywords’ information for the document.

  • keywords_subvariable (str) – Key inside keywords_variable where value is the actual keyword.

  • aut_id_variable (str) – In collection author key where value is the ID of an author.

  • aut_pubs_variable (str) – In collection author key where value is the document list for a given author.

  • client_name (str) – name of the MongoDB client.

  • db_name (str) – name of the MongoDB

Returns:

Raises:
from novelpy.utils.embedding import Embedding

embedding = Embedding(
      year_variable = 'year',
      id_variable = 'PMID',
      references_variable = 'refs_pmid_wos',
      pretrain_path = 'en_core_sci_lg-0.4.0/en_core_sci_lg/en_core_sci_lg-0.4.0',
      title_variable = 'ArticleTitle',
      abstract_variable = 'a04_abstract',
      abstract_subvariable = 'AbstractText')

The first step is to embed every paper’s abstract/title by using get_articles_centroid.

embedding.get_articles_centroid(
      collection_articles = 'Title_abs_sample',
      collection_embedding = 'embedding')

Once this is done you can run the Shibayama et al. [2021] [5] indicator.

plot_dist

Once you have computed multiple indicators, you can plot the distribution for a document of the novelty score for combinations of items in a document.

plot_dist(doc_id, doc_year, id_variable, variables, indicators, time_window_cooc=None, n_reutilisation=None, embedding_entities=None, shibayma_per=10, client_name=None, db_name=None)

Plot the distribution of novelty score for combinations of items in a document

Parameters:
  • doc_id (str/int) – The id of the document you want the distribution.

  • doc_year (int) – Year of creation of the document.

  • id_variable (str) – Name of the key that contains the ID of the doc

  • variables (list) – List of variables you want the distribution of (e.g. [“references”, “meshterms”])

  • indicators (list) – List of indicators name you want the distribution of(e.g [“foster”,”wang”])

  • time_window_cooc (list of int) – List of parameters you want the distribution of, parameters used in wang (e.g [3,5])

  • n_reutilisation (list) – List of parameters you want the distribution of, parameter used in wang (e.g [1,2])

  • embedding_entities (list) – List of entities you want the distribution of, parameters used in shibayama (e.g [“title”,”abstract”])

  • shibayma_per (int) – In shibayama they compared different percentile for the novelty score of each combination (int between 0 and 100)

  • client_name (str) – Name of the MongoDB client

  • db_name (str) – Name of the MongoDB

Returns:

Raises:

novelty_trend

Once you have computed multiple indicators, you can plot the trend of each indicator’s mean novelty score per year, given the variables and hyperparameters.

novelty_trend(year_range, variables, indicators, id_variable, time_window_cooc=None, n_reutilisation=None, embedding_entities=None, shibayama_per=10, client_name=None, db_name=None)

Plot the novelty trend (mean per year) for an indicator given the variable

Parameters:
  • year_range (range) – Get the trend for each year in year_range.

  • variables (list) – List of variables you want the novelty trend of (e.g. [“references”, “meshterms”]).

  • indicators (list) – List of indicators name you want the novelty of(e.g [“foster”,”wang”]).

  • id_variable (str) – Name of the key that contains the ID of the doc.

  • time_window_cooc (list of int) – List of parameters you want the distribution of, parameters used in wang (e.g [3,5]).

  • n_reutilisation (list) – List of parameters you want the distribution of, parameter used in wang (e.g [1,2]).

  • embedding_entities (list) – List of entities you want the distribution of, parameters used in shibayama (e.g [“title”,”abstract”]).

  • shibayma_per (int) – In shibayama they compared different percentile for the novelty score of each combination (int between 0 and 100).

  • client_name (str) – Name of the MongoDB client.

  • db_name (str) – Name of the MongoDB.

Returns:

Raises:

correlation_indicators

Once you have computed multiple indicators, you can plot the correlation heatmap of the novelty score, either per year or during the whole period, for each indicator, given the variables and hyperparameters.

correlation_indicators(year_range, variables, indicators, time_window_cooc=None, n_reutilisation=None, embedding_entities=None, shibayama_per=10, client_name=None, db_name=None)

Plot the novelty trend (mean per year) for an indicator given the variable

Parameters:
  • year_range (range) – Get the trend for each year in year_range.

  • variables (list) – List of variables you want the novelty trend of (e.g. [“references”, “meshterms”]).

  • indicators (list) – List of indicators name you want the novelty of(e.g [“foster”,”wang”]).

  • time_window_cooc (list of int) – List of parameters you want the distribution of, parameters used in wang (e.g. [3,5]).

  • n_reutilisation (list) – List of parameters you want the distribution of, parameter used in wang (e.g [1,2]).

  • embedding_entities (list) – List of entities you want the distribution of, parameters used in shibayama (e.g. [“title”,”abstract”]).

  • shibayma_per (int) – In shibayama they compared different percentile for the novelty score of each combination (int between 0 and 100).

  • client_name (str) – Name of the MongoDB client.

  • db_name (str) – Name of the MongoDB.

Returns:

Raises: