module documentation

Undocumented

Function calculate_similarity_articles Calculate the cosine similarity between articles in a DataFrame using sklearn library.
Function process_similarity_pairs Process similarity pairs based on the given dataframe corpus and similarity dataframe.
def calculate_similarity_articles(df): (source)

Calculate the cosine similarity between articles in a DataFrame using sklearn library. Args: df (pd.DataFrame): DataFrame containing articles with 'source', 'id', 'title', and 'text' columns. Returns: pd.DataFrame: DataFrame containing the cosine similarity scores between articles.

def process_similarity_pairs(df_corpus, similarity_df): (source)

Process similarity pairs based on the given dataframe corpus and similarity dataframe. Args: df_corpus (pandas.DataFrame): The dataframe corpus containing unique IDs. similarity_df (pandas.DataFrame): The similarity dataframe. Returns: dict: A dictionary containing similarity pairs categorized by source and ID. The structure of the dictionary is as follows: { 'reddit': { 'source_id': [ { 'similar_source': 'similar_source', 'similar_id': 'similar_id', 'similarity': similarity_value }, ... ], ... }, 'arxiv': { 'source_id': [ { 'similar_source': 'similar_source', 'similar_id': 'similar_id', 'similarity': similarity_value }, ... ], ... } }