class documentation

class Corpus: (source)

Constructor: Corpus()

View In Hierarchy

Class representing a corpus of documents.

Method __init__ Initialize a Corpus object.
Method __repr__ Returns: String representation of the Corpus object.
Method __str__ Returns a string representation of the Corpus object.
Method add Add a document to the corpus.
Method calculate_word_freq_per_year Calculates the word frequency per year for the given list of words.
Method docs_to_collection Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus.
Method from_pkl_file Load the corpus from a pickle file.
Method get_corpus_contents Get dictionary with just documents texts.
Method get_document Get a document from the corpus.
Method get_stats Calculate statistics about the corpus.
Method search_text Search for passages in the documents containing the given keyword.
Method to_dataframe Convert the corpus to a DataFrame.
Instance Variable authors Undocumented
Instance Variable documents Undocumented
Method __clean_text Clean the text of a document.
Method __concat_data Concatenate all documents in the corpus if not concated.
Instance Variable __author_count Undocumented
Instance Variable __author_to_id Undocumented
Instance Variable __concated_text Undocumented
Instance Variable __document_count Undocumented
def __init__(self): (source)

Initialize a Corpus object.

def __repr__(self): (source)

Returns: String representation of the Corpus object.

def __str__(self): (source)

Returns a string representation of the Corpus object. The string representation includes the statistics of the Corpus object in a tabulated format. Returns: str: A string representation of the Corpus object.

def add(self, doc, author): (source)

Add a document to the corpus. Args: doc: The document to add. author: The author of the document.

def calculate_word_freq_per_year(self, words_to_track: List[str]) -> defaultdict: (source)

Calculates the word frequency per year for the given list of words. Args: words_to_track (list): A list of words to track the frequency of. Returns: defaultdict: A nested defaultdict containing the word frequency per year.

def docs_to_collection(self) -> List[str]: (source)

Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus. Returns: list: A list of documents.

def from_pkl_file(self, path: str): (source)

Load the corpus from a pickle file. Args: path (str): The path to the pickle file.

def get_corpus_contents(self): (source)

Get dictionary with just documents texts. Returns: list: A list of documents.

def get_document(self, doc_id: int) -> str: (source)

Get a document from the corpus. Args: doc_id (int): The ID of the document to get. Returns: str: The document.

def get_stats(self) -> pd.DataFrame: (source)

Calculate statistics about the corpus. Returns: pd.DataFrame: A DataFrame containing the statistics.

def search_text(self, keyword: str) -> List[Tuple[int, int, int]]: (source)

Search for passages in the documents containing the given keyword. Args: keyword (str): The keyword to search for. Returns: list: A list of tuples containing the matched key word start, end positions and document id.

def to_dataframe(self) -> pd.DataFrame: (source)

Convert the corpus to a DataFrame. Returns: pd.DataFrame: A DataFrame representation of the corpus.

Undocumented

documents = (source)

Undocumented

def __clean_text(self, doc: str = None): (source)

Clean the text of a document. Args: doc (str): The document to clean. If None, clean the entire corpus. Returns: str: The cleaned document.

def __concat_data(self): (source)

Concatenate all documents in the corpus if not concated.

__author_count: int = (source)

Undocumented

__author_to_id = (source)

Undocumented

__concated_text = (source)

Undocumented

__document_count: int = (source)

Undocumented