modules.corpus.Corpus

class documentation

class Corpus: (source)

Constructor: Corpus()

Class representing a corpus of documents.

Method	`__init__`	Initialize a Corpus object.
Method	`__repr__`	Returns: String representation of the Corpus object.
Method	`__str__`	Returns a string representation of the Corpus object.
Method	`add`	Add a document to the corpus.
Method	`calculate_word_freq_per_year`	Calculates the word frequency per year for the given list of words.
Method	`docs_to_collection`	Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus.
Method	`from_pkl_file`	Load the corpus from a pickle file.
Method	`get_corpus_contents`	Get dictionary with just documents texts.
Method	`get_document`	Get a document from the corpus.
Method	`get_stats`	Calculate statistics about the corpus.
Method	`search_text`	Search for passages in the documents containing the given keyword.
Method	`to_dataframe`	Convert the corpus to a DataFrame.
Instance Variable	`authors`	Undocumented
Instance Variable	`documents`	Undocumented
Method	`__clean_text`	Clean the text of a document.
Method	`__concat_data`	Concatenate all documents in the corpus if not concated.
Instance Variable	`__author_count`	Undocumented
Instance Variable	`__author_to_id`	Undocumented
Instance Variable	`__concated_text`	Undocumented
Instance Variable	`__document_count`	Undocumented

def __init__(self): (source) ¶

Initialize a Corpus object.

def __repr__(self): (source) ¶

Returns: String representation of the Corpus object.

def __str__(self): (source) ¶

Returns a string representation of the Corpus object. The string representation includes the statistics of the Corpus object in a tabulated format. Returns: str: A string representation of the Corpus object.

def add(self, doc, author): (source) ¶

Add a document to the corpus. Args: doc: The document to add. author: The author of the document.

def calculate_word_freq_per_year(self, words_to_track: List[str]) -> defaultdict: (source) ¶

Calculates the word frequency per year for the given list of words. Args: words_to_track (list): A list of words to track the frequency of. Returns: defaultdict: A nested defaultdict containing the word frequency per year.

def docs_to_collection(self) -> List[str]: (source) ¶

Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus. Returns: list: A list of documents.

def from_pkl_file(self, path: str): (source) ¶

Load the corpus from a pickle file. Args: path (str): The path to the pickle file.

def get_corpus_contents(self): (source) ¶

Get dictionary with just documents texts. Returns: list: A list of documents.

def get_document(self, doc_id: int) -> str: (source) ¶

Get a document from the corpus. Args: doc_id (int): The ID of the document to get. Returns: str: The document.

def get_stats(self) -> pd.DataFrame: (source) ¶

Calculate statistics about the corpus. Returns: pd.DataFrame: A DataFrame containing the statistics.

def search_text(self, keyword: str) -> List[Tuple[int, int, int]]: (source) ¶

Search for passages in the documents containing the given keyword. Args: keyword (str): The keyword to search for. Returns: list: A list of tuples containing the matched key word start, end positions and document id.

def to_dataframe(self) -> pd.DataFrame: (source) ¶

Convert the corpus to a DataFrame. Returns: pd.DataFrame: A DataFrame representation of the corpus.

authors = (source) ¶

Undocumented

documents = (source) ¶

Undocumented

def __clean_text(self, doc: str = None): (source) ¶

Clean the text of a document. Args: doc (str): The document to clean. If None, clean the entire corpus. Returns: str: The cleaned document.

def __concat_data(self): (source) ¶

Concatenate all documents in the corpus if not concated.

__author_count: int = (source) ¶

Undocumented

__author_to_id = (source) ¶

Undocumented

__concated_text = (source) ¶

Undocumented

__document_count: int = (source) ¶

Undocumented