Class representing a corpus of documents.
Method | __init__ |
Initialize a Corpus object. |
Method | __repr__ |
Returns: String representation of the Corpus object. |
Method | __str__ |
Returns a string representation of the Corpus object. |
Method | add |
Add a document to the corpus. |
Method | calculate |
Calculates the word frequency per year for the given list of words. |
Method | docs |
Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus. |
Method | from |
Load the corpus from a pickle file. |
Method | get |
Get dictionary with just documents texts. |
Method | get |
Get a document from the corpus. |
Method | get |
Calculate statistics about the corpus. |
Method | search |
Search for passages in the documents containing the given keyword. |
Method | to |
Convert the corpus to a DataFrame. |
Instance Variable | authors |
Undocumented |
Instance Variable | documents |
Undocumented |
Method | __clean |
Clean the text of a document. |
Method | __concat |
Concatenate all documents in the corpus if not concated. |
Instance Variable | __author |
Undocumented |
Instance Variable | __author |
Undocumented |
Instance Variable | __concated |
Undocumented |
Instance Variable | __document |
Undocumented |
Returns a string representation of the Corpus object. The string representation includes the statistics of the Corpus object in a tabulated format. Returns: str: A string representation of the Corpus object.
Calculates the word frequency per year for the given list of words. Args: words_to_track (list): A list of words to track the frequency of. Returns: defaultdict: A nested defaultdict containing the word frequency per year.
Get all documents in the corpus in a list. Used as collection of all documents objects in the corpus. Returns: list: A list of documents.
Get a document from the corpus. Args: doc_id (int): The ID of the document to get. Returns: str: The document.
Calculate statistics about the corpus. Returns: pd.DataFrame: A DataFrame containing the statistics.
Search for passages in the documents containing the given keyword. Args: keyword (str): The keyword to search for. Returns: list: A list of tuples containing the matched key word start, end positions and document id.