nlp_data_py.commons package¶
Subpackages¶
Submodules¶
nlp_data_py.commons.bookdef module¶
-
class
nlp_data_py.commons.bookdef.
Book
(chunk_splitter='(?<=[.!?]) +', chunks_per_page=5)[source]¶ Bases:
object
For managing data spliting, contents are added to Book class. This will manage things like spliting the contents, based on delimiter, chunking contents into pages. These pages can then be used to create train, test and val sets.
Parameters: - chunk_splitter – regular expression. Pattern on which to split the text
- chunks_per_page – int: Number of chunks that make up a page
Example:
book_def: Book = Book(chunk_splitter='(?<=[.!?]) +', chunks_per_page=2) book_def.text = "This is. A Simple. Book! That makes. No Sense?" println(book_def.num_of_chunks) >>> 5 println(book_def.num_of_pages) >>> 3
-
read_page
(page_number)[source]¶ Reads the content of the page.
Parameters: page_number – int: Number of the page to be read Returns: Contents of the asked page
-
text
¶ This is content of entire book and has to be set before reading pages. Once this property is set, below properties will be availableself.
chucks: Array[str]: Actual chunks after splitting text on reg_ex
num_of_chunks: Number of chunks in the book
num_of_pages: pages in the book. num_of_chunks/chunks_per_page
nlp_data_py.commons.splitter module¶
-
class
nlp_data_py.commons.splitter.
Splitter
(split_ratios: List[float] = [0.8, 0.1, 0.1], dataset_names: List[str] = ['train', 'val', 'test'], shuffle=True)[source]¶ Bases:
object
Splits pages in a book to datasets. This class will simple determine what page numbers make each datasets.
Parameters: - num_of_pages – Book.
- split_ratios – ratio to split the book. Default ratio is 90% train, 5% val and 5% test
- dataset_names – dataset names to be split to
- shuffle – shuffle pages
- Properties:
- ds_to_pages: Contains the dict of datasets and page number in each of the datasets.
Example:
splitter: Splitter = Splitter(split_ratios=[0.8, 0.1, 0.1], dataset_names=['train', 'val', 'test'], shuffle=True) splitter.num_of_pages = 10 print(splitter.shuffled_pages) >>> [4, 3, 1, 0, 8, 6, 9, 7, 2, 5] print(splitter.ds_to_page) >>> { 'train': [4, 3, 1, 0, 8, 6, 9, 7] 'val': [2] 'test': [5] }
-
logger
= <Logger SplitBook (WARNING)>¶
-
static
match_splitratios_and_datasetnames
(split_ratios=[], dataset_names=[])[source]¶ If parameters passed to split and datasets are not even, this expands the shorter one. If the dataset_name is shorter, it creates default dataset name as ‘set_{position of missing item}. If ratio is shorter its set to 0 and no pages for it are created
Parameters: - split_ratios – list of ratios for pages
- dataset_names – list of names for the datasets
Returns: Normalized ratio and datasetnames
-
num_of_pages
¶ Number of pages for splitting. Once num_of_pages is set ds_to_page dict will be availabe.
ds_to_pages: Contains the dict of datasets and page number in each of the datasets.
-
pages_to_datasets
()[source]¶ creates a dict of dataset names and page numbers.
Example:
This returns somethings like { "train": [0, 1, 4, 8, 9, 3, 6] "val" : [2, 5] "test": [7] } In the above example, train set will contain pages in its list and so on for val and test
-
shuffled_pages
¶ List of shuffled page number if shuffle is true, else just ordered page numbers.