nlp_data_py.commons package

Submodules

nlp_data_py.commons.bookdef module

class nlp_data_py.commons.bookdef.Book(chunk_splitter='(?<=[.!?]) +', chunks_per_page=5)[source]

Bases: object

For managing data spliting, contents are added to Book class. This will manage things like spliting the contents, based on delimiter, chunking contents into pages. These pages can then be used to create train, test and val sets.

Parameters:
  • chunk_splitter – regular expression. Pattern on which to split the text
  • chunks_per_page – int: Number of chunks that make up a page

Example:

book_def: Book = Book(chunk_splitter='(?<=[.!?]) +', chunks_per_page=2)
book_def.text = "This is. A Simple. Book! That makes. No Sense?"

println(book_def.num_of_chunks)
>>> 5
println(book_def.num_of_pages)
>>> 3
read_page(page_number)[source]

Reads the content of the page.

Parameters:page_number – int: Number of the page to be read
Returns:Contents of the asked page
text

This is content of entire book and has to be set before reading pages. Once this property is set, below properties will be availableself.

chucks: Array[str]: Actual chunks after splitting text on reg_ex

num_of_chunks: Number of chunks in the book

num_of_pages: pages in the book. num_of_chunks/chunks_per_page

nlp_data_py.commons.splitter module

class nlp_data_py.commons.splitter.Splitter(split_ratios: List[float] = [0.8, 0.1, 0.1], dataset_names: List[str] = ['train', 'val', 'test'], shuffle=True)[source]

Bases: object

Splits pages in a book to datasets. This class will simple determine what page numbers make each datasets.

Parameters:
  • num_of_pages – Book.
  • split_ratios – ratio to split the book. Default ratio is 90% train, 5% val and 5% test
  • dataset_names – dataset names to be split to
  • shuffle – shuffle pages
Properties:
ds_to_pages: Contains the dict of datasets and page number in each of the datasets.

Example:

splitter: Splitter = Splitter(split_ratios=[0.8, 0.1, 0.1], dataset_names=['train', 'val', 'test'], shuffle=True)
splitter.num_of_pages = 10

print(splitter.shuffled_pages)
>>> [4, 3, 1, 0, 8, 6, 9, 7, 2, 5]
print(splitter.ds_to_page)
>>> {
        'train': [4, 3, 1, 0, 8, 6, 9, 7]
        'val': [2]
        'test': [5]
    }
logger = <Logger SplitBook (WARNING)>
static match_splitratios_and_datasetnames(split_ratios=[], dataset_names=[])[source]

If parameters passed to split and datasets are not even, this expands the shorter one. If the dataset_name is shorter, it creates default dataset name as ‘set_{position of missing item}. If ratio is shorter its set to 0 and no pages for it are created

Parameters:
  • split_ratios – list of ratios for pages
  • dataset_names – list of names for the datasets
Returns:

Normalized ratio and datasetnames

num_of_pages

Number of pages for splitting. Once num_of_pages is set ds_to_page dict will be availabe.

ds_to_pages: Contains the dict of datasets and page number in each of the datasets.

pages_to_datasets()[source]

creates a dict of dataset names and page numbers.

Example:

This returns somethings like
{
   "train": [0, 1, 4, 8, 9, 3, 6]
   "val" : [2, 5]
   "test": [7]
}
In the above example, train set will contain pages in its list
and so on for val and test
shuffled_pages

List of shuffled page number if shuffle is true, else just ordered page numbers.

Module contents