nlp_data_py.dataset package

Submodules

nlp_data_py.dataset.command_line module

nlp_data_py.dataset.command_line.str2bool(v)[source]
nlp_data_py.dataset.command_line.wiki_dataset()[source]

nlp_data_py.dataset.constants module

nlp_data_py.dataset.dataset module

class nlp_data_py.dataset.dataset.Dataset(name, scanned_pickle, match, save_dataset_path, book_def: nlp_data_py.commons.bookdef.Book, splitter: nlp_data_py.commons.splitter.Splitter)[source]

Bases: object

Abstract class to create datasets like train, test and val

Parameters:
  • scanned_pickle

    Path to pickle file tracking items that are read. This enables to incrementally read items. Pickle file stores a dict. Example: {

    ”item1”: 1, “item2”: 0, “item3”: -1

    } In the above example, item1 was read previously hence, wont be read again. item2 was not read and will be consider in future reads. item3 errored out in previous reads and will be attempted to read again

  • match – regular expression as string. Only items matching regular expression will be read for creating datasets
  • save_dataset_path – Path to folder where the datasets will be saved.
  • book_def – Book. This object defines a book. Default is 5 sentences per page. Each sentence is by default defined as string ending in . ! or ?
  • splitter – Splitter: Defines how to split datasets. Default is to create train, val and test sets in the ratio of 80%, 10% & 10% respectively. Also, by default shuffle is set to true. With shuffle set to true, pages, as defined by book_def will be shuffled before creating datasets

Once the datasets are created, the items that are covered is tracked as self.scanned. This is written to a pickle file. This helps in continuing to update dataset at a latter point in time

filter_scannable(items)[source]

filters items that meet the criteria for creating this dataset. For the item to meet the criteia, it should match the regular exp specified. And it should be an unread item as tracked by self.scanned

Parameters:items – List of items to be considered for scanning.
Returns:items that meet the criteria.
generate_datasets(text)[source]

Main method for creating datasets. This method takes care of: - splitting text as defined by book and splitter. - writting the contents into datasets such as train, test and val

handle_contents(seed)[source]

Abstract method that handles contents of items. This mainly includes creating datasets

load_scanned_tracker()[source]

checks if scanned_pickle file is provided. If so, its read and contents are returned. Otherwise and empty dict is returned

Returns:dict of scanned items or empty dict.
write_scanned_tracker()[source]

write self.scanned which is tracking items for this run into a pickle file

nlp_data_py.dataset.wiki module

class nlp_data_py.dataset.wiki.WikiDataset(book_def, splitter, seeds=[], match='', recursive=True, limit=20, scanned_pickle='./vars/scanned.pkl', save_dataset_path='./vars/')[source]

Bases: nlp_data_py.dataset.dataset.Dataset

Create datasets such as train, test and val from wikipedia. This is an implemention of Dataset class

Parameters:
  • book_def – Book. This object defines a book. Default is 5 sentences per page. Each sentence is by default defined as string ending in . ! or ?
  • splitter – Splitter: Defines how to split datasets. Default is to create train, val and test sets in the ratio of 80%, 10% & 10% respectively. Also, by default shuffle is set to true. With shuffle set to true, pages, as defined by book_def will be shuffled before creating datasets
  • seeds – List of dataset pages. If seeds are specified and recursive is false, only items in seeds will be read. If seeds are specified and recursive is True, seeds will be read first and then additional pages upto limit will be read
  • match – regular expression as string. Only items matching regular expression will be read for creating datasets
  • recursive – Boolean: Default True. This flag indicates if additional should be read or tracked. i.e. Links in the wikipages will be tracked extracted and tracked in scanned variable which will then be written to pickle file
  • limit – int: default 20. Number of additional pages to be read in addition to seeds. These pages are read from self.scanned variable
  • scanned_pickle

    Path to pickle file tracking items that are read. This enables to incrementally read items. Pickle file stores a dict. Example: {

    ”item1”: 1, “item2”: 0, “item3”: -1

    } In the above example, item1 was read previously hence, wont be read again. item2 was not read and will be consider in future reads. item3 errored out in previous reads and will be attempted to read again

  • save_dataset_path – Path to folder where the datasets will be saved.
classmethod create_dataset_from_wiki(seeds=[], match='', recursive=True, limit=20, scanned_pickle='./vars/scanned.pkl', save_dataset_path='./vars/', book_def: nlp_data_py.commons.bookdef.Book = <nlp_data_py.commons.bookdef.Book object>, splitter: nlp_data_py.commons.splitter.Splitter = <nlp_data_py.commons.splitter.Splitter object>)[source]

class method to read from wikipedia anc create datasets

Parameters:
  • seeds – List of dataset pages. If seeds are specified and recursive is false, only items in seeds will be read. If seeds are specified and recursive is True, seeds will be read first and then additional pages upto limit will be read
  • match – regular expression as string. Only items matching regular expression will be read for creating datasets
  • recursive – Boolean: Default True. This flag indicates if additional should be read or tracked. i.e. Links in the wikipages will be tracked extracted and tracked in scanned variable which will then be written to pickle file
  • limit – int: default 20. Number of additional pages to be read in addition to seeds. These pages are read from self.scanned variable
  • scanned_pickle

    Path to pickle file tracking items that are read. This enables to incrementally read items. Pickle file stores a dict. Example: {

    ”item1”: 1, “item2”: 0, “item3”: -1

    } In the above example, item1 was read previously hence, wont be read again. item2 was not read and will be consider in future reads. item3 errored out in previous reads and will be attempted to read again

  • save_dataset_path – Path to folder where the datasets will be saved.
  • book_def – Book. This object defines a book. Default is 5 sentences per page. Each sentence is by default defined as string ending in . ! or ?
  • splitter – Splitter: Defines how to split datasets. Default is to create train, val and test sets in the ratio of 80%, 10% & 10% respectively. Also, by default shuffle is set to true. With shuffle set to true, pages, as defined by book_def will be shuffled before creating datasets

Example

create_dataset_from_wiki([‘Brain’, ‘Medulla_oblongata’])

In the above example,
  • Brain will be read from wikipedia
  • contents will be broken to pages as defined by default book
  • pages will be shuffled
  • pages will be split as defined by default splitter
  • links will be extracted from the page
  • links matching patter in match (in this case all links) will be added to self.scanned if they are not already there
  • Brain will be set to 1 in self.scanned to indicate that this page is already read
  • same steps are repeated with ‘Medulla_oblongata’
  • since recursive is set to true, and limit is 20, next 20 unread items from self.scanned will be read and their links will be tracked in self.scanned
  • finally self.scanned is written to a pickle file
  • if the same code is run again, pickle file will be read and since Brain and Medulla oblangata are already read, they will be skipped and next 20 items from self.scanned are read
handle_contents(seed)[source]

This method is responsible for reading contents from wikipedia, extracting links from page, adds links to self.

Module contents