Welcome to ml-indie-tools’s documentation!¶
MLEnv object¶
Tools to configure ML environment for Pytorch, MLX, or JAX and optional notebook/colab environment
- class env_tools.MLEnv(platform='pt', accelerator='fastest', old_disable_eager=False)[source]¶
Bases:
object
Initialize platform and accelerator.
This checks initialization and available accelerator hardware for different ml platforms. At return, the following variables are set: self.is_pytorch, self.is_jax, self.is_mlx indicating that the ml environment is available for Pytorch, MLX, or JAX respectively if True. self.is_notebook and self.is_colab indicate if the environment is a notebook or colab environment. self.is_gpu indicates if the environment is a GPU environment, self.is_tpu indicates if the environment is a TPU environment, and self.is_cpu that no accelerator is available.
The logger MLEnv provdides details about the hardware and ml environment.
- Parameters:
platform – Known platforms are: ‘pt’ (pytorch), ‘mlx’, and ‘jax’
accelerator – known accelerators are: ‘fastest’ (pick best available hardware), ‘cpu’, ‘gpu’, ‘tpu’.
old_disable_eager – default ‘False’, on True, old v1 compatibility layer is used to disable eager mode.
According to rumors that might in resulting old codepaths being used?
- describe()[source]¶
Prints a description of the machine environment.
- Returns:
str: description of the machine environment.
- gpu_type¶
GPU type, e.g. ‘Tesla V100’
- init_paths(project_name=None, model_name=None)[source]¶
Initializes the paths for the project.
Depending on if this is a Colab environment or not, persistent data will be stored in either project_path=’/content/drive/My Drive/Colab Notebooks/<project_name>’ or project_path=’.’.
If Google drive access is not available, data will be stored in project_path=’.’. This data is lost, once the Colab session is closed.
project_path/data # training data (cache) project_path/model[/<model_name>] # model state, weights, etc. .logs # log files
Note that log_path is always local, since Colab Google drive caching prevents useful logs to Google drive.
- Parameters:
project_name – name of the project. Only used for Colab environments. Is always current directory for non-Colab environments.
model_name – name of the model. Optional name for model subdirectory to allow support for multiple models.
- Returns:
(root_path, project_path, model_path, data_path, log_path)
- is_colab¶
True if running in a colab notebook
- is_conda¶
True if running in a conda environment
- is_cpu¶
True if no accelerator is available
- is_gpu¶
True if a GPU is is available
- is_jax¶
True if running on Jax
- is_mlx¶
True if running on MLX
- is_notebook¶
True if running in a notebook
- is_pytorch¶
True if running on Pytorch
- is_tpu¶
True if a TPU is is available
- jax_version¶
Jax version, e.g. ‘0.1.0’
- os_type¶
Operating system type, e.g. ‘Linux’, ‘Darwin’
- pt_version¶
Pytorch version, e.g. ‘1.6.0’
- py_version¶
Python version, e.g. ‘3.7.3’
- tpu_type¶
TPU type, e.g. ‘TPU v2’
MLTuner object¶
- class tuner.MLTuner(search_space=None, progress_callback=None)[source]¶
Bases:
object
Simple hyper parameter tuner
Sample search_space:
param_space_minimal_prm = { "dense_layers": [4, 8, 12], "dense_neurons":[256, 512, 768], "learning_rate": [0.001, 0.002], "regu1": [1e-8, 1e-7] }
- Parameters:
search_space – Dictionary defining the search space.
progress_callback – Callback function that is called after each iteration with updated search space as parameter.
- tune(param_space, eval_func)[source]¶
Tune hyper parameters
Example parameter space:
param_space = { "dense_layers": [4, 8, 12], "dense_neurons":[256, 512, 768], "learning_rate": [0.001, 0.002], "regu1": [1e-8, 1e-7] }
eval_func is called with a dictionary of hyper parameters with exactly one value for each key, e.g.:
params={ "dense_layers": 8, "dense_neurons": 256, "learning_rate": 0.001, "regu1": 1e-8 }
- Parameters:
param_space – Dictionary defining the search space.
eval_func – Function that is called to evaluate the hyper parameters.
Gutenberg_Dataset object¶
- class Gutenberg_Dataset.Gutenberg_Dataset(root_url='https://www.gutenberg.org/dirs', cache_dir='gutenberg')[source]¶
Bases:
object
A fuzzy, lightweight class to access, search and filter Project Gutenberg resources
GutenbergLib by default uses a mirror’s root URL. Alternatively, you can specify a local directory containing a Gutenberg mirror. That mirror directory needs to contain a GUTINDEX.ALL file and has typically many sub-directories 0 ,.. n .
A mirror of project Gutenberg can be created by:
#!/bin/bash rsync -zarv --dry-run --prune-empty-dirs --del --include="*/" --include='*.'{txt,pdf,ALL} --exclude="*" aleph.gutenberg.org::gutenberg ./gutenberg_mirror
You can remove the PDF files, since they are currently not used, and need to review the –dry-run option.
Note:
load_index()
needs to be called before any other methods.- Parameters:
root_url – url of Project Gutenberg or any mirror URL, or a local directory containing a Gutenberg mirror.
cache_dir – path to a directory that will be used to cache the Gutenberg index and already downloaded texts.
The cache directory is only used, if a remote Gutenberg URL and not a local mirror is used.
- filter_text(book_text, add_start_tokens=None, add_near_start_tokens=None, add_end_tokens=None)[source]¶
Heuristically remove header and trailer texts not part of the actual books
Unfortunatelly, formatting of Gutenberg books is an unbelievable mess. Using lists of tokens self.start_tokens (indicating the start of the actual book text), self.near_start_tokens (indicating possibly ambiguous tokens near a start_tokens token, further narrowing the start of text), and self.end_tokens (indicating the end of the book text), this function tries to find the start and end of the book text. The user can either extend the lists of class member tokens, of provide temporary additional tokens as parameter to this function.
The list of start_tokens contains only tokens that are always significant as being part of header-cruft (e.g. ‘START OF THIS GUTENBERG’). near_start_tokens are tokens that might be ambiguous, but are still part of the header-cruft, (e.g. ‘produced by’). near_start_tokens are only used, if they are within self.NEAR bytes to the latest start_tokens token, to heuristically prevent false positives.
Note: Use logging via logging.basicConfig(level=logging.DEBUG) to analyze the filtering process.
- Parameters:
book_text – text of the book (string)
add_start_tokens – additional start tokens (list of strings)
add_near_start_tokens – additional near start tokens (list of strings)
add_end_tokens – additional end tokens (list of strings)
- Returns:
filtered text (string)
- find_keywords(*search_keys)[source]¶
Search of an arbitrary number of keywords in a book record
Note:
load_index()
needs to be called once before this function can be used.- Returns:
list of records that contain all keywords in any field.
- get_book(ebook_id: str)[source]¶
Get a book record metadata and filtered text by its ebook_id
This function returns a dictionary with metadata and filtered text. Use
load_book()
to get the raw unfiltered text.Note:
load_index()
needs to be called once before this function can be used.- Parameters:
ebook_id – ebook_id (String, since some IDs contain letters) of the book to be retrieved
- Returns:
book record (dictionary with metadata and filtered text)
- get_record_keys()[source]¶
Get a list of all keys that are used within records. Standard keys are: ebook_id, author, language, title.
Note:
load_index()
needs to be called once before this function can be used.- Returns:
list of all different keys that are somehow used.
- get_unique_record_values(key)[source]¶
Get a list of all unique values a given keys has for all records.
Note:
load_index()
needs to be called once before this function can be used.Example: get_unique_records_values(‘language’) returns all languages in Gutenberg.
- Parameters:
key – key to search for.
- Returns:
list of all unique values for a given key.
- insert_book_texts(search_dict, download_count_limit=20, skip_ids=[])[source]¶
Inserts book texts into the records returned by
search()
.In order to prevent the download of too many books, the download count limit is set to download_count_limit. Downloaded books are cached and cached books are not counted towards the download count limit. Calling this function again will download books that have not been downloaded yet. The filtered book content is inserted into the dictionary with the key text.
- Parameters:
search_dict – search array of dictionaries that at least contain the key ebook_id.
download_count_limit – maximum number of books to download, if no local mirror is used. No limits apply for local mirrors.
skip_ids – list of ebook_ids (string format!) to skip downloading.
- Returns:
list of records including filtered book text-based in the text field.
- load_book(ebook_id)[source]¶
get text of an ebook from Gutenberg by ebook_id
This function returns the unfiltered raw text including all Gutenberg headers and footers. Use
get_book()
to retrieve a dictionary with metadata and filtered text.- Parameters:
ebook_id – Gutenberg id (Note: string, since this sometimes contains a character!)
- Returns:
book text as string, unfiltered. Can be filtered with
filter_text()
- load_index(cache=True, cache_expire_days=30)[source]¶
This function loads the Gutenberg record index, either from cache, or from a website
This should be the first method being used, since many other methods rely on the index being loaded.
- Parameters:
cache – default True, use the cache directory to cache both index and text files.
Index expires after cache_expire_days, text files never expire. Should NOT be set to False in order to prevent unnecessary re-downloading. :param cache_expire_days: Number of days after which the index is re-downloaded.
- search(search_dict)[source]¶
Search for book record with key specific key values For a list of valid keys, use get_record_keys() Standard keys are: ebook_id, author, language, title
Note:
load_index()
needs to be called once before this function can be used.Example: search({“title”: [“philosoph”,”phenomen”,”physic”,”hermeneu”,”logic”], “language”:”english”}) Find all books whose titles contain at least one of the keywords, language english. Search keys can either be search for a single keyword (e.g. english), or an array of keywords.
- Returns:
list of records
Calibre_Dataset object¶
- class Calibre_Dataset.Calibre_Dataset(library_path, verbose=True)[source]¶
Bases:
object
A class to access and search text documents from a Calibre library.
- Parameters:
library_path – Path to the Calibre library
- load_index(use_aliases=False, max_file_size=None, truncate_large=True)[source]¶
This function loads the Calibre library records that contain text-format books.
- Parameters:
use_aliases – If True, books are not referenced by title and author,
max_file_size – If not None, files larger than max_file_size bytes are ignored or truncated (s.b.)
truncate_large – On True, files larger than max_file_size are truncated instead of ignored, only if max_file_size is not None
but by their numeric aliases, thus providing privacy.
- search(search_dict)[source]¶
Search for book record with key specific key values For a list of valid keys, use get_record_keys() Standard keys are: ebook_id, author, language, title
Note:
load_index()
needs to be called once before this function can be used.Example: search({“title”: [“philosoph”,”phenomen”,”physic”,”hermeneu”,”logic”], “language”:”english”}) Find all books whose titles contain at least one of the keywords, language english. Search keys can either be search for a single keyword (e.g. english), or an array of keywords.
- Returns:
list of records
Folder_Dataset object¶
- class Folder_Dataset.Folder_Dataset[source]¶
Bases:
object
A class to access and search text documents from a folder.
- load_index(folder_path, file_extensions=['.txt', '.md', '.py', '.org'], max_file_size=None, truncate_large=True, default_language='English', default_author=None, use_aliases=False)[source]¶
This function loads the text files from the folder.
Load index is additive, new texts are added to already imported texts.
This loads the text files from a folder and creates some metadata from the filename. If parse_metadata is True, the filename format is expected to be: “Title - Author - Language.txt”
- Parameters:
folder_path – Path to a folder containing text files with valid extensions
file_extensions – List of file extensions that identify valid text files
max_file_size – If not None, files larger than max_file_size bytes are ignored or truncated (s.b.)
truncate_large – On True, files larger than max_file_size are truncated instead of ignored, only if max_file_size is not None
default_language – If language is not given via parse_metadata, use None or “English” etc.
default_author – If author is not given via parse_metadata, use None or a specific author that applies to all texts.
use_aliases – If True, documents are not referenced by filename (containing title and author),
but by their numeric aliases, thus providing privacy.
- search(search_dict)[source]¶
Search for book record with key specific key values For a list of valid keys, use get_record_keys() Standard keys are: ebook_id, author, language, title
Note:
load_index()
needs to be called once before this function can be used.Example: search({“title”: [“philosoph”,”phenomen”,”physic”,”hermeneu”,”logic”], “language”:”english”}) Find all books whose titles contain at least one of the keywords, language english. Search keys can either be search for a single keyword (e.g. english), or an array of keywords.
- Returns:
list of records
Text_Dataset object¶
ALU_Dataset object¶
- class ALU_Dataset.ALU_Dataset(bit_count=31, pre_weight=False)[source]¶
Bases:
object
Generate training data for all ALU operations
The ALU takes two integers and applies one of the supported model_ops. E.g. op1=123, op2=100, op=’-’ -> result 23
The net is supposed to learn to ‘calculate’ the results for arbitrary op1, op2 (positive integers, 0..2**bit_count - 1) and the twelve supported ops: [“+”, “-”, “*”, “/”, “%”, “AND”, “OR”, “XOR”, “>”, “<”, “=”, “!=”]
- Parameters:
bit_count – number of bits for each of the two operands, default 31 (mult uses 15 bits)
pre_weight – if True, the model_dis will be reweighted to generate samples for ‘difficult’ ops
- check_results(model, samples=1000, vector=False, positional_encoding=True, valid_ops=None, verbose=False)[source]¶
Run a number of tests on trained model
- create_data_point(op1, op2, op_string, vector=False, positional_suffix=False)[source]¶
create training data from given ints op1, op2 and op_string
- create_training_data(samples=10000, valid_ops=None, equal_distrib=False, verbose=True, title=None)[source]¶
create a number of training samples
- create_vector_training_data(samples=10000, valid_ops=None, equal_distrib=False, verbose=True, title=None, positional_encoding=True)[source]¶
create a number of training samples
- decode_results(result_int_vects)[source]¶
take an array of 32-float results from neural net and convert to ints
- get_data_point(equal_distrib=False, valid_ops=None, vector=False, positional_encoding=False)[source]¶
Get a random example for on ALU operation for training
- Parameters:
equal_distrib – if False, more ‘difficult’ ops will be generated more often.
valid_ops – if not None, only the ops in valid_ops will be used
vector – if True, the result will be returned as an embedded encoded vector
positional_encoding – if True, the result will be returned as an embedded encoded vector with additional bits for positional positional_encoding