Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
AbstractPhil 
posted an update 25 days ago
Post
282
The first set of geometrically aligned datasets are ready. Each dimensional variation is in it's own repo so there's no confusion with splits.
Current Splits;
* wordnet (english)
* unicode
AbstractPhil/geometric-vocab-32d
[32, 64, 128, 256, 512, 768, 1024]
Swap the 32d for the dimension within the list for the repo.

Okay, so the purpose of these; is to give solid anchors to the entire pentachora structure.
With that I've formatted some very concise sentencepiece-esque vocabulary classes that can be saved and loaded as pretrained, but it'll need some tinkering to fully flesh those behaviors out.
For now, the geometric vocab itself can be queried from pretrain but the canonical classes that help regulation, integration, special token usage, and integration aren't fully tested yet.
https://github.com/AbstractEyes/lattice_vocabulary
They are available here, but I give no guarantee on their current state. I'm currently preparing the pip package and have prepared a series of experiments to utilize these for different models including a new version of multimodal Beeper, a classifier set that can handle encodings as feature representations meant for utilization, and more.

The current working variation that I've been utilizing is Flow Matching Discreet Scheduled geometric diffusion - meaning I'm diffusing the GEOMETRY from the image, and then comparing that pentachora that is created from flow matching to the actual representative tokenization structure. On average this is achieving 80% in later stages.

This when curating an indefinite amount of special tokens to create manifests of unique vocabularies, enables the system to perfectly conform to use-cases.
There are some edge-cases where the 1k reserved tokens still exist; however this is currently replaced by an indefinite tokenization dictionary - allowing for an indefinite amount of tokens attached to an indefinite amount of modules for solidity.

Experiments continue.

I've begun the task of properly tooling the lattice_vocabulary for future development and use with a multitude of geometric shapes - not just pentachora.

This experimental system here will house a multitude of additional capabilities;
https://github.com/AbstractEyes/lattice_vocabulary/tree/master/src/geovocab

I plan to implement out of order;

  • simplified state_dict dictionary setup for direct manipulation
  • ✅ full batching structure with iterations removed - utilizing the huggingface datasets columnar system.
  • full transform callback for loading and curating pentachora lossless and deterministically.
  • ✅ a full experimental callback system for transforming crystalized repos into other shapes than penta
  • a simplified interface for converting large independent repos into geometric structure using transforms.
  • a uniform configuration schema for geometric config so any geometric repo can be loaded automatically
  • ✅- ongoing - faster and more optimized load times for default loaders
  • direct crystal training schemas for curating your own lattices with many different sources of information.
  • a full task by task schema for multi-stage crystallization of your crystals so you can perfectly tune them for the use case using defined mathematics and callback capability for research and use-case mathematics.

As many systems suffer with allocating 4d I'll implement deterministic 4d calculations that ensure solidity and calculation cohesion without straying too far into "unknown" territory or requiring full pretrained systems to utilize. I haven't approached 6d or onward yet so we'll see if the human race even has the formulas for that when I actually approach the topic.

In this post