Current Splits;
* wordnet (english)
* unicode
AbstractPhil/geometric-vocab-32d
[32, 64, 128, 256, 512, 768, 1024]
Swap the 32d for the dimension within the list for the repo.
Okay, so the purpose of these; is to give solid anchors to the entire pentachora structure.
With that I've formatted some very concise sentencepiece-esque vocabulary classes that can be saved and loaded as pretrained, but it'll need some tinkering to fully flesh those behaviors out.
For now, the geometric vocab itself can be queried from pretrain but the canonical classes that help regulation, integration, special token usage, and integration aren't fully tested yet.
https://github.com/AbstractEyes/lattice_vocabulary
They are available here, but I give no guarantee on their current state. I'm currently preparing the pip package and have prepared a series of experiments to utilize these for different models including a new version of multimodal Beeper, a classifier set that can handle encodings as feature representations meant for utilization, and more.
The current working variation that I've been utilizing is Flow Matching Discreet Scheduled geometric diffusion - meaning I'm diffusing the GEOMETRY from the image, and then comparing that pentachora that is created from flow matching to the actual representative tokenization structure. On average this is achieving 80% in later stages.
This when curating an indefinite amount of special tokens to create manifests of unique vocabularies, enables the system to perfectly conform to use-cases.
There are some edge-cases where the 1k reserved tokens still exist; however this is currently replaced by an indefinite tokenization dictionary - allowing for an indefinite amount of tokens attached to an indefinite amount of modules for solidity.
Experiments continue.