File size: 1,505 Bytes
4de006e
214fc0a
 
4de006e
 
 
214fc0a
 
 
 
 
4de006e
 
3fa2b36
 
da78394
3fa2b36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5678329
 
 
3fa2b36
fa25138
 
 
 
 
 
3fa2b36
 
4de006e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
datasets:
- ffurfaro/PixelBytes-Pokemon
language: en
library_name: pytorch
license: mit
pipeline_tag: text-to-image
tags:
- image-generation
- text-generation
- multimodal
---

# PixelBytes: Unified Multimodal Generation

Welcome to the **PixelBytes** repository! This project features models designed to generate text and images simultaneously, pixel by pixel, using a unified embedding. (only testing weight)

## Overview

### Key Concepts
- **Image Transformer**: Generates images pixel by pixel.
- **Bi-Mamba+**: A bidirectional model for time series prediction.
- **MambaByte**: A selective state-space model without tokens.

The PixelByte model generates mixed sequences of text and images, handling transitions with line breaks and maintaining image dimension consistency.

## Dataset

We use the **PixelBytes-Pokemon** dataset, available on Hugging Face: [PixelBytes-Pokemon](https://huggingface.co/datasets/ffurfaro/PixelBytes-Pokemon). It contains text and image sequences of Pokémon for training our model.

## Models Trained

- **10 LSTM Models**: (Uni-Bi)directional + 1, 2, 3 layers (including special config : p_embed + 3xhidden_state + 3xembedding_dim)
- **3 Mamba Models**: Bidirectional + 1, 2 layers, Unidirectional + 2 layers
- **2 Transformer Models**: 1, 2 layers

Citation
--------

Furfaro, F. (2024). PixelBytes: A Unified Multimodal Representation Learning Project. 


---

Thank you for exploring **PixelBytes**! We hope this model aids your multimodal generation projects.