Error tracing back to tokenizer.py: "Expected bytes, got a 'float' object"
Hello,
I have been able to successfully tokenize many single-cell sample count matrices in arrow datasets for handful of fine-tuning classification tasks. The process has been fairly simple to this point, but I have encountered an error that I haven't been able to troubleshoot. I save individual count matrices to loom files, with my train and test sets being separate folder containing libraries I want to use to fine-tune and evaluate my Geneformer classifier. However, a handful of my libraries are throwing the following error (I will post the full error below this paragraph). I have pulled the updated Geneformer code, tried converting my count matrix from float to int, have made sure that the 'n_count' column of both my .obs and .var dataframes is stored as integers and not floats, and do not see anything unique about these libraries compared to the other libraries that are tokenized without error. Might you have an idea where these libraries are uniquely presenting float objects where bytes are expected?
Thank you very much for the upkeep of the repository!
code -->
tk = TranscriptomeTokenizer(obs_dict,nproc=52)
tk.tokenize_data(
Path("../../../data/PASC/loom_files/v3_test"),
"../../../data/PASC/token_output",
"v3_PASC_test"
)
output log -->
Tokenizing ../../../data/PASC/loom_files/v3_test/SC245.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC230.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SSc_SSc15.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC299.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC314.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC246.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC215.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SSc_SSc6.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/Mould_S8.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SSc_C3.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC431.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SSc_C11.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC329.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/Mould_S5.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC296.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/SC303.loom
Tokenizing ../../../data/PASC/loom_files/v3_test/Mould_S3.loom
ArrowTypeError Traceback (most recent call last)
/tmp/ipykernel_257597/3369846383.py in
3 Path("../../../data/PASC/loom_files/v3_test"),
4 "../../../data/PASC/token_output",
----> 5 "v3_PASC_test"
6 )
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/geneformer/tokenizer.py in tokenize_data(self, loom_data_directory, output_directory, output_prefix)
107 """
108 tokenized_cells, cell_metadata = self.tokenize_files(Path(loom_data_directory))
--> 109 tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata)
110
111 output_path = (Path(output_directory) / output_prefix).with_suffix(".dataset")
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/geneformer/tokenizer.py in create_dataset(self, tokenized_cells, cell_metadata)
215
216 # create dataset
--> 217 output_dataset = Dataset.from_dict(dataset_dict)
218
219 # truncate dataset
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split)
897 arrow_typed_mapping[col] = data
898 mapping = arrow_typed_mapping
--> 899 pa_table = InMemoryTable.from_pydict(mapping=mapping)
900 if info is None:
901 info = DatasetInfo()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
797 datasets.table.Table
798 """
--> 799 return cls(pa.Table.from_pydict(*args, **kwargs))
800
801 @classmethod
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/datasets/arrow_writer.py in arrow_array(self, type)
187 else:
188 trying_cast_to_python_objects = True
--> 189 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
190 # use smaller integer precisions if possible
191 if self.trying_int_optimization:
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected bytes, got a 'float' object
Thank you for your interest in Geneformer! I have not encountered this error but I would have suggested checking to ensure the count matrix is uniformly formatted as well - sounds like you already did this. Another place you may consider checking is if any of the custom column attributes have mismatched types (for example checking if some classes are floats whereas others are ints or other types).
Thank you for the feedback. I have looked into this, but will double-check to make sure there are no columns from these samples that may be of the wrong datat ype. Otherwise, I will just exclude these from my train/test set and proceed, thanks!