nyuuzyou
AI & ML interests
Recent Activity
Organizations
- HD resolution - 1280×720 · 32 fps
- For each frame keyboard and mouse + world state (player position, velocity, weapon ...)
- HD Stereo audio
- All 10 players perspective
https://huggingface.co/collections/blanchon/opencs2
The pipeline adapts to the source, beginning with collecting target URLs from sitemaps or APIs into a text file to track progress. I fetch the content concurrently. Go with 50 to 200 goroutines handles large scrapes, while Python ThreadPoolExecutor works for smaller jobs. This stage requires retry logic, rate limiting, and checkpoint files to resume interrupted downloads. The custom work happens during parsing since every site structures its data differently. I extract the target data using BeautifulSoup or goquery for HTML and standard parsers for APIs. I then filter the output to drop binaries, validate UTF-8, and skip generated files using tools like go-enry. The clean data gets written to an intermediate JSONL format, appending with a file lock for thread safety. I convert the final JSONL files to Parquet using DuckDB, PyArrow, or parquet-go. These get compressed with Zstandard at level 19, using 10K to 100K row groups and 512MB to 1GB shards. Go handles the high-throughput scraping, Python manages the custom parsing, and DuckDB takes care of the format conversions.
Dataset: ajibawa-2023/Python-Code-Large
Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.
By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.
Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
Thanks! Since the datasets vary so much in size and format, I write custom parsing and processing pipelines for almost every single one.
934,191 image records index Eastern Europe and Northern Asia. Temporal links map historical views at identical coordinates across nine years.
Key Stats:
- 905,940 unique images
- Coverage spanning 2016 to 2025
- Average 14.3 historical links per location
Geographic bounds span 20.49° E to 152.32° E. Urban centers show higher data density.
In short, the students won. They did so by fine-tuning LFM2. LFM2 is a foundation built by Liquid AI. Liquid AI is a $2 billion startup from MIT.
ajibawa-2023/JavaScript-Code-Large
JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.
By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.
JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
nyuuzyou/casino-benchmark
nyuuzyou/casino-benchmark
14 models faced 1,400 simulations of heads-up Blackjack and European Roulette. Shared seeds locked identical cards and spins for each.
Key Stats:
- 14 models benchmarked
- 59,483 rows
- 35 MB compressed Parquet
- 35,000 scored decisions
- Full prompts, JSON responses, reasoning traces, latency
- Bankroll tracking from $1,000 start per run
Live leaderboard tracks bets, hits, stands, and risk management.
Gemini 3 Flash leads at +$3,396. Claude 4.5 Haiku at -$7,788.
Traces in the dataset. Leaderboard in the space.
This is a 9B parameter, 2B active MoE VLM with state of the art visual reasoning capabilities.
More details in the release blog post: https://moondream.ai/blog/moondream-3-preview
With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
SEO spam has also become a lot less noticeable. I'm hoping that the next step will be to crack down on storage and traffic abuse, and maybe that will mean more generous storage limits.
Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories
- 3.6 GB compressed Parquet
- 91 programming languages (Heavily featuring C#, ASP.NET, and C++)
- Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages)
- Includes platform-specific license metadata (Ms-PL, Ms-RL)
Following Rain-100M, we’re scaling up. Rain-v2 features a larger training dataset.
We’ve published a comprehensive blog covering the end-to-end journey—from raw data collection to rigorous evaluation and safety testing.
HF Repo: 🤗 raincandy-u/Rain-v2
Blog: 📚
https://angelkawaii.xyz/2026/01/29/rain-v2/
Special thanks to the open-source community and the SmolLM2 team for their foundational work! 🚀
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)
Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code.
Key Stats:
- 386,629,949 messages from 159,345 newsgroups
- 191 GB compressed Parquet storage
- Spans 2002-2026
- Multilingual: English, German, French, Italian, Dutch, Polish, Russian, and others
- Email addresses redacted for privacy
The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go.
Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.