Show HN: I built a tool to version control datasets (like Git, but for data)

(shodata.com)

2 points | by aliefe04 15 hours ago

1 comments

vmykyt 14 hours ago
That is good start
In (big-)data area the idea of data versioning is flying around for decades. As a current consensus for now is to treat information about your files, which is effectively a data, as a metadata.
Said this while trying to create your own solution is always good, maybe you could look at another solutions, like Apache Iceberg (free and open source).
In particular they have concept of Catalog
While from documentation it may look like to adopt Iceberg you need a lot of other moving part, in reality you can start from docker compose [2] and then manage your data using plain old sql syntax.
It may look lake overkill for your specific needs, still good source to steal some ideas.
P.S. there are plenty of such systems in various form-factor
[1] https://iceberg.apache.org/ [2] https://iceberg.apache.org/spark-quickstart/
[-]
- aliefe04 8 hours ago
  Thanks for the feedback!
  Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse. Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill. Our sweet spot is teams who need:
  Drag-and-drop versioning (no CLI/SDK required) Instant previews and diff visualization Collaboration features (comments, access control) Public sharing (like the LLM hallucinations dataset)
  I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!