> title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
You can also improve compression by reordering the sequences within the FASTA file, as long as you're using it as a dictionary and not a list of title-sequence pairs.
I've explored alternatives to FASTA and FASTQ but in most cases I found that simply not storing sequence data is the best option of all, but if I have to do it, columnar formats with compression are usually the best alternative when considering all of (my) the constraints.
Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.
Interesting. Any idea why this can't be stored in the metadata of the compressed file?
It is stored in the metadata [1], but anything larger than 8 MiB is not guaranteed to be supported. So there has to be an out-of-band agreement between compressor and decompressor.
Thanks! So the --long essentially signals the decoder "I am willing to accept the potentially large memory requirements implied by the given window size"
Seems useful for games marketplaces like Steam and Xbox. You control the CDN and client, so you can use tricky but effective compression settings all day long.
Sending a .zip filled with all zeroes, so it compresses extremely well, is a well-known DoS historically (zip bomb, making the server run out of space in trying to read the archive)
You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file
The decompressed output needn't be in-memory (or even on-disk; it could be directly streamed to analysis) all at the same time, at which point resource limits aren't a problem at all. And I believe --long already is a "grater than or equal to" value, and should also be effectively a memory limit (or pretty close to one at least).
Seeking back in the input might theoretically work, but I feel like that could easily get very bad (aka exponential runtime); never mind needing actual seeking.
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.
Also for historical reasons I think, since Perl used to be the big bioinformatics language, and it is surprisingly hard to compete with in string handling.
When you know you're going to be compressing files of particular structure, it's often very beneficial to tweak compression algorithm parameters. In one case when dealing with CSV data, I was able to find a LZMA2 compression level, dictionary size and compression mode that yielded a massive speedup, uses 1/100th the memory and surprisingly even yields better compression ratios, probably from the smaller dictionary size. That's in comparison to the library's default settings.
I've also noticed this. Zstandard doesn't see very common patterns
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.
From what I've read (although I haven't tested and I can't find my source from when I read it), dictionaries aren't very useful when dataset is big, and just by using '--long' you can cover that improvement.
I don’t think the size of content matters, it’s all about patterns (and their repetitiveness) within, and FASTA is a great target, if I understand the format correctly
I tried building a zstd dictionary for something(compressing many instances of the same Mono(.net) binary serialized class, mostly identical), and in this case it provided no real advantage. Honestly, I didn't dig into it too much, but will give --long a try shortly.
PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.
There's a few options out there that have noticeably better compression, with the downside of being less widely-compatible with tools. zstd also has the benefit of being very fast (depending on your settings, of course).
CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.
There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.
Removing the wrapping newline from the FASTA/FASTQ convention also dramatically improves parsing perf when you don't have to do as much lookahead to find record ends.
Unfortunately, when you write a program that doesn't wrap output FASTAs, you have a bunch of people telling you off because SOME programs (cough bioperl cough) have hard limits on line length :)
I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!
Ha. it gets worse.
Search engines or blacklist processors often use gigantic url lists, which are stored as plain ASCII, which is then fed into a perfect hash generator, which accesses those url's unordered. I.e. they need to create a second ordering index to access the urllist. The perfect hashing guys are mathematicians and so they don't care because their definition of a mphf (minimal perfect hash function) is just a random ordering of unique indices, but they don't care to store the ordering also. So we have ASCII and no index.
BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)
FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.
> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:
- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi
- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag
- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM
A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.
If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.
People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.
It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.
That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.
What's current way to accessibly process my 23andme raw data ? It's been synthesized decade ago and SNPedia and Promethease seems abandoned, so what's alternative if there is, and if there is none how we arrived to this?
> I speculated that this poor performance might be caused by the newline bytes (0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
As someone with an idle interest in data compression, ss it possible to download the original dataset somewhere to play around with? Or rather a like 20gb subset of it.
The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.
Some are bit-by-bit (e.g. the PPM family of compressors[1]), but the normal input granularity for most compressors is a byte. (There are even specialized ones that work on e.g. 32 bits at a time.)
[1] Many of the context models in a typical PPM compressor will be byte-by-byte, so even that isn't fully clear-cut.
They output a bitstream, yeah but I don't know of anything general purpose which effectively consumes anything smaller than bytes (unless you count various specialized handlers in general-purpose compression algorithms, e.g. to deal with long lists of floats)
This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.
Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
[1] https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...
You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file
Seeking back in the input might theoretically work, but I feel like that could easily get very bad (aka exponential runtime); never mind needing actual seeking.
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
Have any of you tested it?
PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.
(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)
CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.
There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.
e.g. https://github.com/ArcInstitute/binseq
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
1. https://madhadron.com/science/farewell_to_bioinformatics.htm...
- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi
- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag
- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM
A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.
If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.
People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.
[O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...
[1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
[1] Many of the context models in a typical PPM compressor will be byte-by-byte, so even that isn't fully clear-cut.