New File Format

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 years ago

New File Format

unalivejoy@lemm.ee · 2 years ago

There are 3 types of files. Renamed txt, renamed zip, and exe

Toribor@corndog.social · edit-2 2 years ago

I’d argue with this, but it seems like image and video file extensions have become a lawless zone with no rules so I don’t even think they count.

Hauskrampf@ttrpg.network · 2 years ago

Looking at you, .webp

Gamma@programming.dev · edit-2 2 years ago

Video files are just a bunch of zip files in a trenchcoat.

fibojoly@sh.itjust.works · edit-2 2 years ago

Back in the day, when bandwidth was precious and porn sites would parcel a video into 10 second extracts, one per page, you could zip a bunch of these mpeg files together into an uncompressed zip, then rename it .mpeg and read it in VLC as a single video. Amazing stuff.

gazter@aussie.zone · 2 years ago

What’s it called when you logically expect something to work, but are totally surprised that it actually does?

fibojoly@sh.itjust.works · 2 years ago

Sounds an awful lot like a normal day at work as a dev.

CULT PONY@lemmy.blahaj.zone · 2 years ago

Don’t forget renamed tar.

Kit Sorens@lemmy.dbzer0.com · 2 years ago

It’s a folder that you put files into, but acts as a file itself. Not at all like zip.

Natanael@slrpnk.net · 2 years ago

Tar.gz is pretty much like zip. Technically tar mimics a file system more closely but like who makes use of that?

AVincentInSpace@pawb.social · edit-2 2 years ago

Tar mimics a filesystem more closely? Tf???

TAR stands for Tape ARchive. It’s called that because it’s designed to be written to (and read from) non-seekable magnetic tape, meaning it’s written linearly. The metadata for each file (name, mtime etc.) immediately precedes its contents. There is no global table of contents like you’d find on an actual filesystem. In fact, most implementations of tar don’t even put the separate files on gzip boundaries meaning you can’t decompress any given file without decompressing all of the files before it. With a tape backup system, you don’t care, but with a filesystem you absolutely do.

PKZIP mimics the traditional filesystem structure much more closely. The table of contents is at the end instead of the beginning, which is a bit strange as filesystems go, but it is a table of contents consisting of a list of filenames and offsets into the file where they can be found. Each file in a zip archive is compressed separately, meaning you can pull out any given file from a ZIP archive without any prior state, and you can even use different compression algorithms on a per-file basis (few programs make use of this). For obvious reasons, the ZIP format prioritizes storage space over modification speed (the table of contents is a single centralized list and files must be contiguous), meaning if you tried to use it as a filesystem it would utterly suck – but you can very readily find software that will let you read, edit, and delete files in-place as though it were a folder without rewriting the entire archive. That’s not really possible with a .tar file.

You could make the argument that tar is able to more closely mimic a POSIX filesystem since it captures the UNIX permission bits and ZIP doesn’t (ustar was designed for UNIX and pkzip was designed for DOS) but that’s not a great metric.

YTG123@feddit.ch · 2 years ago

Isn’t the Windows exe also a renamed zip?

AVincentInSpace@pawb.social · edit-2 2 years ago

See, ZIP files are strange because unlike most other archive formats, they put the “header” and table of contents at the end, and all of the members (files within the zip file) are listed in that table of contents as offsets relative to the start of the file. There’s nothing that says that the first member has to begin at the start of the file, or that they have to be contiguous. This means you can concatenate an arbitrary amount of data at the beginning of a ZIP file (such as an exe that opens its argv[0] as a zip file and extracts it) and it will still be valid. (Fun fact! You can also concatenate up to 64KiB at the end and it will still be valid, after you do some finagling. This means that when a program opens a ZIP file it has to search through the last 64KiB to find the “header” with the table of contents. This is why writing a ZIP parser is really annoying.)

As long as whatever’s parsing the .exe doesn’t look past the end of its data, and whatever’s parsing the .zip doesn’t look past the beginning of its data, both can go about their business blissfully unaware of the other’s existence. Of course, there’s no real reason to concatenate an executable with a zip file that wouldn’t access the zip file, but you get the idea.

A common way to package software is to make a self-extracting zip archive in this manner. This is absolutely NOT to say that all .exe files are self extracting .zip archives.

Appoxo@lemmy.dbzer0.com · edit-2 2 years ago

No. But the Windows office suite is
You can rename a docx and extract it.
Don’t know how it is with ppt/x and xls/x

MonkderZweite@feddit.ch · 2 years ago

xls & co. (the older ones) are something custom. Only after standardization as OOXML (a shitshow btw, there’s a lengthy wiki article about it) they got zip.

Appoxo@lemmy.dbzer0.com · 2 years ago

The whole Word and Libre/OO-Writer world is a shit show.
So complex and everyone decides to interpret it a bit differently.
Not even Libre and OO can be interoperabel between the same file and feature.

The Ramen Dutchman@ttrpg.network · 2 years ago

docx are mostly markup language, actually. Much like SVGs and PDFs.

Appoxo@lemmy.dbzer0.com · 2 years ago

Arent they straight up HTML being specially formatted?

cymor@midwest.social · 2 years ago

The Ramen Dutchman@ttrpg.network · 2 years ago

And HTML is a lot like it, all of them are Markup Language.

fibojoly@sh.itjust.works · 2 years ago

No. The Office ???x files are archives. Inside them you can find folders with resources. Among those, you can find files written in markup languages.

Not quite the same thing.

Just rename your .docx file as .zip to check its contents.

The Ramen Dutchman@ttrpg.network · 2 years ago

Ah, last time I checked it was a kind of ML directly (XML, I’m guessing from [email protected] their comment), but that’s back in Office 2016’s time, so things might have changed.

Thanks for the heads-up!

14th_cylon@lemm.ee · 2 years ago

unalivejoy@lemm.ee · edit-2 2 years ago

Just because you can open it with 7-zip doesn’t mean it’s a zip file. Some exes are also zip files.

𝒍𝒆𝒎𝒂𝒏𝒏@lemmy.one · 2 years ago

Ah, good ol’ Microsoft Office. Taken advantage of their documents being a renamed .zip format to send forbidden attachments to myself via email lol

On the flip side, there’s stuff like the Audacity app, that saves each audio project as an SQLite database 😳

Hexagon@feddit.it · 2 years ago

Also .jar files. And good ol’ winamp skins. And CBZ comics. And EPUB books. And Mozilla extensions. And APK apps. And…

neo (he/him)@lemmy.comfysnug.space · 2 years ago

cbz is literally just a renamed zip

MonkderZweite@feddit.ch · 2 years ago

Btw, you can create “chapters” by creating folders. Easy to automate with a loop.

neo (he/him)@lemmy.comfysnug.space · 2 years ago

a lot of the time i handle it manually since i try to pack things in “volumes” that most closely mimic physical releases, and writing the code to get that information would be slower than just looking it up manually

so, for example, the first volume of bleach has 7 chapters, so i’d pack those 7 chapters together into one cbz, the second volume in another cbz, etc.

beeb@lemm.ee · 2 years ago

an SQLite database

Genius! Why bother importing and exporting

xigoi@lemmy.sdf.org · 2 years ago

Minetest (an open-source Minecraft-like game) uses SQLite to save worlds.

𝒍𝒆𝒎𝒂𝒏𝒏@lemmy.one · 2 years ago

Mineclone2 is an absolute masterpiece of a game for Minetest IMO

xigoi@lemmy.sdf.org · 2 years ago

I prefer games that embrace the difference from Minecraft instead of trying to emulate it. My favorite is MeseCraft.

AVincentInSpace@pawb.social · 2 years ago

So does Scrap Mechanic (sandbox game that’s basically Space Engineers on the ground – or, more loosely, Minecraft but with physics and you can build cars) also uses sqlite to save worlds. It also uses uncompressed JSON files to store user creations.

Gamma@programming.dev · 2 years ago

It used to use project folders, but due to confusion/user error was changed in 3.0.

TherouxSonfeir@lemm.ee · 2 years ago

SQLite is amazing. Shush.

mogoh@lemmy.ml · 2 years ago

that saves each audio project as an SQLite database 😳

Is this a problem? I thought this would be a normal use case for SQLite.

fiah@discuss.tchncs.de · 2 years ago

doesn’t sqlite explicitly encourage this? I recall claims about storing blobs in a sqlite db having better performance than trying to do your own file operations

MNByChoice@midwest.social · edit-2 2 years ago

Thanks for the hint. I had to look that up. (The linked page is worth a read and has lots of details and caveats.)

The scope is narrow, and well documented. Be very wary of over generalizing.

The measurements in this article were made during the week of 2017-06-05 using a version of SQLite in between 3.19.2 and 3.20.0. You may expect future versions of SQLite to perform even better.

https://www.sqlite.org/fasterthanfs.html

SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().

Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.)

Edit 5: consolidated my edits.

OrangeXarot@sh.itjust.works · 2 years ago

wait what

fibojoly@sh.itjust.works · 2 years ago

Civilisation (forget which) runs on an SQLite DB. I was rather surprised when I discovered this, back then.

space@lemmy.dbzer0.com · 2 years ago

Also renamed xml, renamed json and renamed sqlite.

TrustingZebra@lemmy.one · 2 years ago

Those sound fancy, I just use renamed txt files.

neo (he/him)@lemmy.comfysnug.space · 2 years ago

.ini is that you?

Natanael@slrpnk.net · 2 years ago

yaml

hemko@lemmy.dbzer0.com · 2 years ago

It’s everything

neo (he/him)@lemmy.comfysnug.space · 2 years ago

surprised pikachu face

TheAnonymouseJoker@lemmy.ml · 2 years ago

No, I am .nfo

HTTP_404_NotFound@lemmyonline.com · 2 years ago

Amateurs.

I have evolved from using file extensions, and instead, don’t use any extension!

H4mi@lemm.ee · 2 years ago

I don’t even use a file system on my storage drives. I just write the file contents raw and try to memorize where.

Ms. ArmoredThirteen@lemmy.ml · 2 years ago

Sounds tedious, I’ve just been keeping everything in memory so I don’t have to worry about where it is.

257m@lemmy.ml · 2 years ago

Sounds inefficient. You can only store 8 gigs and goes away when you shut off your computer? I just put it on punch cards and feed it into my machine.

Björn@swg-empire.de · 2 years ago

So archaic. Real men just flap a butterfly’s wings so that they deflect in cosmic rays in such a way that they flip the desired bits in RAM.

257m@lemmy.ml · 2 years ago

As yes good old M-x-Butterfly.

MonkderZweite@feddit.ch · edit-2 2 years ago

I use mime. Because magic bit.

dan@upvote.au · 2 years ago

Linux mostly doesn’t use file extensions… It relies on “magic bytes” in the file.

Same with the web in general - it relies purely on MIME type (e.g. text/html for HTML files) and doesn’t care about extensions at all.

fibojoly@sh.itjust.works · 2 years ago

“Magic bytes”? We just called them headers, back in my day (even if sometimes they are at the end of the file)

dan@upvote.au · 2 years ago

The library that handles it is literally called “libmagic”. I’d guess the phrase “magic bytes” comes from the programming concept of a magic number?

fibojoly@sh.itjust.works · 2 years ago

I did not know about that one! It makes sense though, because a lot of headers would start with, well yeah, “magic numbers”. Makes sense.

fibojoly@sh.itjust.works · 2 years ago

You can just go in Folder View and uncheck “hide known file extensions” to fix that! ;)

dan@upvote.au · edit-2 2 years ago

SQLite explicitly encourages using it as an on-disk binary format. The format is well-documented and well-supported, backwards compatible (there’s been no major version changes since 2004), and the developers have promised to support it at least until the year 2050. It has quick seek times if your data is properly indexed, the SQLite library is distributed as a single C file that you can embed directly into your app, and it’s probably the most tested library in the world, with something like 500x more test code than library code.

Unless you’re a developer that really understands the intricacies of designing a binary data storage format, it’s usually far better to just use SQLite.

0x2d@lemmy.ml · 2 years ago

Use binwalk on those

ScrewdriverFactoryFactoryProvider [they/them]@hexbear.net · 2 years ago

Most of Adobe’s formats are just gzipped XML

Ineocla@lemmy.ml · 2 years ago

Microsoft office also is xml

observantTrapezium@lemmy.ca · 2 years ago

Nothing wrong with that… Most people don’t need to reinvent the wheel, and choosing a filename extension meaningful to the particular use case is better then leaving it as .zip or .db or whatever.

CoderKat@lemm.ee · 2 years ago

Totally depends on what the use case is. The biggest problem is that you basically always have to compress and uncompress the file when transferring it. It makes for a good storage format, but a bad format for passing around in ways that need to be constantly read and written.

Plus often we’re talking plain text files being zipped and those plain text formats need to be parsed as well. I’ve written code for systems where we had to do annoying migrations because the serialized format is just so inefficient that it adds up eventually.

AItoothbrush@lemmy.zip · 2 years ago

When i discovered as a little kid that apk files are actually zips i felt like a detective.

TheAnonymouseJoker@lemmy.ml · edit-2 2 years ago

Wait till you meet the real evil…

WEBP images. The worst image file format on earth to deal with metadata and timestamps. FFFFUUUUUCK WEBPOOP (and no AVIF please).

XNViewMP is a saviour on all OSes though, thankfully, being the only tool that can batch convert webpoops to any proper image format with preserved metadata.

Atleast with renamed ZIP files, I literally do not need to care as long as 7-Zip or PeaZip is installed, so I can just “open as * archive”. And for video/audio, have MediaInfo installed on any OS. You will thank me someday.

Ugly Bob@sh.itjust.works · 2 years ago

I’m curious, what’s wrong with webp?

TheAnonymouseJoker@lemmy.ml · 2 years ago

WEBP is very weird to convert to other formats and retain metadata. This is not a problem with JPG, PNG and other formats. And only one tool I mentioned solves that problem.

karlthemailman@sh.itjust.works · 2 years ago

Is that an issue with the format or the currently available tools though?

TheAnonymouseJoker@lemmy.ml · 2 years ago

Google is responsible for this problem. They created WEBP, which was not necessary to adopt, but shoved it in our throats via Chrome saving images as WEBP by default, and making websites that use their cloud as CDN serve WEBPs in general.

Marxism-Fennekinism@lemmy.ml · 2 years ago

Smh at least use 7z

Gamma@programming.dev · 2 years ago

zstd or leave

dan@upvote.au · 2 years ago

They both have their use cases. Zstandard is for compression of a stream of data (or a single file), while 7-Zip is actually two parts: A directory structure (like tar) plus a compression algorithm (like LZMA which it uses by default) in a single app.

7-Zip is actually adding zstd support: https://sourceforge.net/p/sevenzip/feature-requests/1580/

Daisy (she/her)@lemmy.ml · edit-2 2 years ago

Well when using zstd, you tar first, something like tar -I zstd -cf my_tar.tar.zst my_files/*. You almost never call zstd directly and always use some kind of wrapper.

dan@upvote.au · 2 years ago

Sure, you can tar first. That has various issues though, for example if you just want to extract one file in the middle of the archive, it still needs to decompress everything up to that point. Something like 7-Zip is more sophisticated in terms of how it indexes files in the archive, so I’m looking forward to them adding zstd support.

FWIW most of my uses of zstd don’t involve tar, but it’s in things like Borgbackup, database systems, etc.

Daisy (she/her)@lemmy.ml · 2 years ago

Yes, definitely. My biggest use is transparent filesystem compression, so I completely agree!

Derpgon@programming.dev · 2 years ago

I’ll gunzip you to oblivion!

AVincentInSpace@pawb.social · 2 years ago

zstd may be newer and faster but lzma still compresses more

Gamma@programming.dev · edit-2 2 years ago

Thought I’d check on the Linux source tree tar. zstd -19 vs lzma -9:

❯ ls -lh
total 1,6G
-rw-r--r-- 1 pmo pmo 1,4G Sep 13 22:16 linux-6.6-rc1.tar
-rw-r--r-- 1 pmo pmo 128M Sep 13 22:16 linux-6.6-rc1.tar.lzma
-rw-r--r-- 1 pmo pmo 138M Sep 13 22:16 linux-6.6-rc1.tar.zst

About +8% compared to lzma. Decompression time though:

zstd -d -k -T0 *.zst  0,68s user 0,46s system 162% cpu 0,700 total
lzma -d -k -T0 *.lzma  4,75s user 0,51s system 99% cpu 5,274 total

Yeah, I’m going with zstd all the way.

the_weez@midwest.social · 2 years ago

Nice data. Thanks for reminding me why I prefer zstd

AVincentInSpace@pawb.social · 2 years ago

damn I did not know zstd was that good. Never thought I’d hear myself say this unironically but thanks Facebook

Gamma@programming.dev · 2 years ago

*Thank you engineers who happen to be working at Facebook

AVincentInSpace@pawb.social · 2 years ago

Very true, good point

gamer@lemm.ee · 2 years ago

As always, you gotta know both so that you can pick the right tool for the job.