Did you ever saw a char and thought: “Damn, 1 byte for a single char is pretty darn inefficient”? No? Well I did. So what I decided to do instead is to pack 5 chars, convert each char to a 2 digit integer and then concat those 5 2 digit ints together into one big unsigned int and boom, I saved 5 chars using only 4 instead of 5 bytes. The reason this works is, because one unsigned int is a ten digit long number and so I can save one char using 2 digits. In theory you could save 32 different chars using this technique (the first two digits of an unsigned int are 42 and if you dont want to account for a possible 0 in the beginning you end up with 32 chars). If you would decide to use all 10 digits you could save exactly 3 chars. Why should anyone do that? Idk. Is it way to much work to be useful? Yes. Was it funny? Yes.

Anyone whos interested in the code: Heres how I did it in C: https://pastebin.com/hDeHijX6

Yes I know, the code is probably bad, but I do not care. It was just a funny useless idea I had.

    • AllNewTypeFace@leminal.space
      link
      fedilink
      arrow-up
      8
      ·
      2 months ago

      I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”

    • deltapi@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      2 months ago

      The AD&D “Gold Box” games from SSI Inc. stored game text in 6-bit encoding. The first one of these I played was “Champions of Krynn” and the PC release came on 4 360k 5.25 dsdd floppy disks. They actually needed the packing in those days, and couldn’t afford to spent cpu cycles or ram on built in compression.
      I remember opening up the game data files in a file viewer (maybe pc-tools?) and being confounded by the lack of text in the files.

  • bandwidthcrisis@lemmy.world
    link
    fedilink
    arrow-up
    59
    ·
    2 months ago

    You would have done well with this kind of thinking in the mid-80s when you needed to fit code and data into maybe 16k!

    As long as you were happy to rewrite it in Z80 or 6502.

    Another alternative is arithmetic encoding. For instance, if you only needed to store A-Z and space, you code those as 0-26, then multiply each char by 1, 27, 27^2, 26^3 etc, the add them.

    To unpack them, divide by 27 repeatedly, the remainder each time is each character. It’s simply covering numbers to base-27.

    It wouldn’t make much difference from using 5 bits per char for a short run, though, but could be efficient for longer strings, or if encoding a smaller set of characters.

  • sunbeam60@lemmy.ml
    link
    fedilink
    arrow-up
    41
    ·
    edit-2
    2 months ago

    After all… Why not?

    Why shouldn’t I ignore the 100+ cultures whose character set couldn’t fit into this encoding?

  • BartyDeCanter@lemmy.sdf.org
    link
    fedilink
    arrow-up
    34
    arrow-down
    1
    ·
    2 months ago

    Oh god that switch statement. Here, let me come up with something better:

    if (pChar >= 'a' && pChar <= 'z') {
      return pChar - 'a' + 10;
    } else if (pChar == ' ') {
      return 36;
    } else if (pChar == '.'){
      return 37;
    }
    return 0;
    
      • Karyoplasma@discuss.tchncs.de
        link
        fedilink
        arrow-up
        10
        ·
        edit-2
        2 months ago

        Yes, it’s easy to understand, but this ifelse is much safer because it “handles” uppercase and digits by just returning 0. If you’d call OP’s function with something like ‘A’, you’ll get nonsense data because it doesn’t have a default. (I think it will return whatever value currently resides in eax)

      • BartyDeCanter@lemmy.sdf.org
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        2 months ago

        A single line comment would make it as easy to understand, and much more flexible if you wanted to add handling upper case letters or digits. Or even remap the values to a more standard 6 bit encoding.

  • solrize@lemmy.ml
    link
    fedilink
    arrow-up
    33
    ·
    2 months ago

    Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.

    • enumerator4829@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      17
      ·
      2 months ago

      Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).

      (As an aside, now I want to see a version of puppylinux running entirely in L3 cache)

      • BartyDeCanter@lemmy.sdf.org
        link
        fedilink
        arrow-up
        3
        ·
        2 months ago

        I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.

    • DacoTaco@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      2 months ago

      Cache man, its a fun thing. 32k 32 (derp, 32 not 32k) is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
      There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes, more is less :')

      Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :').

      Then there is also the minimum heap allocation size that might factor in. Though a 32k minimum memory block seems… Excessive xD

  • RiQuY@lemmy.zip
    link
    fedilink
    arrow-up
    28
    arrow-down
    1
    ·
    2 months ago

    Interesting idea but type conversion and parsing is much more slower than wasting 1 byte. Nowadays memory is “free” and the main issue is the execution speed.

  • traceur301@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    26
    ·
    2 months ago

    I’m not sure if this is the right setting for technical discussion, but as a relative elder of computing I’d like to answer the question in the image earnestly. There’s a few factors squeezing the practicality out of this for almost all applications: processor architectures (like all of them these days) make operating on packed characters take more operations than 8 bit characters so there’s a speed tradeoff (especially considering cache and pipelining). Computers these days are built to handle extremely memory demanding video and 3d workloads and memory usage of text data is basically a blip in comparison. When it comes to actual storage and not in-memory representation, compression algorithms typically perform better than just packing each character into fewer bits. You’d need to be in a pretty specific niche for this technique to come in handy again, for better or for worse

    • da_cow (she/her)@feddit.orgOP
      link
      fedilink
      arrow-up
      6
      ·
      2 months ago

      This is 100% true. I never plan on actually using this. It might be useful for working on microcontrollers like an ESP32, but apart from that the trade of for more computational power is not worth the memory savings.

      • rustyricotta@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        3
        ·
        2 months ago

        Having seen many of Kaze’s videos on N64 development, I’ve learned that the N64 has like 4x the processing power it needs compared to its memory. In hardware cases like that the trade-off of computational power and memory memory savings gets you some nice performance gains.

    • gusgalarnyk@lemmy.world
      link
      fedilink
      arrow-up
      6
      ·
      2 months ago

      I liked the technical discussion so thank you. Keep it up, I got into this career because there was always so much to learn.

    • nik9000@programming.dev
      link
      fedilink
      arrow-up
      2
      arrow-down
      1
      ·
      2 months ago

      There is still fun to be had! Just… Different fun!

      In database land lookup tables are pretty common. Prefix tries and the like are super common in search land. I’ve seen GCD, offset, delta-of-delta, and some funky bitwise floating point compression used. Sometimes just to save dist space. But usually to save working set space or IO or S3 cache space.

      And squeezing the most out of modern CPUs is its own art. Compilers are glorious. And modern CPUs are magic lightning rocks. But you can learn to sing to them just right to make them all happy.

  • drath@lemmy.world
    link
    fedilink
    arrow-up
    22
    ·
    edit-2
    2 months ago

    Oh god, please don’t. Just use utf8mb4 like a normal human being, and let the encoding issues finally die out (when microsoft kills code pages). If space is of consideration, just use compression, like gz or something.

  • anton@lemmy.blahaj.zone
    link
    fedilink
    arrow-up
    21
    ·
    2 months ago

    You could save 0.64 bit per char more if you actually treated you output as a binary number (using 6 bits per char) and didn’t go through the intermediary string (implicitly using base 100 at 6.64 bits per char).
    This would also make your life easier by allowing bit manipulation to slice/move parts and reducing work for the processor because base 100 means integer divisions, and base 64 means bit shifts. If you want to go down the road of a “complicated” base use base 38 and get similar drawbacks as now, except only 5.25 bits per char.

    • Redkey@programming.dev
      link
      fedilink
      arrow-up
      7
      ·
      2 months ago

      I was so triggered by the conversion from char-to-int-to-string-to-packedint that I had to write a bitwise version that just does char-to-packedint (and back again), with bitwise operators.

      https://pastebin.com/V2An9Xva

      As others have pointed out, there are probably better options for doing this today in most real-life situations, but it might make sense on old low-spec systems if not for all the intermediate conversion steps, which is why I wrote this.

  • Ethan@programming.dev
    link
    fedilink
    English
    arrow-up
    17
    ·
    2 months ago

    Does the efficiency of storage actually matter? Are you working on a constrained system like a microcontroller? Because if you’re working on regular software, supporting Unicode is waaaaaaaaaaay more valuable than 20% smaller text storage.

    • da_cow (she/her)@feddit.orgOP
      link
      fedilink
      arrow-up
      3
      ·
      2 months ago

      I do sometimes work with microcontrollers, but so far I have not encountered a condition where these minimal savings could ever be useful.

      • BartyDeCanter@lemmy.sdf.org
        link
        fedilink
        arrow-up
        2
        ·
        2 months ago

        Not that long ago I was working on TI’s C2000 DSPs. They have a 16 bit word size, which would make this a convenient way to pack three characters into a single word. Ther ewould then be a single leftover bit which could be used for signaling, parity, or something else.

      • magic_lobster_party@fedia.io
        link
        fedilink
        arrow-up
        9
        ·
        2 months ago

        Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.

        So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.

        It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.

        • lad@programming.dev
          link
          fedilink
          arrow-up
          2
          ·
          2 months ago

          Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare

      • Cethin@lemmy.zip
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        2 months ago

        If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.

        If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.

  • hdsrob@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    2 months ago

    We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.