Convert any file into an image

tutorial python cli image-processing

intro

Images is one of the most common media form on the net. There are many hosting sites that provide free image hosting - imgur, gphotos, just to name a few.

While there may a ton of image hosting site, there isn’t as many other file hostings sites out there. This project tries to take advantage of that, and develop a way of storing other file types as images on these image hosting platforms.

Of course, there are many way of putting your data in images, and this is definitely not the more efficient, but this is easy to understand and write, so here it is.

This script uses the python’s pillow library to generate the image, you will have to set that up: pip install pillow.

part 1 - putting data into image

What I have in mind for this project is to just write all the bytes of the files into the image as the value of the different channels. Since we are using an image with 3 channels, each pixel can be used to represent 3 bytes.

Using this, we can actually calculate the size of the image required to put all the data into a square image.

First, we split the data into blocks of 3 bytes each. After that is done, we will use pillow’s putpixel() function to color each pixel with the values, embedding the data into the image.

Just some additional detail here, when there isn’t sufficient bytes to form a pixel, we will simply pad these “pixels” with 0. Once we have done all that, the image is ready to be saved out.

part 2 - file size checking

Many file types are sensitive to file sizes - these means that those additional padding bits at the back of the image will cause the file to be “corrupted”. To solve this, we need to be able to store and later extract the file size information in the image.

In the script, I am using the first pixel to store the file size. I converted the number of bits into a base-255 number, then write these 3 values into the first pixel. This also means that we must shift the pixel representing the first 3 bytes somewhere.

I have shifted the first pixel all the way to the back to the end of image blocks. When we are extracting out the file later, we must remember to shift the bytes back.

part 3 - image formats

When I first started, I stored the image in JPEG format. For some reason, I couldn’t get back the original values that I placed in the image.

After digging for a bit, I found that JPEG library has some compression going on. This means that some of the details are lost and that comes in the form of the pixel values being changed and that affects the way our data is being stored.

So we change to using PNG and everything works great. I am able to convert any file into a image and after that extract back the file, even after uploading the image to some hosting services.

part 4 - getting the file back with the names

There are some obvious flaws to this system. Firstly, the file size is limited to 255^3 bytes. Other than that, there is no way to retrieve the name of the file. Even after decoding the file, we are unable to get the file format of the file.

That is not very helpful, if the receipent has no idea what kind of file they are looking at. A quick solution is to just zip the files together before encoding it. ZIP file format will handle file names, and there is a additional benefit of having less stuff to encode (compression and all).

The script doesn’t currently support the zipping of files, so it must be done prior to running the script. But a possible extention could be to add the zipping mechanism into the script so that the files are automatically zipped and unzipped when passed into the scripts.

final product

Here is the final version of the script. enc.py for converting files into images. It will take in any file (e.g. asdf.txt) and outputs an image (asdf.txt.png).

The dec.py script is run in the following way: dec.py asdf.txt.png output.txt. Here are the scripts available on github gist.

note: as a side note, I realized that after some files are converted into images, there are some artifacts that remains, like straight lines running through, these patterns could be studied and relations can be found between the patterns and the file formats.

I tried encoding a OPUS audio file, and there are lines that are regularly spaced apart, those could come from the fact that the file format is designed to be streamed, hence are designed to be in several packets.