Prehistory
Hello everyone! Most recently, I ran into a problem: for unexplained reasons, the memory card began to move all files to the LOST.DIR folder without any extensions. For a long time, there accumulated more than 500 files of different types: pictures, video, audio, documents. It was impossible to understand the format of the file independently, so I started looking for a way to solve this problem programmatically.
Looking for decision
I did not want to use ready-made solutions in the form of web services or programs, so there was an idea to write a console utility that would go through all the files and install the extensions automatically. Python was chosen to write the utility. The search for suitable modules and libraries did not bring results for several reasons:
- Lack of support from the developer
- Excessive functionality
- Lack of support for new versions of Python
- Excessive code complexity
Of the many libraries, python-magic is very popular (almost 1000 stars on GitHub). Itâs a wrapper for the libmagic library. But itâs impossible to use python-magic under Windows without the DLL for the Unix library. So this option wasnât good enough.
Solution of the problem
Proceeding from the above, I decided not to use third-party libraries and modules and solve the problem without them. After a short search of information on how to implement this task, the only true way was to determine the format by the signature of the file, also called âmagic numberâ.
The file signature is a set of bytes that provides a definition of the file format. The signature has the following form in hexadecimal notation:
50 4D 4F 43 43 4D 4F 43
Fortunately, there are two good sites on the Internet with a lot of signatures of different formats. The most common formats became the goal.As it turned out, some signatures are suitable for different file formats, such as the signature of Microsoft Office files. Based on this, in some cases it will be necessary to return a list of suitable file extensions.
print(get("D:\\some_ms_office_document")) # prints ['doc', 'ppt', 'xls']
Also, often the signatures have an offset from the beginning of the file like 3GP multimedia container.
1. Compiling a list of data
As a list of data, I decided to use a JSON file, with the âdataâ object, whose value will be an array of objects of the following form:
{"format": "jpg", "offset": 0, "signature": ["FF D8 FF E0", "FF D8 FF E1", "FF D8 FF E2", "FF D8 FF E8"]}
Where:
- formatâââfile format;
- offsetâââoffset of the signature from the beginning of the file;
- signatureâââan array of suitable signatures for the specified file format.
2. Writing an utility
Import the necessary modules:
import osimport json
Read a list of data:
abspath = os.path.abspath(os.path.dirname(__file__))data = json.loads(open(os.path.join(abspath, "data.json"), "r", encoding="utf-8").read())["data"]
Great, the data list is loaded. Now we read the file as an array of bytes. We will only read the first 32 bytes, since the determination of common formats doesnât require more, and full reading of a large file will take a long time.
file = open("path_to_the_file", "rb").read(32)
If you print âfileâ variable, you will see something similar to this:
\x90\x00\x03\x00\x00\x00\x04
Now bytes must be converted to a hexadecimal system:
hex_bytes = " ".join(['{:02X}'.format(byte) for byte in file])
Next, we create a list in which the appropriate formats will be added:
out = []
And now we create a structure that will cyclically determine the file format:
for element in data: for signature in element["signature"]: offset = element["offset"]*2+element["offset"] if signature == hex_bytes[offset:len(signature)+offset].upper(): out.append(element["format"])
About this string:
offset = element["offset"]*2+element["offset"]
Since our bytes are represented as a string, and two symbols represent one byte, we multiply the offset by 2 and add the number of spaces between the âbytesâ.
And the only thing that remains for us is to output a list of suitable formats, which is represented by the âoutâ variable.
print(out) # prints something like ['extension_1', 'extension_2']
Conclusion
As it turned out, various projects are faced with the need to recognize the file format, so I decided to release my solution in open-source as a module for Python called fleep link to the GitHub page. You can install the module using the standard python utility âpipâ:
pip install fleep
Also there are examples of usage and a complete list of supported file formats on the GitHub project page.I improve fleep every day, adding new features and formats. You can use it in your project :)
Thank you for attention!
P.S. I would be glad to hear your opinion about my module.P.P.S. English is not my native language, so, excuse me for any mistakes :)