paint-brush
Why You Shouldn't Put a Filesystem on Top of an Object Storeby@minio
6,908 reads
6,908 reads

Why You Shouldn't Put a Filesystem on Top of an Object Store

by MinIONovember 14th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

When large organizations need to store and access oceans of data for deep learning, AI, and other data intensive use cases, POSIX does not have the ability or scalability to meet these demands.
featured image - Why You Shouldn't Put a Filesystem on Top of an Object Store
MinIO HackerNoon profile picture

When purchasing storage, the emphasis is usually on media, but it may be even more important to consider access methods too. You will need to take storage protocols into account when designing and procuring infrastructure, especially when you leave legacy storage behind in order to migrate to cloud-native object storage. However, object storage relies on the S3 API for communications, while legacy workloads rely on POSIX, the Portable Operating System Interface that provides a set of standards developed in the 1980s to allow applications to be portable between Unix OSs. Chances are that most enterprises have had applications in service for decades that were developed to run on POSIX. Chances are also that engineers are already aware of POSIX’s poor performance.


That being said, when you have legacy systems that can only ingest data in a certain format or from a certain source, your options might be limited and you may have no choice but to implement an outdated protocol or rewrite your code. For example, if you can only ingest from a local disk with a filesystem and not with RESTful APIs accessed over network, then you must first make that data available on disk before your application can use it. However, using an object store as a filesystem has a number of serious negative implications when it comes to performance, compatibility, data integrity and security.


Let’s demonstrate  this with some real world testing using a small utility called s3fs-fuse. This utility allows you to mount an S3 bucket as a local filesystem. It stands for S3 (Simple Storage Service) File System - FUSE (Filesystem in Userspace). It's an open-source project that leverages the FUSE (Filesystem in Userspace) interface to present a filesystem-like interface to S3.


Once an S3 bucket is mounted using s3fs-fuse, you can interact with the bucket as if it were a local filesystem. This means you can use regular file operations (like read, write, move, etc.) on files in your bucket. This sounds amazingly convenient and we can argue it simplifies application development. But inherently object storage and file systems have fundamental differences that will affect the s3 bucket mounted as a filesystem.


Let’s take a moment to step back from the s3fs-fuse utility to discuss the real reasons why treating object storage as a filesystem is far from optimal. The problem is much bigger than s3fs-fuse and includes other utilities such as Rust based Mountpoint for Amazon S3, a file client that translates local file system API calls to S3 object API calls. The first reason is that all of these utilities rely on POSIX for filesystem operations. POSIX is inefficient and it was never intended for working with very large files over-the-network.


The speed of POSIX based systems decreases as the demand, especially concurrent demand, on them increases. When large organizations need to store and access oceans of data for deep learning, AI, and other data intensive use cases, POSIX does not have the ability or scalability to meet these demands. While all-Flash arrays have kept POSIX in the game, scalability and RESTful APIs (the hallmarks of the cloud) are like kryptonite.


Because of this, running POSIX on top of an object store is suboptimal. Let’s take a look at some of the reasons why:


  1. Performance: POSIX FS interface is inherently IOPS centric. They are chatty, expensive and hard to scale. RESTful S3 API addresses this by transforming IOPS into a throughput problem. Throughput is easier and cheaper to scale. This is why object storage is high-performance at a massive scale. Layering POSIX over S3 will not scale because POSIX is too chatty to be performed over an HTTP RESTful interface.


  2. Semantics: Because object operations are atomic and immutable, there is no way to guarantee consistency correctness. This means you can lose the uncommitted data in case of a crash or run into corruption issues in case of shared mounts.


  3. Data Integrity: Writes or any mutations to a file won't appear in the namespace until it is committed. This means concurrent access across shared mounts will not see the modifications. It is not useful for shared access.


  4. Access Control: POSIX permissions and ACLs are primitive and incompatible with the S3 API’s way of handling identity and access management policies. It is not possible to safely implement POSIX access management on top S3 APIs.


POSIX also lacks most of the functionality that developers love about S3, such as object level encryption, versioning, immutability – these simply have no equivalent in the POSIX world and nothing is able to translate them.

POSIX Pain Points

These examples illustrate the issue and its implications. To get started, we will use this CSV file which is approximately 10GB and has 112 rows.


Note: We will assume the s3fs-fuse is already installed and you have mounted one of the buckets from the object storage into your filesystem. If not, follow the instructions here.


In these examples, we will assume the bucket name is test-bucket and the file name is taxi-data.csv is in /home/user/ directory and s3fs-fuse bucket is mounted at /home/user/test-bucket/

Copy Operation

We’ll try something simple first and try to copy the CSV file to our test-bucket using mc commands and record the time taken

time mc cp /home/user/taxi-data.csv minio/test-bucket/taxi-data.csv


This shouldn't take a lot of time and the file should be copied to our bucket. Now let’s try to do the same with s3fs-fuse

time cp /home/user/taxi-data.csv /home/user/test-bucket/taxi-data.csv


Time it took during the testing


real	1m36.056s
user	0m14.507s
sys	0m31.891s


In my case I was only able to copy the file partially to the bucket and the operation failed with the following error


cp: error writing '/home/user/test-bucket/taxi-data.csv': Input/output error
cp: failed to close '/home/user/test-bucket/taxi-data.csv': Input/output error


After multiple tries it succeeded


real	5m3.661s
user	0m0.053s
sys	2m35.234s


As you can see, because of the amount of API calls the utility needs to make and the general overhead of the operations, the utility becomes unstable and most of the operations do not even finish.

Pandas Example

We’ve shown you a simple cp example, which may or may not have been convincing, because let's face it, you might think time cp is quite rudimentary.


So for the folks who need more empirical evidence let's write a Python snippet to test this. We’ll do a simple Pandas example with both s3fs-fuse and the python s3fs package, and see the performance impact


import timeit
import os
import fsspec
import s3fs
import pandas as pd

# Write a dummy CSV file to test-bucket
df = pd.DataFrame({"column1": ["new_value1"], "column2": ["new_value2"]})
df.to_csv("s3://test-bucket/data/test-data.csv", index=False)


def process_s3():
    # add fsspec for pandas to use `s3://` path style and access S3 buckets directly
    fsspec.config.conf = {
      "s3":
      {
        "key": os.getenv("AWS_ACCESS_KEY_ID", "minioadmin"),
        "secret": os.getenv("AWS_SECRET_ACCESS_KEY", "minioadmin"),
        "client_kwargs": {
          "endpoint_url": os.getenv("S3_ENDPOINT", "https://play.min.io")
        }
      }
    }
    s3 = s3fs.S3FileSystem()
    for i in range(100):
        # Read the existing data
        print(i)
        df = pd.read_csv('s3://test-bucket/data/test-data.csv')
        # Append a new row
        new_df = pd.concat([df, pd.DataFrame([{"column1": f"value{i}", "column2": f"value{i}"}])], ignore_index=True)
        # Write the data back to the file
        new_df.to_csv('s3://test-bucket/data/test-data.csv', index=False)


execution_time = timeit.timeit(process_s3, number=1)
print(f"Execution time: {execution_time:.2f} seconds")


Time taken during the testing

Execution time: 8.54 seconds


Now let's try the same for s3fs-fuse


import timeit
import pandas as pd

# Write a dummy CSV file to test-bucket
df = pd.DataFrame({"column1": ["new_value1"], "column2": ["new_value2"]})
df.to_csv("s3://test-bucket/data/test-data.csv", index=False)


def process_s3fs():
    for i in range(100):
        # Read the existing data
        print(i)
        df = pd.read_csv('/home/user/test-bucket/data/test-data.csv')
        # Append a new row
        new_df = pd.concat([df, pd.DataFrame([{"column1": f"value{i}", "column2": f"value{i}"}])], ignore_index=True)
        # Write the data back to the file
        new_df.to_csv('/home/user/test-bucket/data/test-data.csv', index=False)


execution_time = timeit.timeit(process_s3fs, number=1)
print(f"Execution time: {execution_time:.2f} seconds")



Time taken during the testing

Execution time: 9.59 seconds


These examples demonstrate constant reads and writes to S3 files. Imagine this performed concurrently by multiple clients – the latency grows dramatically.

Remove the Overhead!

As you can see, the difference between using POSIX translation to treat objects as files and using the direct API to work with objects is night and day. There is simply no comparison when it comes to security, performance, data integrity and compatibility. MinIO has SDKs to integrate with almost any popular programming language, and it can run on almost any platform such as Kubernetes, bare metal Linux, Docker containers – and much more.


MinIO secures objects  with encryption at rest and in-transit, combined with PBAC to regulate access and erasure coding to protect data integrity. You will achieve the best performance possible, regardless of where you run MinIO, because it takes advantage of underlying hardware (see Selecting the Best Hardware for Your MinIO Deployment) to deliver the greatest possible performance. We’ve benchmarked MinIO at 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs.


There is simply no need for a filesystem utility in the middle of MinIO and your application! Any advantage legacy apps might receive will be offset by the pain of POSIX.


If you have any questions about using POSIX translation for your application, be sure to reach out to us on Slack!


Also published here.