CumulonimbusFS

30 July 2015

Introduction

It is incredible how many services on the Internet allow us to store our data. From our holiday photos to a birthday video, one may literally fill the web with their bytes. Some of these services require some kind of authentication or subscription while others allow anonymous submission. But what if someone wanted to reuse that free space for their own type of data.

This post presents CumulonimbusFS, a FUSE-based ruby library that may be used to turn any kind of web storage into your own file system.

Pastebin

We will start our exploration by using one of the simplest data storage available on the web: pastebin. A pastebin website allows a user to store some plain text. This is commonly used by developers to share code snippets or backtraces for debugging. Pastie.org is one example of such service. This service is trivial to use with one HTTP request to store our text and another to retrieve its content. These two requests are going to be our primitives for the file system. An interesting feature of pastie.org is that it does not allow any modification of a paste file. Although, it is possible to create a new paste based on an already existing one. As we will see later, this constraint has had a direct consequence on the file system design.

When inspecting the HTTP request and response for uploading, you may notice that a key is returned. This key is what we will use for reference to our files and directories. In our case, this key is composed of two elements: a pastie id and the key to access the pastie.

Below is the two methods used to store and retrieve data from pastie.org. In this case, I have used the Turf library for the HTTP requests but any other HTTP library should be fine.

class PastieKeyValueStore < TextKeyValueStore

  @@form = { "paste[authorization]" => "bananabread",
             "commit" => "Create Paste" }

  def store(value)
    f = @@form.clone
    f["paste[body]"] = value
    r = Turf::multipart("http://pastie.org/pastes", f)
    r.run
    pastie = r.response.cookies["pasties"]
    key = r.response.headers["Location"].split("/").last
    pastie + "_" + key
  end

  def retrieve(name)
    pastie, key = name.split("_", 2)
    r = Turf::get("http://pastie.org/pastes/#{pastie}/download?key=#{key}")
    r.run
    r.response.content
  end

end

At this stage, we have turned pastie.org into a convenient key-value store. The next step is to turn this store into a full file system.

Text key-value store

The question at this stage is how we are going to fit a directory and file structure into some text. There are multiple options for this problem but as long as you remain consistent and follow a pre-determined format, it should be fine. For a file, I decided to have a magic on the first line and the content of the file base64-encoded following:

#F
<content base64-encoded>

Straight-forward. For the directories, we need some sort of structure. A directory may contain files and sub-directories, so we will keep track of the kind ("D" or "F"). Each entry has a name as well as an address (or key):

#D
F b6de49ab myfile1
D 56865325 mysubdir1

And that is about it. This is not optimised nor handles attributes (execute flags, user, group, etc.) but it works and is easily debuggable. We now have a separate class which implements that description and will be extended by the PastieKeyValueStore:

class TextKeyValueStore

  def parse_directory(content)
    if content.lines.first != "#D\n"
      puts "NOT A DIRECTORY"
    end
    Hash[ content.lines[1..-1].collect { |l|
      t, k, n = l.split(" ")
      [n, {key: k,type: t}]
    }]
  end

  def gen_directory(files)
    header = "#D\n"
    content = files.each.collect { |n, w|
      "#{w[:type]} #{w[:key]} #{n}"
    }.join("\n")
    header + content
  end

  def parse_file(content)
    if content.lines.first != "#F\n"
      puts "NOT A FILE"
    end
    Base64.decode64(content.split("\n", 2).last)
  end

  def gen_file(content)
    "#F\n" + Base64.encode64(content)
  end

end

This class turns a text-encoded file into its binary form and a text-encoded directory into an array in which each element has a name, key and type. From this representation, we shall see how to implement the basic filesystem primitives.

FUSE and rfusefs

FUSE is a kernel module that allows a regular user to mount a virtual file system. This virtual file system is usually generated and managed by another process. Ruby has few bindings for FUSE. I used rfusefs which only requires a limited set of functions to implement. In particular, there is no need to buffer the read or write as the library will do this part for you. (rfusefs exposes a similar interface from fusefs but based on rfuse).

As mentioned in the fusefs documentation, only six methods need to be implemented to create a read-only file system. Below is an excerpt of the code for the content method (retrieve the content of a directory). Note how we mainly rely on the methods defined earlier within the TextKeyValueStore and PastieKeyValueStore classes.

def contents(path)
  dkey = get_key(path)
  d = @store.parse_directory(@store.retrieve dkey)
  d.keys
end

private

def get_key(path)
  return @origin if path == "/"
  d = get_parent(path)
  d[scan_path(path).last][:key]
end

def get_parent(path)
  current = @origin
  d = @store.parse_directory(@store.retrieve current)
  for p in scan_path(path)[0..-2]
    current = d[p][:key]
    d = @store.parse_directory(@store.retrieve current)
  end
  d
end

Although it might seem to be expensive to browse all the parents up to the root to retrieve the content of a directory, this can be quicken by caching some of the already retrieved values (e.g., using an LRU cache). As mentioned earlier, since a record cannot be directly modified, this makes the implementation of such caching system trivial.

For the complete implementation, please refer to the git repository. Our text-based file system is now ready to be used.

Image key-value store

The same approach we had with text may be used for images. We do need to be careful with the type as JPEG may be modified or compressed by the host. In this case, a safe bet is to work with PNG which are lossless. The chunky_png library provides a convenient method to generate PNG files as well as a direct access to the pixels.

One realistic constraint I came across was that the image should looks like a valid image when being uploaded. More precisely, it should be machine-acceptable (i.e., pass any potential test on its size or dimensions) but there is no need to be human-acceptable (i.e., does not contain a cat in the background). If human-acceptable was a real constraint, one may use standard steganography techniques.

In our case, I decided to go for a square image, to avoid awkward dimensions that may easily be spotted. Once this is done, we may use the pixels to store our data. Each pixels allows up to 4 bytes of data (RGBA).

def bytes_to_png(bytes)
  size = bytes.bytesize
  width = Math.sqrt((size.to_f / 4) + 1 + 1).ceil # one for the size, one for the padding
  bytes = pad_content([size].pack("l>") + bytes, 4)
  png = ChunkyPNG::Image.new(width, width, ChunkyPNG::Color::TRANSPARENT)
  bytes.chars.each_slice(4).with_index { |item, i|
    r, c = i / width, i % width
    p = item.join("").unpack("l>").first
    png[r,c] = p
  }
  png.to_blob
end

You may have noticed that I am referring to bytes and not structures as in the case of the text key-value store. A simple trick here to link both is to reuse the text store and simply consider the text we generated as a byte array.

class ImageKeyValueStore < TextKeyValueStore

  def parse_directory(content)
    content = png_to_bytes content
    super content
  end

 [...]

Here is my GPG public key embedded within a PNG:

GPG public key image

Note the top-left black corner which contains the length of the content and the white column on the right-hand side which is our padding. When I uploaded this image on some image sharing service, it got classified as abstract art. A good sign we passed the machine-acceptable criteria.

Stacking up

We now have our very own file storage with multiple types of backend-data supported. At this point, you might object that this is of little interest. Our data appears unencrypted on the server and anybody who has access to the URL and source code will be able access our file. Besides, if our filesystem gets uncovered and deleted, we will have lost all our data.

In fact, both issues may be addressed easily be simply reusing already existing pieces of software. Remember that at this stage, we do have a file system. So this could be used by other processes to bring additional features. For instance, EncFS allows one to create an encrypted file system using another file system as storage. Similarly, ChironFS may be used for replication. It behaves like a RAID-1 configuration but at the filesystem level and not block level.

Limitations

Two limitations remain at this time. Firstly, the size of the files is not verified. This may bring unexpected errors if the files you are creating are too large (i.e., Don't expect to upload gigabytes of ISOs). An elegant way to solve this limitation would be to have another layer of file system that ensures any file is not bigger than a certain size. If this is not the case, this file could be split into smaller files.

Secondly, as you may have noticed the design of the filesystem has a side effect of updating any parent once a child is modified. This includes the root node. While it is easy to keep track of this change within the code, this will require to reuse the last known root when remounting the file system. This peculiarity has actually another positive side effect. Since any older version of a file is kept, we do have a versioned file system. For instance, it is possible to share a read-only copy of a specific subtree of the file system.

Conclusion

The main idea behind this post is to show how easy it is to repurpose available web services. Most of the tools used here are already available and it is just a matter of gluing them together (The complete library is less than 300 lines of Ruby).

From a defender perspective, if you are currently exposing some web service to the Internet, you should have other thoughts on any potential abuses and how to stop or prevent them (e.g., heuristics on anomaly detection and type of degraded service if triggered). For example, one current standard to prevent abuse is to have a CAPTCHA on the registration process only. This step could easily be completed by hand by an attacker. The remaining sections of the application may then be easily automated and therefore repurposed.

Major social networks have already implemented multiple abuse protection mechanisms, mainly based on the spam detection theory. Such mechanism may leverage observables at different levels, from the network levels (e.g., IP addresses of the client) up to the application level (e.g., number of files uploaded in the last hour). In between, the HTTP layer itself will bring valuable information too (e.g., one may detect a POST request without any form being issued). The end goal is always to be able to accurately model the user's behaviour to differentiate regular usage from abuse.

Homework

  • Clone the code and run the binaries.
  • Anywhere on the web, find two requests (GET/POST) which allows you to store some data.
  • Write two methods to interact with these, based on the text or image representation.

Don't hesitate to be creative about where to store the data. Below is a capture of CumulonimbusFS using the chat messages of a social media web site. On the left-hand side of the screen is the CumulonimbusFS process. On the right-hand side, at the top is the web site that receives the text updates. At the bottom is a terminal interacting with the new file system. The file system is mounted, used and unmounted. The last known origin is necessary to remount the file system in its last state.