CloudFFS : Large scale, high performance files storage system
CloudFFS
So we built and deployed this new service not long ago at work, and @stelabouras suggested we document some parts of it for internal consumption. Given that I haven't blogged for months, I thought I 'd just pour those words here instead.
CloudFFS (yes, it is a funny name) is a file-system (but not in the traditional sense, it doesn't hook into the kernel VFS layer or anything ) that provides storage for unbound number of files and very fast access to them over HTTP. It can manage PB scale volumes and up to 2^64 files per namespace(see below).
We have hundreds of millions of static files(images, video, text files, you name it) stored across our storage devices; having to deal with those many files is not a pleasant task, for our sys.operators and developers alike. We wanted a solution that frees our developers from having to worry about storage and provides a very simple way to store and retrieve files, and at the same time help our systems guys deal with backups and management of those files efficiently.
There are many problems associated with the use of multiple files. Wasted inodes/disk blocks, slower access time (iterating a path components is not free, looking up a directory entity within a directory is not free either), difficulty in making backups, need and use of elaborate directory naming schemes in order to deal with large directories, and more. In addition that, accessing those files over a network filesystem (e.g NFS) is not efficient by any means. Developers need to be aware of those limitations and of the rules that are in place in order to deal with said limitations, which places an unnecessary burden on them.
None of the solutions we looked into really seemed all that great for us, so we went ahead and build our own. Though, to be fair, we almost always end up building our own anyway. This practice has worked great for us all those years and given that we are a technology company, it makes sense for us to disregard the 'not invented here' approach.
Data Model
Files are uniquely identified by a 64bit number. They belong in namespaces, for example 'blogs', or 'images, or 'mails'. A file can hold up to 1GB of data. Files can also be either public, or private. Public files can be accessed directly (e.g http://x.pstatic.gr/me/305896160755729.jpg), whereas private files require HTTP authentication. This makes it possible to, say, make everything accessible over the public Web, except files that should not be accessible in that fashion (e.g log files, archived content, emails, etc ).
On disk, each namespace is represented as a directory in the 'root' directory. Within each namespace directory, there are 2 types of files. Data files and indices. They are further subdivided into 'live' and 'immutable' files. Immutable files, once they are created, are never updated again. In practice, there is almost always a single live datafile/live index associated with a namespace (there are cases where there can be 2 for a few seconds, whenever a live datafile is converted to an immutable one) and that live datafile holds incoming updates. Whenever a SET or DEL operation is executed, a new record is appended into the live data file and its respective live index. Once the live data file size exceeds a threshold, it turns into an immutable one and a new live datafile is created and used instead. Whenever it is necessary, a compaction task kicks in that will merge immutable files, discard deleted files and duplicates and create a new set of immutable files out of them and delete the old ones. This happens fairly infrequently though ( depends on the volume of data, but usually once every few month ). A live or immutable data file can hold thousands, millions, or even billions files, packed one after the other. Maybe I will get to write more about the structure of those files and how they are used in a future blog post.
Operations
There are 4 operations that are all mapped to HTTP methods/verbs. Get(GET), Set(PUT), Stat(HEAD) and Delete(DEL). A few things that may be worth noting; Whenever you set a file you can specify if it is a public, or a private file. Whenever you Set or Delete a file, or Get or Stat a file that has been stored as private, you need to specify authentication credentials(username/password). Otherwise, the operation fails (authentication failed). The list of users is defined in the configuration file.
Internals
The service is implemented as a multi-threaded application. A single thread handles network I/O (asynchronous I/O multiplexing w/ vector I/O). There are also some threads for processing requests and 2 threads for processing system tasks (compactions and live to immutable data conversions). The network I/O thread also accepts incoming connections and parses in HTTP requests. Whenever such request is parsed, it is placed in the 'mailbox' queue of the first idle thread for processing. The network I/O thread also accepts RPC connections/requests (for management) and also talks to our message queue service ; whenever a file is stored or deleted a new event is published into the queue service so that we can replicate data or whatever else we choose to do whenever those events are created.
There are 2 types of caches in place. One for immutable datafile keys ranges(a keys range holds a series of file ids and their (offset, size) into their respective immutable datafile), and another one for compressed content (whenever the agent/browser supports gzip/deflate encoding and the file requests can be provided back to the client in compressed form, we compress and cache the contents into that cache for later use). They both operate in an LRU fashion are protected by spinlocks; the cached objects are ref.counted.
Whenever a Get request is executed, we iterate through the list of the live datafiles for the namespace(usually 1). If not found in there, we walk the list of the immutable datafiles(they are always sorted by creation time) until we get a match, or not. A Set operation appends the data into the active live data file for the namespace, syncs on disk and then lazily appends on the index.
Files on disk (both on live and immutable files) are stored as {header} data {footer}. The header holds the key/id, last modified timestamp, size and flags(private, public, etc). The footer currently holds a crc32 checksum which we consult whenever we pull data from disk for integrity checks. If for whatever reason any file (live or immutable, datafile or index file) is corrupt in any way, the system tries to rebuild it, if possible, or salvage as much as possible from it. XFS is our filesystem of choice for the locally attached storage that holds the CloudFFS datafiles. Each of those datafiles can hold millions of files ( each namespace has its own capacity / datafile threshold ); typically each immutable data file is around 1GB in size, but can be TB in size - there is no hard limit there. All those datafiles make up the namespace.
Accessing data
A file/object is accessible at http://domain/namespace/id. e.g http://x.pstatic.gr/me/305896160755729.jpg. The ID is a 64bit integer that identifies the file to be retrieved. An alternative way to access a file is by accessing http://domain/namespace/n/string. In this case 'string' is used to generate an 64bit identifier. e.g http://x.pstatic.gr/avatars/n/M/96/5/22/markp.jpg. In addition, alphanumerical characters can succeed the 64bit identifier - those are mostly ignored, though a filename extension, if present in that string of characters, is used for identifying any rules specified in the configuration file for special treatment of with said extension. For example, you can specify that 'css' files content type will be text/css, and that they expire within 1day since they were created and that they can compressed if the client supports compressed content. Those kind of options can be set on a per namespace basis and on a per namespace extension basis. The HTTP service will look for Last-Modified and ETag headers and will respond with HTTP 304 Not Modified if needed.
We are migrating existing static files into CloudFFS but so far it has worked great for us; very fast access to data (0.005 seconds for a typical file), a few dozen files to manage instead of millions, easy and efficient access to the files for our developers. Our forthcoming CloudFS project will provide more features (TB scale files, random read/write access to files, distributed storage and fault tolerance, etc) but this service is far more suitable for the kind of static files we have and keep creating every second. Maybe someday we will get to talk more about the services that we built and run in house.
Thursday, 27 January 2011 11:59 pm
Thoughts dump, for Jul-06-2010
Our large-scale, high-performance and highly available ( those were the goals anyway, I hope we attained them ) data store has been more or less ready for production for a few weeks now.
We have yet to actually deploy it, although two forthcoming (in-development) projects will be built on top of it. There are a few things here and there we could, and will, change, cleanup the client library API and all that, but as far as I can tell, there are no real issues left to resolve. During testing, we got unto 40K GET(value by key) and over 50K PUT(value by key) operations/second on a 3 nodes system (quorum arrangement). Adding nodes increased capacity and throughput which was one of the design goals.
We got a few more similar projects in the pipeline; more building blocks for our services stack. We are going to build two different file system (one will be optimized for very high performance access to files, another for availability and storage of files not limited in size), a MapReduce framework/infrastructure and a new distributed lock manager which will also replace ad-hoc solutions we currently rely on.
I am very proud of our team; they are smart and inventive, passionate and hard working. They let me toy around with ideas that do not always make sense and they always find ways to make me feel great about what we do. Good times ahead.
Tuesday, 6 July 2010 8:33 pm
mySQL, noSQL, and Key Value datastores
Monolithic RDBMs are losing ground to key-value data stores, particularly persistent distributed in nature. mySQL mounting problems was perhaps the key reason (pun intended) people looked elsewhere. Google's brilliant engineers realized that a key/value data model can satisfy the needs of almost every class of application that needs a datastore backend.
Key/value datastores are simple to build, easy to understand, easy to optimize, easy to scale. The, now famous, CAP theorem states that it is not practically possible to guarantee consistency, availability and partitioning resilience/tolerance all at the same time; one of those traits has to be sacrificed. Again, most applications really do not require all three to function. The CAP theorem is most likely derived from the Project Triangle mode.
Most web-based applications are built on simple data models. Most web-applications eventually suffer from service capacity and availability issues(i.e scalability woes). It is trivial to scale out(vertically) application logic processing(application servers), HTTP requests processing(web servers, load balancers).
It is not easy to scale out an RDBMS. Some expensive systems(Oracle, etc) provide ways to address those issues (e.g Oracle RAC) but its expensive to deploy them, and most of them rely on a shared everything setup which just doesn't work in the long run. (Shared nothing is really the way to go).
Google released a bunch of papers ( actually, a bazillion of papers ), many of them defining and shaping the development of future related technologies. Namely, the papers describing GFS, BigTable, MapReduce (and of course, the paper the changed everything, "The Anatomy of a Large-Scale Hypertextual Web Search Engine" ) steered everyone to the right direction.
In the datastores domain, Hadoop/HBase, Radix, Cassandra and others, based on BigTable and Amazon's Dynamo papers, all relying on the simple key/value datastore model, are gaining market share - rightly so. Coupled with Memmache and similar services(in-memory key/value stores) they are solving the problems of service capacity and availability. This is a paradigm shift. Its a downhill for heavy-footprint, complex and inflexible datastore systems. They wont go away but will not be such a valuable(pun intended) component in tomorrow's technology landscape.
We are going to gradually migrate from RDBMs - though, we are not relying that much on them nowadays - to a key/value datastore (we are currently building one, also based on BigTable and Dynamo ). If nothing else, those simple systems are both simple and beautiful (for the most part).
Saturday, 13 March 2010 8:14 pm