Simple is Beautiful | Technology, Programming, Video Games
This blog is about technology, programming, video games, books and other related topics. It is published by Mark Papadakis.

CloudFFS : Large scale, high performance files storage system

CloudFFS

So we built and deployed this new service not long ago at work, and @stelabouras suggested we document some parts of it for internal consumption. Given that I haven't blogged for months, I thought I 'd just pour those words here instead.

CloudFFS (yes, it is a funny name) is a file-system (but not in the traditional sense, it doesn't hook into the kernel VFS layer or anything ) that provides storage for unbound number of files and very fast access to them over HTTP. It can manage PB scale volumes and up to 2^64 files per namespace(see below).

We have hundreds of millions of static files(images, video, text files, you name it) stored across our storage devices; having to deal with those many files is not a pleasant task, for our sys.operators and developers alike. We wanted a solution that frees our developers from having to worry about storage and provides a very simple way to store and retrieve files, and at the same time help our systems guys deal with backups and management of those files efficiently.

There are many problems associated with the use of multiple files. Wasted inodes/disk blocks, slower access time (iterating a path components is not free, looking up a directory entity within a directory is not free either), difficulty in making backups, need and use of elaborate directory naming schemes in order to deal with large directories, and more. In addition that, accessing those files over a network filesystem (e.g NFS) is not efficient by any means. Developers need to be aware of those limitations and of the rules that are in place in order to deal with said limitations, which places an unnecessary burden on them.

None of the solutions we looked into really seemed all that great for us, so we went ahead and build our own. Though, to be fair, we almost always end up building our own anyway. This practice has worked great for us all those years and given that we are a technology company, it makes sense for us to disregard the 'not invented here' approach.

Data Model

Files are uniquely identified by a 64bit number. They belong in namespaces, for example 'blogs', or 'images, or 'mails'. A file can hold up to 1GB of data. Files can also be either public, or private. Public files can be accessed directly (e.g http://x.pstatic.gr/me/305896160755729.jpg), whereas private files require HTTP authentication. This makes it possible to, say, make everything accessible over the public Web, except files that should not be accessible in that fashion (e.g log files, archived content, emails, etc ).

On disk, each namespace is represented as a directory in the 'root' directory. Within each namespace directory, there are 2 types of files. Data files and indices. They are further subdivided into 'live' and 'immutable' files. Immutable files, once they are created, are never updated again. In practice, there is almost always a single live datafile/live index associated with a namespace (there are cases where there can be 2 for a few seconds, whenever a live datafile is converted to an immutable one) and that live datafile holds incoming updates. Whenever a SET or DEL operation is executed, a new record is appended into the live data file and its respective live index. Once the live data file size exceeds a threshold, it turns into an immutable one and a new live datafile is created and used instead. Whenever it is necessary, a compaction task kicks in that will merge immutable files, discard deleted files and duplicates and create a new set of immutable files out of them and delete the old ones. This happens fairly infrequently though ( depends on the volume of data, but usually once every few month ). A live or immutable data file can hold thousands, millions, or even billions files, packed one after the other. Maybe I will get to write more about the structure of those files and how they are used in a future blog post.

Operations

There are 4 operations that are all mapped to HTTP methods/verbs. Get(GET), Set(PUT), Stat(HEAD) and Delete(DEL). A few things that may be worth noting; Whenever you set a file you can specify if it is a public, or a private file. Whenever you Set or Delete a file, or Get or Stat a file that has been stored as private, you need to specify authentication credentials(username/password). Otherwise, the operation fails (authentication failed). The list of users is defined in the configuration file.

Internals

The service is implemented as a multi-threaded application. A single thread handles network I/O (asynchronous I/O multiplexing w/ vector I/O). There are also some threads for processing requests and 2 threads for processing system tasks (compactions and live to immutable data conversions). The network I/O thread also accepts incoming connections and parses in HTTP requests. Whenever such request is parsed, it is placed in the 'mailbox' queue of the first idle thread for processing. The network I/O thread also accepts RPC connections/requests (for management) and also talks to our message queue service ; whenever a file is stored or deleted a new event is published into the queue service so that we can replicate data or whatever else we choose to do whenever those events are created.
There are 2 types of caches in place. One for immutable datafile keys ranges(a keys range holds a series of file ids and their (offset, size) into their respective immutable datafile), and another one for compressed content (whenever the agent/browser supports gzip/deflate encoding and the file requests can be provided back to the client in compressed form, we compress and cache the contents into that cache for later use). They both operate in an LRU fashion are protected by spinlocks; the cached objects are ref.counted.

Whenever a Get request is executed, we iterate through the list of the live datafiles for the namespace(usually 1). If not found in there, we walk the list of the immutable datafiles(they are always sorted by creation time) until we get a match, or not. A Set operation appends the data into the active live data file for the namespace, syncs on disk and then lazily appends on the index.

Files on disk (both on live and immutable files) are stored as {header} data {footer}. The header holds the key/id, last modified timestamp, size and flags(private, public, etc). The footer currently holds a crc32 checksum which we consult whenever we pull data from disk for integrity checks. If for whatever reason any file (live or immutable, datafile or index file) is corrupt in any way, the system tries to rebuild it, if possible, or salvage as much as possible from it. XFS is our filesystem of choice for the locally attached storage that holds the CloudFFS datafiles. Each of those datafiles can hold millions of files ( each namespace has its own capacity / datafile threshold ); typically each immutable data file is around 1GB in size, but can be TB in size - there is no hard limit there. All those datafiles make up the namespace.

Accessing data

A file/object is accessible at http://domain/namespace/id. e.g http://x.pstatic.gr/me/305896160755729.jpg. The ID is a 64bit integer that identifies the file to be retrieved. An alternative way to access a file is by accessing http://domain/namespace/n/string. In this case 'string' is used to generate an 64bit identifier. e.g http://x.pstatic.gr/avatars/n/M/96/5/22/markp.jpg. In addition, alphanumerical characters can succeed the 64bit identifier - those are mostly ignored, though a filename extension, if present in that string of characters, is used for identifying any rules specified in the configuration file for special treatment of with said extension. For example, you can specify that 'css' files content type will be text/css, and that they expire within 1day since they were created and that they can compressed if the client supports compressed content. Those kind of options can be set on a per namespace basis and on a per namespace extension basis. The HTTP service will look for Last-Modified and ETag headers and will respond with HTTP 304 Not Modified if needed.

We are migrating existing static files into CloudFFS but so far it has worked great for us; very fast access to data (0.005 seconds for a typical file), a few dozen files to manage instead of millions, easy and efficient access to the files for our developers. Our forthcoming CloudFS project will provide more features (TB scale files, random read/write access to files, distributed storage and fault tolerance, etc) but this service is far more suitable for the kind of static files we have and keep creating every second. Maybe someday we will get to talk more about the services that we built and run in house.

Thursday, 27 January 2011 11:59 pm


iPhone CSRs, Digital Certificates, Encryption and Cryptography

I have been reading about cryptography, digital certificates and signatures, symmetric and asymmetric keys based encryption lately(well, yesterday). Developing iPhone applications (and even more so deploying them to devices or the AppStore) involves dealing with certificate signing requests(CSRs), digital certificates, provision profiles, and more.

Apple went into great lengths to simplify the processes involved - for the most part its trivial and they guide you through every step. (alas, it wasn't always so; back in the day, code signing and other related processes were the cause of much pain among iPhone Developers ).

Nevertheless, whenever I come across something I haven't been familiar enough with, I can't help not 'wasting' time to learn it(how it works, why it works, etc). This practice has deprived me of many hours I could have spent elsewhere (working on other stuff, spending time with loved ones, whatever) but I just can't help it; I "need" to learn how things work.

So here is simplified, basics mostly, overview of those concepts, in case someone needs to get started with them.

The fundamental problems encryption solve are those of data integrity validation and identity authentication. That is to say, to verify that data(messages, anything) sent from Alice to Bob (Alice and Bob) can be verified by Bob that indeed came from Alice and that the data were not tempered with/modified in any way while they were in 'transit'. That's about it.

Encryption is the process of generating new data from the original data. The new data is usually unintelligible to anyone but the intended recipient. Decryption is the process of transforming that data back to the original data.

Those processes are facilitated by a cryptographic algorithm(a cipher). Mostly, this involves the use of a key(a number) which is used with the algorithm to perform the encryption and decryption. The same key is used for both encryption and decryption. With symmetric-key encryption, the encryption key has to be kept secret by both parties. If someone else gains access to the key, he can not only decrypt the data but also encrypt new data and send them to Bob and Bob would assume they came from Alice. That clearly is not desirable. Enter Public-Key Encryption.

Public Key Encryption(asymmetric encryption) involves a pair of keys. The public and the private key. The public key is published and is freely available. The private key is kept secret. Alice never reveals the private key. Ever.

The fundamental idea is that data encrypted using the public key can only be decrypted using the private key. Bob, who has access to the public key much like everyone else, can encrypt his message using Alice's public key, send it over to Alice and only Alice can decrypt it, for it is only one who has the private key.

PKE is computational expensive though and not always suitable for large amounts of data. Often enough, a hybrid approach is employed. PKE is used to send a symmetric key, which then can be used(since both parties will know that secret key) to encrypt additional data using symmetric encryption. Using a symmetric key to encrypt and decrypt data is far less computational expensive. SSL and other protocols rely on this hybrid approach.

It is also possible to encrypt a piece of data using a private key which can only be decrypted using the public key. Given that Alice shared her public key with anyone interested, that wouldn't make much sense if Alice was to send data to Bob encrypted with her private key. Anyone could read it

Well, it does make sense, thanks to digital signatures. A digital signature can be used to verify that data sent from Alice - encrypted or not - were not modified in any way by the time Bob received them. In other words, it validates the authenticity of the data. It deals with tampering and impersonation.

Alice will use a hashing algorithm to generate a signature out of the message she wishes to send to Bob. That signature will then be encrypted with her secret private key. Then, she will send both the data she wishes to send to bob and the encrypted signature she generated from that data. It will also send Bob information about the hashing algorithm Alice used to generate the signature of her message.

Bob will use Alice's public key to decrypt the signature. Then, he will use the same hashing algorithm to generate a signature from the message he received. If the signature matches the signature Alice provided him with, it means the message is authentic. That is so because only Alice knows the private key and only Alice could have encrypted the signature like that.
Unless Alice lost her private key, it is 'impossible' for Alice to deny that she signed the data she sent to Bob, or for anyone to 'sign' anything, send it over to Bob and claim she is Alice.

A hashing function converts data into a single value (often a big integer). Hash functions are fundamental in the design and implementation of some of the most important data structures.

There is one last issue that needs to be addressed. Confirming identities. Digital Certificates solve this problem.

A certificate is an electronic document that is used to identify an entity(individual, company, anything) and associate that entity with a unique public key. Your passport identifies you and associates bits of information with you (your name, etc).
Public Key cryptography uses certificates to address impersonation problems.

Much like one would go about obtaining a driving license, by providing the authorities with whatever information and credentials required, so that they can verify the identity of the applicant and then issue her the driving license, Certificate Authorities(CAs) serve a similar purpose.
They will get Alice's application for a certificate (which includes her public key and information about her), they will validate the information she provided is correct and indeed represent her, and then issue her a Digital Certificate.
In essence, the Digital Certificate binds a public key to an entity. They help prevent the use of fake public keys. So, the digital certificate contains the public key of the entity, its name and other information (key/value pairs, e.g name=Steve, organization=Apple, Inc.) It also includes the digital signature of the CA. It is that digital signature that allows the certificate to function as a verified and trusted certificate, by users who know and trust CA (in other words, have the CA's public key and know that that public key belongs to that CA) and trust the CA but do not know the entity identified by the certificate.

Apple is a Certificate Authority. Before deploying your iPhone application to a device, you need to obtain a certificate and a provision profile. So you prepare a Certificate Signing Request(CSR). This contains information about you. It is the information you want Apple to verify. When you create the CSR, the public and private keys pair is also created. The public keys is included in the CSR. The private key is never sent to Apple. The private key is used for signing your binary.
Apple will get your CSR, create a digital certificate based on it. Then, you need to create a provisioning profile. The provisioning profile holds application IDs, device IDs and certificates.

You will need to submit CSRs and install Digital Certificates if you need to deploy and distribute iPhone applications, use Apple iPhone Notifications and In-App purchases. Hopefully this helps understanding why those are needed and what they are about.

Saturday, 27 March 2010 11:01 am


Update on CloudDS

Here is a progress update to my current main project (we call it 'CloudDS' which stands for cloud data store which is a silly name but it will have to do until we can find a replacement ).
I have been working on the data store component of the service. It has taken at least x4 as much time and effort as I thought it would. A prime reason for underestimating the time requirements is that the initial features list I wanted to implement doubled in size. In addition to that, testing for most of the possible logic paths that could result to failure also took a long time - even if some of that testing was automated, not all of it was and validating results is harder than setting up the test environment.

In such a service, it matters little if most of underlying components fail (I/O and tasks scheduler, garbage collector, cache subsystem, etc) as long as the data management component is not affected. Suffering from a service outage is bad, suffering data corruption and/or data loss is something that has to be prevented by any means necessary.

As it stands, that said component now deals fine with reads and writes, self-healing, caching and performs faster than I hoped it would. The data model is based on BigTable, Dynamo, Cassandra and some earlier prototypes/projects we toyed with in the past. It borrows Cassandra's ColumnFamily/SuperColumn/Column key value representation model. Data are pushed into MemTables and an append only commit log, memtables are flushed into SSTables to disk.

The GCollector merges SSTables whenever required to reclaim space, resolve conflicts and extract a single value out of multiple versions, etc. All operations supported by Cassandra are implemented (query by path, predicate, column names, key ranges, etc ) and CloudDS clients/users will also be able to use a scripting language to describe explicitly down to bytes what they need(i.e give me the first couple of bytes for those values, or gimme a concatenation of those values, etc etc).

Now that that component is out of the way, I can move on to the rest; those are relatively straight forward to implement ( the tasks scheduler and the network I/O subsystems are mostly done ).

Friday, 26 March 2010 9:19 pm

« Older Posts  
Powered by Pathfinder Blogs