Storing Media Files in MongoDB + Streaming Back the Chunks

In case you’re unaware, there’s a database system called MongoDB that stores documents in collections unlike an SQL relational database management system that stores rows in tables.  And, unlike an RDBMS, MongoDB developers do not declare the schema for data that will go into each document.

Anyway, MongoDB is really good at sharding data across a bunch of servers and clustering them, so it makes sense that if you’re working on a web application that’s going to be running across a lot of front-end servers that you just store the media files uploaded by visitors (images, videos, whatever) directly in the database and let MongoDB worry about where to store it and how to get it back.

MongoDB GridFS is a specification for developers of drivers that lays out how large files should be stored into collections in the database.  GridFS says there needs to be a “files” collection which lists the files and a “chunks” collection in which each stored document is a 256k chunk of the file.

A “driver” is an access method for MongoDB.  If you’re a PHP programmer like I am, then you’re using the MongoDB Native PHP driver which defines a bunch of PHP classes for working with your database.  They also implement GridFS classes.

Storing a file in the database is easy: http://www.php.net/manual/en/class.mongogridfs.php

On that page you’ll see a note that they did not implement a way to stream chunks directly.  If you want to stream a file back to the web browser, then you have to call the function to collect the file in memory as a string or save the file to disk on the web server and use some old school method to send it.

The most efficient way to handle it would be to stream the chunks directly.   To do that, you have to access the files and chunks collections directly, not by using PHP’s MongoGridFSFile class.

The basic procedure: Look up the Files document representing your file, then determine how many chunks there are by dividing the length by the chunkSize, then go into a loop and ask for each chunk separately, sending the data, and then going back for more.

This isn’t a tutorial, but I’m guessing if you’ve bothered to find this you’re a programmer already.

Here’s some code that will help you write a chunk streamer.

 function sendFile( MongoId $id )  
{
$collFiles = DB::get("fs.files");
$file = $collFiles->findOne( array( "_id" => $id ) );
$mime = filepathToMime($file["filename"]);

header( 'Content-type: ' . $mime );

$length = $file["length"];
$chunkSize = $file["chunkSize"];
$chunks = ceil( $length / $chunkSize );

$collChunks = DB::get("fs.chunks");

for( $i=0; $i<$chunks; $i++ )
{
$chunk = $collChunks->findOne( array( "files_id" => $id, "n" => $i ) );
echo (string) $chunk["data"]->bin;
}

exit();
}

The DB::get() function is one of mine. It connects to MongoDB (if not already), selects the database, and returns the MongoCollection class for that name.

The filepathToMime() is mine too… it’s just a list of file extensions matched up with mime types.

Of course, you’re going to want to do some checking to see if you have to send the file at all.  The browser asks for the file and send the timestamp for the one in the cache (if it’s cached)… so you’re going to want to check it against the timestamp stored in MongoDB and return the appropriate HTTP status code if you don’t need to send it again.

My particular application isn’t browser based.  It’s going to store the MD5 of the file and ask for it by identifier and that MD5 value.   MongoDB stores the MD5 in the Files collection, so I just have to compare the hash values and back out from sending the file again if they match.

For those of you who are not programmers: An MD5 “hash” is a value that for all practical purposes uniquely identifies some data… basically, it one-way encrypts the data into a long number.  There are a bunch of hash algorithms out there, but MD5 by RSA is ubiquitous.

Happy programming.


Comments

2 responses to “Storing Media Files in MongoDB + Streaming Back the Chunks”

  1. You would need to write it to respond to the Range header, I think. The range would be in bytes, so you'd use those indexes to determine the first block and the last block, and then only send from the first block starting at the proper beginning offset, and only send enough of the last block up to the ending offset.

    If you find an example of serving HTTP progressive download with PHP, then you could use that, only get the data from MongoDB.

  2. Is it possible to stream particluar chunks of videos rather then all chunks. I mean streaming on n=> 50 to n=>100 chunks. Is partial streaming supported ?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.