Overview: I will go over sharding basics and how to overcome problems like calculating ranks and inbox.
Sharding is splitting your database into partitions and keeping each of the partitions on different servers.
The following figure will make it more clear.
What we did is, instead of keeping details of all user on one database server, we kept the users with odd userid on one server and others on second server. Lets say, if our system allowed us to make only 1000 queries on user, we can now make 2000 (1000 on each server more or less). You can split your data into any number of shards and it scales horizontally.
Q1: How many shards (partitions)?
The main effort in sharding is upfront while doing the partiions. So its better to plan for next 6 months or so. Lets say have only 2 servers, you can partition into 10 shards and place 5 of them on one server and other 5 on second. Later on, if add one more server, you can move 2 shards each from first and second server onto the third one. You might ask why 10, let like many other decisions in startup, this is more about gut feel then anything else.
Now, how to implement sharding in your code. There are a number of options like mysql proxy but the easiest method (both in effort and maintenance) change your database api layer(the class in your codebase which handles the actual query execution, maintaining connections etc) and pass an extra parameter the userid. The database layer can determine the server and database name to use based on the userid. So:
apps_mysql_query(query) => apps_mysql_query(userid, query)
The biggest issue with sharding is how to compute something at a global level. Lets say in above example, each user had a score. You want to know the top 25 leaders. Earlier it was one simple query:
“SELECT userid, name, photo FROM user WHERE user = ‘active’ ORDER BY totalscore DESC LIMIT 10″
Since now you have 10 different shards, you need to execute the same query on 10 different shards, merge the results and select top 25 from this combined list. Downright Ugly!
To get around this problem, what needs to be done is maintain two snapshots of database, one sharded version and one complete version (called central repository). So what we now have is:
Q2: Doesn’t this take away all advantage we had with sharding?
Not really. Actually far from it. You need to keep make a central repository of those tables on which you need to perform global queries. So User table is one. Tables like “User Permissions” or “User Actions Done” don’t need to be kept on central repository. Now why have a central repository for user table at all? Things like user complete details for user profile, user name and picture etc to show, all these will go to the corresponding user shard. But when you want to show things like leaderboard, only then you will query the central repository.
We need to change our database api so that if userid sent is null, it should query the central repository
Now, we come to the second problem. Let say, user 1 had sent a message to user 2, do we keep it on user1 shard or user2 shard. If we keep it on any one of them, to show messages recieved by user2 do we query all shards? The answer is, keep it on both shards. The one extra insert is much more efficient than 10 selects you will have to do each time user wants to see messages he has recieved. But doesn’t that increase data size? Isn’t disk space close to free :).
So how does our database api change to reflect the above:
apps_mysql_query(userid, query) => apps_mysql_query(userid1, userid2, query)
In case of write query, if userid2 is null, it will just write to shard 1, but in case userid2 is not null, it will write to both shards. In case of read, userid2 is ignored.
So thats it. Let me know in comments if you have used an alternative strategy or any feedback on current system.