Wednesday, January 06, 2010

A twitter data structure using linked lists


I recently did a guest post regarding data modelling on Chet's ORACLENERD blog. My original question was how would you normalize a tweet. Following the comments from there, I'm going to explore some an implementation in a Navigational Database rather than an SQL/RDBMS. Irrespective of whether or not you use a RDBMS for persistance, if you want to rapidly search a large amount of data you need to think about data structures.

Access Paths

A tweet requires different 'access paths' for several use cases. Firstly, I might want to see all the tweets made by an individual (such as oraclenerd). Secondly, I might want to see all the tweets that are replies to my user (syd_oracle) or that mention me. Finally, I may want tweets that relate to a particular subject (such as #oracle). I'll assume we want to access the most recent tweets first.

Thinking data structures, all of these start off with knowing a key, or a starting point for the query. Therefore they can simply be managed as a linked list.


Access by tweeting user

For the first scenario, the sending user is the starting key. The user oraclenerd has 3 tweets [this is just an example :) ]. For arguments sake, we will give them numbers 123, 179 and 201. The 'oraclenerd' user entry can have a 'sent' pointer set to tweet 201, and tweet 201 points to the sender's earlier tweet 179 which, in turn, points to 123 where the list ends. 

When he tweets "Hey, I've got EBS running", we create tweet 207 and link that to the earlier sent tweet, number 201, then repoint the 'oraclenerd' sent list pointer away from 201 to 207. If the case of a critical failure between the creation of 207 and the repointing of 'oraclenerd', upon restart we will have 'oraclenerd' still pointing at 201. While 207 may exist in the persistance layer, nothing points to it and it will never get accessed so it is effectively dead. In an ACID database, if we applied the change as a transaction, the creation of 207 would have been rolled back. If we don't really care about consistency we can forget it, and perhaps have a weekly cleanup job.

That all works easily because oraclenerd will only make one tweet at a time. We can scale by spreading our load, maybe alphabetically with node 1 having all the tweeters 'a-m' and node 2 having 'n-z'. More tweeters and we set up a new node and split 'a-m' into 'a-g' and 'h-m' (though realistically it could be some hash function for balance, or perhaps new tweeters on new nodes). We haven't had to worry about inter-node or concurrency issues...yet.


Replies and Mentions

The 'replies' are a bit more complicated. One approach is this:

'oraclenerd' tweets "@syd_oracle, what is the weather like ?". 

Firstly we create the tweet (number 211). Remember in the first example, I had tweet '207' point to 'oraclenerd's earlier tweet of '201' ?  Now '211' has to link to the previous tweet number '207' but it also has to have another pointer to the previous reply to syd_oracle because I want to see all the replies to me. The code needs to look up the latest reply to 'syd_oracle' (eg tweet number 150), and have the 'reply' pointer of 211 link to 150. Then we update my 'replies' linked list to include the tweet 211.

If we get a failure before that last step, then it can show up in his list but not mine. Untidy, but not a big issue. A worse scenario is that we are on different nodes and we get into a problem about having a single point of truth, and failure, for the tweet. If the node storing oraclenerd's tweets fails, then when I look at my replies, my user entry points to an unavailable tweet and, because that is the only point I have for my linked list, I can't see any of my earlier replies.

When we get into 'mentions' we see that solution design for 'replies' was all wrong, or at least it can't simply be extended to 'mentions'. While a tweet can only come from a single sender, and be a reply to one other user, it can have many other users mentioned (to a maximum of about 40 given the length of a tweet and the need for the separators, flags for indicating the mention and the username). Add in multiple nodes and we'd be lucky if anything works.

Rise of the Clones

So we throw away that 'replies' design. This time, when we create tweet 211 for oraclenerd, we also create a clone of that tweet for syd_oracle (say 215). syd_oracle's 'replies' list is pointed to 215 and 215 is pointed to its previous reply of 150. If oraclenerd's node goes down, my '215' tweet clone is still there on my node so I'm not affected. What's more, this extends easily to 'mentions'. Each user record simply has pointers to their last sent message, their last reply received and their last mention. It is pretty similar for topics too, though the 'topic' record only has one linked list.
 

Once you get into the design, you can see that sending a single tweet might create a number of tweet clone records. This report indicates that the count of 7 billion plus tweets is based on a tweet duplicated for each follower.

The concepts for modelling followers would be similar to those for mentions. Each user would have a linked list of followers. Every time they tweet, the tweet would be cloned for each follower in a similar manner to the other clones for mentions. The clone would be linked to the person following the tweeter so they would have a list of all tweets for users they are following.


[Note: You don't have to clone the full 140 character tweet. You can have a single version of the text and clone pointers to it. That just makes the example a bit more complicated though. You might have some duplicates of the text for node resilience and load balancing purposes.]


Locking

New tweets update the records of the sender, users mentioned, topics and follows. These updates, rather than the inserts, are where we get into concurrency issues. In the example, I had a reply of '150' and then the following happened.

  1. Process reads syd_oracle's latest reply (150).
  2. Process creates the tweet '215', pointing to 150.
  3. Process updates syd_oracle's latest reply to point to '215'.
What happens if I'm dealing with two replies at the same time ? Pretty unlikely for me as I'm not that busty and computers are pretty fast, but possible for a hot topic like a musician's demise.

What can happen is

  1. Process 1 reads #elvis_dead's latest tweet (150).
  2. Process 1 creates the tweet '215', pointing to 150.
  3. Process 2 reads #elvis_dead's latest tweet (150).
  4. Process 2 creates the tweet '218', pointing to 150.
  5. Process 1 updates #elvis_dead's latest tweet to '215'
  6. Process 2 updates #elvis_dead's latest tweet to '218'
Without some form of locking, tweet 215 has got lost in the rush. Is that important ? That's up to you, but that sort of race condition is near impossible to achieve repeatedly in a test run,  probably won't show up in production until you've got a heavy workload and will be tricky to reproduce.


For followers, there is also scope for lost tweets if you follow multiple people who tweet at the same time as they both try to insert their tweet into your list. There are also complications when you 'unfollow' someone, as you need to walk their list of followers (on their node) and get deleted, plus delete their tweets from your clone army. Deletes in linked lists are fairly simple. If 207 points to 201 which points to 179, then you can point 207 straight to 179 and delete 201. But again, without locking, you can get trapped if someone deletes 207 as you are trying to repoint it to 179.

Consistency

If one node has to be shutdown, or crashes, and is restored from a backup with some loss of information, then you will get inconsistencies. If you are using a persistance layer that doesn't have a locking mechanism, you will get inconsistencies. If you don't have a robust queuing mechanism to keep things running when nodes go down, you'll get inconsistencies.

An interesting overview on the tradeoff between consistency, availability and partitioning (or sharding/replication) is here (once you get past the first few paragraphs about punk rock) and even links to a blog post about Twitter and SQL


No comments: