Tuesday, July 17, 2007

Peer-to-peer database synchronization



Being a technologist, I often think of technological solutions before I think of an actual use case. I know from past experience with both my own ideas and others that this Doesn't Work.

My latest flash is the idea of having a community of peers be able to securely share a relational database, creating an opportunity for collaboration and dialog, without having to put the data on a central server. My motivation for this is that as soon as you put data on a central server, that central server becomes key. It puts a particular location in a situation of greater power, and that changes the entire dynamics of the model: socially, economically, politically.

I would like to see communities created where only the peers in the community are involved, and each is an equal: nobody holds a "lock" on the data and the community and is thus tempted to exercise control in various ways.

Examples of this that are out there already which seem to fit into this architecture are BitTorrent and Mercurial. But BitTorrent works with binary data, and Mercurial works with text files (for the most part). Neither of these work with structured, relational data, and the advantages a relational database provides.

There are technical challenges. If you open yourself up to accepting connections, you open yourself to all sorts of trolls, worms, ogres and various evil creatures of the Dark Internet. So I don't like doing this, and neither should you (as a standard, regular Internet user). But how do you do peer-to-peer without doing this? You need to implement some very strong security, and strong security can easily mean a big, oafish, burdensome user interface to let someone join a community, which generally is a killer.

But that's not the only problem. The other problem is: who cares? Why would we want peer-to-peer database sharing among communities? What value does it provide? What is the "killer app?" Does anyone have any ideas? I'd rather test this idea out by building an actual useful solution rather than just building it because it's a "cool idea." Cool ideas don't amount to much if nobody cares and it doesn't do anything useful.

So, if you have thoughts, tell me, or point me to what others are doing. And if you think this is not useful, tell me why. I want to know.

7 comments:

Anonymous said...

Hi David
Links for a couple of papers I worked on for this topic below

I would be interested in collaborating with you on any project relating to this

http://www.inderscience.com/search/index.php?action=record&rec_id=17010&prevQuery=&ps=10&m=or

http://www.springerlink.com/content/4833w27742t3771r/

Regards

Phil Thompson

Unknown said...

Hi David! It's been a long time since we last spoke! Hope you are well.

I wanted to say that I think this notion of peer-to-peer databases is becoming quite interesting to many vertical markets and an area we are investing in a lot right now.

A few years ago, I looked at the startup that implemented a distributed hash table approach to partition data across nodes geographically. Although targeting low-latency and scale-out on commodity hardware, it also supported geographically distributed use cases in telco such as subscriber management for mobile networks. Think about turning on your cell phone after landing in some far away city and the amount of data that needs to be found and managed

We are looking at this right now from a multi-tier database perspective across database types. Analytic / columnar databases host historical data while transaction processing databases run the operational systems. Increasingly, In-Memory DB's acting both autonomously and as caches increasingly need to process streaming or very low-latency access (< 1ms)data while requiring access to operational and historical data.

So, while not classic a P2P app use case, I am beginning to strongly believe that some data & processing needs to be a lot more fluid ... hence peers ... to solve some of the more difficult problems in a number of verticals!
I'd love to get together to discuss 1-1 ... I'll shoot you an email.

Best,
Peter Thawley
CTO Group, Sybase, Inc.

Jon said...

Hi David,

I'm a few years late to this party, but nevertheless thought that a WIP project of mine might be of interest.

From my various research around the web, much of the work on "p2p databases" has employed Distributed Hash Tables to get different nodes to handle different parts of the data. Provided the data has sufficient redundancy, this works, albeit with slow query times.

My project, called Meshing, intends to change the social and political dynamics you mention by (in the main) giving everyone a full copy of the data. It requires an internet-based server, but - like Wordpress - the intention is that if the software is free, you install it yourself or pay someone to do it for you. Thus, it has a levelling and democratising effect even if it doesn't come with a Windows installer!

The project is written in PHP, will work on shared hosting, populates data between nodes relatively slowly, and versions all record updates. Storage will work with most modern databases.

I've written a fair bit about it on my blog, including potential use cases. The prototype is in progress; as of Nov 2011, the storage/versioning is nearly done, and XML transport will be next.

Unknown said...

Hi, Jon. Sounds interesting, but couldn't the same kind of thing be accomplished with CouchDB? How does what you're doing differ from that approach?

Thanks!

David

Jon said...

That's a mighty good question! I've heard of CouchDB but since I've not used it, the bidirectional replication feature comes as news to me. That is very similar to what I am trying to achieve.

However, CouchDB requires quite substantial server resources to run, from what I can tell. The niche I foresee for my project is predicated on my (quite unproven) theory that web-based software will only become popular for relatively small datasets if it is trivial to install on cheap, shared hosting. Though I am not familiar with the NoSQL world, it seems to me that document stores are mainly the preserve of corporate users having 100M+ records and plenty of cash to spend on infrastructure.

True, I could use a hosted document-store service; for the relatively low levels of usage I am aiming at, these are priced somewhere between free and 20USD/month. But it is (afaik) still not quite the upload tarball to host + unpack + web-based wizard UI that I'm aiming for. Hosted CouchDB is also quite rare, whereas there's nearly a shared/VPS host in every back yard these days.

That all said, your suggestion does give me pause for thought. I've blogged a few suggestions where a replicated dataset with no central/co-ordinating server might help achieve the critical mass necessary to shake up an existing market - such as publishing scientific papers or recruitment advertising. But my immediate thought is - if this is easy already - where are they? IMO, versioned replication can be considered "democratised" when individuals/groups with negligible resources can offer a link on their website to a dataset server (in whatever form that might take) and say "mirror us and help us build our data".

Unknown said...

Actually CouchDB, including its bidirectional replication, can run on an Android phone. So I don't believe it requires the server resources at the level you are thinking :) Definitely recommend checking it out, it's open source so you are more than welcome to contribute!

Jon said...

Cheers, David. I wondered this morning what choices Couch would have in regard to letting other instances replicate to it. My approach is that nodes would manually peer, setting read/write/moderated trust in either direction or both.

So, I think you're right. Some R&D with Couch is definitely required - I will look into it.