Question
The Craigslist Dilemma: A case study for big data and NoSQL solutions It's hard to imagine just how many postings Craigslist has handled over the
The Craigslist Dilemma: A case study for big data and NoSQL solutions
It's hard to imagine just how many postings Craigslist has handled over the years, and if you were in charge of archiving those posts and storing them for compliance, you'd need a 'big data' solution. So how does Craigslist manage all of their data, both the incoming stuff and the stuff that needs archiving? It's a beautiful combination of MySQL, NoSQL and a little help from the people at 10Gen.
With more than 1.5 million new ads posting every day, Craigslist users have generated over a billion records some might even consider that big data. Whats more, legislation demands that these records cant simply be erased or overwritten at the whim of the company: after a 60 day retention period in the live portion of the site, records must be migrated over to an archival space for legislative compliance.
And how does Craigslist manage this brobdingnagian volume of data? Prior to 2011, the archive consisted of a MySQL cluster that was part of the companys larger database structure that included over one hundred MySQL servers. Unfortunately, instead of making the job of data persistence easy, the nature of MySQL created complexity, forcing Craigslist to start exploring NoSQL options that could handle a huge amount of incoming data, simultaneously stream the archive process, and all while scale up easily over time.
The 'big data' challenge
As you could imagine, Craigslist faced several challenges due to the nature and volume of data being stored in their relational, MySQL servers. For example, the structure of their data had changed several times over the years. This alone made any change to the database schema a costly, prolonged nightmare, as changes often meant downtime, and of course, any alteration comes with the potential of unintended consequences. And if database alterations were a challenge, just imagine how difficult introducing entirely new features became? Whats more, each change to the live database schema required a corresponding change to the entire archive a process that took months every time. And during these updates, the archival process had to be put on hold, which meant stale data piled up in the live databases, slowing down the sites performance.
The NoSQL solution
Now dont get the impression that anyone at Craigslist is slamming MySQL. MySQL is still revered, its a stellar relational database, and the people in charge didnt want to stop using it for data in active online postings. It was the dead postings that needed a better graveyard. So what was the NoSQL solution? Craigslist passed that baton to MongoDB for archiving posts and their accompanying meta-data, and they archived these posts as documents instead of treating them as rows in a relational database table. And the process was a relatively speedy one. Including the time needed to sanitize and prep the data, migrating 1.5 billion postings to the new archive database only took about three months.
Key Benefits of a NoSQL Solution like MongoDB for Big Data
Dynamic and flexible without being tied to a single schema
Auto-sharding for horizontal scalability
Built-in replication
High availability
Faster and less expensive than relational database
Document based queries
Full index support
And of course, while there are obvious differences between a relational store and a NoSQL solution, there are similarities as well. After all, both systems are simply storing data for future retrieval. Jeremy Zawodny, a software engineer at Craigslist, appreciated this compatibility: Coming from a relational background, specifically a MySQL background, a lot of the concepts carry over.... It makes it very easy to get started.
Craigslist was able to implement a NoSQL solution in both of its data centers using servers in multi-node clusters, providing data replication functions and enhanced reliability, ensuring there is no single point of failure since the entire archive exists in each shard as servers can fail over without losing any data. The whole system is readily scalable over commodity hardware and new machines can be added without any downtime.
Archiving now occurs seamlessly, even when the MySQL schema undergoes changes. Samantha Kosko describes how this process works, Once a posting goes dead, MongoDB then reads into MySQL and writes that posting into a JSON-like document. By doing that, they were able to provide a schema-less design that allowed them the flexibility to archive multiple years of files without worrying about failure or flexibility in design.
This is a real-world example of applying new technology to a well-known application. After reading the case study, find additional resources to support this case study and write about it
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started