Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The Craigslist Dilemma: A case study for big data and NoSQL solutions It's hard to imagine just how many postings Craigslist has handled over the

The Craigslist Dilemma: A case study for big data and NoSQL solutions

It's hard to imagine just how many postings Craigslist has handled over the years, and if you were in charge of archiving those posts and storing them for compliance, you'd need a 'big data' solution. So how does Craigslist manage all of their data, both the incoming stuff and the stuff that needs archiving? It's a beautiful combination of MySQL, NoSQL and a little help from the people at 10Gen.

With more than 1.5 million new ads posting every day, Craigslist users have generated over a billion records some might even consider that big data. Whats more, legislation demands that these records cant simply be erased or overwritten at the whim of the company: after a 60 day retention period in the live portion of the site, records must be migrated over to an archival space for legislative compliance.

And how does Craigslist manage this brobdingnagian volume of data? Prior to 2011, the archive consisted of a MySQL cluster that was part of the companys larger database structure that included over one hundred MySQL servers. Unfortunately, instead of making the job of data persistence easy, the nature of MySQL created complexity, forcing Craigslist to start exploring NoSQL options that could handle a huge amount of incoming data, simultaneously stream the archive process, and all while scale up easily over time.

The 'big data' challenge

As you could imagine, Craigslist faced several challenges due to the nature and volume of data being stored in their relational, MySQL servers. For example, the structure of their data had changed several times over the years. This alone made any change to the database schema a costly, prolonged nightmare, as changes often meant downtime, and of course, any alteration comes with the potential of unintended consequences. And if database alterations were a challenge, just imagine how difficult introducing entirely new features became? Whats more, each change to the live database schema required a corresponding change to the entire archive a process that took months every time. And during these updates, the archival process had to be put on hold, which meant stale data piled up in the live databases, slowing down the sites performance.

The NoSQL solution

Now dont get the impression that anyone at Craigslist is slamming MySQL. MySQL is still revered, its a stellar relational database, and the people in charge didnt want to stop using it for data in active online postings. It was the dead postings that needed a better graveyard. So what was the NoSQL solution? Craigslist passed that baton to MongoDB for archiving posts and their accompanying meta-data, and they archived these posts as documents instead of treating them as rows in a relational database table. And the process was a relatively speedy one. Including the time needed to sanitize and prep the data, migrating 1.5 billion postings to the new archive database only took about three months.

Key Benefits of a NoSQL Solution like MongoDB for Big Data

Dynamic and flexible without being tied to a single schema

Auto-sharding for horizontal scalability

Built-in replication

High availability

Faster and less expensive than relational database

Document based queries

Full index support

And of course, while there are obvious differences between a relational store and a NoSQL solution, there are similarities as well. After all, both systems are simply storing data for future retrieval. Jeremy Zawodny, a software engineer at Craigslist, appreciated this compatibility: Coming from a relational background, specifically a MySQL background, a lot of the concepts carry over.... It makes it very easy to get started.

Craigslist was able to implement a NoSQL solution in both of its data centers using servers in multi-node clusters, providing data replication functions and enhanced reliability, ensuring there is no single point of failure since the entire archive exists in each shard as servers can fail over without losing any data. The whole system is readily scalable over commodity hardware and new machines can be added without any downtime.

Archiving now occurs seamlessly, even when the MySQL schema undergoes changes. Samantha Kosko describes how this process works, Once a posting goes dead, MongoDB then reads into MySQL and writes that posting into a JSON-like document. By doing that, they were able to provide a schema-less design that allowed them the flexibility to archive multiple years of files without worrying about failure or flexibility in design.

This is a real-world example of applying new technology to a well-known application. After reading the case study, find additional resources to support this case study and write about it

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Jeff Hoffer, Ramesh Venkataraman, Heikki Topi

12th edition

133544613, 978-0133544619

More Books

Students also viewed these Databases questions

Question

Spinal is a noun form. True Fa'se

Answered: 1 week ago