I first got interested in NoSQL databases from a May, 2010 article in MSDN magazine. This article started me on a quest to learn more about the potential of using different databases other than relational databases like MS SQL Server, Oracle, or MySQL. Although I have worked mostly with MongoDB, I have been amazed at the growth that has occurred in the NoSQL community. Here is just some of the stuff I have discovered:
Wikipedia has an overview of NoSQL. It says that NoSQL could be called “Not only SQL” and that it differs from relational database management systems in that these data storage approaches may not require fixed table schemas, usually avoid join operations, and typically scale horizontally (by adding multiple servers rather than using more powerful servers). I was particularly drawn to the idea of no fixed table schemas because I have found that the total cost of making changes to database structures to be high, particularly for the type of business applications that I develop.
Roots of NoSQL
I think that NoSQL attempts to address (or accept and deal with) some of the data storage needs for large-scale distributed computing:
- Distributed Computing – arose from the evolution of the Internet and the ability to create systems “in the cloud”. (I like “The Eight Fallacies of Distributed Computing” attributed to Peter Deutsch, 1994, Sun Microsystems:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous)
- CAP theorem for distributed systems – Eric Brewer 2000 (Yahoo)
–(Pick any two)
- Decentralization – distributed systems
- Flexibility – elasticity
- Fault tolerance – how defective machines effect the system
- Consistency – relative levels and how they affect the system
- Frequency of reads and writes
- Eventual consistency
- Joins and relationships
- stored procedures
- Transactions (ACID)
NoSQL and relational databases address these challenges in different ways, so they have strengths and weaknesses associated with the design decisions.
There are many database products that are called NoSQL. (I was surprised at how many there are and the number is increasing.) Here are some of them, by category:
- Key Value (Entity/Attribute/Values)
e.g., Windows Azure Table Service, Amazon Web Services SimpleDB, Oracle Berkely DB, Redis, MemcacheDB, Hibari, Project Voldemort, Basho Riak
e.g., Apache CouchDB, MongoDB, Raven DB
e.g., Google Bigtable, Apache Cassandra, Hypertable
e.g., Microsoft Dryad, Neo Technology Neo4j
I wish I could say that I have worked with all of these products, but I can’t. It has been fun fiddling with many of them, though.
With all of these categories and products, what does “NoSQL” really mean? I have a few ideas:
- No tables – objects, collections, nodes
- No (or fewer) foreign keys and constraints
- No ACID – can’t have it all
- No sophisticated query planners: mostly REST
- No declarative query language (more procedural)
- More flexible, fluid designs (dynamic schemas)
- More natural (and richer?) data representations
- Highly scalable (horizontal scaling e.g., more machines, not bigger machines)
- Sparse data – optional/multi-value fields
- Large datasets (but small datasets too)
- Meaningful identifiers
- Access patterns (such as map-reduce)
Why Use NoSQL?
NoSQL has made inroads into applications when:
- The scale-up of relational database cost is too high (when compared with NoSQL).
- There are lots of temporary data that don’t need to be stored in a relational database.
- There are complex queries with large datasets that need to be optimized.
- Transactions don’t need to be very durable.
- Object models considered to involve too many joins or have to be greatly de-normalized.
- Large quantities of Large Objects (CLOB or BLOB) are stored in a database.
- There is a need for fast data reads (but maybe not writes).
Considerations for Using NoSQL
Here are some things to consider, particularly when evaluating using a NoSQL database:
- What is the problem that needs to be solved? (I know this one seems obvious. 🙂 )
- Data storage growth requirements – scalability & Big Data
- Data structure changes – potential shoehorned tables and queries
- Object inter-dependencies and/or coupling
- Cardinalities of relationships
- Data access patterns
- Application structure
- Single collection opportunities
- Operating system(s)
- Drivers – availability, support
- File storage
- Map reduce/path transversal
- Hybrid solution potential
Impact on InfoTrail
I have been experimenting with using MongoDB as the main data storage engine for InfoTrail modules. So far it has shown some significant benefits:
- A collection-per-entity has reduced the number of tables to deal with. I am looking into potential to use a single collection for all entities that would benefit from caching and another single collection for all entities that are not cached. This could greatly reduce development time and cost as well as fit well into the software factory approach. As a relational database, basic InfoTrail has over 500 tables, and more tables are added for individual customizations.
- Dynamic schema saves time in modification/enhancement of the modules. I would like to have the ability to have the user/admin add, update, or delete keys, values, and sub-collections from a system admin screen. This also would fit well with the need for change/version control.
- Data retrieval rates appear to be faster (by at least an order of magnitude), but this will have to be benchmarked.
Here are some links that I found helpful so far: