Techno Blender
Digitally Yours.

Choose the Right Database Model & Free Your Data from Spreadsheets | by Kaleb Nyquist | May, 2023

0 19


Open up a blank spreadsheet and the seemingly unending rows and columns can appear to be a canvas of infinite possibilities. However, many data engineers and other digital knowledge workers increasingly see the criss-crossing grey lines of a spreadsheet as restrictive — ironically not unlike the horizontal and vertical metal bars of a jail cell!

Admittedly, for the visual metaphor of “spreadsheet jail” to work the spreadsheet has to be rotated 90°. But once you see the eerie resemblance, it becomes impossible to forget. Photo illustration by author. Photo by WWW PROD on Unsplash.

“Spreadsheet jail” happens when the decision to store data in spreadsheets bottlenecks an organization from achieving its goals efficiently. This is primarily because a spreadsheet’s individual cells (no pun intended) are restricted in their ability to be queried and governed: for example, today the cell K18 might refer to the quantity of some item on hand, but tomorrow if a row and column is added K18 could refer to something else entirely.

For larger projects, “spreadsheet jail” also happens when the amount of data being stored exceeds the maximum size of a spreadsheet (10 million cells for Google Sheets; 1,048,576 rows and 16,384 columns for Microsoft Excel). In one extreme example, a list of financial data was discovered to have been fabricated after an employee noticed it oddly was the exact same size as a maxed-out Excel spreadsheet, a deception that now has the company’s disgraced CEO facing decades of actual prison time.

The stakes of spreadsheet jail are less literal for those us not engaged in criminal enterprises, but nevertheless our inappropriate use of spreadsheets often is costly and cumbersome. Consider the following questions:

  • Is opening a spreadsheet that once was considered a reliable and efficient way of looking up information now a source of dread?
  • Does your organization have a staff person spending multiple hours a day answering questions where the answers are either (1) buried deep in a single spreadsheet or (2) spread wide across multiple spreadsheets?
  • Are there important analytical questions that have even your pivot table pros scratching their head? Similarly, are your cell formulas fragile, breaking constantly or requiring multiple manual steps to calculate?
  • Is vital information being relegated outside of cell values, for example as a “note” or as a “comment” on the spreadsheet?

If you answered “no” to all of these questions, congratulations! When data management isn’t keeping you from achieving your goals, you’re not in spreadsheet jail. There is no need to replace what works with an overpowered database solution.

However, if you answered “yes” to any of these questions, that unfortunately is a sign that you are in spreadsheet jail. The good news is that it is possible to break free by adopting one of the many database technologies that are more appropriate to your needs.

In what follows, I help narrow down the options by introducing three database model paradigms while also providing an example of a spreadsheet that particularly benefits by being converted into that model. You might wish to keep open a copy of your most problematic spreadsheet as you review these descriptions to try and diagnose the problem that landed you in spreadsheet jail and determine which database model is the best antidote.

These descriptions are written for project managers and executives with some tech-savvy trying to better understand their data needs. Data professionals whose companies have inadvertently stuck them in spreadsheet jail can also refer to this piece to find the language they need to advocate for the adoption of more robust database technologies.

Relational databases consist of “tables” that conceptually are similar to the two-dimensional grid of a spreadsheet, but with individual rows of data connected across tables by “keys.” A “primary key” is a data point that represents an individual row to other tables, whereas a “foreign key” stores which row(s) from other tables are meant to have a particular relationship with the row in a given table.

SQL — or “Structured Query Language” —is commonly used for interacting with data stored in a relational database, including asking questions that involve data scattered across multiple tables. Although it might sound intimidating at first, SQL is relatively easy and inexpensive to learn (there are numerous free SQL courses available online, and many community colleges offer introductory courses in SQL).

Notably, SQL has been a public good since 1986, when the American National Standards Institute first recognized a form of SQL that could be used interchangeably across corporations, government agencies, and academic institutions. If your goals call for importing open data or exporting internal data for academic research, having a database with a relational model that can be accessed via SQL is arguably the safest way to go.

Because of the ubiquity of relational databases and SQL, it can be overwhelming to try and determine what specific solution is best. The most popular databases, such as Oracle and SQL Server, are generally overpowered and too costly for most organizations that are simply shifting from spreadsheets and not (yet) running full-fledged apps. More appropriate are the free and open-source options including PostgreSQL which does not skimp on features and SQLite which is nimble and easy to deploy.

If a relational database model is compelling but the investment to setup and learn SQL is cost-prohibitive, Airtable is an excellent compromise. Instead of a query language, Airtable’s spreadsheet-like graphical interface works well for interacting with what (from an end-user perspective) functions like a relational database model.

Spreadsheet Diagnosis

If your data are currently stored within a spreadsheet, you should consider making the leap to a relational database if the spreadsheet has a significant number of “repeating sub-rows”.

This spreadsheet tracks participants in an afterschool program. Notice that Parent is a “repeating sub-row” and the Parent 2 cell values in particular are either all filled or completely empty. This gives the end of the spreadsheet a sort of “string cheese” appearance, where the rows are frayed at the end. Example data by author.

Say, for example, you run an after-school program. Each row in your spreadsheet represents data for a single kid (name, grade, allergies, etc.) and you have sub-rows that represent data for each of the kids’ parents (name, contact information, etc.). It is quite likely that multiple kids will have the same parents, but if you were to update contact information for that parent you would have to make sure to catch every sub-row the parent appears in.

This simple entity-relationship diagram, using crow’s foot notation, is an illustration to say that each child in the afterschool program’s database has a name, grade, a list of allergies (including an empty list), and one or more associated parents. Each of these parents has a name, email address, phone number, and one or more associated students. Diagram by author.

A relational database solves this issue by converting the “sub-rows” into a single “Foreign Key” cell that refers to data stored on a different “parent” table. If you update information on one parent, you simultaneously update the parental information for each kid associated with that parent. Because there is no need to make batch updates across duplicated data, the chore of data integrity becomes much easier when using a relational data model.

The afterschool program’s data migrated into Airtable. Although technically not a SQL database, Airtable embraces the relational data paradigm while maintaining a familiar spreadsheet interface. Example data by author.

Of course, most after-school programs are run by volunteers and other people who do not have the time and skills to manage a full-fledged SQL database. Airtable is a great no-code alternative for this use case, with the additional advantage of having built-in tools like “Forms” that allow for guided data entry.

A document-oriented database, or “document store”, is ideal for working with loosely structured data. If the paradigm of relational databases described above are “tables” consisting of “rows” and “columns”, then the paradigm of document-oriented databases are “collections” of “documents” that consist of “key” and “value” pairs. For example, a key of Length (meters)might be paired with a value of 18, or a key of Locations might be paired to a list consisting of ["Portland", "Omaha", "San Antonio", "Savannah"] . Furthermore, a value can even be an entirely new set of key-value pairs called a “sub-document”. Assuming each document has a unique id, it also is possible for a value to point towards another document and emulate the relationality of SQL databases.

People new to the jargon of digital data may be confused by the term “document.” I have to confess that when I first learned about document-oriented databases, I naïvely thought these systems were designed to store a bunch of PDF and Word documents (I later discovered that the term for that sort of technology is a “document management system”). In this context, however, a “document” actually refers to the lightweight and popular JSON format designed to be relatively easy for both humans and computers to read or write. Whereas each row in a relational database must conform to the schema (aka the “columns”) of a table, a JSON document can theoretically have any desired set of key-value pairs.

Because of this flexibility, document-oriented databases have a reputation as being the “adaptable” database model. Over time as you discover more data about what it is you’re collecting information on, you can add new fields without mucking up any old records that have a different set of key-value pairs. Furthermore, because the same key can be associated with different types of values across documents, this database model lends itself well to data heterogeneity.

One practical way to take advantage of this heterogeneity is to create “placeholder data” such as a text string that will be expanded into a subdocument later. It also makes it easier to distinguish between different types of blank data, for example by (1) excluding a key to indicate missing data versus (2) setting a value to null or None if there is nothing to observe, such as the geographic coordinates of an organization that operates entirely online, or the driver’s license number of a small child too young to drive.

Document stores are ideal for organizations that are in the beginning stages of trying to figure out what data to collect. Notably, shifting to a document-oriented database is part of what turned around the fortunes of the HealthCare.gov website and has allowed it to continue adapting to the constantly changing data inputs of multiple federal and state agencies. That said, if the database’s engineers lack discipline and are careless about adding fields, there is a real risk that over time the database will become increasingly messy and difficult to manage.

The most popular document-oriented database solution is MongoDB. With a license agreement that is almost-but-not-quite-considered-open source, their free 512MB cloud tier and locally-hosted community edition can meet the data needs for many of those transitioning from spreadsheets without incurring direct expenses. Furthermore, the basics of MongoDB querying (sometimes referred to as “MQL”) and other data navigation tools like Compass can be learned online for free. However, unlike SQL, “MQL” is not actually a standardized query language but rather a set of language-specific drivers and APIs. This means that although it is harder to hire talent that knows MQL, it should still be easy for someone who is already working in a specific programming language (i.e. Python, JavaScript, etc.) to pick up MQL and implement MongoDB.

Spreadsheet Diagnosis

If your data are currently stored within a spreadsheet, you should consider making the leap to a document-oriented database if you have a significant number of cells that are either completely empty or crammed with data (more than half the spreadsheet is a good rule of thumb). You also want to look out for “ad hoc columns” that exist just to input data on a few rows or a “miscellaneous” notes column.

This spreadsheet stores information for a local community directory. Notice the use of the `|` symbol to create “mini-spreadsheets” within cells. There also are a number of empty cells scattered about, giving this spreadsheet a sort of “swiss cheese” appearance. Example data by author.

The above example spreadsheet features an imagined list of local places, all of which have little in common with each other than the fact they have a name. Notice there is plenty of data on some places, like Joe’s Burgers, whereas some places like Alcove Speakeasy have sparse data. Many of the cells are empty, including the Restrictions column which contains data for a single row only. Under the Websites column it is unclear if the cell is blank because the data is missing (as is the case with Joe’s Burgers) or because the place doesn’t have a website (as is the case with the Centennial Obelisk). There are also a number of cells bloated with nested data, in particular under the Photos, Prices, and Scores columns. And of course there is a Miscellaneous column that is merging three potential columns into one to keep the spreadsheet from growing too unwieldly.

The same local community directory as a JSON file. This is similar to how the data would be stored within a document-oriented database. Full file is available to view here. Screenshot by author.

Looking at the same community directory stored as a JSON file some crucial advantages become apparent.

  • The documents can be different sizes depending on the amount of actual data that is being stored. This means instead of a Miscellaneous column to reduce the number of spreadsheet columns, an Awards property can be created for just a few documents.
  • Notice that for “Joe’s Burgers” there is no Website field whereas for “Centennial Obelisk.” This is one way to communicate that the website for Joe’s Burgers might exist but is missing in the data, whereas there is no website for the Centennial Obelisk. Compare this to the spreadsheet, where these two scenarios are indistinguishable!
  • There is a lot more room for the data breathe, metaphorically speaking. The Photos, Prices, and Scores fields are crammed in spreadsheet cells but can stretch further as their own sub-documents or, in the case of Scores, as a list.
  • A relational data structure has been created using an @id field as a “primary key”, whereas fields such as Type and User can look up documents by their @id.

Graph databases are designed to describe connections. This is accomplished by simultaneously managing two different orders of data: “nodes” and “relationships” (note that the terminology may differ depending on the specific graph database technology). Nodes have their own existence independent of any schema or other data, although nodes can be categorized using “labels” (similar to how relational databases use “tables” and document-oriented databases use “collections” for categorization). Relationships, as the second order of graph data, are defined in terms of a source node and a target node, categorized by a type of relationship. Similar to documents in document-oriented databases, both nodes and relationships can have any number of properties represented as key-value pairs.

Again, for people new to the jargon of digital data, the term “graph” can be a distraction from what is actually meaningful about this model type. Instead of common data visualizations such as “bar graph” or “line graph,” here the term graph refers to the fact that this data model type lends itself to a form of visual representation more akin to a mind-map of circles (as nodes) and lines (as relationships). It has been observed that this data model type is similar to how the human mind works, which not only has implications for neuroscience and artificial intelligence, but also means graph data models are relatively easy to understand for less tech-savvy stakeholders.

Ironically, graph databases are actually more relationship-oriented than relational databases. While relational databases connect related data through matching primary and foreign keys across tables, relationships in graph databases have their own “existence” so to speak and therefore can be loaded with additional properties. The properties can speak to the strength, quality, duration, or any other characteristic of the relationship.

The leading graph database solution is Neo4j. Their community edition is fully open source and can be used for local- or self-hosting, and the free cloud tier caps out at 200,000 nodes and 400,000 relationships. The Neo4j query language, called “Cypher”, is free to learn and has a unique visual aspect that makes it easy for non-developers to understand. At present, the standardization of a Cypher-inspired “Graph Query Language” (or GQL) is in the works and is even considered as a step towards the United Nations’ Sustainable Development Goal #9 of building resilient infrastructure, promoting inclusive and sustainable industrialization, and fostering innovation.

Spreadsheet Diagnosis

If your organization’s data are currently contained within a spreadsheet, you should consider making the leap to a graph database especially if you are frequently contextualizing the data through notes, comments, and additional columns that exist only to describe the data in other columns.

Many of the cells in this spreadsheet have a “note” with important contextual data which can be revealed through moving the cursor to the cell. In Microsoft Excel, notes with cells are demarcated by a red triangle in the top-right corner, giving the spreadsheet a pepperjack cheese appearance. File available for download here. Screen capture by author.

The above example shows a “power mapping” of an imaginary advocacy organization called Fayette Advocacy Network. A power map is used for tracking who is influential on a particular issue, their attitude towards that issue, and their relationship to the organization (however many degrees removed).

Note that the most interesting data in this graph isn’t in the cell values themselves but in the notes for each cell. For example: Councilwoman Nadia Carlson and editorial board member Julius Carlson are married. Scott Poor is not only the President of the Fayette Teachers Union but also an employee of Fayette Public Schools and an active member of Fayette Advocacy Network. Not only is it documented that Sheriff Ricardo Ortiz is against the Fayette Advocacy Network’s cause, but this is known because of a social media comment he made.

In this graph view, decision-makers are in blue, all other humans are in fuchsia, organizations are in green, stances on the issue are in grey, and the Fayette Advocacy Network gets it’s own big gold node. Relationships between these nodes are represented with arrows and are a valuable form of data in their own right. View in Arrows.app.

Because a graph database can store relationships with properties, data relegated to notes in a spreadsheet can be featured prominently in a graph database. The graph model type improves the organization and accessibility of this valuable contextual information.

When looking at all the data at once it may appear to be a jumble. It is possible to narrow down the data by querying more specific questions like: “Who are the members of Fayette Advocacy Network that have a connection to a decision-maker whose position on the issue is still unknown?” For reference, a corresponding Cypher query is available below. Notice that it is relatively easy for a human to read this code that is primarily written for a computer to parse:

MATCH (faynet:Organization)<-[:MEMBER_OF]-(member:Person)-[:CONNECTED*1..2]-(decisionmaker:Person)-[:STANCE]->(pos:Position)
WHERE faynet.name = "Fayette Advocacy Network"
AND pos.name = "Unknown"
RETURN member

This query returns a list of Persons who are a MEMBER_OF “Fayette Advocacy Network” who also are CONNECTED to (or connected to someone connected to) decisionmakers whose Position on the issue is “unknown.” Even if you do not know Cypher, you should be able to figure out how the query works after just a few minutes puzzling over what the various arrows and brackets mean.

For what it’s worth: to help make sense of the jumble, Neo4j also has a set of graph data science packages to help identify patterns and calculate common measures of node significance such as degree centrality and betweenness centrality. Increasingly, there also are more and more software solutions becoming available for visualizing these graph data, ranging from Neo4j’s simple in-house solution Bloom all the way to virtual reality headsets. Needless to say, such queries, calculations, and visualizations are next to impossible to do in a spreadsheet alone.

The motivating principle behind this guide is to help you identify the appropriate technology for your data needs. Many times spreadsheets will actually be that appropriate technology: for storing non-relational data in a simple format, to collaborate with less tech-savvy users, to crunch some quick numbers, and even as a creative place for data-driven brainstorming and project planning. Spreadsheets work because they combine data entry, engineering, analysis, collaboration, and sharing all in a single space.

But the convenience of having all these features bound together can also be what makes spreadsheets so restrictive for your data. The reality might be that you need the associations of a relational database, or the adaptability of a document-oriented database, or the contextualization of a graph database. This is true even if you have poured hundreds of hours into perfecting a single spreadsheet — in fact, the more time you invest in spreadsheet, the more likely it is the spreadsheet is holding you back.

“Spreadsheets look really good no matter what goes into them. Someone brings a spreadsheet into a meeting, it looks so precise. It’s got numbers and graphs and pretty fonts. It feels like it’s telling you exactly how the world is. It feels like truth.”

— David Kastenbaum on Spreadsheets!

Spreadsheet jail happens because spreadsheets feel safe. Unlike database technologies, tools like Microsoft Excel and Google Sheets come bundled in home office suites and have the same familiar interfaces as our word processors and slideshow creators. Many of us first learned how to use spreadsheets in school, and sometimes wrangling a spreadsheet is the tie that binds together colleagues. Even on TikTok, communities are popping up dedicated to mastering the art of spreadsheets.

Making the leap. Photo illustration by author. Photo by Sammie Chaffin on Unsplash.

To liberate our data from spreadsheet jail requires a leap of faith. We have to believe that the benefits of a particular database model will outweigh the risks that come with adopting an unfamiliar technology. The good news is that we have room for error: it is easier to fix a messy database through a few lines of code than it is to fix clunky spreadsheets through hundreds of mouse clicks. Furthermore, choosing the wrong database technology to start with is far from the end of the world: once a spreadsheet has been re-organized into tables (for relational), collections (for document-oriented), or labels (for graph) it is much easier to translate between the three paradigms.

However, there might be an even more important reason to escape spreadsheet jail. As we liberate our data from spreadsheets, we also liberate our imaginations about what data are and how data can be utilized throughout an organization. Looking up and entering data goes from being a time-suck chore to becoming a contribution to an increasingly exciting and valuable store of information. Our data goes from being a clumsily-held internal secret to a living center of collaboration for team members and external partners.

Take one more hard look at your data. If spreadsheet jail is keeping you from achieving your goals, it is time to take a deep breath and make a courageous leap into the world of databases!


Open up a blank spreadsheet and the seemingly unending rows and columns can appear to be a canvas of infinite possibilities. However, many data engineers and other digital knowledge workers increasingly see the criss-crossing grey lines of a spreadsheet as restrictive — ironically not unlike the horizontal and vertical metal bars of a jail cell!

Admittedly, for the visual metaphor of “spreadsheet jail” to work the spreadsheet has to be rotated 90°. But once you see the eerie resemblance, it becomes impossible to forget. Photo illustration by author. Photo by WWW PROD on Unsplash.

“Spreadsheet jail” happens when the decision to store data in spreadsheets bottlenecks an organization from achieving its goals efficiently. This is primarily because a spreadsheet’s individual cells (no pun intended) are restricted in their ability to be queried and governed: for example, today the cell K18 might refer to the quantity of some item on hand, but tomorrow if a row and column is added K18 could refer to something else entirely.

For larger projects, “spreadsheet jail” also happens when the amount of data being stored exceeds the maximum size of a spreadsheet (10 million cells for Google Sheets; 1,048,576 rows and 16,384 columns for Microsoft Excel). In one extreme example, a list of financial data was discovered to have been fabricated after an employee noticed it oddly was the exact same size as a maxed-out Excel spreadsheet, a deception that now has the company’s disgraced CEO facing decades of actual prison time.

The stakes of spreadsheet jail are less literal for those us not engaged in criminal enterprises, but nevertheless our inappropriate use of spreadsheets often is costly and cumbersome. Consider the following questions:

  • Is opening a spreadsheet that once was considered a reliable and efficient way of looking up information now a source of dread?
  • Does your organization have a staff person spending multiple hours a day answering questions where the answers are either (1) buried deep in a single spreadsheet or (2) spread wide across multiple spreadsheets?
  • Are there important analytical questions that have even your pivot table pros scratching their head? Similarly, are your cell formulas fragile, breaking constantly or requiring multiple manual steps to calculate?
  • Is vital information being relegated outside of cell values, for example as a “note” or as a “comment” on the spreadsheet?

If you answered “no” to all of these questions, congratulations! When data management isn’t keeping you from achieving your goals, you’re not in spreadsheet jail. There is no need to replace what works with an overpowered database solution.

However, if you answered “yes” to any of these questions, that unfortunately is a sign that you are in spreadsheet jail. The good news is that it is possible to break free by adopting one of the many database technologies that are more appropriate to your needs.

In what follows, I help narrow down the options by introducing three database model paradigms while also providing an example of a spreadsheet that particularly benefits by being converted into that model. You might wish to keep open a copy of your most problematic spreadsheet as you review these descriptions to try and diagnose the problem that landed you in spreadsheet jail and determine which database model is the best antidote.

These descriptions are written for project managers and executives with some tech-savvy trying to better understand their data needs. Data professionals whose companies have inadvertently stuck them in spreadsheet jail can also refer to this piece to find the language they need to advocate for the adoption of more robust database technologies.

Relational databases consist of “tables” that conceptually are similar to the two-dimensional grid of a spreadsheet, but with individual rows of data connected across tables by “keys.” A “primary key” is a data point that represents an individual row to other tables, whereas a “foreign key” stores which row(s) from other tables are meant to have a particular relationship with the row in a given table.

SQL — or “Structured Query Language” —is commonly used for interacting with data stored in a relational database, including asking questions that involve data scattered across multiple tables. Although it might sound intimidating at first, SQL is relatively easy and inexpensive to learn (there are numerous free SQL courses available online, and many community colleges offer introductory courses in SQL).

Notably, SQL has been a public good since 1986, when the American National Standards Institute first recognized a form of SQL that could be used interchangeably across corporations, government agencies, and academic institutions. If your goals call for importing open data or exporting internal data for academic research, having a database with a relational model that can be accessed via SQL is arguably the safest way to go.

Because of the ubiquity of relational databases and SQL, it can be overwhelming to try and determine what specific solution is best. The most popular databases, such as Oracle and SQL Server, are generally overpowered and too costly for most organizations that are simply shifting from spreadsheets and not (yet) running full-fledged apps. More appropriate are the free and open-source options including PostgreSQL which does not skimp on features and SQLite which is nimble and easy to deploy.

If a relational database model is compelling but the investment to setup and learn SQL is cost-prohibitive, Airtable is an excellent compromise. Instead of a query language, Airtable’s spreadsheet-like graphical interface works well for interacting with what (from an end-user perspective) functions like a relational database model.

Spreadsheet Diagnosis

If your data are currently stored within a spreadsheet, you should consider making the leap to a relational database if the spreadsheet has a significant number of “repeating sub-rows”.

This spreadsheet tracks participants in an afterschool program. Notice that Parent is a “repeating sub-row” and the Parent 2 cell values in particular are either all filled or completely empty. This gives the end of the spreadsheet a sort of “string cheese” appearance, where the rows are frayed at the end. Example data by author.

Say, for example, you run an after-school program. Each row in your spreadsheet represents data for a single kid (name, grade, allergies, etc.) and you have sub-rows that represent data for each of the kids’ parents (name, contact information, etc.). It is quite likely that multiple kids will have the same parents, but if you were to update contact information for that parent you would have to make sure to catch every sub-row the parent appears in.

This simple entity-relationship diagram, using crow’s foot notation, is an illustration to say that each child in the afterschool program’s database has a name, grade, a list of allergies (including an empty list), and one or more associated parents. Each of these parents has a name, email address, phone number, and one or more associated students. Diagram by author.

A relational database solves this issue by converting the “sub-rows” into a single “Foreign Key” cell that refers to data stored on a different “parent” table. If you update information on one parent, you simultaneously update the parental information for each kid associated with that parent. Because there is no need to make batch updates across duplicated data, the chore of data integrity becomes much easier when using a relational data model.

The afterschool program’s data migrated into Airtable. Although technically not a SQL database, Airtable embraces the relational data paradigm while maintaining a familiar spreadsheet interface. Example data by author.

Of course, most after-school programs are run by volunteers and other people who do not have the time and skills to manage a full-fledged SQL database. Airtable is a great no-code alternative for this use case, with the additional advantage of having built-in tools like “Forms” that allow for guided data entry.

A document-oriented database, or “document store”, is ideal for working with loosely structured data. If the paradigm of relational databases described above are “tables” consisting of “rows” and “columns”, then the paradigm of document-oriented databases are “collections” of “documents” that consist of “key” and “value” pairs. For example, a key of Length (meters)might be paired with a value of 18, or a key of Locations might be paired to a list consisting of ["Portland", "Omaha", "San Antonio", "Savannah"] . Furthermore, a value can even be an entirely new set of key-value pairs called a “sub-document”. Assuming each document has a unique id, it also is possible for a value to point towards another document and emulate the relationality of SQL databases.

People new to the jargon of digital data may be confused by the term “document.” I have to confess that when I first learned about document-oriented databases, I naïvely thought these systems were designed to store a bunch of PDF and Word documents (I later discovered that the term for that sort of technology is a “document management system”). In this context, however, a “document” actually refers to the lightweight and popular JSON format designed to be relatively easy for both humans and computers to read or write. Whereas each row in a relational database must conform to the schema (aka the “columns”) of a table, a JSON document can theoretically have any desired set of key-value pairs.

Because of this flexibility, document-oriented databases have a reputation as being the “adaptable” database model. Over time as you discover more data about what it is you’re collecting information on, you can add new fields without mucking up any old records that have a different set of key-value pairs. Furthermore, because the same key can be associated with different types of values across documents, this database model lends itself well to data heterogeneity.

One practical way to take advantage of this heterogeneity is to create “placeholder data” such as a text string that will be expanded into a subdocument later. It also makes it easier to distinguish between different types of blank data, for example by (1) excluding a key to indicate missing data versus (2) setting a value to null or None if there is nothing to observe, such as the geographic coordinates of an organization that operates entirely online, or the driver’s license number of a small child too young to drive.

Document stores are ideal for organizations that are in the beginning stages of trying to figure out what data to collect. Notably, shifting to a document-oriented database is part of what turned around the fortunes of the HealthCare.gov website and has allowed it to continue adapting to the constantly changing data inputs of multiple federal and state agencies. That said, if the database’s engineers lack discipline and are careless about adding fields, there is a real risk that over time the database will become increasingly messy and difficult to manage.

The most popular document-oriented database solution is MongoDB. With a license agreement that is almost-but-not-quite-considered-open source, their free 512MB cloud tier and locally-hosted community edition can meet the data needs for many of those transitioning from spreadsheets without incurring direct expenses. Furthermore, the basics of MongoDB querying (sometimes referred to as “MQL”) and other data navigation tools like Compass can be learned online for free. However, unlike SQL, “MQL” is not actually a standardized query language but rather a set of language-specific drivers and APIs. This means that although it is harder to hire talent that knows MQL, it should still be easy for someone who is already working in a specific programming language (i.e. Python, JavaScript, etc.) to pick up MQL and implement MongoDB.

Spreadsheet Diagnosis

If your data are currently stored within a spreadsheet, you should consider making the leap to a document-oriented database if you have a significant number of cells that are either completely empty or crammed with data (more than half the spreadsheet is a good rule of thumb). You also want to look out for “ad hoc columns” that exist just to input data on a few rows or a “miscellaneous” notes column.

This spreadsheet stores information for a local community directory. Notice the use of the `|` symbol to create “mini-spreadsheets” within cells. There also are a number of empty cells scattered about, giving this spreadsheet a sort of “swiss cheese” appearance. Example data by author.

The above example spreadsheet features an imagined list of local places, all of which have little in common with each other than the fact they have a name. Notice there is plenty of data on some places, like Joe’s Burgers, whereas some places like Alcove Speakeasy have sparse data. Many of the cells are empty, including the Restrictions column which contains data for a single row only. Under the Websites column it is unclear if the cell is blank because the data is missing (as is the case with Joe’s Burgers) or because the place doesn’t have a website (as is the case with the Centennial Obelisk). There are also a number of cells bloated with nested data, in particular under the Photos, Prices, and Scores columns. And of course there is a Miscellaneous column that is merging three potential columns into one to keep the spreadsheet from growing too unwieldly.

The same local community directory as a JSON file. This is similar to how the data would be stored within a document-oriented database. Full file is available to view here. Screenshot by author.

Looking at the same community directory stored as a JSON file some crucial advantages become apparent.

  • The documents can be different sizes depending on the amount of actual data that is being stored. This means instead of a Miscellaneous column to reduce the number of spreadsheet columns, an Awards property can be created for just a few documents.
  • Notice that for “Joe’s Burgers” there is no Website field whereas for “Centennial Obelisk.” This is one way to communicate that the website for Joe’s Burgers might exist but is missing in the data, whereas there is no website for the Centennial Obelisk. Compare this to the spreadsheet, where these two scenarios are indistinguishable!
  • There is a lot more room for the data breathe, metaphorically speaking. The Photos, Prices, and Scores fields are crammed in spreadsheet cells but can stretch further as their own sub-documents or, in the case of Scores, as a list.
  • A relational data structure has been created using an @id field as a “primary key”, whereas fields such as Type and User can look up documents by their @id.

Graph databases are designed to describe connections. This is accomplished by simultaneously managing two different orders of data: “nodes” and “relationships” (note that the terminology may differ depending on the specific graph database technology). Nodes have their own existence independent of any schema or other data, although nodes can be categorized using “labels” (similar to how relational databases use “tables” and document-oriented databases use “collections” for categorization). Relationships, as the second order of graph data, are defined in terms of a source node and a target node, categorized by a type of relationship. Similar to documents in document-oriented databases, both nodes and relationships can have any number of properties represented as key-value pairs.

Again, for people new to the jargon of digital data, the term “graph” can be a distraction from what is actually meaningful about this model type. Instead of common data visualizations such as “bar graph” or “line graph,” here the term graph refers to the fact that this data model type lends itself to a form of visual representation more akin to a mind-map of circles (as nodes) and lines (as relationships). It has been observed that this data model type is similar to how the human mind works, which not only has implications for neuroscience and artificial intelligence, but also means graph data models are relatively easy to understand for less tech-savvy stakeholders.

Ironically, graph databases are actually more relationship-oriented than relational databases. While relational databases connect related data through matching primary and foreign keys across tables, relationships in graph databases have their own “existence” so to speak and therefore can be loaded with additional properties. The properties can speak to the strength, quality, duration, or any other characteristic of the relationship.

The leading graph database solution is Neo4j. Their community edition is fully open source and can be used for local- or self-hosting, and the free cloud tier caps out at 200,000 nodes and 400,000 relationships. The Neo4j query language, called “Cypher”, is free to learn and has a unique visual aspect that makes it easy for non-developers to understand. At present, the standardization of a Cypher-inspired “Graph Query Language” (or GQL) is in the works and is even considered as a step towards the United Nations’ Sustainable Development Goal #9 of building resilient infrastructure, promoting inclusive and sustainable industrialization, and fostering innovation.

Spreadsheet Diagnosis

If your organization’s data are currently contained within a spreadsheet, you should consider making the leap to a graph database especially if you are frequently contextualizing the data through notes, comments, and additional columns that exist only to describe the data in other columns.

Many of the cells in this spreadsheet have a “note” with important contextual data which can be revealed through moving the cursor to the cell. In Microsoft Excel, notes with cells are demarcated by a red triangle in the top-right corner, giving the spreadsheet a pepperjack cheese appearance. File available for download here. Screen capture by author.

The above example shows a “power mapping” of an imaginary advocacy organization called Fayette Advocacy Network. A power map is used for tracking who is influential on a particular issue, their attitude towards that issue, and their relationship to the organization (however many degrees removed).

Note that the most interesting data in this graph isn’t in the cell values themselves but in the notes for each cell. For example: Councilwoman Nadia Carlson and editorial board member Julius Carlson are married. Scott Poor is not only the President of the Fayette Teachers Union but also an employee of Fayette Public Schools and an active member of Fayette Advocacy Network. Not only is it documented that Sheriff Ricardo Ortiz is against the Fayette Advocacy Network’s cause, but this is known because of a social media comment he made.

In this graph view, decision-makers are in blue, all other humans are in fuchsia, organizations are in green, stances on the issue are in grey, and the Fayette Advocacy Network gets it’s own big gold node. Relationships between these nodes are represented with arrows and are a valuable form of data in their own right. View in Arrows.app.

Because a graph database can store relationships with properties, data relegated to notes in a spreadsheet can be featured prominently in a graph database. The graph model type improves the organization and accessibility of this valuable contextual information.

When looking at all the data at once it may appear to be a jumble. It is possible to narrow down the data by querying more specific questions like: “Who are the members of Fayette Advocacy Network that have a connection to a decision-maker whose position on the issue is still unknown?” For reference, a corresponding Cypher query is available below. Notice that it is relatively easy for a human to read this code that is primarily written for a computer to parse:

MATCH (faynet:Organization)<-[:MEMBER_OF]-(member:Person)-[:CONNECTED*1..2]-(decisionmaker:Person)-[:STANCE]->(pos:Position)
WHERE faynet.name = "Fayette Advocacy Network"
AND pos.name = "Unknown"
RETURN member

This query returns a list of Persons who are a MEMBER_OF “Fayette Advocacy Network” who also are CONNECTED to (or connected to someone connected to) decisionmakers whose Position on the issue is “unknown.” Even if you do not know Cypher, you should be able to figure out how the query works after just a few minutes puzzling over what the various arrows and brackets mean.

For what it’s worth: to help make sense of the jumble, Neo4j also has a set of graph data science packages to help identify patterns and calculate common measures of node significance such as degree centrality and betweenness centrality. Increasingly, there also are more and more software solutions becoming available for visualizing these graph data, ranging from Neo4j’s simple in-house solution Bloom all the way to virtual reality headsets. Needless to say, such queries, calculations, and visualizations are next to impossible to do in a spreadsheet alone.

The motivating principle behind this guide is to help you identify the appropriate technology for your data needs. Many times spreadsheets will actually be that appropriate technology: for storing non-relational data in a simple format, to collaborate with less tech-savvy users, to crunch some quick numbers, and even as a creative place for data-driven brainstorming and project planning. Spreadsheets work because they combine data entry, engineering, analysis, collaboration, and sharing all in a single space.

But the convenience of having all these features bound together can also be what makes spreadsheets so restrictive for your data. The reality might be that you need the associations of a relational database, or the adaptability of a document-oriented database, or the contextualization of a graph database. This is true even if you have poured hundreds of hours into perfecting a single spreadsheet — in fact, the more time you invest in spreadsheet, the more likely it is the spreadsheet is holding you back.

“Spreadsheets look really good no matter what goes into them. Someone brings a spreadsheet into a meeting, it looks so precise. It’s got numbers and graphs and pretty fonts. It feels like it’s telling you exactly how the world is. It feels like truth.”

— David Kastenbaum on Spreadsheets!

Spreadsheet jail happens because spreadsheets feel safe. Unlike database technologies, tools like Microsoft Excel and Google Sheets come bundled in home office suites and have the same familiar interfaces as our word processors and slideshow creators. Many of us first learned how to use spreadsheets in school, and sometimes wrangling a spreadsheet is the tie that binds together colleagues. Even on TikTok, communities are popping up dedicated to mastering the art of spreadsheets.

Making the leap. Photo illustration by author. Photo by Sammie Chaffin on Unsplash.

To liberate our data from spreadsheet jail requires a leap of faith. We have to believe that the benefits of a particular database model will outweigh the risks that come with adopting an unfamiliar technology. The good news is that we have room for error: it is easier to fix a messy database through a few lines of code than it is to fix clunky spreadsheets through hundreds of mouse clicks. Furthermore, choosing the wrong database technology to start with is far from the end of the world: once a spreadsheet has been re-organized into tables (for relational), collections (for document-oriented), or labels (for graph) it is much easier to translate between the three paradigms.

However, there might be an even more important reason to escape spreadsheet jail. As we liberate our data from spreadsheets, we also liberate our imaginations about what data are and how data can be utilized throughout an organization. Looking up and entering data goes from being a time-suck chore to becoming a contribution to an increasingly exciting and valuable store of information. Our data goes from being a clumsily-held internal secret to a living center of collaboration for team members and external partners.

Take one more hard look at your data. If spreadsheet jail is keeping you from achieving your goals, it is time to take a deep breath and make a courageous leap into the world of databases!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment