Did you know we sometimes use the DSL approach for data storage?
data:image/s3,"s3://crabby-images/52b72/52b728cfd2bd47096aa37241296e8ba742b4be45" alt="Pieter Olivier"
Many legacy applications store data in binary files by using generic database systems like Microsoft Access, proprietary libraries or homegrown binary formats. All these methods have one thing in common: the data is stored in an opaque binary format that only the application can decipher.
If such an application is used in-house to edit data whose correctness is vital to the workings of the organization, solid version management practices are essential. For instance, inspection of changesets, peer review of changes, and reliable merging of multiple versions should be used to guard the integrity of the data. However, given the binary nature of the data, this is difficult at best because all version management operations require application-specific implementations to be effective. In practice, this means version management is mostly limited and often anchored to complicated procedures to avoid version mayhem.
In this post, we offer a solution to this problem by introducing domain-specific languages as a way to store declarative descriptions of non-code artefacts. This unlocks the full power of generic version control tools.
Surveys-R-us: introduction
We will exemplify some of the points we want to make in this post by using a hypothetical company that is loosely based on a real project: Surveys-R-us! The company offers surveying services for market research. They have developed several applications. One of these is their survey editor, which stores the forms in an off-the-shelf key/value database.
Generic version control systems and binary data
For software source code, the version management problem is mostly solved: from the inception of SCCS around 1972 onwards, source code version control systems have steadily increased in capabilities and prowess. Version control tools can merge different versions of text files, often automatically or with a minimum of supervision. Sometimes, changes clash, and then human insight is needed to decide how to merge two versions.
Nowadays, Git has more or less won the battle and is the most used version control system worldwide. But, although most version control systems can handle binary data, they cannot effectively show differences or perform a merge between two versions of binary files because that requires knowledge about what the binary data actually represents.
In some cases, application-specific version control tools are implemented that understand the application data and can work directly with the binary data. However, taking full advantage of the benefits that good version management can offer requires a whole stack of version control tools to be implemented and maintained. Whenever the binary data format changes, the version control tools also need to be updated, often with backwards-compatibility support built in to also support the old format. Such version control tools run a high risk of becoming a maintenance nightmare themselves.
Surveys-R-us: version trouble
At a certain point, thousands of surveys have been created for clients of the company. Evolution of these surveys is common: based on feedback from their clients, surveys are updated regularly and often multiple versions of the same survey are in use at the same time.
Company policy requires that each change be checked by an employee other than the author. Yet because what has changed is not always clear, errors often remain undetected before publishing.
Moreover, the survey editor also evolved, so the structure of the data scheme used by the application has changed over time, mostly getting richer. Care has been taken to ensure that old survey data is migrated each time the data structure changes.
Version issues crop up regularly and version management has become a major headache. It is infeasible to compare survey definitions from database snapshots from the past with current versions.
DSLs to the rescue
Instead of writing data to an opaque binary format, a good solution can be to write the data to text files in a domain-specific language (DSL). A good quality DSL captures business data in a way that is self-contained: a domain expert should be able to understand the content of a file written in a DSL without the help of application-specific tools.
With the introduction of such a DSL, standard version control tools come into focus for performing the version management.
Surveys-R-us: QL
After a careful analysis of the problem, the Surveys-R-us management contracted Swat.engineering to help solve their version management issues.
Note that in this post, we will use an existing query language called QL, whereas, for the actual client, we developed a brand new language that better suited their needs.
Here is a simple example of a survey written in QL:
form job-hopping { "Are you currently employed?" employed: boolean if (employed) { "Are you happy in your current job? happy: boolean } "Are you looking for a job?" looking: boolean }
With relatively minor changes to the existing codebase, the old key/value database was replaced with storage in QL text files and all the old data was migrated to QL files while ensuring no information was lost in the process. These QL files were then brought under version control.
Diff readability
Reability of files is one thing, readability of differences is another. Diff readability has a profound impact on the quality of change reviews and productivity while reviewing, and even more so when merging changes from two different versions.
A “version control friendly” DSL should be designed in such a way that version control tools can generate readable differences between the two versions. This not only helps when inspecting or reviewing differences but also when merging two different versions together.
The crucial aspect here is that version control tools tend to be line-oriented. Algorithms that calculate the difference between two versions (“diffs”) usually work per line. Therefore heavy use of block delimiters can severely limit the readability of these diffs. Examples of DSLs whose design constrains the readability of diffs are JSON, XML, and their close cousin HTML. Similar constraints exist between indentation-based and curly-braces-based blocks.
Surveys-R-us: code reviews
With the introduction of the QL DSL and the use of off-the-shelf review tools, changing reviews is now a breeze. For instance, if the example survey needs a more complex answer to one of the questions, the diff could look like this:
Old | New |
---|---|
form job-hopping { |
form job-hopping { enum Happy = { yes, no, sometimes } "Are you currently employed?" employed: boolean if (employed) { "Are you happy in your current job?" happy: Happy if (happy != yes) { "Are you looking for a job?" looking: boolean } } |
A reviewer could check and sign off on this change in seconds. Merging changes from two different versions can be done automatically if the changes are independent. Conflicting changes require human intervention, as only a human can decide how to resolve such conflicts. However, here, readable diffs will help enormously in understanding the conflict.
Adapting the processes within an organization to work with standard version control tools can be a challenging step. The tremendous advantages obtained from solid version management can help build support for this transition.
Key takeaways
- Writing data as text documents in a correctly designed DSL can offer enormous value for version management.
- Homegrown version management tooling is expensive and hard to maintain.
- The introduction of a high-quality DSL enables the use of off-the-shelf version control tools, which can often deliver huge value for a relatively small investment.
In future posts of this series, we will take a closer look at what it takes to replace a legacy database with DSLs to substantiate this investment claim.
What to expect next
The advantages of using a DSL to store your data are clear, but how do we reach this stage from an existing application? We will continue this series with two posts that describe an evolutionary approach to replacing binary data with textual DSLs.
Get in touch
Does this post strike a nerve? If your application(s) could benefit from a DSL-centric approach to storing data then reach out to us. We can help you analyze and solve the problems you’re facing.