Effective development of parsers for binary data

Bird is a declarative binary parser generator. I will introduce its features and demonstrate the VS Code extension we developed.

by Rodin Aarssen on 26 Feb 2025

Rodin Aarssen

Women behind zeroes and ones

Writing and maintaining parsers for the binary data that surrounds us is a tough challenge. Enter Bird, a domain-specific language developed by Swat.engineering. It provides a declarative approach to describing a binary file format and fully automates the process of creating and maintaining a parser for it. Bird is an enabler for building tools for network and performance monitoring, cybersecurity, and digital forensics.

Why is parsing binary data relevant?

Binary data is everywhere: network packages entering and leaving our laptops enable our Internet connection; Word documents, PDFs, photos, and videos are omnipresent in daily life. Binary data is fine as long as it is used via tailored applications that effectively hide its binary nature. But sometimes, there is a need to see behind the scenes and monitor or search binary data. For example: Is there a DDoS attack on this network? Can we detect any intrusions? Is there illegal content (such as CSAM) hidden in this file? Are people sending malformed files to trigger exploits in our system? Can we reconstruct the contents of this confiscated, partially erased disk? Each of these cybersecurity tasks requires parsing (decomposing) of binary data at the bit level.

Writing parsers for binary data is hard: it is technically intricate, many different binary file formats exist, and these formats are continuously evolving. This is a development and maintenance nightmare that requires a solution.

Why is it so hard to parse binary data?

When we consider the challenges with “parsing”, we typically think about the obstacles we need to tackle for textual languages, such as solving ambiguity and operator precedence. While these are difficult enough in their own right, parsing binary data also entails other considerations. Here are some of such challenges.

  • Jumping around through the input. While textual languages are usually parseable from top to bottom, parsing binary data might involve non-linear parsing, e.g., TCP packets arriving out of order, serialized binary trees, or fragmented file systems (linked lists).
  • Poorly documented file formats. Binary data is usually described by a data format. While some formats are described nicely (e.g., PNG), many file formats are described poorly or even proprietarily. Furthermore, internal data formats of mobile applications, for instance, may change often and without announcement.
  • Data-dependent parsing. Part of the binary data might influence how another part of the input should be interpreted. For example, the header of an IP packet contains a field indicating the total length of the header to account for the presence of optional options.
  • Low-level encoding intricacies, such as endianness, (un)signedness, and character encoding, are sometimes mixed within the same stream.
  • Nested parsing by combining multiple file formats (e.g., parsing a JPEG in a ZIP that is contained in a FAT32 image transmitted over TCP which was send over IP).

Bird: a declarative solution for parsing binary data

Bird is a domain-specific language developed by Swat.engineering in collaboration with the Netherlands Forensic Institute (NFI). It is well-documented and freely available on GitHub. Bird separates the technical aspects of parsing from the structural and functional details of the binary data to be parsed. Bird enables users to declaratively describe the format of binary data. Such a specification is automatically translated to Java code interfacing with Nest, Bird’s runtime parsing engine. Complex aspects of binary data such as endianness, signedness, and character encoding are first-class citizens of the Bird language as well. Furthermore, Bird specification are fully modular and reusable. Thanks to its declarative nature and the parser generation capabilities of its supporting tools, Bird makes developing and maintaining parsers for a wide range of data formats easier and cheaper.

A specification for portable network graphics (PNG)

PNG is a widely used image format for raster graphics and is extensively described in ISO/IEC 15948:2004 (80 pages). Its specification is one of the examples that is packaged with Bird. Let us dissect the PNG specification to get an impression of Bird’s capabilities. A PNG file starts with a file header (called Signature), followed by a number of Chunks, and finally, an end Chunk (called the IEND chunk). This is encoded in Bird as follows:

struct PNG@(encoding = US_ASCII, endianness = BIG) {
    Signature signature
    Chunk[] chunks
    IEND end
}

Each line states a type, an identifier (where “_” can be used to omit an identifier), and optionally a side condition starting with “?”. The character encoding and endianness of the PNG specification are specified by tags on the PNG struct. The PNG signature is specified as follows:

struct Signature {
    u8 _ ?(== 0x89)
    byte[] _[3] ?(== "PNG")
    byte[] _[4] ?(== <0x0d, 0x0a, 0x1a, 0x0a>)
}

The PNG signature is fixed: it starts with a single byte 0x89 (type u8, “8 unsigned bits”, is a shorthand for a single byte, or byte[] id[1]), the 3 bytes that represent the string PNG in ASCII, and four fixed bytes (designed to let text editors know it is not a textual file).

Now, let us look at chunks that represent the actual binary data in the PNG file:

struct Chunk {
    u32 length
    byte[] type[4] ?(!= "IEND")
    byte[] data[length.as[int]]
    u32 crc ?(this.as[int] == crc32(type ++ data))
}

Each chunk starts with its length (4 bytes, type u32), followed by its type (a byte array of 4 bytes), which cannot be IEND (the type of the end chunk). The chunk data is next, and this showcases data-dependent parsing: the number of bytes is not fixed but is bound to the length encountered a few bytes back (length.as[int] is an expression to map the byte-typed length field to an integer). A chunk ends with a checksum over the type and data.

The end chunk is structured similarly, with the difference being that its length is 0 (and the checksum is therefore fixed).

struct IEND {
    u32 length ?(this.as[int] == 0)
    byte[] type[4] ?(== "IEND")
    u32 crc ?(== <0xae, 0x42, 0x60, 0x82>)
}

These few, but admittedly technically detailed, lines of Bird specification are enough to generate a parser for binary data in PNG format. This small example also nicely demonstrates Bird’s solutions for many of the problems encountered in binary parsing that we discussed above.

Developing Bird specifications

Bird has been developed using the Rascal metaprogramming language. Bird language support is available for Visual Studio Code, where the following features are provided.

  • Syntax highlighting
  • Jump to definition
  • Type checker
  • Visualization of the specified data formats
  • Code lenses
  • Inlay hints
  • Automatic translation to Java

The Bird repository contains several example specifications, including communication-related file formats (TCP/IP, UDP, IPv4, IPv6, DNS, PCAP) and image file formats (BMP, JPEG, PNG). These specifications serve not only as inspiration but can also be used on real-world data.

The following screenshot shows a snippet of the Bird specification for PNG in the VS Code IDE. Several of these IDE features are visible in the image, including syntax highlighting, code lenses, and inlay hints.

VS Code Bird IDE features

Bird’s type checker provides meaningful messages upon programming mistakes. Refer to Typepal (to be published) for more information about the type checker.

showing IDE error in typechecker

The visualization feature for Bird in VS Code provides a fast bird’s-eye (pun intended) view of definitions and uses of structs. The following screenshot shows the structure of an Image struct that is either a PNG or a JPEG; this is generated on the fly from the source code by the tooling. For more information, see lang::bird::LanguageServer.

VS Code jumping to definition via an graph visualisation

Key takeaways

  • Developing and maintaining binary parsers is hard.
  • Bird is a domain-specific language tailored towards the quick development of parsers for binary data. It can make a crucial difference in developing and maintaining binary parsers when many different binary file formats are involved or when the file formats rapidly evolve.
  • Due to the excellent embedding of Bird in VS Code, developing binary parsers has become a joy and they can be more widely used for many (cybersecurity or other) purposes.

Get in touch

Do you face DLS design and implementation issues or other language engineering challenges? Then reach out to us. We look forward to discussing how our solutions could help you.



Header image by: Photo by cottonbro studio

Recent posts

The beauty of syntax highlighting

by Sung-Shik Jongmans

Sung-Shik Jongmans
Syntax highlighting improves the productivity of DSL users. However, building a syntax highlighter is normally a serious investment and maintenance burden on the DSL developers. At Swat.engineering, we build Rascal-based syntax highlighters using a single-source-of-truth philosophy. This improves the synergy among language tools and simplifies their maintenance.

Read More…

The tragedy of the commons is a well-known metaphor for a common good that is destroyed by citizens’ greed and lack of care. If all sheep of the village use the common meadow too often, the meadow will become dry and barren and will no longer feed any sheep. In a similar fashion, open-source software constitutes a gigantic software commons from which the whole world benefits: companies and citizens alike. How does Swat.engineering contribute to this software commons?

Read More…