From logs to metrics

Why and how to parse syslog messages

Leonardo Di Donato
7 min readFeb 12, 2019
Photo by Ivan Stern on Unsplash

This blog post series is comprised of two parts.

The former (this one) overviews why parsing logs is necessary and difficult, what syslog standards are and finally presenting an amazingly fast syslog parser built by me with Ragel and Go.

The second blog post will address how to use the TICK stack to transform logs into metrics.

In case you have only few minutes, the slides for a talk about this topic are available here.

Nowadays we generate a huge quantity of information: it’s so big that we are not able to exactly quantify it. And it keeps growing at an astonishing rate.

Everything around us generates logs: our code, servers, containers, infrastructure components etc.
Much of this entropy is text containing information (with various degrees of depth and detail) about the life cycle, the status, and the health of our applications.

We often forget that the purpose of this information is not just to support developers during the development phase but it could go much further.

To exploit such unstructured (or semi-structured) information, and turn it into something useful even during the life of the application from which it is issued, it is necessary to give it a structured form, to have a way to search, maybe understand, manipulate, synthesize and transform it into valuable insights.

During the development of our platforms at InfluxData the need to struggle against this entropy (almost totally in vain) emerged clearly, forcing us to address such topic first of all for ourselves.

The primary need was to create a blazing fast log parser to process and extract data from the super abundant quantity of logs we have. However, we soon realized that, while being an issue, such quantity was not the main problem. Rather, the most problematic aspects of this scenario are the diversity of the logs and sources, and the lack of a single standard format and protocol.

Syslog standards

During our initial research we encountered various common log standards and formats. For example, there is the so called BSD syslog format (RFC
3164
). It is dated 2001, widely used, with a lot of opensource tools claiming to implement it.

But since it is not very formal and does not present a well-defined grammar and rule, nor it discusses how to transport and frame the syslog messages, during years various customization have been created and used.

Therefore we shifted our focus towards a more recent (2009) standard with a unique and very clear definition, the RFC 5424. It standardizes the syslog protocol and obsoletes the one defined by RFC 3164.

Here the ABNF of the syslog protocol.

SYSLOG-MSG      = HEADER SP STRUCTURED-DATA [SP MSG]

HEADER = PRI VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP
PROCID SP MSGID
PRI = "<" PRIVAL ">"
PRIVAL = 1*3DIGIT ; range 0 .. 191
VERSION = NONZERO-DIGIT 0*2DIGIT
HOSTNAME = NILVALUE / 1*255PRINTUSASCII

APP-NAME = NILVALUE / 1*48PRINTUSASCII
PROCID = NILVALUE / 1*128PRINTUSASCII
MSGID = NILVALUE / 1*32PRINTUSASCII

TIMESTAMP = NILVALUE / FULL-DATE "T" FULL-TIME
FULL-DATE = DATE-FULLYEAR "-" DATE-MONTH "-" DATE-MDAY
DATE-FULLYEAR = 4DIGIT
DATE-MONTH = 2DIGIT ; 01-12
DATE-MDAY = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on mm/yy
FULL-TIME = PARTIAL-TIME TIME-OFFSET
PARTIAL-TIME = TIME-HOUR ":" TIME-MINUTE ":" TIME-SECOND
[TIME-SECFRAC]
TIME-HOUR = 2DIGIT ; 00-23
TIME-MINUTE = 2DIGIT ; 00-59
TIME-SECOND = 2DIGIT ; 00-59
TIME-SECFRAC = "." 1*6DIGIT
TIME-OFFSET = "Z" / TIME-NUMOFFSET
TIME-NUMOFFSET = ("+" / "-") TIME-HOUR ":" TIME-MINUTE


STRUCTURED-DATA = NILVALUE / 1*SD-ELEMENT
SD-ELEMENT = "[" SD-ID *(SP SD-PARAM) "]"
SD-PARAM = PARAM-NAME "=" %d34 PARAM-VALUE %d34
SD-ID = SD-NAME
PARAM-NAME = SD-NAME
PARAM-VALUE = UTF-8-STRING ; chars ", \, and ] MUST be escaped
SD-NAME = 1*32PRINTUSASCII ; except =, SP, ], "

MSG = MSG-ANY / MSG-UTF8
MSG-ANY = *OCTET ; not starting with BOM
MSG-UTF8 = BOM UTF-8-STRING
BOM = %xEF.BB.BF
UTF-8-STRING = *OCTET ; UTF-8 string as specified in RFC 3629
OCTET = %d00-255
SP = %d32
PRINTUSASCII = %d33-126
NONZERO-DIGIT = %d49-57
DIGIT = %d48 / NONZERO-DIGIT
NILVALUE = "-"

It states that a syslog message is composed by a header, followed by a white space, a nullable structured data section, and an optional message body (prefixed with a white space when provided).

Nullable parts are fragments that can be substituted with a dash.

The header is composed by:

  • a priority value, a value between 0 and 191 which synthesizes the facility [0,23] and the severity codes [0, 7] — i.e., facility * 8 + severity
  • the protocol version, in the range [1,999]
  • a nullable RFC3339 (with microseconds precision) timestamp
  • a nullable hostname
  • a nullable application name
  • a nullable process identifier
  • a nullable message identifier

The structured data is an indexed list of hash maps, while message can contain UTF-8.

There is no upper limit to the size of the syslog message. Yet, the size limit is dictated by the transport mapping you will use.

What follows is an example of a syslog message.

<85>1 2018-10-11T22:14:15.003Z leodido - 31932 - [ex@31932 iut="3"] An auth token...

So it tells us that the it concerns authorization (facility is 10) and it is a notice (severity is 5) — i.e., 10 * 8 + 5. Its application name and message id are null. It has the timestamp information, the hostname, the structured data section, and a message.

In addition to RFC 5424, there is RFC 5425 which describes the use of TLS (the recommended transport) to provide a secure connection for the transport of RFC 5424 syslog messages. It also defines and recommends a framing technique, called octet-counting, which does not have the inherent problems of the non-transparent framing techniques (described into RFC 6587) observed with legacy syslog messages.

This simple but effective technique allows us to transport syslog messages across various packets.

It consists in prefixing the entire syslog message with a number (and a white space), indicating to the receiver the number of bytes from which the message is composed.

84 <85>1 2018-10-11T22:14:15.003Z leodido - 31932 - [ex@31932 iut="3"] An auth token...

After this overview on the syslog protocols, it is time to have a look at the library built to parse such log messages.

A blazingly fast syslog parser

We chose Ragel to create a golang syslog parser strictly and robustly following the RFC 5424 format. It also provides the pieces to parse streams of syslog messages transported following various RFCs.

For example:

  • TLS with octet-counting (RFC 5425)
  • TCP with non-transparent framing or with octet-counting (RFC 6587)
  • UDP carrying one message per packet (RFC 5426)

Why Ragel?

It is a state machine compiler. It generates parsers from regular languages for various host languages with almost inline code (thus really fast). Furthermore it is able to execute code at arbitrary points during the recognition of the regular language, and it provides guarded operators to control any non-determinism, plus other operators to influence the execution of the state machine jumps between nodes and edges.

You can find such library at github.com/influxdata/go-syslog.
It also provides a builder to generate valid syslog messages.

I also plan to add a parser for legacy syslog messages (RFC 3164), which is in the making at the moment but not top priority.

Parsing syslog messages

Since its usage is really simple, here is an example.

As you can see in the previous example, the parser provides a modality that I called “best effort”.

It allows the user to obtain partial information from not completely valid syslog messages. Using it the parser will stop at the point where the syslog message is not valid anymore and if it contained at least priority and version data return back the extracted fields.

As a consequence, this mode is useful to obtain information (albeit partial) instead of losing it completely. It turns to be very helpful when the likelihood of receiving partially invalid syslog message is high. Okay, you will lose the structured data section maybe, or something else, but not the event (point in time) that a syslog message represents.

Now suppose a syslog message (maybe partially valid) comes every second and we want to be able to parse them as they arrives.

Thus we need to launch our parser in a goroutine. As soon it parses the received syslog message we want to define a mechanism to send the parsing result into a channel.

This custom mechanism is a simple function that describes how we want to handle result. We have to pass it to the syslog.WithListener(...) option of our parser.

Performances

The library provides benchmarks you can run on your machine cloning the repository and executing make bench.

However on an Intel Core i7–7600U CPU @ 2.80GHz it usually takes:

  • ~250ns to parse the smallest legal message
  • ~2µs to parse an average legal message
  • ~4µs to parse a very long legal message
Reasonable performances.

A table containing benchmarks for various kind of syslog messages can be found here.

In the next blog post we will see how to put some InfluxData pieces together to parse Kubernetes logs.

See you there! :)

--

--

Leonardo Di Donato

Crafting software 💻 ▪ I make Falco 🦅 ▪ In love with OpenSource and Linux 🧝🧞‍♂️ ▪ Currently @ Sysdig ⛏