Skip to content

Developing Notes of ASTJSON Library

After reading about the compiler, I feel intrigued as it challenges me with numerous design and implementation details. It's an exciting topic for me. Inspired by the APIs the Haskell Aeson library exposed, it's a good practice for me to write a toy library named xieyuschen/astjson to parse JSON string to AST in a slightly functional way. This is a developing notes page, and I record some reflections.

Basic Concepts

Lexer and Token

Tokens are the possible valid combinations of characters proclaimed by specification , like {, [, :, "string" and so forth.

For example, the definition of Whitespace in json specification refers to all space, linefeed, carriage return and horizontal tab are Whitespace, which are nothing different in json.

A lexer is required to get tokens from raw bytes. It reads from the bytes and processes them as a state machine.

AST and Parser

AST is the structure that carries the value and type, which is nothing matter with tokens. When analyzed, AST represents the hierarchy relations of data but denotes nothing about what the original bytes look like. In other words, AST extracts the core information inside a series of bytes and discards trivial traits. Hence, we could convert an AST to any other configuration format, for example, yaml``,xml`, etc.

The parser is which processes tokens to AST. It reads the tokens in an orderly and checks the validation with the grammar mentioned in the specification.

Grammar

Grammar focuses on the rule about conversing from tokens to AST. It also overlooks the detailed characters because that's charged in the lexer. Grammar stressed the validation of compositions of multiple tokens. For example, after a { token, the grammar expects a String token with any Whitespace tokens among them. It‘s indifferent about which character stands for Whitespace token but concerned about whether the composition of tokens({ and String with any whitespace among them) satisfies the grammar.

Lexer Implementation

Terminals(Tokens)

We need to define terminals(tokens) before writing the lexer. The tokens in JSON are pretty simple, as it doesn't allow variables and complex syntax.

// token represents the json token.
// currently, token only supports to the limited json value and limits primitive
// types only.
type token struct {
    tp Type

    // the token value is [ leftPos, rightPos)
    // index starts at 0
    leftPos, rightPos int

    // hasDash and isFloat only make sense for tkNumber because we don't
    // want to lose precise
    hasDash, isFloat bool
}
Type Definition
const (
  tkWhiteSpace Type = iota
  tkString
  tkNumber
  tkBool
  tkNull
  tkEOF
  tkObjectStart
  tkObjectEnd
  tkArrayStart
  tkArrayEnd
  tkComma
  tkColon
)

An optimization here is made for number as we suffer a precise lose while parsing the max number of uint64. Before optimization, I used float64 to store all numbers, and it works well in several cases. However, when the max uint(\(2_{64} - 1\)) is assigned to a float64 and cast to `uint64`` back, it overflows.

Hence, I defined the token with more details to benefit the Number type. I check whether the dot or negative mark exists while parsing the number. Moreover, I treated the Number as the default case if no matched case was found while scanning the bytes.

Lexer

A lexer receives a series of bytes and generates lots of tokens. Hence, there are two sentries for a lexer to proceed bytes. The predictive scanning is used during parsing bool and null types, as the only valid tokens for t, f, and n characters are true, false, and null.

The lexer has some details while scanning the string and number.

While scanning for the string type, we must solve some cases below.

  • backforward slash \.
  • hex digital like \ucdef

Adding additional checks could solve them. Moreover, predictive scanning is helpful here, and the cases are very limited.

When scanning for the number type, the float and negative is stored for the AST parser.

AST Definition

JSON has limited types of nodes, which are Array, Object, Null, Bool, String, and Number. Hence, their definitions are simple. But there are still some considerations while defining.

Facade of AST Node

Because Go language doesn't support sumtype, such as type Number = uint64 | int64 | float64, an agent to represent all types is required. Inside astjson lib, I defined Value as the agent:

// NodeType represents an AST node type
//
//go:generate stringer -type=NodeType
type NodeType uint

const (
    Number NodeType = iota
    Null
    String
    Bool
    Object
    Array
)

// Value is the concrete AST representation
type Value struct {
    NodeType

    // AstValue stores the real value of an AST Value
    // the literal type(Number, String, Bool and Null) stores them by value
    // the Object and Array stores them by pointer
    AstValue interface{}
}

We could cast the AstValue inside Value to the corresponding type with the meta information. What's more, the library provides some utility functions to cast the AstValue to a concrete type. The functions panic when fail to cast, with the same behaviors of the methods of reflect.Value like Int(), Bool() and so forth.

Why Reflect Cast Panic if Not Satisfy

The reflect.Value holds the type and values inside it as well. It provides some utility methods to cast the agent to a concrete type. However, what if the method is called with a wrong desired type? The reflect package chooses to panic to clean the return value, instead of returning an error.

// Bool returns v's underlying value.
// It panics if v's kind is not Bool.
func (v Value) Bool() bool {
  // panicNotBool is split out to keep Bool inlineable.
  if v.kind() != Bool {
    v.panicNotBool()
  }
  return *(*bool)(v.ptr)
}

Number AST Node Definition

As mentioned above, to solve the precise loss, the token carries more details if it's a number token.

Number AST Definition

type numberType uint

const (
  floatNumber numberType = iota
  // unsignedInteger refers we could store the value inside an uint64
  unsignedInteger
  // integer refers we could store the value inside an int64
  integer
)

type NumberAst struct {
  nt numberType
  f  float64
  u  uint64
  i  int64
}

It provides utilities to retrieve the correct value with a certain type as you like.

func (n NumberAst) GetInt64() int64
func (n NumberAst) GetUint64() uint64
func (n NumberAst) GetFloat64() float64

The Other AST Node Definitions

The Object in JSON is a map with String key and any possible types as its value. Initially, I define a map[*Value]*Value to store the key-value pair. However, as the key is String forever, I directly use string as the key type.

type NullAst struct{}
type BoolAst bool
type StringAst string

type ObjectAst struct {
    KvMap map[string]Value
}

type ArrayAst struct {
    Values []Value
}

The concern for literary types Null, Bool and String is defining type alias or a new type. I chose to define a new type:

  • option1(chosen): define a new type: type BoolAst bool
  • option2: define an alias type: type BoolAst = bool

I decided to make all types consistent and defined by the library, so the option1 was picked up.

todo topics

  • Parser Implementation
  • Decoder Implementation
  • Walker Implementation