# Datalog

Datalog is a relational language for deductive reasoning. In the last lesson, we write our first equality saturation program in egglog, but you can also write rules for deductive reasoning a la Datalog. In this lesson, we will write several classic Datalog programs in egglog. One of the benifits of egglog being a language for program optimization is that it can talk about terms natively, so in egglog we get Datalog with terms for free.

In this lesson, we will define multiple relations with the same name, so we use the (push) and (pop) commands to clone and reset the database. Under the hood, (push) clones the current database and pushes it onto a stack, and (pop) resets the database to the top of the stack.

(push)

Let's first define relations edge and path. We use (edge a b) to mean the tuple (a, b) is in the edge relation. (edge a b) means there are directed edges from a to b, and we will use it to compute the path relation, where (path a b) means there is a (directed) path from a to b.

(relation edge (i64 i64))
(relation path (i64 i64))

We can insert edges into our relation by asserting facts:

(edge 1 2)
(edge 2 3)
(edge 3 4)

Fact definitions are similar to definitions using (let ..) definitions in the last lesson, in that facts are immediately added to relations.

Now let's tell egglog how to derive the path relation.

First, if an edge from a to b exists, then it is already a proof that there exists a path from a to b.

(rule ((edge a b))
      ((path a b)))

A rule has the form (rule (atom1 atom2 ..) (action1 action2 ..)).

For the rule we have just defined, the only atom is (path a b), which asks egglog to search for possible a and bs such that (path a b) is a fact in the database.

We call the first part the "query" of a rule, and the second part the "body" of a rule. In Datalog terminology, confusingly, the first part is called the "body" of the rule while the second part is called the "head" of the rule. This is because Datalog rules are usually written as head :- body. To avoid confusion, we will refrain from using Datalog terminology.

The rule above defines the base case of the path relation. The inductive case reads as follows: if we know there is a path from a to b, and there is an edge from b to c, then there is also a path from a to c. This can be expressed as egglog rule below:

(rule ((path a b) (edge b c))
      ((path a c)))

Again, defining a rule does not mean running it in egglog, which may be a surprise to those familiar with Datalog. The user still needs to run the program. For instance, the following check would fail at this point.

(fail (check (path 1 4)))

But it passes after we run our rules for 10 iterations.

(run 10)
(check (path 1 4)) ; this succeeds

For many deductive rules, we do not know the number of iterations needed to reach a fixpoint. The egglog language provides the saturate scheduling primitive to run the rules until fixpoint.

(run-schedule (saturate (run)))

We will cover more details about schedules later in the tutorial.

Exercises:

Consider the variant of our last rule: (rule ((path a b) (path b c)) ((path a c))) Does this rule compute the same relation as the original rule? How does this rule compare to the original rule? Hint: it's slow, why?


(pop) ;; resets egglog
(push)

Our last example determines whether there is a path from one node to another, but we don't know the details about the path. Let's slightly extend our program to obtain the length of the shortest path between any two nodes.

(function edge (i64 i64) i64 :no-merge)
(function path (i64 i64) i64 :merge (min old new))

Here, we use a new keyword called function to define a table that respects the functional dependency. A relation is just a function with output domain Unit. By defining edge and path with function, we can associate a length to each path.

What happens it the same tuple of a function is mapped to two values? In the case of relation, this is easy: Unit only has one value, so the two values must be identical. But in general, that would be a violation of functional dependency, the property that a = b implies f(a) = f(b). Egglog allows us to specify how to reconcile two values that are mapped from the same tuple using merge expressions.

For instance, for path, the merge expression is (min old new); old and new are two special values in an expression that denotes the current output of the tuple and the output of the new, to-be-inserted value. The merge expression for path says that, when there are two paths from a to b with lengths old and new, we keep the shorter one, i.e., (min old new).

For edge, we can define the merge expression the same as path, which means that we only keep the shortest edge if there are multiple edges. But we can also assert that edge does not have a merge expression using no-merge. This means we don't expect there will be multiple edges between two nodes. More generally, it is the user's responsibility to ensure no tuples with conflicting output values exist. If a conflict happens, egglog will panic.

Now let's insert the same edges as before, but we will assign a length to each edge. This is done using the set action, which takes a tuple and an output value:

(set (edge 1 2) 10)
(set (edge 2 3) 10)
(set (edge 1 3) 30)

Let us define the reflexive rule and transitive rule for the path function. In this rule, we use the set action to set the output value of the path function. On the query side, we use = to bind the output value of a function.

(rule ((= (edge x y) len))
      ((set (path x y) len)))
(rule ((= (path x y) xy) 
       (= (edge y z) yz))
      ((set (path x z) (+ xy yz))))

Let's run our rules and check we get the desired shortest path.

(run-schedule (saturate (run)))
(check (= (path 1 3) 20))

(pop)
(push)

Now let us combine the knowledge we have learned in lessons 1 and 2 to write a program that combines both equality saturation and Datalog.

We reuse our path example, but this time the nodes are terms constructed using the mk constructor,

We start by defining a new, union-able sort.

(sort Node)

We can then define a new constructor for our sort.

(constructor mk (i64) Node)

Note: We could have equivalently written (datatype Node (mk i64)) datatype is a syntactic sugar for sorts and constructors.

(relation edge (Node Node))
(relation path (Node Node))

(rule ((edge x y))
      ((path x y)))
(rule ((path x y) (edge y z))
      ((path x z)))

(edge (mk 1) (mk 2))
(edge (mk 2) (mk 3))
(edge (mk 3) (mk 1))
(edge (mk 5) (mk 6))

Because we defined our nodes using a sort, we can "union" two nodes. This makes them indistinguishable to rules in egglog.

(union (mk 3) (mk 5))

union is a new keyword here, but it is our old friend: rewrites are implemented as rules whose actions are unions. For instance, (rewrite (Add a b) (Add b a)) is lowered to the following rule: (rule ((= e (Add x y))) ((union e (Add y x))))

(run-schedule (saturate (run)))

(check (edge (mk 3) (mk 6)))
(check (path (mk 1) (mk 6)))

We can also give a new meaning to equivalence by adding the following rule.

(rule ((path x y)
       (path y x))
      ((union x y)))

This rule says that if there is a path from x to y and from y to x, then x and y are equivalent. This rule allows us to tell if a and b are in the same connected component by checking (check (= (mk a) (mk b))).

(run-schedule (saturate (run)))
(check (= (mk 1) (mk 2))
       (= (mk 1) (mk 3))
       (= (mk 2) (mk 3)))

Exercises:

(Parametrized rewrite rule) In our lesson 1, it is tedious to write associativity and commutativity (AC) rules for every AC operators (e.g., Add, Mul, ...). One way to save us from this repetitive work is to parameterize over the operators.

In the following program, instead of Mul and Add, Expr has a Bop operator that takes an operator kind and two operands.

 (datatype BopKind (Add) (Mul) (Div))
 (datatype Expr (Num i64) (Var String) (Bop BopKind Expr Expr))

First, add a relation AcBopKind that holds BopKinds which are associative and commutative. Second, insert the appropriate BopKinds into that relation. Third, add a single rule that implements commutativity for all Bops with an ACBopKind. Finally, add a single rule that implements associativity for all Bops with an ACBopKind.

(Universe relation) One of the rewrite rules that we cannot make a birewrite in lesson 1 is the rule (rewrite (Add x (Num 0)) x). This is because it is not clear what x is bound to in the inverse rule, (rewrite x (Add x (Num 0))). Therefore, defining this rule in egglog causes an error. One way to fix this is to introduce a "universe" relation, which contains every term in the e-graph. So we can write our rules as follows:

 (rule ((universe x))
       ((union x (Add x (Num 0)))))

Please define such a universe relation so that the above rule works.

Side Note: The above rule can also be written using the shorthand rewrite with a :when condition: (rewrite x (Add x (Num 0)) :when ((universe x))), and similarly the birewrite rule (birewrite (Add x (Num 0)) x :when ((universe x))), which introduces a bit of overhead for the forward direction, but is more concise.


Everything is a function

We have talked about relations being a special case of functions, but in fact, constructors like Add and Mul are also function tables under the hood! For instance, the Add constructor

(constructor Add (Expr Expr) Expr)

is a function with two Expr inputs and one Expr output. Its merge expression is simply to union the two output values. This corresponds to functional dependency: If x1 = x2, and y1 =y2, then it follows that (Add x1 y1) and (Add x2 y2) are equivalent. In other words, functions, functional dependencies, and merge expressions are the fundamental unifying concepts in our framework. This allows us to implement egglog as a database system and using advanced database techniques.

There is one last caveat about constructors though: Different from function but similar to relation, constructor have default values, so expressions like (let e1 (Num 2)) succeeds even when (Num 2) is not already in the e-graph, in which case it creates a new id for the output. In contrast, since relation are functions with output domain Unit, its default value is the only value of type Unit.