# Datalog
Datalog is a relational language for deductive reasoning. In the last lesson, we write our first equality saturation program in egglog, but you can also write rules for deductive reasoning a la Datalog. In this lesson, we will write several classic Datalog programs in egglog. One of the benifits of egglog being a language for program optimization is that it can talk about terms natively, so in egglog we get Datalog with terms for free.
In this lesson, we will define multiple relations with the same name,
so we use the (push)
and (pop)
commands to clone and reset the database.
Under the hood, (push)
clones the current database and pushes it onto a stack,
and (pop)
resets the database to the top of the stack.
(push)
Let's first define relations edge and path.
We use (edge a b)
to mean the tuple (a, b) is in the edge
relation.
(edge a b)
means there are directed edges from a
to b
,
and we will use it to compute the path
relation,
where (path a b)
means there is a (directed) path from a
to b
.
(relation edge (i64 i64))
(relation path (i64 i64))
We can insert edges into our relation by asserting facts:
(edge 1 2)
(edge 2 3)
(edge 3 4)
Fact definitions are similar to definitions using (let ..)
definitions in the last lesson,
in that facts are immediately added to relations.
Now let's tell egglog how to derive the path
relation.
First, if an edge from a to b exists, then it is already a proof that there exists a path from a to b.
(rule ((edge a b))
((path a b)))
A rule has the form (rule (atom1 atom2 ..) (action1 action2 ..))
.
For the rule we have just defined, the only atom is (path a b)
, which asks egglog to search
for possible a
and b
s such that (path a b)
is a fact in the database.
We call the first part the "query" of a rule, and the second part the "body" of a rule.
In Datalog terminology, confusingly, the first part is called the "body" of the rule
while the second part is called the "head" of the rule. This is because Datalog rules
are usually written as head :- body
. To avoid confusion, we will refrain from using
Datalog terminology.
The rule above defines the base case of the path relation. The inductive case reads as follows:
if we know there is a path from a
to b
, and there is an edge from b
to c
, then
there is also a path from a
to c
.
This can be expressed as egglog rule below:
(rule ((path a b) (edge b c))
((path a c)))
Again, defining a rule does not mean running it in egglog, which may be a surprise to those familiar with Datalog. The user still needs to run the program. For instance, the following check would fail at this point.
(fail (check (path 1 4)))
But it passes after we run our rules for 10 iterations.
(run 10)
(check (path 1 4)) ; this succeeds
For many deductive rules, we do not know the number of iterations
needed to reach a fixpoint. The egglog language provides the saturate
scheduling primitive to run the rules until fixpoint.
(run-schedule (saturate (run)))
We will cover more details about schedules later in the tutorial.
Exercises:
Consider the variant of our last rule:
(rule ((path a b) (path b c))
((path a c)))
Does this rule compute the same relation as the original rule? How does this rule
compare to the original rule? Hint: it's slow, why?
(pop) ;; resets egglog
(push)
Our last example determines whether there is a path from one node to another, but we don't know the details about the path. Let's slightly extend our program to obtain the length of the shortest path between any two nodes.
(function edge (i64 i64) i64 :no-merge)
(function path (i64 i64) i64 :merge (min old new))
Here, we use a new keyword called function
to define a table that respects the functional dependency.
A relation is just a function with output domain Unit
.
By defining edge
and path
with function
, we can associate a length to each path.
What happens it the same tuple of a function is mapped to two values?
In the case of relation, this is easy: Unit
only has one value, so the two values must be identical.
But in general, that would be a violation of functional dependency, the property that a = b
implies f(a) = f(b)
.
Egglog allows us to specify how to reconcile two values that are mapped from the same tuple using merge expressions.
For instance, for path
, the merge expression is (min old new)
; old
and new
are two special values in an expression
that denotes the current output of the tuple and the output of the new, to-be-inserted value.
The merge expression for path
says that, when there are two paths from a
to b
with lengths old
and new
,
we keep the shorter one, i.e., (min old new)
.
For edge
, we can define the merge expression the same as path
, which means that we only keep the shortest edge
if there are multiple edges. But we can also assert that edge
does not have a merge expression using no-merge
.
This means we don't expect there will be multiple edges between two nodes. More generally, it is the user's
responsibility to ensure no tuples with conflicting output values exist. If a conflict happens, egglog will
panic.
Now let's insert the same edges as before, but we will assign a length to each edge. This is done using the set
action,
which takes a tuple and an output value:
(set (edge 1 2) 10)
(set (edge 2 3) 10)
(set (edge 1 3) 30)
Let us define the reflexive rule and transitive rule for the path
function.
In this rule, we use the set
action to set the output value of the path
function.
On the query side, we use =
to bind the output value of a function.
(rule ((= (edge x y) len))
((set (path x y) len)))
(rule ((= (path x y) xy)
(= (edge y z) yz))
((set (path x z) (+ xy yz))))
Let's run our rules and check we get the desired shortest path.
(run-schedule (saturate (run)))
(check (= (path 1 3) 20))
(pop)
(push)
Now let us combine the knowledge we have learned in lessons 1 and 2 to write a program that combines both equality saturation and Datalog.
We reuse our path example, but this time the nodes are terms constructed using the mk
constructor,
We start by defining a new, union-able sort.
(sort Node)
We can then define a new constructor for our sort.
(constructor mk (i64) Node)
Note: We could have equivalently written
(datatype Node
(mk i64))
datatype
is a syntactic sugar for sort
s and constructor
s.
(relation edge (Node Node))
(relation path (Node Node))
(rule ((edge x y))
((path x y)))
(rule ((path x y) (edge y z))
((path x z)))
(edge (mk 1) (mk 2))
(edge (mk 2) (mk 3))
(edge (mk 3) (mk 1))
(edge (mk 5) (mk 6))
Because we defined our nodes using a sort
, we can "union" two nodes.
This makes them indistinguishable to rules in egglog
.
(union (mk 3) (mk 5))
union
is a new keyword here, but it is our old friend: rewrite
s are implemented as rules whose
actions are union
s. For instance, (rewrite (Add a b) (Add b a))
is lowered to the following
rule:
(rule ((= e (Add x y)))
((union e (Add y x))))
(run-schedule (saturate (run)))
(check (edge (mk 3) (mk 6)))
(check (path (mk 1) (mk 6)))
We can also give a new meaning to equivalence by adding the following rule.
(rule ((path x y)
(path y x))
((union x y)))
This rule says that if there is a path from x
to y
and from y
to x
, then
x
and y
are equivalent.
This rule allows us to tell if a
and b
are in the same connected component by checking
(check (= (mk a) (mk b)))
.
(run-schedule (saturate (run)))
(check (= (mk 1) (mk 2))
(= (mk 1) (mk 3))
(= (mk 2) (mk 3)))
Exercises:
(Parametrized rewrite rule) In our lesson 1, it is tedious to write associativity and commutativity (AC) rules for every AC operators (e.g., Add, Mul, ...). One way to save us from this repetitive work is to parameterize over the operators.
In the following program, instead of Mul
and Add
, Expr
has a Bop
operator
that takes an operator kind and two operands.
(datatype BopKind (Add) (Mul) (Div))
(datatype Expr (Num i64) (Var String) (Bop BopKind Expr Expr))
First, add a relation AcBopKind
that holds BopKind
s which are associative and commutative.
Second, insert the appropriate BopKind
s into that relation.
Third, add a single rule that implements commutativity for all Bop
s with an ACBopKind
.
Finally, add a single rule that implements associativity for all Bop
s with an ACBopKind
.
(Universe relation)
One of the rewrite rules that we cannot make a birewrite
in lesson 1 is the rule
(rewrite (Add x (Num 0)) x)
. This is because it is not clear what x
is bound to
in the inverse rule, (rewrite x (Add x (Num 0)))
. Therefore, defining this rule in egglog
causes an error. One way to fix this is to introduce a "universe" relation, which contains
every term in the e-graph. So we can write our rules as follows:
(rule ((universe x))
((union x (Add x (Num 0)))))
Please define such a universe
relation so that the above rule works.
Side Note: The above rule can also be written using the shorthand rewrite
with a :when
condition: (rewrite x (Add x (Num 0)) :when ((universe x)))
, and similarly the birewrite
rule (birewrite (Add x (Num 0)) x :when ((universe x)))
, which introduces a bit of
overhead for the forward direction, but is more concise.
Everything is a function
We have talked about relations being a special case of functions, but in fact, constructor
s like Add
and Mul
are also function tables under the hood!
For instance, the Add
constructor
(constructor Add (Expr Expr) Expr)
is a function with two Expr
inputs and one Expr
output. Its merge expression is simply to union the two output values.
This corresponds to functional dependency: If x1 = x2
, and y1 =y2
, then it follows that
(Add x1 y1)
and (Add x2 y2)
are equivalent.
In other words, functions, functional dependencies, and merge expressions are the fundamental unifying concepts in our framework.
This allows us to implement egglog as a database system and using advanced database techniques.
There is one last caveat about constructors though: Different from function
but similar to relation
, constructor have default values, so expressions
like (let e1 (Num 2))
succeeds even when (Num 2)
is not already in the e-graph, in which case it creates a new id for the output.
In contrast, since relation
are functions with output domain Unit
, its default value is the only value of type Unit
.