Skip to content

Files

Latest commit

e500d40 · Jul 3, 2025

History

History
2615 lines (2170 loc) · 121 KB

2025-06-03.md

File metadata and controls

2615 lines (2170 loc) · 121 KB

Datalog in Rust

Over the Memorial Day weekend Kris Micinski hosted a logic programming workshop in the delightful Minnowbrook Conference Center in upstate New York. I, a notorious villain, was invited for what I was half sure was my long-due comeuppance. As it turned out it was delightful, with lots of friendly and supportive people with shared goals.

We had one requirement, which was that we write a blog post reflecting on the event, which Kris would collect and share out with the wider world. You are going to get a bunch of thoughtful, considerate takes on the workshop, so I thought I'd go in a different direction. As great as the time was, it struggled with a standard academic problem: smart people working on challenging problems that don't always directly connect to the challenges at hand. Outputs without outcomes, as they say in Product-speak.

After a few days of Datalog (and other logic programming languages) talks, the one that stood out was the last talk by Denis Bueno on ctdal. It's program analysis in Datalog! But .. boy was it a struggle, to hear it. Many tools didn't work; Soufflé did, but you have to hold it right. Many challenges, and .. to be honest the bad news in the best news if you want to make progress.

I figured I should put my money-time where my mouth-brain is, and see what it is like to build a thing that is meant to be usable as well as "performant". Plus, I figure everyone benefits from seeing what it's like to build such a thing too, and you'll even get to learn about logic programming!

Let's build an interactive Datalog, in Rust!

I've built a few things like Datalog in my day, but one thing I haven't done is try and make a thing that is all three of simple, useable, and performant. I thought we might try and do that here, with Datalog! You can follow along in the datatoad repository.

I have previously built much of datafrog (with help!), which provides the guts of a Datalog engine that you are then invited to wire together your own damned self. Understandably, this was not the Datalog renaissance you might imagine. We aren't going to start from datafrog, but we'll use a lot of the same algorithmic ideas. If you are familiar with it you should be able to follow along, and if you are not familiar with it you're doing just fine.

I had an outline and post that started by building a bad Datalog, a slow and awkward Datalog, that we would then improve together as we learned things. It was sufficiently bad that I think we should just start with the good version, and we can talk through the load-bearing moments in more detail as we go.

By the end of this post, we'll have an interactive Datalog that we can bulk load facts into, add rules on the go, and have it keep up with pretty good performance. Here's an example that does a nullability analysis on an httpd dataflow (in the PL sense) graph;

> .list
    e:	9905624
    n:	138331
30.708µs

> m(?b, ?a) :- n(?a, ?b) .
52.027834ms

> .list
    e:	9905624
    m:	138331
    n:	138331
45.916µs

> m(?c, ?a) :- m(?b, ?a), e(?b, ?c) .
8.360353834s

> .list
    e:	9905624
    m:	9393283
    n:	138331
49.042µs

> 

This takes about 2s in datafrog, vs 8.3s here, though the datafrog example uses (u32, u32) data, but here with Vec<String> data and no compiled queries it takes only 4x longer. I hope to get that down in the future.

I haven't confirmed that this does a great job on all problems yet. I haven't even confirmed that it is generally correct, though on these reachability problems it produces the same number of output tuples as the datafrog implementation. So for the moment, no too worried about performance, and more worried about ergonomics, explainability, and future extensibility.

Here's an overview of my plan for what we'll cover. I'm aspirationally describing each of the sections having not yet done all of the work, but I have high hopes! Much of this derives from prior work on the datafrog project, but we'll start a bit slower and in more depth, and end up going further as well.

My plan is to go top-down, starting with "what is Datalog", and building the framework for an interactive evaluator.

  1. Introducing Datalog

Once we have overview in place, we'll start to build up what are currently the three loci of functionality.

  1. Parsing Datalog
  2. Representing and maintaining sets of facts
  3. Planning and evaluating Datalog rules

There are some very natural next steps. I haven't done these yet, but I'm certain they can be done without a great deal of effort. I'll put them here, and if you are especially excited about them and for some reason I haven't already done them, use this paragraph to compel me to do so.

  1. Spilling everything to disk.
  2. Scaling out to multiple workers and processes.
  3. Streaming and worst-case optimal joins.

I have a few more tricks planned, as the project is also partly about unlocking an ability to experiment with new data-parallel computation patterns. It turns out most of what we'll be doing is mechanically very similar to differential dataflow, though substantially simpler (and less capable).

Introducing Datalog

Datalog is a language in which you describe simple logical rules, and it derives all of the consequences that can be reached from them. The rules are each Horn clauses, which are a grim way of describing rules that takes a set of input facts to an output fact. These logical statements have placeholder variables, and part of the fun of Datalog is that it will fill all of the possible settings of the values.

Datalog rules are shaped as a "head" which is the thing (or things) that could be inferred, followed by :- and then a "body" which is a list of sufficient conditions. For example, we might write a rule that says that a triangle (a, b, c) exists if each of three edges exist:

tri(a, b, c) :- edge(a, b), edge(b, c), edge(a, c).

The names tri and edge are relations, and the variables a, b, and c are free variables. The rule is implicitly universally quantified over the variables, meaning that for any setting of the variables, if the body is true for those settings then the head is true for those settings as well. We are going to exclude rules that have variables in a head atom that are not present in the body.

We often forget when writing Datalog, but there are also "facts" that are like rules with an empty body, meaning they are true unconditionally. For example, we might state that the following three edges exist through the facts

edge(1, 2) :- .
edge(1, 3) :- .
edge(2, 3) :- .

We could also write this using "multiple heads" as:

edge(1, 2), edge(1, 3), edge(2, 3) :- .

When rules and facts combine, more facts are inferred. With these facts, we would expect to infer that tri(1, 2, 3) is true. This now becomes a fact itself, and might lead to more facts if we had rules that relied on tri.

Datalog has an appealing monotonicity property: as you add more rules (including facts), the set of things that are true only increases. Moreover, Datalog is confluent, in that no matter what order you put these rules in you'll end at the same set of facts for any set of input rules.

At this point, I hope you can imagine typing rule after rule into a Datalog shell, and watching the consequences of these rules spill back out at you. And, I hope just after imagining this you are imagining them spilling out so quickly because we have done such a good job making this work well, that actually you want to read from and write to files, and use this for all sort of real (or at least plausible) applications.

How to represent facts and rules

Let's start with rules, as facts end up as a special but different case of rules.

A rule is two lists of atoms, the list before the turnstile called the "head" and the list after the turnstile the "body". Whenever all atoms in the body evaluate to true, the atoms of the head must also be true.

struct Rule {
    head: Vec<Atom>,
    body: Vec<Atom>,
}

An atom is a named relation, as well as a number of terms that matches the arity of the relation.

struct Atom {
    name: String,
    terms: Vec<Term>,
}

Each term is either a named variable or, a simplifying choice we'll make, a string literal.

enum Term {
    Var(String),
    Lit(String),
}

There's no strong reason to make it a string literal. Any types can work, and often folks use integers or other compact types. In fact, it's an outright lie and we'll really use Vec<u8> as the type, and if you'd like those bytes to mean String or (u32,u32) or something else, you'll be welcome to do so. The only properties we'll use are literal equality and an arbitrary order, and using Vec<u8> intentionally doesn't close any doors.

Facts are rules that have an empty body, and heads with no variables. We can treat facts as data, recording the list of literal strings in each position. Although not where we will end up, you can think of them for now as

type Fact = Vec<String>;

We'll also collect lists of facts and associate them with names, as our representation of relations. Again, not exactly where we'll end up, but a great starting intuition.

type Facts = BTreeMap<String, Vec<Fact>>;

Finally, we'll bundle our facts and rules into our Datalog interpreter state.

struct State {
    /// Rules that are not facts.
    rules: Vec<Rule>,
    /// Rules with an empty body and only literals in the head.
    facts: facts::Facts, // as yet undisclosed
}

Up next, what do you even do with all these things?

An overview of operation

We're going to write a fairly simple skeleton of our interactive Datalog shell. It will set up some State, repeatedly read rules from the input, and apply them to our facts until no new facts emerge. If you type a name rather than a rule, we'll print out the relation with that name.

The code uses a few methods on State we haven't defined yet. We'll unpack each of these in coming sections.

fn main() {

    let mut state = State::default();

    use std::io::Write;
    println!("");
    print!("> ");
    let _ = std::io::stdout().flush();

    let mut text = String::new();
    while let Ok(_size) = std::io::stdin().read_line(&mut text) {

        if let Some(parsed) = parse::datalog(&text) {
            state.extend(parsed);
            state.update();
        }

        else {
            match text.trim() {
                ".list" => {
                    for (name, facts) in state.facts.iter() {
                        println!("\t{}:\t{:?}", name, facts.len());
                    }    
                }
                _ => {
                    println!("Parse failure: {:?}", text);
                }
            }
        }

        println!("");
        print!("> ");
        let _ = std::io::stdout().flush();
        text.clear();
    }

}

There are three methods in there we haven't seen yet: parse::datalog, State::extend, and State::update. The Datalog parsing is its own self-contained bit, going from lines of text to lists of rules; we'll cover it in the next section. Before heading to that, let's see the extend method and discuss the update method which is the beating heart of Datalog.

impl State {

The extend method mostly just looks for facts, which are rules whose heads have only literals in them, and with an empty body. We store them explicitly as lists of data, in the facts member variable by the name of the head atom (the name of the relation).

    /// Adds new rules to the state.
    pub fn extend(&mut self, rules: impl IntoIterator<Item=Rule>) {
        for rule in rules.into_iter() { self.push(rule); }
    }

    /// Adds a new rule to the state.
    pub fn push(&mut self, rule: Rule) {
        if rule.body.is_empty() {
            for atom in rule.head.iter() {
                let mut lits = Vec::with_capacity(atom.terms.len());
                for term in atom.terms.iter() {
                    if let Term::Lit(text) = term {
                        lits.push(text.to_string().into_bytes());
                    }
                    else { continue; }
                }
                let mut builder = facts::FactBuilder::default();
                builder.push(lits);
                self.facts
                    .entry(atom.name.to_owned())
                    .or_default()
                    .add_set(builder);
            }
        }
        else {
            let rule_plan = crate::join::JoinPlan::from(&rule);
            join::implement_plan(&rule, &rule_plan, self.rules.len(), true, &mut self.facts);
            self.rules.push(rule);
        }
    }
}

There are some join:: things going on here. Hold your breath for a bit; they are a necessary step to kick off some work when a new rule shows up. We'll get there, but for the moment just read them as "start the derivations for that new rule".

The update method is the beating heart of Datalog, where we repeatedly derive facts continuing as long as any new fact has been derived. Once we fail to derive any new facts, well we won't derive any more by just trying again, so we are done for the moment.

/// Applies all rules to all facts, and indicates if new facts were
pub fn update(&mut self) {
    self.advance();
    while self.active() {
        for (index, rule) in self.rules.iter().enumerate() {
            let rule_plan = crate::join::JoinPlan::from(rule);
            join::implement_plan(rule, &rule_plan, index, false, &mut self.facts);
        }
        self.advance();
    }
}

There are some helper methods in there that we use to move our understanding of what we've learned forward. Informally, active tells us whether there are new facts still to process, and advance checks all new facts against the old facts for novelty and gets us ready to go again.

/// True iff any relation has new facts
fn active(&self) -> bool {
    self.facts
        .values()
        .any(|x| x.active())
}

/// Checks new facts against old facts.
pub fn advance(&mut self) {
    for facts in self.facts.values_mut() {
        facts.advance();
    }
}

It turns out there is a lot of secret stuff going on in facts.advance(), and we'll get there once we unpack more about how we plan to represent sets of facts. Before that, a segue into parsing Datalog.

How to parse rules from text

We are going to write a Datalog parser. If I took an undergraduate class on lexing and parsing, I can't remember it, so this is going to be a not-great way to get Datalog rules from lines of text.

The code that follows is currently parse.rs in the project.

This is also the moment where I reveal the actual grammar, which is lifted from Soufflé. In particular, variables start with ? to distinguish them from literals. I could imagine prefering the other way around, using " to indicate string literals, but I figured tracking Soufflé might make everyone else's lives easier.

Tokenization

The first step in many a parser is "tokenization": turning the sequence of characters into a sequence of more meaningful symbols.

For Datalog, we have just a few significant tokens: ., ,, (, ), :-, and ? (used to distinguish variables from literals). Everything that is not one of these will be a string token, describing an atom or term name.

/// Text translated to a sequence of tokens.
#[derive(Debug, PartialEq)]
enum Token {
    Period,
    Comma,
    LParen,
    RParen,
    Turnstile,
    Question,
    Text(String),
}

We will need to add Exclamation whenever we get to adding negation to our rules. Not today!

The tokenizer I wrote removes all whitespace, and converts :- into to have a single symbol to scan for. More ambitious individuals could find a better way to write this.

fn tokenize(text: &str) -> Option<Vec<Token>> {

    let mut text = text.replace(":-", "←");
    text.retain(|c| !c.is_whitespace());

    let mut result = Vec::new();

    let pattern = ['.', ',', '(', ')', '←', '?'];
    for token in text.split_inclusive(&pattern) {
        let mut split = token.split(&pattern);
        let prev = split.next().unwrap();
        if !prev.is_empty() {
            result.push(Token::Text(prev.to_owned()));
        }
        let next = token.chars().rev().next().unwrap();
        result.push (
            match next {
                '.' => Token::Period,
                ',' => Token::Comma,
                '(' => Token::LParen,
                ')' => Token::RParen,
                '←' => Token::Turnstile,
                '?' => Token::Question,
                _ => { None? } 
            }
        );
    }
    
    Some(result)
}

As you can see, some horror around finding the key moments for these symbols and breaking things apart into tokens. The challenge I had with Rust was convincing it to show me these locations, without also consuming the symbols. Again, I can imagine a better implementation.

Parsing

We'll parse a token sequence by repeatedly peeking at the next token, and then attempting to extract one of our types: rules, made of atoms, made of terms.

fn parse(tokens: Vec<Token>) -> Option<Vec<Rule>> {
    let mut tokens = tokens.into_iter().peekable();
    let mut rules = Vec::new();
    while tokens.len() > 0 {
        rules.push(parse_rule(&mut tokens)?);
    }

    Some(rules)
}

Rule parsing extracts atoms until a turnstile, then atoms until a period.

fn parse_rule<I: Iterator<Item=Token>>(tokens: &mut Peekable<I>) -> Option<Rule> {
    let mut head = Vec::new();
    while &Token::Turnstile != tokens.peek()? {
        if &Token::Comma == tokens.peek()? { tokens.next(); }
        head.push(parse_atom(tokens)?);
    }
    let Token::Turnstile = tokens.next()? else { None? };
    let mut body = Vec::new();
    while &Token::Period != tokens.peek()? {
        if &Token::Comma == tokens.peek()? { tokens.next(); }
        body.push(parse_atom(tokens)?);
    }
    let Token::Period = tokens.next()? else { None? };

    Some(Rule { head, body })
}

Atom parsing extracts a name and left parenthesis, terms as long as there are commas, then expects a right parenthesis.

fn parse_atom<I: Iterator<Item=Token>>(tokens: &mut Peekable<I>) -> Option<Atom> {
    let Token::Text(name) = tokens.next()? else { None? };
    let Token::LParen     = tokens.next()? else { None? };

    let mut cols = Vec::new();
    cols.push(parse_term(tokens)?);
    while let Token::Comma = tokens.peek()? {
        tokens.next();
        cols.push(parse_term(tokens)?);
    }
    let Token::RParen     = tokens.next()? else { None? };
    Some(Atom { name: name.to_owned(), cols })
}

Term parsing checks for an optional question mark, and then expects a text name.

fn parse_term<I: Iterator<Item=Token>>(tokens: &mut Peekable<I>) -> Option<Term> {
    if let Token::Question = tokens.peek()? {
        tokens.next()?;
        let Token::Text(term) = tokens.next()? else { None? };
        Some(Term::Var(term.clone()))
    }
    else { 
        let Token::Text(term) = tokens.next()? else { None? };
        Some(Term::Lit(term.clone()))
    }
}

Malformed rules result in a None result, with no great effort made to clearly communicate what went wrong. As you might expect by this point, we could have tried harder and made this better.

How to represent sets of facts

We said above that facts will Vec<String>, and sets of facts are Vec<Fact>. That wasn't entirely accurate.

The problem is that Vec<Vec<String>> is a memory management nightmare. Allocations of allocations of allocations. Memory everywhere .. else; not where you were just looking. We are going to use a different physical representation of the same information.

This section corresponds to facts.rs in the project.

Data oriented design / columns

Our project has (at this moment) one external dependence: columnar: a crate that converts Rust types (like Vec<Fact>) into a flat layout of only a few linear allocations. Informally, it packs all the strings together in one massive Vec<u8>, and then it records the offsets that delineate String boundaries, and then again offsets in that list that delineate Fact boundaries. But it does this all for you with the magical incantation

<Fact as Columnar>::Container

This is the type we will use to represent sets of facts. It's such a good idea, we'll even give it a name.

/// A sorted list of distinct facts.
struct FactContainer {
    ordered: <Fact as Columnar>::Container,
}

impl std::ops::Deref for FactContainer {
    type Target = <Fact as Columnar>::Container;
    fn deref(&self) -> &Self::Target {
        &self.ordered
    }
}

We are following a Rust idiom of "wrapper types". The FactContainer type wraps a list of facts, but also indicates that the sequence is sorted and deduplicated. Any time we have one of these, we'll be assured of these properties (as long as we build them carefully; implementation coming up in a moment).

A key downside of columnar containers is that they are basically append only. In principle with a Vec you could swap a fact around halfway through a list, or change a string literal it contains. We aren't going to want to do that though, so we are mostly good.

One thing we'll want to be able to do is add facts to a list, and if we need to keep it sorted and deduplicated we now have a bit of a problem. We can't change the list, other than by appending to it, and appending may not preserve the sorted order. We'll need a different technique to accumulate sets of facts than just the FactContainer.

Were going to use a log-structured merge-tree to represent our sets of facts. This may sound complicated, but it really just means "a list of FactContainers".

/// A list of sets of fact that double in length.
#[derive(Clone, Default)]
pub struct FactLSM {
    pub layers: Vec<FactContainer>,
}

The key property with an LSM is that the "layers" grow in size geometrically. To add some facts, you just put your FactContainer into the list, and then we tidy the list by merging layers whose sizes are within 2x of each other. Here's one way you can do that.

impl FactLSM {
    /// Adds a layer and restores the invariant.
    fn push(&mut self, layer: FactContainer) {
        self.layers.push(layer);
        self.tidy();
    }

    /// Ensures layers double in size.
    fn tidy(&mut self) {
        self.layers.sort_by_key(|x| x.len());
        self.layers.reverse();
        while let Some(pos) = (1..self.layers.len()).position(|i| self.layers[i-1].len() < 2 * self.layers[i].len()) {
            while self.layers.len() > pos + 1 {
                let x = self.layers.pop().unwrap();
                let y = self.layers.pop().unwrap();
                self.layers.push(x.merge(y));
                self.layers.sort_by_key(|x| x.len());
                self.layers.reverse();
            }
        }
    }
}

There is a FactContainer::merge function we use here, which does the expected thing on two sorted lists of distinct facts: it merges the two into one sorted list of distinct facts.

The LSM is a handy way to maintain a set of facts as you work with it, especially if you do not need it in its final form right yet. We'll eventually want to convert all of those layers into one FactContainer, but we can put that off until we are ready. It's such a useful idiom that we'll also take a moment to introduce a helper type we'll use to do the building for us:

/// Staging area for collecting fact sets.
pub struct FactBuilder {
    /// Un-sorted, possibly repeated facts.
    active: <Fact as Columnar>::Container,
    /// Sorted, deduplicated facts.
    layers: FactLSM,
}

This type has space in active for disorderly facts, but once they get numerous enough it will convert them into a FactContainer and add them to its layers.

As a final bit of fact-organization, we'll want a way to reflect the fact "lifecycle", from a new but potentially repeated fact, to a recent novel fact, to facts that we have fully processed. We'll do that with three distinct collections, .. ah .. two of which are LSMs for some reason, and not the third.

pub struct FactSet {
    /// Newly arrived facts that may not be novel.
    pub to_add: FactLSM,
    /// Distinct facts that are due to be processed.
    pub recent: FactContainer,
    /// Distinct facts that have been fully processed.
    pub stable: FactLSM,
}

In fact, FactSet::advance is one of the missing functions from the initial section. It promised to do something about deduplicating facts, and "getting us ready to go again". Let's see what that actually means.

/// Moves recent into stable, then to_add into recent.
pub fn advance(&mut self) {
    // Move recent into stable
    if !self.recent.is_empty() {
        self.stable.push(std::mem::take(&mut self.recent));
    }

    // Convert the LSM to one list of facts.
    if let Some(to_add) = self.to_add.flatten() {

        // Tidy stable for the work we are about to do.
        self.stable.tidy_through(2 * to_add.len());

        // Remove from to_add any facts already in stable.
        let mut starts = vec![0; self.stable.layers.len()];
        let stable = &self.stable;
        self.recent = to_add.filter(move |x| {
            starts.iter_mut().zip(&stable.layers).all(|(start, layer)| {
                crate::join::gallop::<Fact>(layer.borrow(), start, |y| y < x);
                *start >= layer.borrow().len() || layer.borrow().get(*start) != x
            })
        });
    }
}

That last closure is a mess and you should complain. It's checking for each element of to_add whether it can be found anywhere in the stable LSM. It's just .. kinda hard to read.

The lifecycle of facts, going from to_add to recent to stable, is what will guide our hand as we work to derive new facts, without redoing work we've already done for existing facts. That's what's coming up next.

How to apply rules to facts

Our last (finally!) step is to put down the logic that takes us from rules and facts, to even more facts. It's here that several seemingly arbitrary choices come together. But it's also pretty head stuff, as we try and understand what Datalog rules even mean.

This is the subject of the join.rs file in the project.

Datalog rules are Joins

When we have a rule like our triangle rule, updated with the new ? notation,

tri(?a, ?b, ?c) :- edge(?a, ?b), edge(?b, ?c), edge(?a, ?c).

we'll want to first try to apply it with the facts we have, from edge. This might seem overwhelming at first, but we can approach it a few ways.

First, observe that there are only so many settings of ?a, ?b, and ?c to actual literal values. Each of the literal values need to come from a fact involving edge. The ?a values need to come from the first coordinate, and the ?c values need to come from the second coordinate. We could plausibly just consider all settings of these values. It's a lot, but not infinite or anything.

Second, observe that all possible settings of these values is not infinite but it is really too many so we'll have to be smarter. Let's try taking a smaller bite at a time, and start by finding the values a, b, and c that satisfy the last two edge constraints. Let's give that the name vee, and rewrite our rule in terms of it:

vee(?a, ?b, ?c) :- edge(?b, ?c), edge(?a, ?c).
tri(?a, ?b, ?c) :- edge(?a, ?b), vee(?a, ?b, ?c).

I chose last two rather than first two, and it's totally arbitrary but the code works out better if we collapse towards the head. If you want it the other way around, you can call body.reverse() just after you parse the rule.

We now have two rules, each of which has a body with a pair of relations. I assure you that this produces the same results, and now we can think about just solving the problem for bodies containing only two atoms. Specifically, let's try and find literal triples that work for vee.

We are looking at a thing called an "equijoin" in relational databases. We have two relations, some coordinates of which must match, and we'd like to find all assignments of values where that can happen. The unlock with equijoins is noticing that ?c is present in both atoms, and we can start from it and explore outwards to find ?a and ?b that work.

For any two relations, how do we find the facts where certain key columns match? By sorting the relations by those key columns, and merging them! (That's why we sorted these things, in addition to helping to deduplicate).

Generalizing

We need to figure out how to do this in general, rather than just for triangles based on edges.

Our plan of attack will be to go through the body right-to-left, turning the last pair of relations into a join, and placing their result at the end of the list. Once we have only two relations remaining, we'll do a special-cased join that puts the results in the form the head requires. If we only have one relation to start with, we skip all this and just transform that relation into the shape the heads require.

The concrete parts we'll need to bring together are:

  1. For each atom in the body, a transformation of the facts so that they start with the columns we'll need to equate. Also, if the atom requires filtering (e.g. literal columns or repeated variable names), we would do this here.
  2. For each join we will perform, the arity (number of columns) of the key, to know which prefix of each fact to equate. Also for each join other than the last, the order of columns it should produce, so that the relation it deposits is ready to be joined. We won't have to worry about applying filtering here.
  3. For each head atom, the "layout" facts need to be in to be inserted into the relation. Each coordinate may be either a bound variable, or a literal from the atom itself.

Here's the shape of the join plan we'll use. I should stress, this is .. a very naive plan, and there are cooler ones I hope to get to in the future. This is a "right linear" join plan, and they don't get much simpler (maybe left linear is). Simpler doesn't mean easy in this case, though.

/// A description of a linear join plan for a Datalog rule.
///
/// The plan goes right to left, joining pairs of atoms until there are at most two left.
/// With one or two body atoms, we either:
/// 1. map the single body atom onto the head atoms, or
/// 2. join the two body atoms and map the result onto the head atoms.
#[derive(Debug)]
pub struct JoinPlan {
    /// For each body atom, the pre-work to put it in the right shape.
    bodys: Vec<Plan>,
    /// For each join other than the leftmost, the key arity and projection to apply to its output.
    ///
    /// This should land the results keyed appropriately, and pass along all demanded bindings.
    joins: Vec<(usize, Vec<usize>)>,
    /// For each head atom, the action to perform in the final join to correctly shape the result.
    ///
    /// An `Ok(index)` indicates a coordinate to copy, and an `Err(lit)` indicates a literal to copy.
    /// If `joins` is empty there is no final join, and each are applied to the rows of the body atom.
    heads: Vec<Vec<Result<usize, String>>>,

    /// The join arity for the final head-producing join, if such a join exists.
    arity: Option<usize>,
}

The two hard steps are now 1. create this join plan from a Rule, and 2. implement this join plan using joins.

Create a JoinPlan from a Rule

This is unfortunately just a wall of text.

We can start off on the right foot by noticing for each variable the leftmost atom and rightmost atom the variable appears in. The leftmost occurrence is the last moment (right-to-left) the bound variable needs to be retained; after this moment we can discard the values. The rightmost occurrence tells us when a variable first enters play, which is important for other body atoms that mention the variable to know if they must plan to include it in their key columns.

With this information, each body atom can look at its own variables and the leftmost and rightmost numbers to partition its columns into

  1. dead columns that no one wants to hear about,
  2. key columns that will need to be equated with the incoming relation,
  3. value columns that must be presented and carried on, but are not equated.

Similarly, for each join we can partition the columns of the two participating relations similarly:

  1. dead columns that no one wants to hear about (perhaps they were just joined),
  2. key columns that will need to be equated with the incoming relation,
  3. value columns that must be presented and carried on, but are not equated.

This thinking gives us the bodys and joins members, as well as the awkward arity member (a special case because of the leftmost join).

The final member, heads, is just about reading through the head atoms, recording either a literal or the location of the named variable. There is also some chaos in dealing with the possibility that there is no join at all, and what we need is to map the contents of the single body atom.

The code that does all of this, likely incorrectly, is in joins.rs nicely packaged up in a From<&str> implementation for JoinPlan:

impl<'a> From<&'a Rule> for JoinPlan {
    fn from(rule: &'a Rule) -> Self {
        ...
    }
}

And that's all I'm going to show you of it here!

Implement a JoinPlan using Joins

We are getting down to business. Up until this point everything was either pretty abstract, or pretty concrete but just involving some text or rules or things. We are now heading in to the performance sensitive part of things.

Join plan implementation starts through a method that looks like this:

/// Implements a provided rule by way of a provided plan.
///
/// Additionally, a rule-unique identifier allows the logic to create new relation names.
/// The `stable` argument indicates whether we need to perform a full evaluation or only the changes.
pub fn implement_plan(
    rule: &Rule, 
    plan: &JoinPlan, 
    pos: usize, 
    stable: bool, 
    facts: &mut Facts) 
{
    ...
}

There are a few secrets being revealed here. At least, you may have some questions about the stable argument and I'll now have to explain.

When stable is set we are going to join together the relations contents including stable + recent. Both of those are valid distinct facts, and we'll want pretty much everything. This is the mode we use when we introduce a new rule. Even if facts are stable, the rule is new and we have to do everything just to get started.

When stable is unset, we want the join results that don't derive exclusively from stable facts. We have already seen anything that can be redived from stable facts, so don't waste our time! Not yet clear how we are going to accomplish this though.

Joins, and binary joins in particular, which is what we are implementing, are "bi-linear", meaning in this case that they distribute over addition. If we have fact sets A and B, each of which have had some facts a and b "recently" added, we can write their join as:

  (A + a) ⋈ (B + b)
=
   A ⋈ B + A ⋈ b + a ⋈ B + a ⋈ b

The first term in the second line is A ⋈ B which is exactly the join results from stable facts. Yuck! We don't want that!

What we really want is the join results including recent facts, minus the join results on stable facts. We can rewrite the above into that:

  (A + a) ⋈ (B + b) - A ⋈ B
=
   A ⋈ b + a ⋈ B + a ⋈ b

This is great news! To get what we want, only the new derivations, we can perform three joins that each always involve a small letter, a "recent" addition. With that stable flag in mind, we'll implement our join with the ability to toggle between "full" and "incremental" join.

Apply the body plans

The first step in our join implementation is to apply the plans in JoinPlan::bodys. These are the things that tell us for each input how to re-lay out the columns, and how to filter them if necessary.

// New name for handy use.
let name = format!(".temp-{:?}", pos);
// Stash names of relations to draw facts out of.
let mut names = Vec::with_capacity(rule.body.len());

// For each body plan apply the plan to a newly named relation.
for (index, (plan, atom)) in plan.bodys.iter().zip(rule.body.iter()).enumerate() {

    if plan.identity() {
        facts.entry(atom.name.clone()).or_default();
        names.push(atom.name.clone());
    }
    else {
        let new_name = format!("{}-{:?}-in", name, index);
        let mut builder = FactBuilder::default();
        let found = facts.entry(new_name.clone()).or_default();
        if stable {
            for layer in found.stable.layers.iter() {
                plan.apply(layer, &mut builder);
            }
        }
        plan.apply(&found.recent, &mut builder);

        facts.get_mut(&new_name).unwrap().add_set(builder);
        names.push(new_name);
    }
}

There is a cute moment here where if the plan is a no-op, and we try to make them a no-op if at all possible, we can just skip the step. In that case, we'll just use the source fact list as reported!

Otherwise, we use that trusty FactBuilder we described earlier, and apply the plan to recent and if asked to each layer of stable. The builder is handed off to the FactSet, which knows how to integrate the FactLSM into its to_add member.

Apply the joins: right to left

We now have prepared body inputs, each in the sorted order they need to be in to perform a merge join. The next step is to apply these merge joins until there are at most two relations left. Each time we apply a join we need to indicate an arity, and what to do with the matching results. We will just put them into a FactBuilder, and stash that associated with a new temporary name.

let mut r_name = names.pop().unwrap();
if names.len() > 1 {
    // Burn down the joins until there are at most two collections left.
    for (index, (arity, projection)) in plan.joins.iter().enumerate().rev() {

        let l_name = names.pop().unwrap();
        let mut builder = FactBuilder::default();
        join_with(facts.get(&l_name).unwrap(), facts.get(&r_name).unwrap(), stable, *arity, |v1, v2| {
            builder.push(projection.iter().copied().map(|i| {
                if i < v1.len() { v1.get(i) } else { v2.get(i - v1.len()) }
            }));
        });

        // Stash the results under the name we'll use next.
        r_name = format!("{}-{:?}-mid", name, index + 1);
        facts.entry(r_name.clone())
             .or_default()
             .add_set(builder);
    }
}

Clearly the join magic is in another castle. The castle named join_with. We'll finally get there, but it's basically just ripped off of datafrog, so don't get too excited. Before then, one last step in this method.

Apply the last join: populate the heads

This last part is just horrible, at least in proportion to how little is going on.

If there are two body atoms to join, we are going to do the same thing as above but we'll use different instructions to lay out the facts. We'll also use multiple builders, because there could be multiple heads. If there is only one body atom we'll do the same thing but without a join, just by reading the relation.

Somehow I couldn't write this code very cleanly, so it's the same thing repeated three times. You'll have to check out the repo to read it, I'm sorry.

Getting down to joining

Let's knock out that join_with function and be done!

/// Joins `body1` and `body2` using the first `arity` columns.
///
/// Matching elements are subjected to `action`.
/// When `stable` is set, we join the stable plus recent elements of each input;
/// when it is unset we exclude pairs of terms that are both stable.
pub fn join_with(
    body1: &FactSet,
    body2: &FactSet,
    stable: bool,
    arity: usize,
    mut action: impl FnMut(<Fact as Columnar>::Ref<'_>, <Fact as Columnar>::Ref<'_>),
)
{
    // Compare elements by their first `arity` columns only.
    // This is a moment where compile-time information about types would help us; perhaps column introspection can recover.
    let order = |x: <Fact as Columnar>::Ref<'_>, y: <Fact as Columnar>::Ref<'_>| { (0 .. arity).map(|i| x.get(i)).cmp((0 .. arity).map(|i| y.get(i))) };

    if stable {
        for layer1 in body1.stable.layers.iter() {
            for layer2 in body2.stable.layers.iter() {
                join::<Fact>(layer1, layer2, order, &mut action);
            }
        }
    }

    for stable2 in body2.stable.layers.iter() {
        join::<Fact>(&body1.recent, stable2, order, &mut action);
    }

    for stable1 in body1.stable.layers.iter() {
        join::<Fact>(stable1, &body2.recent, order, &mut action);
    }

    join::<Fact>(&body1.recent, &body2.recent, order, &mut action);
}

Ok this was apparently all a lie and now there is some other join function?

So all we are doing here is handling that stable bit. We are performing the three joins that involve recent facts, and if asked the one join that involves only stable facts. We don't pass stable along to the join function, which is join really going to join things I swear.

/// Match keys in `input1` and `input2` and act on matches.
fn join<'a, T: Columnar<Ref<'a> : Ord+std::fmt::Debug>> (
    input1: &'a T::Container,
    input2: &'a T::Container,
    mut order: impl FnMut(T::Ref<'a>, T::Ref<'a>) -> std::cmp::Ordering,
    mut action: impl FnMut(T::Ref<'a>, T::Ref<'a>),
) {
    use std::cmp::Ordering;

    let input1 = input1.borrow();
    let input2 = input2.borrow();

    let mut index1 = 0;
    let mut index2 = 0;

    // continue until either input is exhausted.
    while index1 < input1.len() && index2 < input2.len() {
        // compare the keys at this location.
        let pos1 = input1.get(index1);
        let pos2 = input2.get(index2);
        match order(pos1, pos2) {
            Ordering::Less => {
                // advance `index1` while strictly less than `pos2`.
                gallop::<T>(input1, &mut index1, |x| order(x, pos2) == Ordering::Less);
            },
            Ordering::Equal => {
                // Find *all* matches and increment indexes.
                let count1 = (index1..input1.len()).take_while(|i| order(input1.get(*i), pos1) == Ordering::Equal).count();
                let count2 = (index2..input2.len()).take_while(|i| order(input2.get(*i), pos2) == Ordering::Equal).count();

                for i1 in 0 .. count1 {
                    let row1 = input1.get(index1 + i1);
                    for i2 in 0 .. count2 {
                        let row2 = input2.get(index2 + i2);
                        action(row1, row2);
                    }
                }

                index1 += count1;
                index2 += count2;
            },
            std::cmp::Ordering::Greater => {
                // advance `index2` while strictly less than `pos1`.
                gallop::<T>(input2, &mut index2, |x| order(x, pos1) == Ordering::Less);
            },
        }
    }
}

Straight out of datafrog, which took it straight out of somewhere else. I didn't invent joins.

This is a merge join, in that it is moving sequentially through sorted inputs and looking for keys that match. When it finds matching keys, it calls the action function, which as we know just packs facts into a FactBuilder. When it doesn't find matching keys, it calls the gallop function, which I definitely stole from the EmptyHeaded folks, and which is a bit like a binary search to move forward quickly to the next possible match.

You want to see gallop, you say?

/// Increments `index` until just after the last element of `input` to satisfy `cmp`.
///
/// The method assumes that `cmp` is monotonic, never becoming true once it is false.
/// If an `upper` is supplied, it acts as a constraint on the interval of `input` explored.
#[inline(always)]
pub(crate) fn gallop<'a, T: Columnar>(input: <T::Container as Container<T>>::Borrowed<'a>, index: &mut usize, mut cmp: impl FnMut(<T as Columnar>::Ref<'a>) -> bool) {
    let upper = input.len();
    // if empty input, or already >= element, return
    if *index < upper && cmp(input.get(*index)) {
        let mut step = 1;
        while *index + step < upper && cmp(input.get(*index + step)) {
            *index += step;
            step <<= 1;
        }

        step >>= 1;
        while step > 0 {
            if *index + step < upper && cmp(input.get(*index + step)) {
                *index += step;
            }
            step >>= 1;
        }

        *index += 1;
    }
}

That is literally the very end of join.rs. We are now done with the code tour.

Experimentation

I have some data from the Graspan project, acquired at the time but still available on Google Drive. If you want to follow along, put this fragment at the top of the main.rs file.

for arg in std::env::args().skip(1) {

    // Read input data from a handy file.
    use std::fs::File;
    use std::io::{BufRead, BufReader};

    let mut dict: BTreeMap<String, facts::FactBuilder> = BTreeMap::default();
    let filename = arg;
    let file = BufReader::new(File::open(filename).unwrap());
    for readline in file.lines() {
        let line = readline.expect("read error");
        if !line.is_empty() && !line.starts_with('#') {
            let mut elts = line.split_whitespace().rev();
            if let Some(name) = elts.next() {
                dict.entry(name.to_string())
                    .or_default()
                    .push(elts.rev().map(|x| x.as_bytes()));
            }
        }
    }
    for (name, facts) in dict { 
        state.facts
             .entry(name)
             .or_default()
             .add_set(facts); 
    }
}
state.update();

There are a few different inputs and different program analyses to look at. We'll break them down by analyses (dataflow and points-to) and go from there.

Nullability analysis

I'm starting things up this way, with the httpd dataflow input and the .list command once we are in the shell to see what we are looking at.

mcsherry@gallustrate datatoad % cargo run --release -- ~/Projects/datasets/graspan-dataflow/httpd_df
    Finished `release` profile [optimized] target(s) in 0.00s
     Running `target/release/datatoad /Users/mcsherry/Projects/datasets/graspan-dataflow/httpd_df`

> .list
        e:      9905624
        n:      138331
39.458µs

> 

Two relations, e and n, corresponding to "edges" and "nodes" respectively. The same will be true for the psql and linux inputs as well, for the dataflow analysis.

What is this dataflow analysis?

  • The n(?a, ?b) relation describes the potential write of some value ?a to a location ?b.
  • The e(?a, ?b) relation describes the movement of the value at one location ?a to another location ?b, as through assignment in a program.

We're doing a "reachability" query to see which values written to some location can end up at any other through a path of edges. We can write this in Datalog as:

n(?a, ?c) :- n(?a, ?b), e(?b, ?c) .

In words, "if value ?a may be written to ?b, and ?b may be copied to ?c, then the value ?a may be written to ?c".

Let's key this rule in to the shell and see what happens!


Narrator: What happened was an error. Always test your code, kids.


With the error fixed, and pushed to the repo, the result is

> n(?a, ?c) :- n(?a, ?b), e(?b, ?c) .
15.494418333s

> .list
        .temp-0-0-in:   9393283
        e:      9905624
        n:      9393283
45.375µs

> 

We took about 15s here, which .. is a number. It's quite a lot faster than what Graspan reports for the same problem (684s). But it's not the end of the story.

There's also this .temp-0-0-in up there, which isn't clearly anything we care about. And it has .. roughly as many recorsd as n, and that number isn't small. What's going on?

When we look at the join, we want to order the columns of n first by ?b and then by ?a, so that they can be joined with e(?b, ?c), which already has ?b up front. As such, we need to make an intermediate collection that just re-orders n, re-sorts it, it won't need to re-dedup it but we'll do that as well. So .temp-0-0-in is there only because we need to reshape n to make it work for the join.

We could try and load the data in differently, or sort n by its second coordinate, or something else complicated, but let's not. A simpler thing we can do, by virtual of n being small initially, is to reshape it once and develop the reshaped variable instead. Here's what that looks like:

m(?b, ?a) :- n(?a, ?b) .
m(?c, ?a) :- m(?b, ?a), e(?b, ?c) .

Pretty easy, with a bit of care taken to wiggle the variables around correctly. In fact, screw this let's use proper names to make this clearer.

m(?loc, ?val) :- n(?val, ?loc) .
m(?loc, ?val) :- m(?mid, ?val), e(?mid, ?loc) .

I'd love to give e, n, and m better names, but for the moment that would involve copying the whole of the data around. Could do, but it's not the performance no-op that using better variable bindings is. For now.

Let's see what happens now.

> m(?loc, ?val) :- n(?val, ?loc) .
58.964458ms

> m(?loc, ?val) :- m(?mid, ?val), e(?mid, ?loc) .
8.434536333s

> .list
        .temp-0-0-in:   138331
        e:      9905624
        m:      9393283
        n:      138331
23.959µs

> 

This is now substantially faster, but it does still have a .temp-0-0-in in there. This is my bad. It corresponds to the first rule, remapping n to m. There is no reason to have a non-identity map there, and I'll just need to tighten the logic for that (relatively untested) special case.

The good news is that we don't have a .temp relation for the join. Both e and m are laid out the way the join wants them laid out, and we don't have to make or maintain copies of the data. Other than that glitch in the copying. But we can ignore the glitch for now.


Narrator: He could not ignore the glitch.


The glitch went unnoticed because even for single body atoms, we create a plan that orders the columns alphabetically, because it uses a BTreeSet to hold the names. When we switched from ?a and ?b to instead use ?val and ?loc, for narrative reasons, that order changed. It reveals a larger glitch, spanning joins as well, that body atoms have the set of their key columns dictated at them, but they can choose the order of these columns, and prefer their natural order if there is one that is compatible.

We'll fix this, but not right now.

What we've learned here is that there is a bit of query planning and optimization that is up to the user. The initial shape of the data determines how efficient the plan can be, in a way that is tricky to fix without deferring computation until we see the ways the data will be used. That's not the intent of our current Datalog interpreter, which eagerly performs work, but it's a reasonable thing to think about supporting.

At this point we can load up other datasets and hit them with the same analysis. There is a psql dataset and several linux datasets. My understanding of the linux datasets is that they are meant to be analyzed in isolation: they all use identifiers starting at 0 and going up, but they don't correspond to the same code locations across files. The largest by far is lnx_kernel_df, and I'll just use that one.

I put the results together, reporting the time to perform the rule evaluation. The time to load the data is not insubstantial, though it could/should be instantaneous (more on this when we do "spill to disk").

dataflow httpd psql lnx_kernel
graspan 684s 8640s 42840s*
datatoad 8.43s 24.33s 55.01s
datafrog 1.30s 4.06s 8.03s

I had previously forgotten to subtract the loading time out for datafrog numbers, which led me to feeling more smug than I currently do about the datatoad numbers. We will unpack where the time goes in a future update, but the answer is "dealing with the uncertainty of variable lengthed facts and literals". Also datafrog hasn't had as many eyes on; perhaps there are some low hanging fruits others can see!

I put a * next to the Graspan lnx_kernel number because they report the times for everything together, but clearly not together like "run at the same time" because of the colliding identifiers. The lnx_kernel input is twice the size of all the other inputs combined, so it's not directionally misleading, although absolutely factually inaccurate.

We have some overhead over datafrog we need to explain, but are already comfortably non-trivial with respect to Graspan. The right point of comparison is probably Soufflé , which is the tool it seems like practicing professionals reach for, and which is certainly better than Graspan.

Aliasing analysis

A second analysis the Graspan folks consider, taken from Zheng and Rugina, is about analyzing the potential "aliasing" of values and memory locations. Their formulation works with expressions of two flavors: values e and locations in memory *e. They start from two input relations:

Assignment:     A(?val, ?loc)   // for each `?loc <- ?val`
Dereference:    D(?val, ?loc)   // for each `?loc` as `*?val`.

From these input data, they would like to determine two classes of aliasing (quoted from Zheng and Rugina):

  • Memory (or location) aliases: two lvalue expressions are memory aliases if they might denote the same memory location;
  • Value aliases: two expressions are value aliases if they might evaluate to the same pointer value.

If these are in any way confusing (they are to me) we can also just move forward with the data and the expressed analyses. They move forward, describing memory and value aliasing thusly, using terms we'll need to unpack.

Memory aliases:  M ::= D^T V D
Value aliases:   V ::= F^T M^? F
Value flows:     F ::= (A M^?)^*

Ok. What do these weird things mean?

The right hand sides are sequences of binary relations, where the intent is that Q ::= X Y Z the existence of intermediate variables that link up the the relations into a path. In Datalog we might write instead:

Q(?a, ?d) :- X(?a, ?b), Y(?b, ?c), Z(?c, ?d).

There are some additional weird symbols that we don't see in Datalog:

  • The ^T means "transpose", and it turns the relation around.
  • The ^? means "optionally", and the term can either be present or absent.
  • The ^* means "repeatedly", and allows any number of repetitions (including zero) of the term.

For example, we could write the M rule in Datalog as

M(?l1, ?l2) :- D(?v1, ?l1), V(?v1, ?v2), D(?v2, ?l2).

Notice how we handled the ^T just by changing the order of the variables in first D atom. Informally, the rule says that if values ?v1 and ?v2 are potentially aliased, then the memory locations that results from dereferencing them may be "memory aliased". That seems like it could make sense.

If we try and do the other two rules we run in to the ^? and ^* terms, and .. they are annoying for Datalog. Datalog doesn't really have a "yeah, that .. or could just skip it".

The ^? operator is relatively easy to accommodate, just by repeating the rule with and without the term. For value aliases, which may or may not involve a memory alias, we can write the two rules

V(?v1, ?v2) :- F(?v3, ?v1), F(?v3, ?v2).
V(?v1, ?v2) :- F(?l1, ?v1), M(?l1, ?l2), F(?l2, ?v2).

The ^* operator is a bit harder, because one "alternative" rule would have an empty body. Informally, this kinda means that it is unioned with the identity operator, which we may have to represent explicitly. We'll do that for now, but there is another fix that involves dramatically expanding out the set of rules. We'll discuss that if this approach ends up terrible.

The value flow term is any number (including zero) of steps along a A M^? path, which we can write as:

F(?val, ?unk) :- A(?val, ?loc), F(?loc, ?unk).
F(?val, ?unk) :- A(?val, ?loc), M(?loc, ?loc2), F(?loc2, ?unk).

I put ?unk as the right hand variable of F because .. we don't really know what it is yet. The rules say that you get whatever ?unk you found from the existing ?unk, but where does it all start from?

Informally, you start from the identity operator on the combination of all values and locations. We'll need values because F's first argument results from A's first argument, a value. We'll need locations because F's first argument joins with locations. We don't have permission to do anything less, so let's put all values and locations we have ever heard about into F.

F(?v, ?v), F(?l, ?l) :- A(?v, ?l).
F(?v, ?v), F(?l, ?l) :- D(?v, ?l).

These last rules are the unlock that both complete the definition of F and start the iteration. Let's do that now.

While all of this was running, I went and read the Zheng and Rugina paper. It's maybe a good moment to point out that while the Graspan paper took the analysis from Zheng and Rugina, the whole point of their paper was to avoid doing the sort of bottom-up analysis we are doing. The Zheng and Rugina paper is literally "Demand-Driven Alias Analysis for C", where "demand-driven" means it tries to prove certain specific target facts, rather than all facts. We'll come back to that in a bit.

> F(?v, ?v), F(?l, ?l) :- a(?v, ?l).
686.574892166s

> F(?v, ?v), F(?l, ?l) :- d(?v, ?l).
736.344523208s

> .list
        -a:     362799
        -d:     1147612
        .temp-0-1-in:   361947256
        .temp-0-1-mid:  172922754
        .temp-2-1-in:   92806768
        .temp-2-1-mid:  173310916
        .temp-3-0-in:   362799
        .temp-4-0-in:   362799
        .temp-4-1-in:   92806768
        .temp-4-1-mid:  173310916
        .temp-5-0-in:   362799
        .temp-6-0-in:   1147612
        F:      2669647
        M:      92806768
        V:      361947256
        a:      362799
        d:      1147612
145.333µs

> 

Oh boy that took a while. My the process is also sitting at 50.13GB, on my laptop, so .. that spilling to disk thing seems to already be happening. Let's look a bit closer and see if we can't dial in the numbers a bit more by writing the query more carefully. Then we'll discuss the whole "demand-driven" thing.

Let's start by looking at these temporary relations. Why do they each exist?

  • The .temp-0-1-in term is V "backwards" in the production of M.
  • The .temp-2-1-in term is M "backwards" in the production of V.
  • The .temp-3-0-in term is a "backwards" in the production of F.
  • The .temp-4-0-in term is a "backwards" in the production of F.
  • The .temp-4-1-in term is M "backwards" in the production of F.
  • The .temp-5-0-in term is a "backwards" in the production of F.
  • The .temp-6-0-in term is d "backwards" in the production of F.

The two big relations here are V and M, which are each "backwards" in all of their uses. The a and d relations are pretty small, but we can replace their uses with -a and -d, which are the Graspan names for the transposes (loaded as part of the input data). Let's instead produce and use -V and -M, following the Graspan nomenclature, flipping the order of their columns and making our rules now:

-M(?l2, ?l1) :- d(?v1, ?l1), -V(?v2, ?v1), d(?v2, ?l2) .

-V(?v2, ?v1) :- F(?v3, ?v1), F(?v3, ?v2).
-V(?v2, ?v1) :- F(?l1, ?v1), -M(?l2, ?l1), F(?l2, ?v2).

F(?val, ?unk) :- -a(?loc, ?val), F(?loc, ?unk).
F(?val, ?unk) :- -a(?loc, ?val), -M(?loc2, ?loc), F(?loc2, ?unk).
F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).

As we key in the final two rules, the system kicks into action again

> F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
418.523364167s

> F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).
397.395456458s

> .list
        -M:     92806768
        -V:     361947256
        -a:     362799
        -d:     1147612
        .temp-0-1-mid:  172922754
        .temp-2-1-mid:  173310916
        .temp-4-1-mid:  173310916
        F:      2669647
        a:      362799
        d:      1147612
531.042µs

> 

We've removed all of the -in temporary relations, and the process is now at 31.96GB. The numbers of tuples in the -M, -V, and F relations match their counterparts in the previous execution, so we've almost certainly produced the same thing. It might even be correct, but let's not get too unrealistic. We do still have the temporary intermediate results for the three-way joins, but we'll have to wait to learn about streaming joins before we can remove them.

Our total ends up at 815.92s, or 13.6 minutes, which compares favorably with Graspan's 8.3 hours. Still not sure if we are computing the right thing, so let's only be lightly pleased with ourselves. But pleased nonetheless.


Let's do a next unlock, and read the Zheng and Rugina more carefully. When they get to explaining their solution to the problem, relatively abstract up until this point, we learn a very important detail

We now present the demand-driven algorithm for answering memory may-alias queries. Given two lvalue expressions e1 and e2, the algorithm determines whether M (e1, e2) holds.

They only need to produce M as an output! And really they only need to be able to probe M, which is what that "demand-driven" term means, but let's talk about M first. If you only need to produce M, you don't need to produce V, which is great news because V is enormous. But don't we need V to produce M? No, not really.

The V relation contains all value aliasing, but M is only interested in value aliasing for addresses: values that appear as the first column in d. Why the hell are we computing all of V then? Let's inline the definition of V into M (make sure to choose fresh variable names; I sure didn't the first time!).

-M(?l2, ?l1) :- d(?v1, ?l1), F(?v3, ?v1), F(?v3, ?v2), d(?v2, ?l2) .
-M(?l2, ?l1) :- d(?v1, ?l1), F(?l3, ?v1), -M(?l4, ?l3), F(?l4, ?v2), d(?v2, ?l2) .
F(?val, ?unk) :- -a(?loc, ?val), F(?loc, ?unk).
F(?val, ?unk) :- -a(?loc, ?val), -M(?loc2, ?loc), F(?loc2, ?unk).
F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).

Keyed in, this results in

> F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
234.626393458s

> F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).
98.101901291s

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        .temp-0-1-mid:  17768284
        .temp-0-2-in:   2669647
        .temp-0-2-mid:  1859886
        .temp-1-1-mid:  172280896
        .temp-1-2-mid:  73474947
        .temp-1-3-in:   2669647
        .temp-1-3-mid:  1859886
        .temp-3-1-mid:  173310916
        F:      2669647
        a:      362799
        d:      1147612
86.75µs

> 

The -M and F counts have stayed the same, so probably still the right answer. The -V count is gone, which was the whole point. The process is now sitting at 18.96GB, so that was clearly a bit of a savings.

We do have some new join plan temporaries now. Notice in particular that the sizes of .temp-0-2-mid and .temp-1-3-mid are identical. They both correspond to the join of the two atoms at the end of the first two rules F(?v3, ?v2), d(?v2, ?l2).

Also, if you turn your head upside down and inside out, you can see that these are the same as the first two atoms in each of these joins. Why don't we just give them a name and use that. We'll call it Fd to evoke the sequencing of the two.

Fd(?v3, ?l2) :- F(?v3, ?v2), d(?v2, ?l2).

Rewritten, the -M productions are now a fair bit shorter.

-M(?l2, ?l1) :- Fd(?v3, ?l1), Fd(?v3, ?l2) .
-M(?l2, ?l1) :- Fd(?l3, ?l1), -M(?l4, ?l3), Fd(?l4, ?l2) .
F(?val, ?unk) :- -a(?loc, ?val), F(?loc, ?unk).
F(?val, ?unk) :- -a(?loc, ?val), -M(?loc2, ?loc), F(?loc2, ?unk).
F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).

Again, keying in the final two rules kick things off

> F(?v, ?v), F(?l, ?l) :- -a(?l, ?v).
163.471007292s

> F(?v, ?v), F(?l, ?l) :- -d(?l, ?v).
61.675481125s

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        .temp-0-0-in:   2669647
        .temp-2-1-mid:  73474947
        .temp-4-1-mid:  173310916
        F:      2669647
        Fd:     1859886
        a:      362799
        d:      1147612
103.333µs

> 

Again, a pretty nice improvement.

One last thing to do for now. The only thing we do with F is put a d on the end. Also Fd is a bit smaller than F, so it seems like we didn't need everything we produced. We can just compute Fd directly, starting from d and pre-pending any number (including zero) of (a M^?). This also solves the problem we have with the identity relation: Fd always has a d in it.

-M(?l2, ?l1) :- Fd(?v3, ?l1), Fd(?v3, ?l2) .
-M(?l2, ?l1) :- Fd(?l3, ?l1), -M(?l4, ?l3), Fd(?l4, ?l2) .
Fd(?val, ?unk) :- -a(?loc, ?val), Fd(?loc, ?unk).
Fd(?val, ?unk) :- -a(?loc, ?val), -M(?loc2, ?loc), Fd(?loc2, ?unk).
Fd(?val, ?loc) :- -d(?loc, ?val).

Now we only have to key in that last rule, and get

> Fd(?val, ?loc) :- -d(?loc, ?val).
162.082759041s

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        .temp-1-1-mid:  73474947
        .temp-3-1-mid:  73474947
        Fd:     1859886
        a:      362799
        d:      1147612
34.333µs

> 

We're seeing exactly the same thing, which is that the intermediate result -M Fd is computed twice, and is pretty chonky. Let's fix that. For reasons that are fundamentally unfair to me, we'll want MFd without a - in front of it.

MFd(?l1, ?l2) :- -M(?l3, ?l1), Fd(?l3,?l2).
-M(?l2, ?l1) :- Fd(?v3, ?l1), Fd(?v3, ?l2) .
-M(?l2, ?l1) :- Fd(?l3, ?l1), MFd(?l3, ?l2).
Fd(?val, ?unk) :- -a(?loc, ?val), Fd(?loc, ?unk).
Fd(?val, ?unk) :- -a(?loc, ?val), MFd(?loc, ?unk).
Fd(?val, ?loc) :- -d(?loc, ?val).

As usual, the last rule kicks off the work

> Fd(?val, ?loc) :- -d(?loc, ?val).
119.34187225s

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        Fd:     1859886
        MFd:    73474947
        a:      362799
        d:      1147612
31.666µs

> 

And we're sitting at 5.32GB. That's about a 10x improvement in memory from our first attempt at the plan, and nearly a 10x improvement in running time as well. I have no idea if we've computed anything of value, but I'm pretty confident that we've computed the original thing, only a fair bit faster. In fact, I think we've largely gone backwards through time, performing the technique that Zheng and Rugina departed from: explicit points-to sets that we intersect.


This was a not-unpleasant afternoon for me!

We did a bit of stress testing the tool, and managed to explore query optimization without needing to rewire the tool at all (except for a few bug fixes).a We saw a series of program transformations that were mostly mechanical, and you could totally imagine a query optimizer doing these for you. An order of magnitude improvement from naively stated rules is an amazing testament to the potential of query optimizers.

We also learned that you can get any bushy-tree join plan you like by naming and forming the indexed intermediate results, like we did with Fd. A lot of "declarative" languages have the frustrating defect that you can't always get the plan you want, but that doesn't seem to be the case here. We also saw the corresponding downside, that if you name a thing (e.g. V) that you don't actually need we'll still produce it for you, potentially at great cost.

All in all, I'm pleased with this experiment so far. Up next, I'd like to grab more examples of program analysis in Datalog and see how they work out. Maybe collect a bit more experience with the space of bespoke optimizations before trying to automate any of them.


Update: I forgot to talk about "demand-driven" queries. We'll get there in time, but the answer is to a first approximation "magic sets", a query transformation that embeds target literals into the query. Think for example of starting Vd not from all d, just the ones you are interested in. That would be wrong, it turns out, but there is a general way to transform queries to only explore facts that are "demanded", hence "demand-driven". It turns out that magic sets are not the optimal answer, several papers explain how they could be more efficient. I'll be reading them looking for the simplest way to do things, but in the future.

Planning and Optimizing

Let's write a very simple, and really quite bad, join optimizer!

Wait, wait!

I need to get something out of my system about the motivation for all of this. And by "all of this" I mean Datalog. Why all the work for what seems like an esotoric logic programming languages? Are you truly excited about programs built out of Horn clauses, their minimal Herbrand models, and extensions to non-monotonic logics? No?

Me neither. Sorry not sorry.

I find Datalog fascinating because it is for me the purest form of the main problem in data-parallel computation: data rendezvous. Essentially all of data-parallel computation reduces down to "I have subproblems that are easy enough to solve once I have brought the right data in the same place".

When you write a rule like

h(x, y, z) :- b1(x, y), b2(y, z) .

What you are saying, to me at least, is:

For each y, I need to bring together its x and z for some reason.

I don't know what your reason is, but this is 100% the sort of thing you do over and over and over in data-parallel computations. Yes of course if you want to pair up a customer's orders and account status you'll need a join like this, but it goes so much further. If you want to compute a PageRank step you'll need to bring together the ranks x of pages y with pages z they reference (and not shown: the out-degree of each page y). If you want to find friends x of yours y who might know a work acquaintance z, the first step is to collect your list of friends and coworkers.

The primitive operation in data-parallel computation is to group records by some key, and present the grouped list up to user logic (the reduce in MapReduce). Joins are the first way you can say "ok, ok, but not everything". They let you selectively route information to its ultimate destination h(x, y, z), whatever that is about.

A data-parallel IR

Let's write a dead simple intermediate representation (IR) for data-parallel computation. It's going to be too simple, but .. already more expressive than most systems out there. It's specialized to joins at the moment, but we'll generalize it soon enough.

There are three opcodes in this IR

pub enum Op {
    /// A collection of data by name.
    Var(String),
    /// Applies an action to each tuple.
    /// The action can filter, permute, and project tuples.
    Map(Action),
    /// Brands the first so-many columns as keys.
    Key(usize),
    /// Brings together multiple collections with the same key length.
    /// For each key, produces the Cartesian product (cross join).
    Mul(usize),
}

I know I said three opcodes, and there are clearly four in that list. We'll end up fusing Map and Key into one operator, to make our lives easier. As it turns out it is really powerful to allow the sharing and reuse of the outputs of Key and Mul expressions, but nothing great happens from reusing Map expressions.

The Action type could do a great many things, but for the moment we'll have it do something roughly like our Plan from way up above:

/// Filters rows, re-orders columns, and groups by a prefix.
pub struct Action {
    /// Columns that must equal some literal.
    lit_filter: Vec<(usize, String)>,
    /// Lists of columns that must all be equal.
    var_filter: Vec<Vec<usize>>,
    /// The order of input columns presented as output.
    projection: Vec<usize>,
    /// The prefix length by which the rows are grouped.
    key_arity: usize,
}

With this type, and in particular the key_arity field, we can just drop the Key variant above.

This IR is already able to express any of the right-to-left join plans we have produced so far. Each binary join is implemented by keying its inputs by the columns being equated, and then passing any next actions on to the join output in as the action argument to our join function.

At the same time, the IR can also represent things much smarter, and also much less smart. Let's start with less smart, and build up to smarter.

Something much less smart

Let's parse an arbitrary rule into a program in our opcodes!

I'm not sure how much less smart you were thinking, but I'm thinking we just cross-join all of the atoms in the body, and for each head put a filter and projection atop the cross product. Nothing wrong with Map using a key arity of zero, and Mul applying to any number of these. Nothing wrong other than performance, of course. We'll get there.

/// Converts a multi-head rule into assembly instructions.
pub fn rule_to_ir(rule: &Rule) -> Vec<ENode<Op>> {

    let mut head = Action::default();

    // Columns introduced with the same names (equality constraints).
    let mut bind = BTreeMap::<&str, Vec<usize>>::default();
    let mut prog = vec![Op::Mul(rule.body.len())];

    // Each atom is introduced with a no-op action.
    // Head rules will introduce filters and projections.
    for atom in rule.body.iter() {
        prog.extend([
            Op::nop(atom.terms.len()),
            Op::Var(atom.name.clone()),
        ]);
        for term in atom.terms.iter() {
            let column = head.projection.len();
            head.projection.push(column);
            match term {
                // Variable terms record the name used.
                Term::Var(name) => { 
                    bind.entry(name)
                        .or_default()
                        .push(column); 
                },
                // Literal terms record the filter.
                Term::Lit(data) => { 
                    head.lit_filter
                        .push((column, data.to_string())); 
                },
            }
        }
    }
    // Add the equated columns as a filter constraint.
    // Retain `bind` to support name resolution for heads.
    head.var_filter
        .extend(bind.values()
                    .filter(|l| l.len() > 1)
                    .map(|l| l.clone()));

    unimplemented!("Head atoms next!")
}

This puts together a list of Op instructions that I claim will perform the work we need. We are using Polish notation, aka "prefix notation", where something like

vec![
    Op::Mul(2), 
    Op::nop(3), 
    Op::Var("potato"),
    Op::nop(1),
    Op::Var("salad"),
];

Says to perform a binary (2) join between the keyless three columns of potato and the keyless single colunm of salad. That Op::nop function takes a number of columns, and does nothing to the data but project those columns out (key arity of zero).

We have the body of the rule ready to go. If we wanted to see everything, we'd just put Op::Map(head) at the front of our list of commands. Instead, we have a bunch of different head atoms we'll want to tailor head to support.

    // Each head atom should personalize `act` and apply it to `join`.
    // The personalization projects out only the distinct names.
    let heads = rule.head.iter().map(|atom| {
        let mut head_names = BTreeSet::<&str>::default();
        let mut head = head.clone();
        head.projection.clear();
        for term in atom.terms.iter() {
            // Ignore literals in head atoms.
            if let Term::Var(name) = term {
                if head_names.insert(name) {
                    // Use the first equivalent column.
                    let col = bind.get(name.as_str()).unwrap()[0];
                    head.projection.push(col);
                }
            }
        }
        head
    }).collect::<Vec<_>>();

The list heads now contains an Action for each head atom, and if that action is used in an Op::Map and applied to prog we'll compute the right result. Again, incredibly inefficiently with lots of cross joining. But it should be the right result. Just oh so very slowly.

Optimizing our IR

I hope you like e-graphs. I sure do.

If you don't know about e-graphs you might want to check out the egg webpage. I've also written a post about e-graphs, which describes the code we are about to use next. I've written up an e-graph implementation, borrowing poorly from the egg project, which you should use if you'd like professional e-graphs. I was hoping to learn by doing, so I did some doing. Learning then ensued.

The first step to learning about e-graphs is to learn about term graphs. A term graph can be seen as a map from identifiers Id to operations that only reference identifiers.

struct ENode<T> {
    op: T,
    args: Vec<Id>,
}

type TermGraph<T> = BTreeMap<Id, ENode<T>>;

This is a way to represent an expression tree, where each tree node has its own Id. A term graph allows you to go further and express sharing: if two ENode instances are identical, same operator and same arguments, then you can use the same identifier for both. This sharing allows programs to be more concise, but also to reveal which data can be reused.

It may seem like there's not likely a lot of helpful sharing to pull out of our current plans, but that's only because you don't know what we are about to do next. We are going to make a LOT of new e-nodes. Before we get there, let's introduce the expressions we already have, from prog (the body) and heads.

    // Houses our options for the rule implementation.
    let mut e_graph: EGraph<Op> = Default::default();

    // Insert and capture a name for `join`.
    let body = e_graph.insert(prog.clone());
    // Insert but ignore each head atom.
    for head in heads.iter() {
        e_graph.ensure(ENode::new(Op::Map(head.clone()), [body]));
    }

The e_graph.insert method omnomnoms the supplied sequence of operators and rebuilds them as e-nodes internally. It returns the name associated with the root of the program, in our case the Op::Mul instruction. The e_graph.ensure method is similar, but it works with a raw e-node: you specify the operator and the list of argument identifiers. We could copy our head action atop many copies of prog, but that's just extra work.


Let's introduce the reference example I've been using, which tracks the "clever" observation from the implementation section above.

head(?a, ?b) :- a(?x, ?a), b(?y, ?x), b(?y, ?z), a(?z, ?b) .

This rule walks backwards through a then b, then forwards through b then a. In term graph representation, this looks like:

0 -> [ Var("a") ]
1 -> [ Map(Action { proj: [0, 1], arity: 0 })[0] ]
2 -> [ Var("b") ]
3 -> [ Map(Action { proj: [0, 1], arity: 0 })[2] ]
4 -> [ Mul(4)[1, 3, 3, 1] ]
5 -> [ Map(Action { var_filter: [[0, 3], [2, 4], [5, 6]], proj: [1, 7], arity: 0 })[4] ]

We've cleverly noticed that although we use a and b twice, because we do nothing to them yet (identity projection and no key), it's super easy to re-use them. We'll still be murdered by the Mul(4) cross join, but the program is more concise now.

Equality Saturation

Time to optimize the e-graph! But optimize how? We just have our list of instructions.

    // Time to optimize the e-graph!
    e_iterate(&mut e_graph, &[
        &MulPermute, 
        &MulPartition,
        &MapPushdown,
    ]);

Not very clear what happened here.

The tl;dr is that each of MulPermute, MulPartition, and MapPushdown are rules that are able to introduce new e-nodes, and equate them to other e-nodes. For example, MulPermute says that you can equate

    [ Map1, Mul(k), A, B, .. Z ]
=   
    [ Map2, Mul(k), B, Z, .. Q ]

as long as Map1 and Map2 are the same other than the appropriate column permutation. There are a lot of permutations, and .. so lots of equivalent e-nodes. Syntactically distinct e-nodes, so we'll write them all down, and then call them equal.

How do e-graphs "call things equal"? Rather than map from Id to one ENode<T>, they'll map to a list Vec<ENode<T>>. Each of the e-nodes in the list are equivalent, and any could be used in place of the other.

Our three e-graph rules will respectively do:

  1. MulPermute : equate all permutations of inputs for Mul(k).
  2. MulPartition : equate all ways you can break Mul(k) into
  3. MapPushdown : equate a Map atop a Mul(2) with a Mul atop a Map that keys equated columns across the inputs, and has input-local filters and projections.

The code for these is a bit gnarly, and honestly I'm hoping not to use it for very long. It doesn't take an advanced degree to grok that a Mul(k) can produce at least 2^k * k! equivalent terms, just from the permutation and partitioning.


If we run the MulPermute rule only, repeatedly until we stop introducing new information, we end up with not all that many e-nodes

0:  -> [ Var("a") ]
1:  -> [ Map(Action { proj: [0, 1], arity: 0 })[0] ]
2:  -> [ Var("b") ]
3:  -> [ Map(Action { proj: [0, 1], arity: 0 })[2] ]
4:  -> [ Mul(4)[1, 3, 3, 1] ]
6:  -> [ Mul(4)[1, 3, 1, 3] ]
10: -> [ Mul(4)[1, 1, 3, 3] ]
13: -> [ Mul(4)[3, 1, 3, 1] ]
15: -> [ Mul(4)[3, 1, 1, 3] ]
17: -> [ Mul(4)[3, 3, 1, 1] ]

Those are the only distinct ways to use two copies of a and two copies of b. Admittedly, I concealed the e-class for the head. That's the one with all the equivalent expressions, so what does it look like?

EClass #5:  -> [ 
    Map(Action { var_filter: [[0, 3], [2, 4], [5, 6]], proj: [1, 7], arity: 0 })[4],
    Map(Action { var_filter: [[0, 3], [2, 6], [4, 7]], proj: [1, 5], arity: 0 })[6],
    Map(Action { var_filter: [[0, 5], [2, 4], [3, 6]], proj: [1, 7], arity: 0 })[4],
    Map(Action { var_filter: [[0, 5], [4, 6], [2, 7]], proj: [1, 3], arity: 0 })[10],
    Map(Action { var_filter: [[0, 7], [2, 6], [3, 4]], proj: [1, 5], arity: 0 })[6],
    Map(Action { var_filter: [[0, 7], [4, 6], [2, 5]], proj: [1, 3], arity: 0 })[10],
    Map(Action { var_filter: [[1, 2], [0, 4], [5, 6]], proj: [3, 7], arity: 0 })[13],
    Map(Action { var_filter: [[1, 2], [0, 6], [4, 7]], proj: [3, 5], arity: 0 })[15],
    Map(Action { var_filter: [[1, 4], [0, 2], [3, 6]], proj: [5, 7], arity: 0 })[17],
    Map(Action { var_filter: [[1, 4], [0, 6], [2, 7]], proj: [5, 3], arity: 0 })[15],
    Map(Action { var_filter: [[1, 6], [0, 2], [3, 4]], proj: [7, 5], arity: 0 })[17],
    Map(Action { var_filter: [[1, 6], [0, 4], [2, 5]], proj: [7, 3], arity: 0 })[13],
    Map(Action { var_filter: [[2, 5], [0, 4], [1, 6]], proj: [3, 7], arity: 0 })[13],
    Map(Action { var_filter: [[2, 5], [4, 6], [0, 7]], proj: [3, 1], arity: 0 })[10],
    Map(Action { var_filter: [[2, 7], [0, 6], [1, 4]], proj: [3, 5], arity: 0 })[15],
    Map(Action { var_filter: [[2, 7], [4, 6], [0, 5]], proj: [3, 1], arity: 0 })[10],
    Map(Action { var_filter: [[3, 4], [0, 2], [1, 6]], proj: [5, 7], arity: 0 })[17],
    Map(Action { var_filter: [[3, 4], [2, 6], [0, 7]], proj: [5, 1], arity: 0 })[6],
    Map(Action { var_filter: [[3, 6], [0, 2], [1, 4]], proj: [7, 5], arity: 0 })[17],
    Map(Action { var_filter: [[3, 6], [2, 4], [0, 5]], proj: [7, 1], arity: 0 })[4],
    Map(Action { var_filter: [[4, 7], [0, 6], [1, 2]], proj: [5, 3], arity: 0 })[15],
    Map(Action { var_filter: [[4, 7], [2, 6], [0, 3]], proj: [5, 1], arity: 0 })[6],
    Map(Action { var_filter: [[5, 6], [0, 4], [1, 2]], proj: [7, 3], arity: 0 })[13],
    Map(Action { var_filter: [[5, 6], [2, 4], [0, 3]], proj: [7, 1], arity: 0 })[4],
]

Haha. Ok. I got you.

For each Mul(4) above there are 2 x 2 Map here, corresponding to the two ways you could choose the a and b inputs. We could write a smarter MulPermute that notices when the argument Id are equivalent and avoids the redundancy. But strictly speaking we haven't taught our e-graph that these terms are all equivalent. Perhaps the hash of var_filter is very important, for some reason. We are pretty sure that isn't the case, but still.

The next rule, MulPartition, takes the number of distinct identifiers up to 43, and doublesn the number of equivalent head expressions up to 48. Everything else is pretty tidy though. There just aren't that many distinct subterms you can form. Enough that I don't want to copy/paste them here, but small enuogh that a computer does a fine job with them.

The last rule, MapPushdown, allows filters and projections to move across Mul operators, descending into their children where they can introduce a non-zero key_arity. There are now 114 distinct identifiers. There are for some reason 68 equivalent e-nodes corresponding to the root of the query. Several of them look like this, though

Map(Action { proj: [1, 3], arity: 0 })[239],

That is, they are simply a projection applied to some Mul term we can't see here. The numbers are not densely packed; they seem to go as high as 890 in this case.

So how do we find a plan that we like? Are any of them good?

Eggstraction

After saturating an e-graph with equivalences, the next step is "extraction".

This is where we take a peek at the e-graph and assign costs to each identifier. This is where we get to express our opinions about what is good and what is bad. For example, we should make sure that a four-way cross-join is bad.

We'll use a naive heuristic that costs each relation by its number of uncorrelated columns. For a Map this is the number of columns it produces. For a Mul this is the number of key columns, plus the sum of non-key columns across all inputs (the additional key columns are equated to the first, and needn't really be there). For a Var we just use zero because we have to use it and don't know how many columns it has. If an e-node has a higher cost than another it is frighteningly worse, and we won't use it.

For a set of identifiers we'll use the maximum cost among them. If there are ties in this cost metric, and there will be, we'll break them by minimizing the number of Map operators, then Mul operators. The Map operators need to write their data down, whereas the Mul operators just need to visit the data and do some compute.

There is some nuance here. It turns out there are a lot of distinct sets of identifiers we could use to activate the equivalent e-nodes of any one identifier. This input is small enough that you can just enumerate them all, though I took some care to suppress sets of identifiers that are strictly worse (a superset of) some other option. You could be smarter and proceed in waves from 0 upward, simply rejecting all e-nodes with a cost greater than the current wave. Once you establish viability at some cost, you could thin the options down to those, and do the more expensive thing.

The last thing to do, having done all of this mostly correct, I imagine, is find the specific sequence of operations to perform. More concretely, we'll return a sequence of ENode<Op> renumbered to use identifiers from zero onward referencing their list position. Let's see what happens on our reference input.

> head(?a, ?b) :- a(?x, ?a), b(?y, ?x), b(?y, ?z), a(?z, ?b) .

EPlan:
step 0: ENode { op: Var("a"), args: [] }
step 1: ENode { op: Var("b"), args: [] }
step 2: ENode { op: Map(Action { proj: [0, 1], arity: 1 }), args: [0] }
step 3: ENode { op: Map(Action { proj: [1, 0], arity: 1 }), args: [1] }
step 4: ENode { op: Mul(2), args: [3, 2] }
step 5: ENode { op: Map(Action { proj: [1, 3], arity: 1 }), args: [4] }
step 6: ENode { op: Mul(2), args: [5, 5] }
step 7: ENode { op: Map(Action { proj: [1, 3], arity: 0 }), args: [6] }
40.2635ms

> 

This gets found in wave two, at most two uncorrelated columns in any operator, and has four Map operators. Two of these are for the input and one of them is for the output. There is just one additional intermediate collection we form, corresponding to the b(?y, ?z), a(?z, ?b) fragment.

You might also notice that 40ms is not insubstantial for a release build.

What's next

I haven't done those clever things I mentioned that might speed things up. The time is almost entirely in the equivalence saturation, where a great deal of self-inflicted uncleverness occurs.

But there is also substantially less naive equivalences to produce. For example, a fused MulPermute and MulPartition could realize that it's only worth partitioning the input into parts where each part is still connected by variable equivalences. If you partition any other way you'll have a cross join, and while that's sometimes unavoidable, it usually means something is wonky about your query.

One important next step is that all of the above has been for a single rule. We have many rules, and when the same term occurs in multiple heads it is the union of the inputs, which suggests an Op::Sum opcode that might learn about distributivity. It's also a bit awkward in that we can interactively update rules, so a term that may look like another term now may become unequal if we add a new rule to target it. We'll need to not accidentally equate terms with their definitions, as appealing as that would be.

For the moment, my plan is to hook up the outputs of this optimization and use it. If it sucks, I'll have to figure out why and fix it. Using things is a great forcing function to make them better.

But actually, the thing I'd like to do most is get away from binary joins as our plans. As fundamental as they are to data-parallel computation, we can all see the worst-case optimal train coming down the tunnel at us. There are fundamentally better ways to bring relevant data together, the atoms of which will still look like Map and Mul, but we'll need to generalize them first.

Did I mention that equivalence saturation and extraction are nearly Datalog programs? They fit the data-parallel paradigm, and are mostly about rendezvousing the right data.

Update 29-06-2025

I implemented the execution of the optimized plans. It took more work than I expected.

We last we met, our planning emerged with a Vec<ENode<Op>>, with the instructions to apply these operations in this sequence, and use the ordinal positions of each e-node to identify it. This is all well and good, but we don't actually want to execute each node in isolation.

We have a mix of Var, Map, and Mul operators, and our actual execution intent is

  1. For each Var, for each Map that depends on it, in one scan through the external collection apply each of the maps.
  2. For each Mul, for each Map that depends on it, in one scan through the several inputs, perform the join and apply each of the maps.

This means that Op::Map(action) specifically isn't an operation we "perform", as much as queue up for the operation on which it depends. We'll need to do more prep work to put together an actionable execution plan.

To make our lives easier, I've started to use an augmented Action type, evocatively called

struct TempAction {
    lit_filter: Vec<(usize, String)>,
    var_filter: Vec<Vec<usize>>,
    projection: Vec<Result<usize, String>>,
}

The projection here can be either a column reference or a string literal. That's the only difference. The name actually a bit of a misnomer, as I hope to start using this everywhere, but I didn't want to have to update the optimization code, and all of the bugs I might introduce by changing it.

The extended pre-execution planning swings through our e-nodes and has each Op::Map variant enqueue itself in a map keyed by the input on which it depends. It does something different for the body and the head, only the latter of which can introduce literals, and which we have ignored in our optimization.

// The head needs planning assistance from its atoms, wrt literals and repeats.
let (body, head) = self.plan.split_at(self.plan.len() - self.rule.head.len());

With the head and body teased apart, we can enqueue the map actions by the identifier of the map input.

let mut steps = BTreeMap::<Id,Vec<(Id, TempAction)>>::default();

// For all map actions in the body, place in the step they derive from.
for (index, node) in body.iter().enumerate() {
    if let Op::Map(action) = &node.op {
        steps.entry(node.args[0])
             .or_insert_with(Vec::new)
             .push((index, TempAction::from_action(action)));
    }
}
// For all map actions in the head, place in the step they derive from.
for ((index, node), atom) in head.iter().enumerate().zip(self.rule.head.iter()) {
    if let Op::Map(action) = &node.op {
        steps.entry(node.args[0])
             .or_insert_with(Vec::new)
             .push((body.len() + index, TempAction::from_head(atom, action)));
    }
    else { panic!("Malformed plan; head atom is not a map!"); }
}

The steps map now contains the actual steps we need to perform in the order we should perform them. What follows is a re-writing of our much, much earlier much, much more horrible logic for evaluating a bespoke join plan. It's a bit tidier, but if you don't want to read the whole thing, I understand.

We'll start with the set-up: we'll still need names for each collection we materialize. Some of these names will be input names, some will be output names, and others will be entirely temporary. But names are how we access our Facts to get a concrete list of stable and recent facts, and to add candidate new facts.

let mut names = BTreeMap::<Id, String>::default();
for (idx, actions) in steps {
    let node = &self.plan[idx];
    match &node.op {

        unimplemented!()

    }
}

To show you just how serious about names we are, here is the logic for determining the right name to use just when we load data from an input, no joins:

    Op::Var(name) => {

        let mut todos = Vec::default();
        for (id, action) in actions.iter() {
            // Head productions should write at their own name.
            if *id >= body.len() {
                names.insert(*id, self.rule.head[id-body.len()].name.clone());
                todos.push((*id, action));
            }
            // Each action that is an identity can repeat the input name.
            else if action.is_ident(*arities.get(name).unwrap()) {
                names.insert(*id, name.clone());
            }
            // Each action that is not an identity will create a new name,
            // and deposit the loading results behind that name.
            else {
                let new_name = format!(".temp-{}-{}", pos, id);
                names.insert(*id, new_name);
                todos.push((*id, action));
            }
        }

The three cases are

  1. a head rule, which must write at its own name,
  2. a no-op transformation that can re-use the existing name, and
  3. a non-trivial non-head transform, which will need a temporary name.

With names in hard, we load up the input facts, if we can find them, and apply all actions. We'll accumulate the result in a list of FactBuilder things appropriately called builders.

        if !todos.is_empty() {
            let mut builders = vec![FactBuilder::default(); todos.len()];
            if let Some(found) = facts.get(name) {
                for layer in found.stable.contents().filter(|_| stable).chain(Some(&found.recent)) {
                    for ((_id, action), builder) in todos.iter().zip(builders.iter_mut()) {
                        for item in layer.borrow().into_index_iter() {
                            let lit_match = action.lit_filter.iter().all(|(i,l)| item.get(*i).as_slice() == l.as_bytes());
                            let var_match = action.var_filter.iter().all(|idxs| idxs.iter().all(|i| item.get(*i) == item.get(0)));
                            if lit_match && var_match {
                                builder.push(action.projection.iter().map(|i|
                                    match i { Ok(col) => item.get(*col).as_slice(),
                                                Err(lit) => lit.as_bytes() }));
                            }
                        }
                    }
                }
            }

And finally, we commit each builder to its corresponding FactSet.

            for ((id, _action), builder) in todos.iter().zip(builders.into_iter()) {
                let new_name = names.get(id).unwrap();
                facts.entry(new_name.clone()).or_default().add_set(builder);
            }

This was substantially longer in the original code, I think, because all of these cases were split apart and handled separately. The code is a bit shorter now; I won't say simpler.

The join side of the house is a bit simpler. We'll need to pull some information about the inputs, their key arity and names and get references to their fact sets.

        Op::Mul(2) => {

            // Peek at an input to see the intended join key arity.
            let arity = if let Op::Map(action) = &self.plan[node.args[0]].op { action.key_arity } else { panic!("malformed plan") };

            // Ensure collections exist, and then (borrow checker) fetch them.
            facts.entry(names.get(&node.args[0]).unwrap().clone()).or_default();
            facts.entry(names.get(&node.args[1]).unwrap().clone()).or_default();
            let facts0 = facts.get(names.get(&node.args[0]).unwrap()).unwrap();
            let facts1 = facts.get(names.get(&node.args[1]).unwrap()).unwrap();

            let mut builders = vec![FactBuilder::default(); actions.len()];

            join_with(facts0, facts1, stable, arity, |v1, v2| {
                for ((_id, action), builder) in actions.iter().zip(builders.iter_mut()) {
                    builder.push(action.projection.iter().map(|i| {
                        match i { Ok(col) => if *col < v1.len() { v1.get(*col).as_slice() } else { v2.get(*col - v1.len()).as_slice() },
                                    Err(lit) => lit.as_bytes() }}));
                }
            });

            // Commit the results back to their `FactSet`s.
            for ((id, _action), builder) in actions.iter().zip(builders.into_iter()) {
                let name = if *id >= body.len() { self.rule.head[id-body.len()].name.clone() } else { format!(".temp-{}-{}", pos, id) };
                facts.entry(name.clone()).or_default().add_set(builder);
                names.insert(*id, name);
            }
        }

That's the whole thing! There's an additional match case where we complain if we see another instruction and panic because Mul(2) and Map(_) cannot be executed.

I did have to add a new e-analysis that learned to remove Mul(1), which were introduced by unary rules. I had a bunch of other bugs as well, but it runs on our complicated example.

> Fd(?val, ?loc) :- -d(?loc, ?val).
114.284011209s

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        Fd:     1859886
        MFd:    73474947
        a:      362799
        d:      1147612
109.333µs

> 

It's even five seconds faster, for no reason that I can explain. I haven't tried it on more complicated plans, for example more naively stated versions of this workload. Mostly, I'm scared and would prefer to make smoothing out the optimization into a background task I improve in spare time. Also, we never got around to teaching it about multi-rule optimization, which might require a re-think of how we approach planning.

Streamlining our fact representation

Teaser: despite taking 50GB down to 5GB, we're still using about 10x more memory than we need to. Before we work to spill data to disk, let's at least make sure we aren't going to be spilling pure waste. We'll also need to look around for a harder problem, because doing spill-to-disk with less than 1GB feels a bit performative.


If we look more carefully at where the memory goes, we'll find out that most of it is waste.

Having run our workhorse computation to completion, we have the following relations with their associated fact counts:

> .list
        -M:     92806768
        -a:     362799
        -d:     1147612
        Fd:     1859886
        MFd:    73474947
        a:      362799
        d:      1147612
109.333µs

>

The large one here is -M, though MFd is also pretty chonky. Both of them will tell roughly the same story, so we'll focus on -M for now.

Recall that each of our collections are actually a log-structure merge-thing (list). That is, each is represented by a sequence of lists whose lengths at least double over the previous one. If we crack open the representation of -M and print out these lengths, we get the report

Layer size: 57289225
Layer size: 24951369
Layer size: 6406719
Layer size: 2738555
Layer size: 1040087
Layer size: 351521
Layer size: 24612
Layer size: 4550
Layer size: 130

Let's focus on the largest one, with 57,289,225 facts, as again the other layers tell a similar story.

Each of these layers is a columnar representation of Vec<Vec<Vec<u8>>>, a list of facts, where a fact is a list of byte lists. So many lists. The columnar layout uses three columns:

Vecs {
    bounds: Vec<usize>,
    values: Vecs {
        bounds: Vec<usize>,
        values: Vec<u8>,
    }
}

The outer bounds indicates for each fact, how many terms (each a Vec<u8>) does it contain. The inner bounds indicates for each term, how many bytes does it contain. The final values containes all of the bytes for all of the terms for all of the facts.

The largest layer requires just over two gigabytes:

total:  2,098,253,766
        = 458,313,800
        + 916,627,600
        + 723,312,366

You can see the contribution in order there, with the actual byte data at the very end. There is a lot of information that is not the actual bytes, and we're going to get that all down to zero, in some cases. Our case, in particular.

A sequence of optimizations

We'll discuss several optimizations that can apply in sequence. They conclude with a new datastructure that I want to try out, and which I've implemented. At the same time, most of the optimizations apply even to the original representation, and I'd like to start with them to avoid conflating the gains of the new representation with easier optimizations.

Now is also a great time to admit that I haven't implemented all of the optimizations yet. I have implemented them in other systems, and they should be pretty easy to do here, but as you'll see it requires a bit of tomfoolery. Totally unobjectionable tomfoolery in my eyes, but I wanted to get to a datastructure I like before the shenanigans. With that caveat in place, let's begin.

Optimization: specialize the bounds

The first large number above (458,313,800) tells us the boundaries between facts. What are the offsets in our list of terms where one fact stops and another fact starts? Our facts all have exactly two columns, and generally all of the facts for any one relation will have the same number of columns.

This collection is exactly (0, len+1).step_by(arity).

We don't need 458,313,800 bytes to keep track of the arity and the length. We can instead use a container for usize that optimistically assumes the data will be strided, like so, and record the arity and length that would make it so. If that ever stops being the case, the container cuts over to recording explicit offsets.

This optimization effectively zeros out the first large number.

Optimization: make the terms the same length

The second large number above (916,627,600) records the boundaries between terms. What are the offsets in our list of bytes where one term stops and another term starts? Our terms have .. varying numbers of bytes because they are the textual representations of numbers. The numbers go up to seven digits, and many of them are actually seven digits.

While we might need 916,627,600 bytes to record all of the varying lengths of each of the terms, we don't actually need to have terms of varying length. We can just use seven bytes for all terms, reformatting the text to include leading zeros. At this point, the same striding optimization kicks in, and we would record a length and the stride of seven.

This optimization effectively zeros out the second large number.

It does potentially increase the number of bytes, but we'll reduce these in just a moment.

Optimization: represent terms in binary

The remaining large number (723,312,366) is the concatenation of all bytes from all terms from all facts. There are roughly seven bytes per term, two terms per fact, and ~57.2M facts, which gets us to this number.

We've already discussed that these are numbers up to seven digits, and as such they fit great in a u32 which takes only four bytes. If we go down from seven bytes per term to four bytes per term, we end up with

   57,289,225
          * 2
          * 4
    ---------
= 458,313,800

many bytes. This is already a 1.57x reduction from the original size, but there is one last trick to pull.

While u32 is a fine Rust type, we know the numbers are at most seven digits, which means they fit great in three bytes. And we just have lists of bytes we are working with. Three is a non-standard number of bytes, but .. we don't care. We might care eventually when we look at cache line alignment, but .. for the moment let's not worry.

A further 4/3x reduction compounds to a 2.10x reduction, and gets us down to 343,735,350 bytes. All told, we have gone from two gigabytes down to ~350 megabytes, a roughly 6.10x reduction. At the same time, we've probably made our computation easier, because all of the various offsets are directly computed rather than the result of sniffing around in lots of data.

Getting to 10x

I claimed we were going to get a 10x reduction in memory footprint, and .. I wrote that before I looked at the data. But we're going to do it.

The last remaining improvement comes from the observation that our lists of sorted facts have a dramatic amount of repetition in their first term. This makes sense: we've sorted them, and as many facts as there are that start with the same term, we'll have that many copies of the first term in a row. Not much point writing these first terms down explicitly, is there?

The large layer we are discussing has 57,289,225 distinct facts, but only 1,147,612 distinct first terms. We could rewrite our facts from a collection of (Term, Term) into a collection of (Term, [Term]): a list of pairs of distinct first terms, and the list of each's second terms. If we do this naively in columns, using usize for the bounds, it would take

term0.values:    1,147,612 * 3 =   3,442,836
term1.bounds:    1,147,612 * 8 =   9,180,896
term1.values:   57,289,225 * 3 = 171,867,675
                                 -----------
total bytes                      184,491,407

This hits a 11.37x reduction from our initial 2GB, for this the largest layer of the largest relation. If the pattern of reduction holds across our 5GB of total state, we'd expect to get down below 500MB. It probably wont be quite as strong a reduction for the smaller layers, because the lower density of facts might result in less repetition, but it is close for the largest of layers, which are the ones that matter.

This is where we will leave our announcements of improvements for the moment.

Up next will be actually doing these things.

Update 02-07-2025

We have the first wave of optimizations in place and running. That largest batch now takes exactly 343,735,382 bytes vs a theorized 343,735,350. The addition 32 bytes are two copies of (stride, length) pairs of u64 numbers, which were the two instances of "effectively zero" in the text above.

The computation takes less time now! The time hovers around 95s, dipping below and raising above, slightly. This compares to ~115s that it used to be, so knocking off about 20%. And there should be more optimization to do; we're pretty casual about the implementation, and haven't even looked at a profiler to see what it is doing.

The second wave of optimizations, recording the data as a tree rather than a list, is also in place for measurement purposes, but not yet driving the engine. The breakdown of the largest batch of -M previously and now looks like:

old bytes: 343,735,382
         =          16    // every fact has 2 terms.
         +          16    // every term has 3 bytes.
         + 343,735,350    // 2 x 3 x 57289225 bytes.

new bytes: 184,491,255
         =          16    // only one set of first terms.
         +          16    // every term has 3 bytes.
         +   3,442,836    // 3 x 1147612 (first terms) bytes.
         +   9,180,696    // 8 x 1147587 (first terms) bounds.
         +          16    // every term has 3 bytes.
         + 171,867,675    // 3 x 57289225 bytes.

The new representation uses two lists of byte lists, one for the single list of all distinct first terms, in order, and a second for a number of lists equal to the distinct first terms of the corresponding second terms. Mostly we are recording bytes, but we also need to take some space now to record the lengths of the lists in the second collection: not all first terms have the same number of second terms. It appears we were lightly lucky, and the first 25 lists had the same length, which explains why the (first terms) numbers don't match.

The next step is deploying the layered trie in anger, to pilot the joins, merging, and deduplication. Once this is in, and running not-terribly, we can turn off the old representation and have a further reduced memory footprint. The layered trie is exciting to me personally, because I want to understand more about how to use this representation to represent relations that are more complex than our stolid binary relations.

Spilling facts to disk

Teaser: our columnar stores that hold our facts are backed by relatively few but large allocations. When creating them we can write them to files instead of to memory, and then memory map the results back in and use them.

Scaling rule evaluation out

Teaser: Our joins, deduplication, and distinctness checks are all based on equality of keys. We can distribute the keys, and the data attached to those keys, across multiple workers. We even have the right library at hand to do this, and scale out to multiple processes (timely_communication).

Streaming facts through rules

Teaser: Our joins all use binary joins with materialized outputs. If the right indexes already exist, it's possible to use the multi-linearity of joins to create plans that materialize no internal state. I think this was where I was going to do worst-case-optimal joins also, for some reason.

Specializing to compiled code

Teaser: We'll dive deeper into the moments in our code that exist because we didn't have compile time access to something that datafrog did have. I'm cautiously optimistic that we can get a fair bit of that performance back, by noticing that our data is laid out the same way, and we just need to help Rust notice that some implementations are still ok. And if not, we'll learn something.


Our memory optimization around using the same size for all terms leads to an impressive compute optimization.

The memory optimization was that if all terms have the same length B in bytes, and all facts in a relation have the same number T of terms, then we can think of the Vec<u8> backing the relation as Vec<[[u8; B]; T]>. This removes all of the "mystery" about how the data are shaped, and gives Rust a lot of time back that it was previously spending carefully checking bounds and lengths and such.

Comparisons, in particular, become substantially cheaper.

Where do we do comparisons in datatoad? Pretty much everywhere.

  1. To form batches of facts we sort and deduplicate them.
  2. To merge batches we linearly scan each comparing items as we go.
  3. To join two relations we merge their keys, skipping over areas we miss.
  4. To filter new facts against old we scan the new facts and exponential search the old facts.

These are in order of fact lifecycle, but also I think in order of decreasing "comparison density". Sorting requires a lot of comparisons; merging performs linear comparisons, and joining and antijoining (filtering) skips through larger datasets without looking at everything.

What sort of amazing thing will happen if we make sorting faster?

Baseline

Let's get baseline measurements for each of the two tasks, and three datasets. In doing this, we've had to back off of the "three bytes" take for each term, and go up to four bytes. Some of the inputs have values that go above 16M, unfortunately. But it also feels a bit more responsible, which feels good.

Copy/pasting from above, my understanding of valid comparisons for the "dataflow" task are:

dataflow httpd psql lnx_kernel
graspan 684s 8640s 42840s*
datatoad 7.44s 17.26s 42.25s
datafrog 1.30s 4.06s 8.03s

For the "aliasing" task, I'm less clear that we are doing the same thing, and I've also hand-optimized the logic. The psql input has multiple files, not unlike the linux input, and I'll use the biggest one. Also, we don't have a datafrog program for this problem, because it is a pain in the ass to write datafrog programs. In fact that's kinda why we are doing all of this.

aliasing httpd psql lnx_kernel
graspan 8.4h 6.0h* 1.7h*
datatoad 101.24s 96.36s 20.20s
datafrog UNK UNK UNK

Again, let's not get too excited about the speed-up over the Graspan numbers. I'm not sure why they are large, but the point isn't that they are or are not a soft starting point, as much as evidence that we aren't way out in left field performance-wise.

Optimizing sorting

Facts start life as a Vec<Term> where each Term is itself a Vec<u8>. Our first action, once we've collected one million facts (an arbitrary threshold, perhaps worth revisiting), is to sort and deduplicate that list of facts.

Here's how we would normally sort and deduplicate our facts.

fn from(facts: &mut InternalContainer) -> Self {

    let mut ordered = InternalContainer::default();
    let borrowed = <InternalContainer as Container<Fact>>::borrow(facts);

    let mut items = borrowed.into_index_iter().collect::<Vec<_>>();
    items.sort();
    items.dedup();
    ordered.extend(items);
    use columnar::Clear;
    facts.clear();

    Self { ordered }
}

Informally, we collect references to the facts, and then sort and deduplicate the collected list. It's a bit awkward performance-wise, because unlike most good sort algorithms we aren't relocating the data to increase locality. At the same time, this is currently general columnar facts, terms, and bytes, all with varying lengths, and it isn't clear how to efficiently re-layout the data (radix sort?).

But, there's a special case we've been discussing when we happen to know a common arity for facts, and a common byte length for terms. If we see that happen, and .. we've pre-baked in the constants to the program, good things can happen. Let's just insert this bit of code at the top of the method there

    if let (Some(2), Some(4)) = (facts.bounds.strided(), facts.values.bounds.strided()) {
        let len = unsafe {
            let temp: &mut Vec<[u8; 8]> = std::mem::transmute(&mut facts.values.values);
            temp.set_len(temp.len() / 8);
            temp.sort();
            temp.dedup();
            temp.set_len(temp.len() * 8);
            temp.len() as u64
        };
        facts.bounds.bounds[1] = len / 8;
        facts.values.bounds.bounds[1] = len / 4;
        return Self { ordered: std::mem::take(facts) }
    }

Oh sweet baby Jesus, someone invited unsafe to the party.

What we are doing here is recasting a Vec<u8> as a Vec<[u8; 8]>, because we "know" that the bytes come in blocks of eight. I don't know a better way to sort and deduplicate than this transmute, though I'd love to hear. You can sort safely with [T]::as_chunks_mut, but you still want to deduplicate, and that seems to require a Vec. We could probably write safe dedup through a bunch of overwrites on the slice followed by a truncate on the Vec<u8>; I'll surely try that to avoid the unsafe.

But for the moment, let's see what happens! We'll look at the original datatoad numbers and their improvements due to better sorting. First the dataflow use case:

dataflow httpd psql lnx_kernel
dt-orig 7.44s 17.26s 42.25s
dt-sort 4.99s 13.55s 32.15s
datafrog 1.30s 4.06s 8.03s

And the aliasing use case

aliasing httpd psql lnx_kernel
dt-orig 101.24s 96.36s 20.20s
dt-sort 52.99s 53.19s 11.20s
datafrog UNK UNK UNK

So that's a pretty big improvement. At least, if we could repeatedly knock off large fractions of runtime, we'll have something great in not too long. The effect is more pronounced in the more complex aliasing analysis, which seems to move more data around.

Optimizing merging

We repeatedly merge lists of facts.

We do this mostly to maintain a limited number of sorted lists, so that when we need to perform a join, or filter against existing records, we only need a few passes over data.

The original merging code is not wildly complicated, and I won't copy it here. We'll do a similar thing near the head of the method, though, to see if we can spy some properly aligned inputs.

    if let (Some(2), Some(4)) = (self.ordered.bounds.strided(), self.ordered.values.bounds.strided()) {
        if let (Some(2), Some(4)) = (other.ordered.bounds.strided(), other.ordered.values.bounds.strided()) {
            let mut ordered = InternalContainer::default();
            ordered.values.values = self.ordered.values.values.iter().chain(other.ordered.values.values.iter()).copied().collect::<Vec<_>>();
            let len = unsafe {
                let temp: &mut Vec<[u8; 8]> = std::mem::transmute(&mut ordered.values.values);
                temp.set_len(temp.len() / 8);
                temp.sort();
                temp.dedup();
                temp.set_len(temp.len() * 8);
                temp.len() as u64
            };
            ordered.bounds.clear();
            ordered.bounds.bounds[0] = 2;
            ordered.bounds.bounds[1] = len / 8;
            ordered.values.bounds.clear();
            ordered.values.bounds.bounds[0] = 4;
            ordered.values.bounds.bounds[1] = len / 4;
            return Self { ordered };
        }
    }

This case is a bit more complicated, in that we need to form an output which means setting its strides and lengths correctly. Otherwise, this implements "merge" by "concatenate, then sort and merge", which is much less sophisticated than our original merge. But it works.

First the dataflow use case:

dataflow httpd psql lnx_kernel
dt-orig 7.44s 17.26s 42.25s
dt-sort 4.99s 13.55s 32.15s
dt-both 3.71s 11.23s 23.58s
datafrog 1.30s 4.06s 8.03s

And the aliasing use case

aliasing httpd psql lnx_kernel
dt-orig 101.24s 96.36s 20.20s
dt-sort 52.99s 53.19s 11.20s
dt-both 31.32s 30.08s 8.56s
datafrog UNK UNK UNK

These are some pretty decent improvements, for not all that much code. They do strongly rely on the common length of every term, which makes sense with typed primitive data, but won't hold up if we start using strings everywhere again.

Optimizing other things

There are two more uses of comparison: joining and antijoining (filtering out old facts).

Looking at profiles, these might be another 30% of the time, and are absolutely worth removing. Or they would be, except that we are just about to rewrite a bunch of this logic to move to trie-based data structures. When we do, we'll be keeping our eye out for the same optimization opportunity, which certainly exists but applies on a column-by-column basis, which is more likely to hold than our "all terms have the same length" requirement.

We didn't quite get to the compiled performance of datafrog, and I think we'll probably return to this subject in the future, as there's no obvious (to me) reason that we shouldn't be able to hit the same performance when allowed the cheeky sorts of nonsense we've just pulled in this section. Some profiling should reveal what it is that datatoad is doing that datafrog does not do, and it seems like all of our computational kernels can be brought into line with some unsafe.

I'm personally going to next see if I can remove the unsafe with some safe byte shuffling, followed by a safe (but unjustified) vector truncate. Unsafe code is the worst, folks; avoid it at all costs.

Specializing to custom representations

Teaser: I have a bee in my bonnet about detecting "transitive closure" computations and specializing them to the strongly connected component decomposition. You can do similar things with equivalence relations, which you can also read out of the rules themselves, and then use e.g. a union-find datastructure. I also thought we might discuss bddbddb, a database built around binary decision diagrams, which leads us to factorized databases if we haven't already hit those yet.

Searching for relevant facts

The demand transform, which I still don't entirely 100% certainly understand. But we should understand it, because you want to use it if you are interactively exploring data with Datalog.