Launch HN: Lume (YC W23) – Generate custom data integrations with AI

wefarrell · on March 20, 2023

One area where I think AI would be super useful is interpreting enterprise data dictionaries and companion guides, for example:

https://www.cms.gov/files/document/cclf-file-data-elements-r...

Currently I have to write validations based off of that definition and then write code to transform it to another standardized claim format. The work is king of mind numbing and it seems like it would be possible to use AI to streamline the process.

nmachado · on March 20, 2023

If you have the desired standardized claim format, Lume supports this use case. We also have a pdf parser in the roadmap to parse documents exactly like the one you linked, to then transform and pipe the data accordingly.

wefarrell · on March 20, 2023

How does Lume support this today without a pdf parser? Do you have the option to use a preexisting claim format or does the format have to be specified another way?

nmachado · on March 21, 2023

Our V1 supports json and csv formats for manual imports, and we’re quickly expanding to other formats (like pdf).

So, to clarify - Lume supports this today only if you provide the linked claim data in json or csv format, and in the near future will support direct pdf formats. All of our users so far provide custom data through their data warehouse, json, or csv.

wefarrell · on March 21, 2023

Just to be clear, the pdf does not contain the data. The pdf contains the data dictionary that describes the structure of the data such as the type of field, whether it's required, etc... the actual claim data is sent in a csv.

The objective is be to parse the csv based on the data dictionary described in the pdf.

nmachado · on March 21, 2023

Gotcha! In that case, we do not yet support an end-to-end experience for this, but would be willing to prioritize building it for clients if we get strong demand.

Avicebron · on March 20, 2023

Cool, so are you actually using a LLM? If so, is it yours or are you borrowing someone else's (you mentioned that recent improvements in LLM's being a catalyst as the right time to tackle it)?

If not, I'd definitely like to hear more about your specific AI model.

nmachado · on March 20, 2023

Yes, we are using an LLM for some parts of the code generation, specifically GPT-4. In the medium-term, we plan to go lower in the stack and have our own AI model. We broke down the process into modular steps to only leverage LLMs where it's most needed, and use rule-based methods in other parts of the process (e.g. in fixing compilation errors). This maximizes the accuracy of the transformations.

btbuildem · on March 21, 2023

Modular use of an LLM over a problem-specific workflow skeleton is the winning ticket. Nicely conceptualized!

Avicebron · on March 20, 2023

Do you have some sort of automatic test suite for what's generated by the LLM prior to release? Just to ensure what it returns won't break downstream?

robert-te-ross · on March 20, 2023

Yes, internally, we have separate models that produce tests the final data has to pass before being presented to the user. In addition, you can define your own tests on the platform, and we will ensure transformations produced will pass those tests before deployment. We also have helpful versioning and backtesting features.

jxnlco · on March 20, 2023

looks like it probs passes the source and target schema throught an LLM that generates a sql create statement. similar to https://magic.jxnl.co/data

and make a request like 'write me sql to map the existing tables to a new table with this schema'

margorczynski · on March 20, 2023

Considering you're using an nondeterministic way of generating the transformation (LLM) what sort of guarantee do I get that it will work correctly and do what I want?

Is my proprietary data stored on your servers (database schema, rows, etc.)? If so what safety guarantees do I get?

nmachado · on March 20, 2023

Regarding guarantee that it will work correctly, there are ways to reduce the ambiguity in the task given. One way is to input very detailed descriptions of your end schema. This limits the amount of assumptions our model has to make. In addition, you can define tests either by writing SQL code on Lume, or by explaining in plain English the tests the final data has to pass (and edit them, of course). Our models make sure the end data passes these tests, guaranteeing your desired outcomes. We also offer versioning and backtesting capabilities, so you can have more confidence in your deployments. You can also review the sample data + the sql used to guarantee Lume drafted the integration you desired.

With regards to where your data is stored, technically we only need your schema information for our models and everything is run on your cloud, which some customers prefer for privacy / safety. That being said, the ability to sample source data or test the end schema, which does require some data read access, will improve your experience with Lume. In these cases, we of course have contractual agreements with our customers.

dustymcp · on March 20, 2023

Is this really much faster than just writing these things? My latest integration with 4 endpoints took around 3-4 hours with tests ? I feel most of the work comes from your business model and making the fitting which you would still need to do unless im missing something entirely?

robert-te-ross · on March 20, 2023

In most cases, we build these transformations in a matter of seconds. Furthermore, we can detect changes from either source or destination and change the transformation accordingly, reducing maintenance burden as well.

tyingq · on March 21, 2023

I see some evidence that it handles complex transformations, but there's so many corner cases in the real world, like...

- Different ranges, where the source is, say "size 0-10", and the destination is "S/M/L".

- Various flattening or exploding needs. Like an array of namespaced tags driving a flat list of boolean fields. Or a source with 2 tables and a foreign key being transformed into tags, or flat fields, or a 3-level nesting.

- Encoding/Decoding things. Transforming windows-1252 into utf-8. Decoding base64 (or json, or xml, or...) and storing as fields in the destination.

- Compound transforms, both directions, two fields into one, or vice-versa with splitting on a delimeter.

- Appending a unique suffix/count to some field because the source doesn't enforce uniqueness on the field, but the destination does. Or going the other direction.

- Hundreds of similar patterns.

It's fairly easy to see the breadth if you look at all the dials and knobs on any popular ETL tool.

I'm curious if the idea is to pull all these into scope, or if it's to ignore it, and focus on a deliberately smaller market.

Nebyou · on March 21, 2023

We've observed that our system performs really well handling most corner cases as long as the context required can be interpolated from its inputs (either in the schemas and their descriptions or in the underlying data we sample from). In the worst case, the most you'd have to do is edit schema descriptions on our platform to include the necessary context (For example, specifying the encoding that you expect the field in your end schema to have).

For the compound transform scenario, since we optimize for modularity in the transformations we build, our systems prioritize defining these transformations unless it makes no sense to do so.

gregw2 · on March 20, 2023

The demo shows you transforming As-Is data to As-Is data.

Is this a full refresh each time or incremental (do you have to tell it the incremental columns or can it "tell"?)

Can you create audit timestamps in the target which track when rows were inserted or updated (or soft-deleted) in the target?

Can you take sources which contain "current state" table information and transform them into tables that have record start-effective/end-effective date (+ a current record indicator flag?) that support As-Was querying for a given primary key and/or which tracks soft deletes over time as an additional target table column?

Nebyou · on March 21, 2023

This example is a full refresh, but incremental is usually the norm, especially for our supported connectors and continuous syncs. Our models can detect incremental columns.

Audit timestamps (usually tables) are typically created in intermediary stages (whose materializations you would have access to in your database) before getting pruned out to fit your destination schema. Of course, if the destination schema expects these audit tables or columns, they would be included in the target.

To your last question, if you include these tables or columns in the end schema you specify to Lume (or create a separate flow with a new end schema with these fields), what you described is definitely possible.

gregw2 · on March 21, 2023

How does Lume know what to fill in on the target table if a record effective end date isn’t in the source system but is a property of when the data was fetched? How does Lume know to update the record effective termination date of a target row when a new row comes along?

(This type of data modeling is described here: https://www.kimballgroup.com/data-warehouse-business-intelli... )

Nebyou · on March 21, 2023

Apologies, I misread your earlier question. Lume can only ascertain information from what is given directly or from what assumptions it can make with reasonable confidence. So, in this case, this will not be possible unless there is any information in the start schema where this can be interpolated. If these prior transformations were done in a dbt project, we could extract the information needed for this easily (a dbt project can be integrated and represented as a separate flow within your workspace).

mosseater · on March 20, 2023

Wow! How did you get 300(+) data connections with such a small team?

nmachado · on March 20, 2023

We leveraged Airbyte - it makes supporting that many connections much more seamless ... and a lot of coding!

mritchie712 · on March 21, 2023

Have you check Airbyte's license[0]? Much of it is Elastic License 2.0, which I don't think allows what you're doing.

0 - https://github.com/airbytehq/airbyte/blob/master/LICENSE

danpalmer · on March 21, 2023

> You may not provide the software to third parties as a hosted or managed service, where the service provides users with access to any substantial set of the features or functionality of the software.

The screenshots of Airbyte and Lume even look nearly the same. It looks like it's just a hosted Airbyte instance with GPT generating the SQL/config.

Smart for an MVP, not much of a moat for the long term, and it's a shame that software licencing is such a blindspot for so many SV startups.

mritchie712 · on March 21, 2023

I personally wouldn't do this even for an MVP. The product is competitive with Airbyte and they're breaking Airbyte's license.

Definitely cases where it's ok to do something similar for an MVP, but I wouldn't touch this product knowing they can't continue to operate it this way (could get shut down at any moment).

I'm a bit surprised someone at YC didn't flag this.

swyx · on March 21, 2023

for what its worth airbyte's connectors themselves are MIT https://airbyte.com/blog/a-new-license-to-future-proof-the-c... arguably this is "much of it" rather than ELv2

Nebyou · on March 21, 2023

Thanks for raising this. We share your respect for the intellectual property rights of others. We are aware of Airbyte's license structure and use connectors per its terms.

danpalmer · on March 21, 2023

Nice, if the connectors are MIT licenced but you've got your own server, that's great.

It might be worth differentiating the product further. Right now, looking at screenshots, it looks like a re-skin. I realise there's only so much you can differentiate an ETL service, and that the LLM feature is the main differentiating factor, but I do worry that it's very close right now.

jeanlaf · on March 22, 2023

Hi team! I'm one of the Airbyte co-founders. I think it might be worth chatting regarding the license indeed :). Don't hesitate to reach out to me on our Slack "John (Airbyte)"

Nebyou · on March 21, 2023

This is not a simply a hosted airbyte instance. We use airbyte's connectors for its common standards and the active community behind them. That being said, our use of the project is both limited, customized, and deeply embedded under our app. We do not use any UI components from Airbyte.

adv0r · on March 20, 2023

stupid feedback: The Loom video started with "Hi this is lume", which in my head is pronounced exactly like "loom" itself. My brain farted for a couple of seconds until I saw the Logo of "Lume" in the "loom" itself

nmachado · on March 20, 2023

Thanks! It is a funny meta moment to be using a similarly-named tool.

dgudkov · on March 20, 2023

That's a good problem to solve, but I wish it would be solved using standards, not with yet another service. Anyway, good luck to the founders!

ya1sec · on March 21, 2023

Will this functionality be available only as a no-code tool? I'd love to see a python library or something of the sort.

nmachado · on March 21, 2023

Right Lume is a low to no code tool (see demo), but we have gotten requests for an SDK. Creating a library / SDK is in our radar! If anyone has personal or company use cases for a library / SDK, please email founders@lume.ai, we'd love to learn more.

liminal · on March 20, 2023

Hi, how do you position yourself relative to products like Workato, Tray, AppConnect, etc.?

Nebyou · on March 20, 2023

It's true that our platform can be used for the same use cases as some of those products. However, the main difference is in the customizability we offer. These products focus on and support the most common integrations and offer them as an automation service. For most custom integrations, users still have to write custom code within these products if possible, or build them out in-house. With Lume, this would not be necessary.

alexchantavy · on March 21, 2023

Any plans for a graph schema destination e.g. Neo4j?

Nebyou · on March 21, 2023

Definitely high up on the priority list! We've actually been experimenting with this internally.

SMAAART · on March 20, 2023

Commenting so that I can remember it in the future.

doomain · on March 20, 2023

Is this a direct competitor to Zapier?

nmachado · on March 21, 2023

Another user asked this about other IPaaS’, such as Workato. This is our response: https://news.ycombinator.com/item?id=35238714

In short, Lume can be used for the same use cases as Zapier. However, Zapier focuses on and supports the most common trigger integrations and offer them as an automation service. For most custom integrations, users still have to write custom code within these products if possible, or build them out in-house. With Lume, this would not be necessary.

bodhi_mind · on March 20, 2023

Are you letting users prompt the llm?

robert-te-ross · on March 20, 2023

Our system only uses LLMs at particular points of the process, so we do not expect letting users do this to have much value. However, descriptions we generate and/or take in as input for both end and start schema columns have a significant effect on the generation of your transformations. Therefore, the ability to edit these descriptions can be a powerful way to experiment with our models.

tough · on March 20, 2023

It's also a way to prompt engineer/hack your stuff too keep in mind

bodhi_mind · on March 20, 2023

Yes, I’m curious how they’re handling sandboxing for this effectively untrusted code.

Nebyou · on March 20, 2023

Our transformations are executed in a staging database/schema before deployment. We also have versioning and backtesting capabilities. In addition, you will have complete visibility of the code we produce before and after deployment.

nmachado · on March 20, 2023

Yep - we do not expose any sort of prompting. We use the LLM only at specific parts of the process, and the user has no access to it.

tough · on March 21, 2023

Doesn't the user provide the input that's feed to that function calling the LLM tho? Prompt hacking is a bit like sql injection in my mind but we don't have ORM's yet

nmachado · on March 21, 2023

This would be a concern if we are feeding the raw user input and feed it directly into an LLM. In our case, we are not simply a wrapper over an LLM.

There are multiple parsing and rule-based steps done to the input schemas - we extract specific pieces from the schemas and convert them to our internal format before feeding it our models. Thus, it mitigates such malicious behavior.

tough · on March 23, 2023

Thanks for the answer, I just found out about kor on twitter and made me think back of this thread, sharing in case it's of your interest https://eyurtsev.github.io/kor/

towndrunk · on March 20, 2023

You have the same company name as a deodorant company. https://lumedeodorant.com/

brap · on March 20, 2023

Coming up with original company names at this point is nearly impossible (and somewhat overrated)

MisterBastahrd · on March 20, 2023

They also have a name that sounds the same as a video conferencing solution:

https://loom.com