Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Lume (YC W23) – Generate custom data integrations with AI
115 points by nmachado on March 20, 2023 | hide | past | favorite | 56 comments
Hi HN, we’re Nicolas, Nebyou, and Robert and we’re building Lume (https://lume.ai). Lume uses AI to generate custom data integrations. We transform data between any start and end schema and pipe the data directly to your desired destination. There’s a demo video here: https://www.loom.com/share/bed137eb38884270a2619c71cebc1213.

Companies spend countless engineering hours manually transforming data for custom integrations, or pay large amounts to consulting firms to do it for them. Engineers have to work through massive data schemas and create hacky scripts to transform data. Dynamic schemas from different clients or apps require custom integration pipelines. Many non-tech companies are even still relying on schemas from csv and pdf file formats. Days, weeks, and even months are spent just building integrations.

We ran into this problem first-hand as engineers: Nebyou during his time as a ML engineer at Opendoor, where he spent months manually creating data transformations, while Nicolas did the same at his time working at Apple Health. Talking to other engineers, we learned this problem was everywhere. Because of the dynamic and one-off nature of different data integrations, it has been a challenging problem to automate. We believe that with recent improvements in LLMs (large language models), automation has become feasible and now is the right time to tackle it.

Lume solves this problem head-on by generating data transformations, which makes the integration process 10x faster. This is provided through a self-serve managed platform where engineers can manage and create new data integrations.

How it works: users can specify their data source and data destination, both of which specify the desired data formats, a.k.a. schemas. Data source and destinations can be specified through our 300+ app connectors, or custom data schemas can be connected by either providing access to your data warehouse, or a manual file upload (csv, json, etc) of your end schema. Lume, which includes AI and rule-based models, creates the desired transformation under the hood by drafting the necessary SQL code, and deploys it to your destination.

At the same time, engineers don’t want to rely on low- or no-code tools without visibility under the hood. Thus, we also provide features to ensure visibility, confidence, and editability of each integration: Data Preview allows you to view samples of the transformed data, SQL Editor allows you to see the SQL used to create the transformation and to change the assumptions made my Lume’s model, if needed (most of the time, you don’t!). In addition, Lineage Graph (launching soon) shows you the dependencies of your new integration, giving more visibility for maintenance.

Our clients have two primary use cases. One common use case is to transform data source(s) into one unified ontology. For example, you can create a unified schema between Salesforce, Hubspot, Quickbooks, and Pipedrive in your data warehouse. Another common use case is to create data integrations between external apps, such as custom syncs between your SaaS apps. For example, you can create an integration directly between your CRM and BI tools.

The most important thing about our solution is our generative system: our model ingests and understands your schemas, and uses that to generate transformations that map one schema to another. Other integration tools, such as Mulesoft and Informatica, ask users to manually map columns between schemas—which takes a long time. Data transformation tools such as dbt have improved the data engineering process significantly (we love dbt!) but still require extensive manual work to understand the data and to program. We abstract all of this and do all the transformations for our customers under the hood - which reduces the time taken to manually map and engineer these integrations from days/weeks to minutes. Our solution handles the truly dynamic nature of data integrations.

We don’t have a public self-serve option yet (sorry!) because we’re at the early stage of working closely with specific customers to get their use cases into production. If you’re interested in becoming one of those, we’d love to hear from you at https://lume.ai. Once the core feature set has stabilized, we’ll build out the public product. In the meantime, our demo video shows it in action: https://www.loom.com/share/bed137eb38884270a2619c71cebc1213.

We currently charge a flat monthly fee that varies based on the quantity of data integrations. In the future, we plan on having more transparent pricing that’s made up of a fixed platform fee + compute-based charges. To not have surprise charges, we currently run the compute in your data warehouse.

We’re looking forward to hearing any of your comments, questions, ideas, experiences, and feedback!




One area where I think AI would be super useful is interpreting enterprise data dictionaries and companion guides, for example:

https://www.cms.gov/files/document/cclf-file-data-elements-r...

Currently I have to write validations based off of that definition and then write code to transform it to another standardized claim format. The work is king of mind numbing and it seems like it would be possible to use AI to streamline the process.


If you have the desired standardized claim format, Lume supports this use case. We also have a pdf parser in the roadmap to parse documents exactly like the one you linked, to then transform and pipe the data accordingly.


How does Lume support this today without a pdf parser? Do you have the option to use a preexisting claim format or does the format have to be specified another way?


Our V1 supports json and csv formats for manual imports, and we’re quickly expanding to other formats (like pdf).

So, to clarify - Lume supports this today only if you provide the linked claim data in json or csv format, and in the near future will support direct pdf formats. All of our users so far provide custom data through their data warehouse, json, or csv.


Just to be clear, the pdf does not contain the data. The pdf contains the data dictionary that describes the structure of the data such as the type of field, whether it's required, etc... the actual claim data is sent in a csv.

The objective is be to parse the csv based on the data dictionary described in the pdf.


Gotcha! In that case, we do not yet support an end-to-end experience for this, but would be willing to prioritize building it for clients if we get strong demand.


Cool, so are you actually using a LLM? If so, is it yours or are you borrowing someone else's (you mentioned that recent improvements in LLM's being a catalyst as the right time to tackle it)?

If not, I'd definitely like to hear more about your specific AI model.


Yes, we are using an LLM for some parts of the code generation, specifically GPT-4. In the medium-term, we plan to go lower in the stack and have our own AI model. We broke down the process into modular steps to only leverage LLMs where it's most needed, and use rule-based methods in other parts of the process (e.g. in fixing compilation errors). This maximizes the accuracy of the transformations.


Modular use of an LLM over a problem-specific workflow skeleton is the winning ticket. Nicely conceptualized!


Do you have some sort of automatic test suite for what's generated by the LLM prior to release? Just to ensure what it returns won't break downstream?


Yes, internally, we have separate models that produce tests the final data has to pass before being presented to the user. In addition, you can define your own tests on the platform, and we will ensure transformations produced will pass those tests before deployment. We also have helpful versioning and backtesting features.


looks like it probs passes the source and target schema throught an LLM that generates a sql create statement. similar to https://magic.jxnl.co/data

and make a request like 'write me sql to map the existing tables to a new table with this schema'


Considering you're using an nondeterministic way of generating the transformation (LLM) what sort of guarantee do I get that it will work correctly and do what I want?

Is my proprietary data stored on your servers (database schema, rows, etc.)? If so what safety guarantees do I get?


Regarding guarantee that it will work correctly, there are ways to reduce the ambiguity in the task given. One way is to input very detailed descriptions of your end schema. This limits the amount of assumptions our model has to make. In addition, you can define tests either by writing SQL code on Lume, or by explaining in plain English the tests the final data has to pass (and edit them, of course). Our models make sure the end data passes these tests, guaranteeing your desired outcomes. We also offer versioning and backtesting capabilities, so you can have more confidence in your deployments. You can also review the sample data + the sql used to guarantee Lume drafted the integration you desired.

With regards to where your data is stored, technically we only need your schema information for our models and everything is run on your cloud, which some customers prefer for privacy / safety. That being said, the ability to sample source data or test the end schema, which does require some data read access, will improve your experience with Lume. In these cases, we of course have contractual agreements with our customers.


Is this really much faster than just writing these things? My latest integration with 4 endpoints took around 3-4 hours with tests ? I feel most of the work comes from your business model and making the fitting which you would still need to do unless im missing something entirely?


In most cases, we build these transformations in a matter of seconds. Furthermore, we can detect changes from either source or destination and change the transformation accordingly, reducing maintenance burden as well.


I see some evidence that it handles complex transformations, but there's so many corner cases in the real world, like...

- Different ranges, where the source is, say "size 0-10", and the destination is "S/M/L".

- Various flattening or exploding needs. Like an array of namespaced tags driving a flat list of boolean fields. Or a source with 2 tables and a foreign key being transformed into tags, or flat fields, or a 3-level nesting.

- Encoding/Decoding things. Transforming windows-1252 into utf-8. Decoding base64 (or json, or xml, or...) and storing as fields in the destination.

- Compound transforms, both directions, two fields into one, or vice-versa with splitting on a delimeter.

- Appending a unique suffix/count to some field because the source doesn't enforce uniqueness on the field, but the destination does. Or going the other direction.

- Hundreds of similar patterns.

It's fairly easy to see the breadth if you look at all the dials and knobs on any popular ETL tool.

I'm curious if the idea is to pull all these into scope, or if it's to ignore it, and focus on a deliberately smaller market.


We've observed that our system performs really well handling most corner cases as long as the context required can be interpolated from its inputs (either in the schemas and their descriptions or in the underlying data we sample from). In the worst case, the most you'd have to do is edit schema descriptions on our platform to include the necessary context (For example, specifying the encoding that you expect the field in your end schema to have).

For the compound transform scenario, since we optimize for modularity in the transformations we build, our systems prioritize defining these transformations unless it makes no sense to do so.


The demo shows you transforming As-Is data to As-Is data.

Is this a full refresh each time or incremental (do you have to tell it the incremental columns or can it "tell"?)

Can you create audit timestamps in the target which track when rows were inserted or updated (or soft-deleted) in the target?

Can you take sources which contain "current state" table information and transform them into tables that have record start-effective/end-effective date (+ a current record indicator flag?) that support As-Was querying for a given primary key and/or which tracks soft deletes over time as an additional target table column?


This example is a full refresh, but incremental is usually the norm, especially for our supported connectors and continuous syncs. Our models can detect incremental columns.

Audit timestamps (usually tables) are typically created in intermediary stages (whose materializations you would have access to in your database) before getting pruned out to fit your destination schema. Of course, if the destination schema expects these audit tables or columns, they would be included in the target.

To your last question, if you include these tables or columns in the end schema you specify to Lume (or create a separate flow with a new end schema with these fields), what you described is definitely possible.


How does Lume know what to fill in on the target table if a record effective end date isn’t in the source system but is a property of when the data was fetched? How does Lume know to update the record effective termination date of a target row when a new row comes along?

(This type of data modeling is described here: https://www.kimballgroup.com/data-warehouse-business-intelli... )


Apologies, I misread your earlier question. Lume can only ascertain information from what is given directly or from what assumptions it can make with reasonable confidence. So, in this case, this will not be possible unless there is any information in the start schema where this can be interpolated. If these prior transformations were done in a dbt project, we could extract the information needed for this easily (a dbt project can be integrated and represented as a separate flow within your workspace).


Wow! How did you get 300(+) data connections with such a small team?


We leveraged Airbyte - it makes supporting that many connections much more seamless ... and a lot of coding!


Have you check Airbyte's license[0]? Much of it is Elastic License 2.0, which I don't think allows what you're doing.

0 - https://github.com/airbytehq/airbyte/blob/master/LICENSE


> You may not provide the software to third parties as a hosted or managed service, where the service provides users with access to any substantial set of the features or functionality of the software.

The screenshots of Airbyte and Lume even look nearly the same. It looks like it's just a hosted Airbyte instance with GPT generating the SQL/config.

Smart for an MVP, not much of a moat for the long term, and it's a shame that software licencing is such a blindspot for so many SV startups.


I personally wouldn't do this even for an MVP. The product is competitive with Airbyte and they're breaking Airbyte's license.

Definitely cases where it's ok to do something similar for an MVP, but I wouldn't touch this product knowing they can't continue to operate it this way (could get shut down at any moment).

I'm a bit surprised someone at YC didn't flag this.


for what its worth airbyte's connectors themselves are MIT https://airbyte.com/blog/a-new-license-to-future-proof-the-c... arguably this is "much of it" rather than ELv2


Thanks for raising this. We share your respect for the intellectual property rights of others. We are aware of Airbyte's license structure and use connectors per its terms.


Nice, if the connectors are MIT licenced but you've got your own server, that's great.

It might be worth differentiating the product further. Right now, looking at screenshots, it looks like a re-skin. I realise there's only so much you can differentiate an ETL service, and that the LLM feature is the main differentiating factor, but I do worry that it's very close right now.


Hi team! I'm one of the Airbyte co-founders. I think it might be worth chatting regarding the license indeed :). Don't hesitate to reach out to me on our Slack "John (Airbyte)"


This is not a simply a hosted airbyte instance. We use airbyte's connectors for its common standards and the active community behind them. That being said, our use of the project is both limited, customized, and deeply embedded under our app. We do not use any UI components from Airbyte.


stupid feedback: The Loom video started with "Hi this is lume", which in my head is pronounced exactly like "loom" itself. My brain farted for a couple of seconds until I saw the Logo of "Lume" in the "loom" itself


Thanks! It is a funny meta moment to be using a similarly-named tool.


That's a good problem to solve, but I wish it would be solved using standards, not with yet another service. Anyway, good luck to the founders!


Will this functionality be available only as a no-code tool? I'd love to see a python library or something of the sort.


Right Lume is a low to no code tool (see demo), but we have gotten requests for an SDK. Creating a library / SDK is in our radar! If anyone has personal or company use cases for a library / SDK, please email founders@lume.ai, we'd love to learn more.


Hi, how do you position yourself relative to products like Workato, Tray, AppConnect, etc.?


It's true that our platform can be used for the same use cases as some of those products. However, the main difference is in the customizability we offer. These products focus on and support the most common integrations and offer them as an automation service. For most custom integrations, users still have to write custom code within these products if possible, or build them out in-house. With Lume, this would not be necessary.


Any plans for a graph schema destination e.g. Neo4j?


Definitely high up on the priority list! We've actually been experimenting with this internally.


Commenting so that I can remember it in the future.


Is this a direct competitor to Zapier?


Another user asked this about other IPaaS’, such as Workato. This is our response: https://news.ycombinator.com/item?id=35238714

In short, Lume can be used for the same use cases as Zapier. However, Zapier focuses on and supports the most common trigger integrations and offer them as an automation service. For most custom integrations, users still have to write custom code within these products if possible, or build them out in-house. With Lume, this would not be necessary.


Are you letting users prompt the llm?


Our system only uses LLMs at particular points of the process, so we do not expect letting users do this to have much value. However, descriptions we generate and/or take in as input for both end and start schema columns have a significant effect on the generation of your transformations. Therefore, the ability to edit these descriptions can be a powerful way to experiment with our models.


It's also a way to prompt engineer/hack your stuff too keep in mind


Yes, I’m curious how they’re handling sandboxing for this effectively untrusted code.


Our transformations are executed in a staging database/schema before deployment. We also have versioning and backtesting capabilities. In addition, you will have complete visibility of the code we produce before and after deployment.


Yep - we do not expose any sort of prompting. We use the LLM only at specific parts of the process, and the user has no access to it.


Doesn't the user provide the input that's feed to that function calling the LLM tho? Prompt hacking is a bit like sql injection in my mind but we don't have ORM's yet


This would be a concern if we are feeding the raw user input and feed it directly into an LLM. In our case, we are not simply a wrapper over an LLM.

There are multiple parsing and rule-based steps done to the input schemas - we extract specific pieces from the schemas and convert them to our internal format before feeding it our models. Thus, it mitigates such malicious behavior.


Thanks for the answer, I just found out about kor on twitter and made me think back of this thread, sharing in case it's of your interest https://eyurtsev.github.io/kor/


You have the same company name as a deodorant company. https://lumedeodorant.com/


Coming up with original company names at this point is nearly impossible (and somewhat overrated)


They also have a name that sounds the same as a video conferencing solution:

https://loom.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: