Q: Run SQL Directly on CSV Files

rudolfwinestock · on Nov 15, 2018

“Any file is a database if you awk hard enough.” —Bryan Horstmann-Allen

https://twitter.com/neilkod/status/914217352564137984

Furthermore, no thread on CSV files can be complete without mentioning this infamous bit of fact-trolling: the ASCII standard has had unit- and record-delimiters baked into it from the beginning.

https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...

rusk · on Nov 15, 2018

> the ASCII standard has had unit- and record-delimiters baked into it from the beginning.

That's mad that I've never heard of that before.

I'm amazed this isn't used more often!

stouset · on Nov 15, 2018

They don’t really get used IMO because if you’re storing ASCII or Unicode text, you have to be prepared for those characters to be in your data. And if you’re storing binary data, delimiters don’t really cut it in the first place.

bane · on Nov 15, 2018

No it's fine. Just base64 encode the binary data. It's common and streams, or you can capture and convert the entire field at once if you need to seek around in the binary data.

https://en.wikipedia.org/wiki/Base64

zbentley · on Nov 15, 2018

That only works if the thing consuming your CSV can be configured to decode whatever encoding you picked. Uploading CSVs of data to various SQL servers is an example case where that is often not possible.

rusk · on Nov 15, 2018

I didn't mean for CSV. There's plenty of usecases for tabular data that aren't hand coded that this could be used for. Plenty of times I've been just wanting to STGDD and not having to mess around with a serialization library, or have to select delimeters, encapsulators, do escaping etc.

marcodave · on Nov 15, 2018

also they cannot be found in keyboards, if you need to write CSV by hand

fulafel · on Nov 18, 2018

You can type it at a terminal the same way you can write other low-ascii codes. For example record separator, 30, is ctrl-^ - you can type it in the terminal by the usual literal control char way, ctrl-v[1].

So for example at a Python prompt:

  >>> ord('^^') # typed ctrl-v ctrl-^ here
  30

(the low unprintable ascii codes correspond to certain ctrl combinations because ctrl-x is just the ascii code of x with the high bits masked off (i.e. & 0x1f) )

[1] custimizable in your terminal settings, see stty lnext ("literal next")

romdev · on Nov 15, 2018

Can't be found, but you can enter them by hand by holding down alt while entering the ASCII code on the numeric keypad. I wonder if they'll survive posting a comment: 28∟ 29↔ 30▲ 31▼ Edit: looks like they do survive. Hmmm... Now I wonder about extended ASCII. └┐│└└╚╦╝

dapithor · on Nov 15, 2018

Also, you can leverage the ISO/IEC 2022 extension and represent them with Caret Notation. Note: recommend an escaping mechanism depending on data. REF: https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_(AS... An editor such as vim or pager like 'less' will display them. You can also 'cat -v' a file.

brianm · on Nov 15, 2018

Ability to eyeball it on terminal is huge though :-(

Never underestimate ergonomics like screen printability!

gdfasfklshg4 · on Nov 15, 2018

If people used them then the characters would almost certainly be represented in terminals as table spacing characters...

z92 · on Nov 15, 2018

Usability suffers, if you can't type the delimiters easily.

hyperbovine · on Nov 15, 2018

I can’t stand it when I’m banging out a CSV by hand and have to manually escape the field and record separators. Happens all the time.

z92 · on Nov 15, 2018

I use TSV for that reason. Most of the time you don't need to escape anything. And from code you need to escape/unescape only 4 chars. \t \n \NULL and \\.

dghf · on Nov 15, 2018

In a terminal, Record Separator is CTRL-^, Unit Separator is CTRL-_. Vi[m] will accept the former literally, and will accept the latter if escaped with a CTRL-V.

enriquto · on Nov 15, 2018

oh, I love that! I'm going to use these delimiters from now on.

bhengaij · on Nov 15, 2018

And they're literally named thus too. US, FS.

We just need an rcat to print them clearly.

delinka · on Nov 14, 2018

If you're on Windows, you've had the ability to do this for Quite Some Time®: https://support.microsoft.com/en-us/help/850320/creating-an-...

eli · on Nov 14, 2018

Ah, I thought you were going to link to the fascinating skunkworks Log Parser tool https://www.microsoft.com/en-us/download/details.aspx?id=246... Wonder if it still runs on Windows 10?

NicoJuicy · on Nov 14, 2018

If you need a UI, don't forget the tool: Log Parser Lizard ^^

http://www.lizard-labs.com/log_parser_lizard.aspx

gronne · on Nov 15, 2018

Good of you to also accomodate the disabled.

guipsp · on Nov 15, 2018

Is this a joke or is the tool actually accessible?

mattmanser · on Nov 14, 2018

Yeah, still works fine.

bhengaij · on Nov 15, 2018

Been on windows for 18 years now, never heard of this before. Sometimes I wonder if Microsoft really sucks at marketing dev tools

poizan42 · on Nov 15, 2018

Probably because ODBC is old technology, if you had been developing Windows software in the late 90'ies you would probably have known.

ppande · on Nov 15, 2018

Sqlite virtual tables. You can literally query any data source. All you need is a module implementation of interface exposed by sqlite for the data source of interest. I once wrote a module that could query protocol buffers encoded data

joshumax · on Nov 15, 2018

Fun fact: MySQL has actually supposed CSV files as a storage backend for quite some time now[1]

1: https://dev.mysql.com/doc/refman/8.0/en/csv-storage-engine.h...

athenot · on Nov 15, 2018

Came here to post the same. This can be used in very interesting ways, either for quick ad-hoc queries or for import/export functionality (ETL), where the core of your data is in InnoDB but you keep a CSV version for one of the phases of your processing, to interface with other tools.

This may looks like a dirty hack compared to a clean API, but if you have performance considerations or need to interface with some legacy thing that happens to understand CSV, this can be a good choice.

cbcoutinho · on Nov 14, 2018

I've never heard of this before, but it reminded me of this other command line tool for wrangling csv files in a cli. This tool makes use of SQL syntax much more than xsv, so there isn't 100% overlap here.

https://github.com/BurntSushi/xsv

Cyph0n · on Nov 15, 2018

There is also a similar TSV manipulation tool suite written in D: https://github.com/eBay/tsv-utils

devy · on Nov 14, 2018

Thanks! xsv is new to me. I love Andrew Gallant's ripgrep(rg) grepping cmd tool.

goodside · on Nov 14, 2018

You should definitely consider using xsv under the hood. It’s dramatically faster than any other tool of its kind that I’ve tried, with low-level implementation in Rust.

bhengaij · on Nov 15, 2018

His name is Burnt Sushi I think.

yellowapple · on Nov 15, 2018

Another trick along these lines is to cut out the middleman (middleprogram? middleware?) and use SQLite to do it:

    $ sqlite3
    sqlite3> .mode csv
    sqlite3> .import foo.csv foo
    sqlite3> SELECT * FROM foo WHERE bar = 'baz'; -- you get the gist...
    (a bunch of rows)

Q seems to be much easier to use (and certainly easier to remember), but I've always found it handy to have the full power of SQL at my fingertips when needing to do a bunch of CSV manipulations.

pixelmonkey · on Nov 15, 2018

I think the csvkit CLI tool, csvsql, provides this functionality, too.

Tepix · on Nov 14, 2018

The perl CPAN module DBD::CSV also lets you do this. It was started in 1998 and still gets updates.

walshemj · on Nov 14, 2018

Text::CSV_XS is my goto module for csv just use that and whack it into MySQL postgress would be my advice.

singingfish · on Nov 15, 2018

nah sqlite is what you want when you want a step up from CSV.

walshemj · on Nov 15, 2018

Well all I want to do is ingest some Google analytics and combine it with the data from SEmRush

singingfish · on Nov 15, 2018

I just came to say this too. Turns out it was last updated in October 2018 :o

codeulike · on Nov 14, 2018

Real-world CSV files generally contain some or all of the following horrors:

- some strings enclosed in speechmarks, but some not

- empty fields

- speechmarks within strings

- commas within strings

- carriage returns within strings

How does Q do up against a CSV file with those traits?

setr · on Nov 14, 2018

All of your “horrors” seem...correct? Its comma delimited, so anything that is between two commas should be parsed without issue; if it’s a string with a comma in it, and unquoted, you simply have a broken csv file. If its quoted, than anything until the next (unescaped) quote is fine, including commas

Unless you’re trying to parse csv files with regexes, none of those should be difficult, or even unexpected, to handls with a PEG parser, or any equivalent device

Ofc if you’re accepting ambiguity then its just arbitrary how you handle it, but none of your examples afaict present any ambiguity (I’m assuming strings are either quoted or unquoted, with the former primarily allowing commas/newlines in strings; escaping exists as well; comma delimited columns, newline delimited rows)

em500 · on Nov 15, 2018

If you need to read to import CSV from someone else, there are tons of ambiguities. Do you interpret an empty field as an empty string or a NULL? If you've treated a column of unquoted digits as numbers so far, do you parse the first row with a non-number in that column as a NaN, NULL or string? If string, do you reinterpret all the previous column values as strings?

Many people are not in the position to just return the file to the client/boss and tell them they have a "broken csv file". (They'll tell you they saved it in Excel and it reads back fine, so the problem must be on your end. E.g.: https://stackoverflow.com/questions/43273976/escaping-quotes...)

codeulike · on Nov 14, 2018

Yes, it would be valid CSV. I suppose my point is that naive attempts to roll-your-own CSV parsers tend to fail on the points I listed. Hopefully Q does not do that.

Dylan16807 · on Nov 15, 2018

They do? Commas and quotes are the two basic features of CSV, so it seems very strange to forget to implement half.

83457 · on Nov 15, 2018

There is a lot of inconsistency out there. I have seen csv files saved in Excel not be import-able by Access because the latter doesn't handle breaks in fields correctly. I've seen csvs saved from various systems such as sql mngmnt studio grid view and wufoo exports not generate csv correctly. There are many lazy attempts at csv generators out there that just throw breaks between records and commas between fields and call it a day.

83457 · on Nov 15, 2018

And even if they do wrap all fields in double quotes it is very common to forget to escape double quotes in fields, then it depends on the parser as to whether it can determine the proper structure of the record.

barrkel · on Nov 15, 2018

More interesting: any kind of delimiter, including chars from utf8 and windows-1252, and you need to detect encoding too. And CSV embedded in CSV, a result of flattening an XML source. And fixed width files, not CSV but where you see CSV you may need to support. And let's not get into date parsing or other typed data, and type inference over sample files.

harelba · on Nov 18, 2018

Hi, q's creator here,

Any kind of input/output delimiter is supported (-d <delim> and -D <delim>), and also multiple encodings (-e <encoding>). Also, q performs automatic type inference over the actual data.

Encoding autodetection and fixed width files are not supported though.

barrkel · on Nov 19, 2018

The company I work for also does delimiter autodetection, quote character inference (from a limited set), and encoding inference (which is mostly limited to utf8 / windows-1252 / iso-8859-15, but it can't reliably differentiate latter two).

Avshalom · on Nov 15, 2018

Without diving into the source code I can only say Q pops up a couple times a year in either as posts or in cli recommendation threads so I suspect it's at least reasonably robust.

snazz · on Nov 15, 2018

For those reasons I much prefer tab-delimited files. Does anyone know if Q supports that?

harelba · on Nov 18, 2018

q supports any kind of input and output delimiter (-d <input-delim> and -D <output-delim> respectively).

harelba (creator of q)

samatman · on Nov 15, 2018

Just this weekend was filtering commas out of unquoted dollar values in a CSV

badcircle · on Nov 14, 2018

PowerShell will clean up a gross CSV:

Import-CSV .\file.csv | Export-CSV .\file.csv -NoTypeInformation -Encoding UTF8

jannes · on Nov 15, 2018

Excel has a lot of this stuff builtin as well in the PowerQuery editor [0] which supports CSV, JSON and XML data sources. Albeit not SQL, it allows you to do almost everything SQL can do, but in a GUI.

After you're done preparing your data in PowerQuery you can run PowerPivot on it for aggregations.

[0]: https://cdn-5a6cb102f911c811e474f1cd.closte.com/wp-content/u...

ashrk · on Nov 14, 2018

Looks like PostgreSQL has a foreign data wrapper for CSV.

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

In fact, seems it has several.

garyclarke27 · on Nov 15, 2018

Yes Also trivially easy and lightning fast to import csv to table using copy command in Postgres.

daotoad · on Nov 15, 2018

Perl in 1998. https://metacpan.org/pod/DBD::CSV Okay, 2002--it took a while before anyone bothered to write a shell. https://metacpan.org/pod/release/TLOWERY/DBI-Shell-11.9/lib/...

And there are drivers for everything from PostgreSQL and Oracle to text files and the Azure WMI.

hprotagonist · on Nov 14, 2018

Related, but with a free web interface https://github.com/simonw/datasette , https://github.com/simonw/csvs-to-sqlite

herdrick · on Nov 14, 2018

csvkit includes csvsql, which does this. I’ve used it and liked it. https://csvkit.readthedocs.io/en/1.0.3/tutorial/3_power_tool...

fiveFeet · on Nov 15, 2018

Came here to tell about csvkit and the csvsql that comes with it. Highly recommended. It is one of the first things I install on a new system.

zimpenfish · on Nov 15, 2018

Discovered csvkit last week and it was a lifesaver. Had to remove a bunch of columns from a large CSV file with many columns - csvcut was way easier than trying to do it with awk or cut.

summarity · on Nov 14, 2018

So is this a wrapper around https://www3.sqlite.org/csv.html ? It mentions sqlite.

alex_stoddard · on Nov 14, 2018

Looks like it is implemented in python(2) with use of the stdlib sqlite3 module. https://github.com/harelba/q/blob/master/bin/q

bonesss · on Nov 15, 2018

We've had this for ages now: the F# type provider gives strongly types access to CSV files, and the resulting object heiarchies are LINQ compatible. There's even support for applying units of measure.

Veeeery nice for data munging :)

http://fsharp.github.io/FSharp.Data/library/CsvProvider.html

aaaaaaaaaab · on Nov 15, 2018

The letter Q is already taken...

https://en.m.wikipedia.org/wiki/Q_%28programming_language_fr...

kkarakk · on Nov 15, 2018

>It is proprietary software, commercialized by Kx Systems

q is available i'm thinking

aaaaaaaaaab · on Nov 15, 2018

What? kdb, K and Q are part of a renowned array processing toolkit/environment widely used by data scientists for decades.

Calling this command-line tool Q, which happens to be used for the same purpose as the Q programming language (i.e. querying data) begs for confusion and misunderstanding...

usgroup · on Nov 14, 2018

I think tools like this don’t exist much because it typically makes more sense to just put it into a database or at least SQLite ... then do whatever you want from there.

Meanwhile for all your streaming, filtering and aggregating need there is awk.

I’ve never been board enough to write my own little sql library for awk, but I’d be surprised if it doesn’t exist.

TkTech · on Nov 14, 2018

That's exactly what q (the linked tool) and almost all other tools like it do - it loads into an sqlite database and provides some handy wrappers, that's it.

IOT_Apprentice · on Nov 14, 2018

Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

https://github.com/dbohdan/sqawk

dima55 · on Nov 15, 2018

There's also this:

  https://github.com/dkogan/vnlog

This is effectively a set of wrappers around awk, sort, join, etc that knows about field names, and lets you use the vanilla shell tools in a much nicer way. It's exactly what you want for the 99% use case. If your problem is complicated enough that you feel like sql is what you want, maybe leaving the shell is ok at that point.

phaedrus · on Nov 14, 2018

Many of the applications at my workplace use flat text files in CSV format for logs and configuration. For new development I've been using Sqlite to replace some of these usages. Sometimes when I need to analyze legacy log files I import the CSV data into Sqlite tables. After doing this a few times I hit upon the idea of why not skip the import (which balloons my otherwise-small db files) and write an Sqlite plugin that allows treating the flat CSV file as a virtual table? It's neat to see there's some existing work here!

mathiasrw · on Nov 15, 2018

If you need to work with CSV or Excel or tab files from the command line I suggest having a look at alasql. Can be installed via npm and gives you a good list of export options.

shodan757 · on Nov 14, 2018

Eep, no Python 3 support? :(

f00_ · on Nov 15, 2018

I was about to just recommend you pandas, but just realized it doesn't natively support running sql on dataframes, but pyspark does!

jzelinskie · on Nov 15, 2018

I've done a similar thing, but using jq syntax on any object-like file: https://github.com/jzelinskie/faq

Ya know how some people want everything to be vim after they use it? I'm surprised more people haven't done that with other tools.

aepiepaey · on Nov 15, 2018

> faq is pronounced "fah queue".

So... "fuck you"?

iblaine · on Nov 14, 2018

> Have you ever stared at a text file on the screen, hoping it would have been a database so you could ask anything you want about it?

Not really. I could see this being helpful for debugging, but at that point you can get by with some simple bash string operations. Doing that same operation w/SQL seems like overkill.

mjirv · on Nov 14, 2018

Maybe my bash just isn't good enough anymore, but I do this all the time. In fact, I was just doing some work this morning where I needed to load a CSV into SQLite for analysis.

Since I still had the file, I ran some queries on the same file via Q to test it out. And it works great! (With some handholding to get quoting and delimiters right). The major downside was it's pretty slow, presumably because it reloads the file into a SQLite database each time you run a command. So, I'll probably stick with loading CSVs into SQLite myself, but I could see this being a useful tool for running one-off analysis on data from stdin.

syn0byte · on Nov 14, 2018

The devil you know; I find myself with the opposite problem on occasion. "If only I could just grep and awk out what I wanted instead of using some contrived query and arcane data connectors"

qwerty456127 · on Nov 15, 2018

> you can get by with some simple bash string operations. Doing that same operation w/SQL seems like overkill.

Coding bash seems an overkill for just so many people. Bash is scary while SQL is intuitive by design.

abathur · on Nov 15, 2018

Not sure how they compare, but this reminded me of a thread about textql from earlier this year:

https://news.ycombinator.com/item?id=16781294

eismcc · on Nov 15, 2018

Piling on with pre-existing tools: The demo for calcite is a CSV to SQL wrapper.

https://calcite.apache.org

rmbeard · on Nov 15, 2018

This would make a nice complement to xsv and csvkit but really needs to support python3. That is the main downside to using it.

joelthelion · on Nov 15, 2018

csvkit actually supports SQL queries on csv files. But performance is pretty poor.

chrisweekly · on Nov 15, 2018

Is there any use for this not handled easily by lnav?

https://lnav.org

qwerty456127 · on Nov 15, 2018

Great! I just wonder why did it take so long for somebody to finally invent this...

gpvos · on Nov 15, 2018

It didn't. See the other comments for about a dozen of already existing similar tools.

iron0013 · on Nov 15, 2018

You can do something very similar to this using R's sqldf package.

purplezooey · on Nov 15, 2018

We are doing this with Apache Drill on MapR and it works wonderfully.

JacKTrocinskI · on Nov 15, 2018

What is the performance like compared to say Oracle External Tables?

therealmarv · on Nov 15, 2018

Is there something similar for JSON? Thanks

dghf · on Nov 15, 2018

There's jq, which doesn't support SQL but does allow you to apply filters and reductions to data in JSON files: https://stedolan.github.io/jq/