Useless Use of "dd" (2015)

spudlyo · on May 23, 2022

In the bad old Linux days, if you read a huge amount of data off the disk (like if you were doing a backup) Linux would try to take all that data and jam it in the page cache. This would push out all the useful stuff in your cache, and sometimes even cause swapping as Linux helpfully swaps out stuff to make room for more useless page cache.

One of the great things about `dd` is that you have a lot of control how the input and output files are opened. You can bypass the page cache when reading data by using iflag=direct, which would stop this from happening.

bayindirh · on May 23, 2022

Moreover, flash drives (and all flash media) have a favorite page size, which is generally 4kB or 512kB. By defining a common denominator page size like 1024kB (with bs=1024kB), you can keep your flash drive happy with enough backlog to write, so it can perform at its peak write speed without churning, which will help with faster writes and lower write amplification, which is a win-win.

vladvasiliu · on May 23, 2022

Man, this brings back memories of a few years ago when I tried to dd a Linux install image on a usb drive.

It would take forever and always end up with an i/o error. I figured the new PC had somehow wonky usb ports or something. But when that happened on another "known good" box, I figured it was the flash drive. But Disk Utility in MacOS worked well.

Then I tried increasing the block size to 1M and everything went smoothly. It even took less time to write the whole image correctly than it took it to error out before.

nousermane · on May 23, 2022

(In theory at least,) kernel should take care of aggregating write blocks and those will be plenty big enough by the time it reaches target drive, all thanks to the very same page cache GP is talking about - unless you specify "oflag=direct" to dd.

That being said, probably don't use too small of a block-size - this will eat up CPU in system call overhead and slow down copy regardless of target media type.

bayindirh · on May 23, 2022

I don't know how internals of "cp" and the related machinery interacts with the target drive, however if you don't provide bs=1024kB, dd writes with extremely small units (1 byte at a time IIRC), which overwhelms the flash controller and creates a high CPU load at the same time.

I always used dd since it provides more direct control over the transfer stream and how it's transported. I also call dd as "direct-drive" sometimes due to these capabilities of the tool.

beermonster · on May 23, 2022

From the blog post:

"By specification, its default 512 block size has had to remain unchanged for decades. Today, this tiny size makes it CPU bound by default. A script that doesn’t specify a block size is very inefficient, and any script that picks the current optimal value may slowly become obsolete — or start obsolete if it’s copied from "

bayindirh · on May 24, 2022

While I remembered the default wrong (because I never used the defaults, and I was too lazy to look for it during writing the comment), it's possible for a script to get a correct block size every time.

There are ways to get block size of a device. Multiply it by 2 to 4 (or more), open it directly, and keep your device busy.

The blog post is oblivious to nuances about the issue and usefulness of "dd" in general.

nousermane · on May 24, 2022

Please forgive the nit-picking, I'm not attacking this (excellent) article, or your entirely sensible inclination to dig up some "physical" number, but...

With modern SSDs, "sector/block size" is rapidly approaching vagueness of cylinder/head/sector addressing scheme, as used a couple of decades ago on venerable spinning/magnetic disks.

That is - it is definitely a thing, somewhere deep down, but software running on host CPU trying to address those, wouldn't necessarily end up addressing the same thing as user had in mind.

If you want a concrete example - look no further than "SLC mode" cache - where drive would have a number of identical flash chips, but some of them (or even a dynamically-allocated fractional number of chips) would be run at lower bits per cell count, for higher speed/endurance. However erase- and write- blocksize for the chip is expressed in cells, not bits. What that means is - cache and main storage of the very same SSD would have different blocksize (in bits/bytes).

bayindirh · on May 24, 2022

> Please forgive the nit-picking, I'm not attacking this (excellent) article, or your entirely sensible inclination to dig up some "physical" number, but...

I don't think it's nitpicking. We're discussing there. We're technical people, and we tend to point out different aspects/perspectives of a problem, and offer our opinions. That's something I love when it's done in a civilized manner.

Regarding to remaining part of your comment (I didn't want to quote to not make it look crowded), I kindly disagree.

The beauty of SSDs are they have a controller which fits into the definition of black magic, and all flash is abstracted behind that, but not completely. Hard drives also are almost in the same realm.

Running a simple "smartctl -a /dev/sdX" returns a line like the following:

    Sector Sizes:     512 bytes logical, 4096 bytes physical

This means I can bang it with 512 byte packs and it'll handle it fine, but the physical cell (or sector) size is different, which is 4kB. I have another SSD, again from same manufacturer, which reports:

    Sector Size:      512 bytes logical/physical

So, I can dd it and it'll just handle it just fine, but the first one needs a bs=4kB to minimize write-amplification and maximize speed.

This is completely same with USB flash drives. Higher end drives will provide full SMART (since they're bona-fide SSDs), but lower end ones are not that talkative. Nevertheless, a common denominator block size (like 1024kB, because drives also can be composed of huge, 512kB cells, too) allows any drive to divide the chunk to physical flash sector sizes optimally, and push data to it

In the SLC/xLC hybrid drives' case, controller does the math to minimize the write amplification, but again having a perfect multiple of reported physical sector size makes controller's work much easier, and makes things way smoother. Either because the reported physical size is for the SLC part which is you're hitting for the most cases, or the controller is already handling multi-level logistics inside the flash array (but thinking in terms of block sizes since the this is how it works on the bus side regardless of the case inside).

cosmotic · on May 23, 2022

It's great to have control over this but I suspect most users never knew this was happening, had no idea dd could bypass that behavior, nor knew which argument to pass to dd to accomplish this. It's like saying 'what makes 3d printers so great is you can make anything!' but you'd be way better off with an industrially forged object than the 3d printed object.

birdyrooster · on May 23, 2022

right and good luck doing injection or vacuum molding for a few copies of a niche design

Edit: correcting for nitpicker

girvo · on May 23, 2022

You can do vacuum forming at home pretty easily. We used to do it back in the day for one-off handheld versions of home game consoles.

dTal · on May 23, 2022

That's cool, I didn't know you could do that. How do you make the mold? If (as I suspect) the answer is handcrafting, 3d printing still offers a considerable advantage in labor.

girvo · on May 24, 2022

Oh gosh yes. Though for certain finishes, vacuum forming still has some upsides. Whats neat is you can use 3D printing to make the mould for vacuum forming over.

This gives you the best of both worlds, in terms of the shapes/finishes that forming can achieve, but with the workflow benefits of 3D printing.

Similarly, the new hotness is compression/cold forging using 3D printed moulds and carbon fibre (and other materials). Very very cool.

andi999 · on May 23, 2022

For a few copies of niche design you would do vacuum molding.

cosmotic · on May 24, 2022

The main point wasn't the problems with 3d printers but actually looking at dd from a human centered design perspective.

ChuckMcM · on May 23, 2022

Pretty much, and understanding what is going on "under the hood" as it were can be informative. Had the author done a 'cp myfile.foo /dev/sdb' on a UNIX system they would have found they now had a regular file named '/dev/sdb' with the contents of myfile.foo and and their sd card would have remained untouched. But you would only know that if you realized that cp would check to see if the file existed in the destination, unlink it[1], and then create a new file to copy into.

The subtlety of opening the destination file first, and then writing into it, was what made dd 'special' (and it would open things in RAW mode so there wasn't any translation going on for say, terminals) but that is lost on people. Bypassing the page cache and thus not killing directory and file operations for other users of the system is a level even below that. Only the few remaining who have done things "poorly" an incurred the wrath of the other users of the system sitting in the same room really get a good feel for that :-). Fortunately for nearly everybody these days they will never have to experience that social embarrassment. :-)

[1] Well unless you had noclobber set in which case it would error out.

totetsu · on May 23, 2022

Still the bad old days for copying files from iOS to Linux. It seems to make a copy internally of everything you copy internally on the device before sending it, which leads to running out of free space just trying to copy things of :(

a-dub · on May 23, 2022

amusingly this would also occur with writes as well.

there was some hueristic in there that tried to prevent it, but it wasn't very good.

watersb · on May 24, 2022

In my experience, Windows NT (now just Windows) is very fond of its file cache and large copies can blow up into memory paging as well.

Early Windows NT was awful with this, pegging the system with a cascade of disk IO at unpredictable times, often for ten seconds or more.

Can anyone suggest ways to avoid blowing the file cache on Windows with large copies? Is this even a problem anymore?

watersb · on May 24, 2022

On macOS, you can also use the `--nocache` flag for the `ditto` command.

Please keep in mind that `ditto` is a file copy and archive utility, not a block copy utility like `dd` (which is also available on macOS).

An online man page for ditto: https://ss64.com/osx/ditto.html

guerrilla · on May 23, 2022

I think using a small bs also determines the size of the cache you use, as its the buffer.

KMag · on May 23, 2022

GP is talking about the Linux kernel's buffer cache. Unless you tell the kernel to operate directly on the disk, your reads come from and your writes go to pages within the kernel's buffer cache. Using a small bs probably results in a buffer of only bs bytes inside dd's address space, but the buffer cache is completely different and resides in the kernel.

That is, without iflag=direct, dd will repeatedly ask the kernel to copy bs bytes from the kernel's buffer cache into dd's address space, and then ask the kernel to copy bs bytes from its address space into the kernel's buffer cache.

bushbaba · on May 23, 2022

I’d of thought dd always avoided page cache. what dd use case is that desired behavior?

throwaway2048 · on May 23, 2022

linux absolutely still does this FWIW, its one of the reasons that swap is a net negative.

dredmorbius · on May 23, 2022

Useful uses of dd(1).

There remain some useful applications of dd. These may of course be achieved by other mechanisms, but typically less conveniently.

1. Read a specific number of blocks or bytes from a source:

  dd if=/dev/hda of=/root/mbr bs=512 count=1

This will make a copy of, say, your master boot record (first 512 bytes of your first disk drive) and stash it in your /root directory.

2. Read from specific bytes of a file

  dd if=mydata skip=1k bs=32 count=1

Reads 32 bytes after the first 1024 (1k) bytes of "mydata".

3. Write to specific bytes of a file

  dd if=source of=target seek=10k bs=512 count=1 conv=notrunc

That should write 512 bytes from "source" beginning 10k into "target". (I've not tested this, you should verify.)

4. Create a sparse file. Sparse files appear to have a nonzero size, but take up no space on disk, until data is actually written to them. These are often used as "inflating" dynamic filesystem images for virtual machines.

  dd if=/dev/zero of=sparsefile bs=1 count=0 seek=20000M # Create 20 GB sparse file

5. Case conversions. Sure, you could use tr(1), but where's the sport?

  dd if=MixEdCaSE of=lcase conv=lcase   # Convert to lower case
  dd if=MixEdCaSE of=ucase conv=ucase   # Convert to upper case

6. ASCII / EBCDIC conversions

  dd if=ebcdic of=ascii conv=ascii   # ebcdic -> ascii
  dd if=ascii of=ebcdic conv=ebcdic  # ascii -> ebcdic

When reading to or from IBM data tapes, you might find blocking / unblocking conversions useful. I've done this, but it's so long ago that I don't trust my memory on that any more. Odds are good you'll not have to worry about this.

There are other useful applications as well, though these are not typically encountered very often. Do feel free to explore and attempt these on safe media.

photon-torpedo · on May 23, 2022

Careful, your #2 and #3 are incorrect -- skip and seek operate with blocks, not bytes. So your #2 would copy 32 bytes after the first 32kB of data, and #3 would write 512 bytes at position 5120k.

_flux · on May 23, 2022

Btw, there are iflags in Gnu dd to work in bytes as well. I have ddbytes aliased to dd iflag=count_bytes,skip_bytes oflag=seek_bytes .

dredmorbius · on May 23, 2022

Thanks. Most of that was from memory and not tested.

The specific recipies should be vetted. The stated goals can be achieved with proper invocation.

zimpenfish · on May 23, 2022

> Read from specific bytes of a file

Especially handy when you've fed in a huge amount of JSON (sometimes all on one line because, y'know, why not) into jq and you get the inscrutable output:

    parse error: Invalid numeric literal at line 1, column 236162512

nemetroid · on May 23, 2022

Tail can do this, too:

  tail --bytes=+236162512

zimpenfish · on May 23, 2022

Handy but `dd` is much better at giving you a limited context to look at - `dd bs=1 skip=236162504 count=20' just gives you 8 before and 12 after.

(edit: I do have to concede that the `tail|head` version is much faster than the `dd` - ~11s vs ~65s in my quick test with that ^skip)

dredmorbius · on May 23, 2022

I suspect you'd get better performance increasing the blocksize, and if necessary trimming the output in a second pass.

Large blocks -> efficient I/O. Within reason.

nrclark · on May 23, 2022

A note here: if you're using GNU dd, you can also use iflag=count_bytes and then set the block-size to whatever you want. That'll give you the best of both worlds.

zimpenfish · on May 24, 2022

Ah, good to know. With a block size of 1M, that brings the `dd` version down to ~15s.

dredmorbius · on May 23, 2022

Ugh!

I'll keep that in mind, json-parsing being among my current hobbies...

Waterluvian · on May 23, 2022

Apologies. Tangent:

What does the (1) mean beside dd? I see this with man pages. Is it a version identifier?

Edit: thank you both for taking the time to share. I appreciate the quick response.

colejohnson66 · on May 23, 2022

The “section”[0] or “subpage.” Programs are section 1, hence ‘dd(1)’ The idea is that it can be possible for a library function (section 3) to have the same name as an executable. In which case, you’d type into your shell:

    man 1 <name> # executable: <name>(1)
    man 3 <name> # function:   <name>(3)

Without the distinction argument, man will throw its arms up in the air and give up.

However, if there’s no ambiguity (such as with ‘dd’ where only the executable exists), you can drop the section number parameter when running man. man will then search all the sections, see that ‘dd’ only exists in section 1, and go from there.

[0]: https://linux.die.net/man/

zengargoyle · on May 23, 2022

Something like `read` gives a better example. Like `man 1 read` is "read — read from standard input into shell variables", "read [-r] var...", a shell function. And `man 2 read` is " read - read from a file descriptor", "ssize_t read(int fd, void buf, size_t count);". You can also get into `man 8 catman`, "catman - create or update the pre-formatted manual pages", "catman [-d?V] [-M path] [-C file] [section] ...". The `catman` is a mirror-like hierarchy of man pages (usually in troff format using the 'an' macro package) pre-formatted for the standard terminal or whatever to shave off a bit of time running roff on the source files all the time.

This is all ancient knowledge and well in place back in the mid 1980's long before Linux or any of that stuff. The `man` page sections used to be different 3-ring binders all printed out sitting on a table in the computer lab. The 'sections' were just different binders of documentation. Get off my lawn!!!

dredmorbius · on May 23, 2022

Correct.

I also use the notation as a convention to indicate that I'm referring to a Unix command (or library function, etc.). E.g., "cat" might refer to a feline, but "cat(1)" should more clearly refer to the Unix / Linux command.

(Where I've got proper markup, I'll typically set commands or references in monospace using backtick notation: `dd`, `cat`, etc.

NAR8789 · on May 23, 2022

What do you do if you need to unambiguously refer to cat-as-in-feline?

eesmith · on May 23, 2022

cat (Felis catus) ;)

dredmorbius · on May 23, 2022

I show claws.

CRConrad · on May 24, 2022

> Where I've got proper markup, I'll typically set commands or references in monospace

You can do that here by putting them on a separate line (two newlines, separate paragraph) and indenting by a couple characters.

jwilk · on May 23, 2022

> Without the distinction argument, man will throw its arms up in the air and give up.

At least in man-db and FreeBSD implementations, it actually shows you the first page it found by default.

colejohnson66 · on May 23, 2022

You're correct; I thought that might've been wrong...

I just tested, and `man man` opened man(1) despite man(7) existing.

OJFord · on May 23, 2022

Or config in 5 is another perhaps more likely to come across as a user.

xigoi · on May 23, 2022

Why is it random numbers rather than readable words?

    man exe <name>
    man lib <name>

karatinversion · on May 23, 2022

As I understand it, because it descends from printed manuals with numbered sections.

colejohnson66 · on May 23, 2022

Yep. Executables are section 1 simply because they came first in the binders.

There's nothing necessarily preventing man from allowing string codes as a replacement, but man interprets (fully) non-numeric arguments as pages to go. So `man cat dd` (on Ubuntu) will first open cat(1), and when you exit (with 'q'), will prompt if you want to continue. If you say yes, it'll open dd(1).

That has the side effect than `man exe dd` would be interpreted as opening exe(#) (printing "No manual entry for exe") followed by dd(1).

nieve · on May 23, 2022

Man pages are divided into sections as follows (GNU & Linux): 0 Header files (usually found in /usr/include) 1 Executable programs or shell commands 2 System calls (functions provided by the kernel) 3 Library calls (functions within program libraries) 4 Special files (usually found in /dev) 5 File formats and conventions, e.g. /etc/passwd 6 Games 7 Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7) 8 System administration commands (usually only for root) 9 Kernel routines [Non standard]

Arnavion · on May 23, 2022

As explained in `man man`, of course.

Maursault · on May 23, 2022

[flagged]

masklinn · on May 23, 2022

Er… what?

GP’s comment should apply to most if not all systems with a few caveats / divergences.

FreeBSD certainly mentions and lists standard sections at the top of `man man`. OpenBSD mentions the concept of categories / sections early on, though it only lists the specific section when it comes around to documenting the corresponding filter.

GNU invented info(1), not man(1).

Maursault · on May 23, 2022

It is just a pet peeve of mine that a lot of younger pros and devs talk about Linux as though it were groundbreaking, when all of GNU/Linux is a copy of a copy of a copy of a copy of the actual earth-shaking developments, SysV (arbitrary) and BSD. Everything in Linux was there before Linux was a twinkle in Linus' eye. I remember when most webservers ran NetBSD, and 5 years later Linux took over the data center. Was NetBSD really so intolerable and Linux really that superior? No, its just that with no memory of the past, one can't know any better. Linux really brought nothing new, no new advances, nothing that wasn't there before, and yet it took over like a jihad. That isn't accurate... not like a jihad... it was a jihad. We can thank fanaticism for Linux in the datacenter. I'm not unfaithful, just system-agnostic. Linux was a solution to a problem already solved many times. And since Linux is not original, it sort of gets under my skin when it is insinuated as such. When talking about any software, one should refer to its original development. When we talk about web browsers, we don't talk as though, say, Microsoft invented the web browser, just because it has one; instead we talk about Tim Berners-Lee, not a copy of a copy of a copy of his work.

xorcist · on May 23, 2022

The main reason Linux outcompeted the BSD-descendants is the GPL license. Instead of competing with proprietary extensions on a free base, fragmenting the ecosystem, upstreaming as much work as possible makes economic sense.

We can still observe that effect in the Android ecosystem which started out really bad and fragmented and slowly drifts into a more coherent whole, instead of the other way around.

Many saw what the UNIX war led to and tended to avoid similar situations. I too liked what the BSD homogenous distribution let to on a technical basis, but the GPL makes sense as long as business cases can be made fit.

masklinn · on May 23, 2022

Nah the main reason Linux outcompeted the BSD is that it arrived at exactly the worst moment for the BSDs: USL v. BSDi.

This case put a severe pall on the attractiveness of BSDs as they were suddenly in legal jeopardy just at the outset of the UNIX wars and as they were coming into their own.

And at the same moment, a cleanroom unix arrived on the market, limited in many ways but safe.

The GPL was at best neutral for most users, as can be seen from its adoption (or lack thereof). However the GPL was nowhere near as problematic as “AT&T might get our OS declared illegal”.

xorcist · on May 23, 2022

Yes, there was that, too. Now I did not mean that the license is of crucial importance to most end users (it might be for some, but likely the other way around as some will prefer the simpler BSD clauses). But it was decidedly important for the business and consulting side to form, and that was hugely where Linux won.

Red Hat, Cygnus and the IBM service group took early big bets on Linux, which could not have happened on a product where vendors based their respective offerings on proprietary lock-ins. That drove adoption in banking and defense whose existing UNIX stacks looked increasingly old, which drove a huge industry shift that took the better part of a decade.

It used to be quite common to find people arguing that BSD was the more "business friendly" license, which is may be true in some specific ways but tends to miss the bigger picture. The adoption of a mainstream system under GPL license was important.

Then the situation was probably different in the web hosting business, in academia, and in other sectors where other factors dominate.

teddyh · on May 23, 2022

What NetBSD did not have was drivers for any old commodity PC hardware which everyone had laying around. That’s it. That’s why people ran Linux on their stuff, and then continued running Linux in the data center.

seedie · on May 23, 2022

386BSD the predecessor of Net/Free/OpenBSD was published under a BSD license in 1992. 6 months after Linus posted his kernel sources on usenet. It was free and open source and no later than the "free" BSDs that are still available today. I'd say that is a reason why it is successful.

Edit: "free" in quotatio marks to not confuse with FreeBSD.

hnlmorg · on May 23, 2022

Not to mention the GNU project had been around for nearly a decade previous

tssva · on May 23, 2022

I fail to see how the comment you are so vehemently responding to in anyway implied that GNU/Linux was somehow groundbreaking. The only mention of GNU & Linux was to clarify that the sections that followed are the man sections on GNU/Linux systems which is appropriate since not all UNIX flavors contain the same sections or the same order of sections. GNU/Linux systems mostly utilize the same sections and order as BSD based systems. SYSV based systems usually have some differing sections and a differing order of sections.

mdp2021 · on May 23, 2022

> We can thank fanaticism for Linux in the datacenter. I'm not unfaithful, just system-agnostic. Linux was a solution to a problem already solved many times

And outside the datacenter? Was the problem of an Open Common Desktop OS - for those people who are radically *not* system-agnostic - solved at the time?

What could have been the effects on the trends building the scenario until today and beyond, had Linux not appeared but keeping the rest of the chessboard intact?

xelxebar · on May 23, 2022

While we're on a tangent. Here's a quick way to list all man pages available on your system:

    $ man -k ''

And if you just want to see the ones for high-level documentation (i.e. section 7), then

    $ man -s 7 -k ''

does the trick. Sections 7 and 5, in particular, are full of hidden gems.

klibertp · on May 23, 2022

On the topic of hidden gems. The info directory is frequently populated by default with manuals when you install a piece of software. These tend to be more in-depth and complete manuals then the man pages for the same tool. Just type `info` in a terminal and be amazed. Emacs has a convenient info viewer also, under `C-h i`.

tingletech · on May 23, 2022

usually `info` only works on a GNU derived OS. `man -k` has worked on every flavor of unix I've met.

edit: `man -k ''` might be a GNUism too. Just tried on a BSD derived OS and got back nothing.

yjftsjthsd-h · on May 23, 2022

Perhaps more directly to your question: Yes, as siblings note it is a manpage section number, but writing it like that is just a way to refer to programs; "cat" could be a feline, but "cat(1)" is a unix program. Oh, and it can disambiguate; printf(1) is a program you run from the shell, printf(3) is a C library function. IMO it's as much a cultural convention as anything.

dredmorbius · on May 23, 2022

Precisely this, esp. the cat vs. cat(1) distinction (which I'd just addressed in another response).

fsckboy · on May 23, 2022

the other answers are good, but haven't quite covered the topic.

the unix manual was commonly printed out, and these section numbers were a necessity for looking things up. Collation was section number, and then alphabetical

when you were first introduced to unix, you sat down with the manual and read it.

rocqua · on May 23, 2022

For the first option a simple

    head -b 512

Will also copy the first 512 bytes in case you want to avoid dd for clarity. I have actually used that for moving mbrs around.

tyingq · on May 23, 2022

Though "-b" is a gnuism, and isn't there at all on many unixy OSes, or is "-c" on others.

jimmaswell · on May 23, 2022

What drives whoever's the second person to implement such a flag to make it different? An explicit desire to inconvenience other "tribes" of computer users and use the flag as a symbol of an in-group?

scbrg · on May 23, 2022

Following a convention already established in the local ecosystem, presumably.

Though in this case I think previous posters are mistaken. My GNU implementation of head supports -c and not -b.

Perhaps it's the fact that the longopt is --bytes that caused the confusion.

dredmorbius · on May 23, 2022

How do you change, improve, or extend a standard if no changes may be permitted?

What's keeping other implementations from adding these features?

jimmaswell · on May 23, 2022

Changes with a good reason are fine, but I don't see a good reason to make the same flag a different letter.

tyingq · on May 23, 2022

Apparently a typo on the post I was replying to...there is no "-b" switch, it's "-c" or "--bytes" for gnu head. Though there are versions of head without one or both of "-c", "--bytes".

rocqua · on May 23, 2022

As noted below my post above is wrong. It should be -c.

I simply miss-rememered.

cperciva · on May 23, 2022

create a sparse file

Note that this can also be done using truncate(1).

Nux · on May 23, 2022

Also fallocate (Linux only though).

natmaka · on May 23, 2022

dd can do it on an existing file or stream, transforming sequences of 0 it contains into "holes".

inopinatus · on May 23, 2022

caveat operator: to ensure the conv=sparse operand achieves the desired outcome, be sure to use an output blocksize equal to st_blksize of the output filesystem.

jhugo · on May 23, 2022

or just use `cp --sparse=always` if it's from one file to another

inopinatus · on May 23, 2022

That is not so portable, so I recommend sticking with dd.

jhugo · on May 23, 2022

Yup, `dd` helps in a lot of situations when you need portability, not just this one.

Commands like `cp --sparse=always` aren't worth such a blanket disrecommendation though; if you are working directly at a console as opposed to scripting you typically don't need portability.

inopinatus · on May 23, 2022

Again, most of the consoles I work at don’t use GNU coretools.

jhugo · on May 23, 2022

Then you obviously can't use this, but many people reading my suggestion can.

yepguy · on May 23, 2022

My most common use for `dd` is using it with `sudo` to direct the output of a unprivileged pipeline to a root-owned file. Instead of running `echo hello >/root/test.txt`, which will fail, I use `echo hello | sudo dd of=/root/test.txt`.

nrclark · on May 23, 2022

a note: I'd recommend using tee instead of dd for that job, or add iflag=fullblock if your dd supports it.

The thing is that dd issues a read() for each block, but is doesn't actually care how many bytes it gets back in response (unless you turn on fullblock mode).

This isn't really a problem when you're reading from a block device, because it's pretty uncommon to get back less data than you requested. But when you're reading from a pipe, it can/does happen sometimes. So you might ask for five 32-byte chunks, and get [32, 32, 30, 32, 32]-sized chunks instead. This has the effect of messing up the contents of file you're writing, with possibly destructive effects.

To avoid it, use `tee` or something else. Or use iflag=fullblock to ensure that you get every byte you request (up to EOF or count==N).

yepguy · on May 23, 2022

I've never had any trouble, but good to know.

matja · on May 23, 2022

7. Write a new MBR to a disk, keeping the partition table:

    dd bs=440 count=1 if=/usr/lib/syslinux/bios/mbr.bin of=/dev/sda

Even in the age of EFI/GPT, that still gets used often (usually VM providers that only offer MBR boot).

jhugo · on May 23, 2022

  head -b 440 /usr/lib/syslinux/bios/mbr.bin > /dev/sda

inopinatus · on May 24, 2022

I suspect you meant -c 440; I can’t find a variant of head(1) that has a -b operand on any Unix. Note that -c is not POSIX but does have widespread support. Notably missing on Solaris.

Fun fact, the -c usage comes from ksh, where head is a shell builtin.

jhugo · on May 25, 2022

Oops, yes, -c indeed! Thanks!

gnubison · on May 23, 2022

Fun fact: 1-4 don’t work in the context of short reads — and GNU’s fullblock extension isn’t specified in POSIX.

mgerdts · on May 23, 2022

While there’s a lot of truth here, there are times when you can do much better than cat. A while back I tweeted:

Today I found the magical dd command that causes an NVMe drive to run at almost full speed:

  # dd if=/dev/nvme3n1 bs=4096k iflag=direct of=/dev/null status=progress
  959090524160 bytes (959 GB, 893 GiB) copied, 178 s, 5.4 GB/s

The trick to getting this throughput is telling Linux to do an insanely large IO (4 MiB). The drive can't do 4 MiB reads - the largest IO it can handle is 2 MiB.

  # nvme id-ctrl /dev/nvme3 | grep mdts
  mdts      : 9
  # echo '4 \* 2^9' | bc -l
  2048

More in the tread starting here:

https://twitter.com/OMGerdts/status/1514376206082269191?s=20...

jagrsw · on May 23, 2022

My SSD - apparently one of the fastest on the market - KINGSTON SKC3000D2048G - rated 7GB/7GB read/write, after writting data to it, stopped being very fast. No idea, if it's something related to how nvme/ssd-s work, or maybe I have a broken unit.

  $ sudo dd if=/dev/nvme0n1 bs=4096k iflag=direct of=/dev/null status=progress
  ...
  9667870720 bytes (9,7 GB, 9,0 GiB) copied, 20,2714 s, 477 MB/s

But when reading from empty space (as in, non-written yet blocks), it goes at full PCIE x4 Gen4 speeds.

  $ sudo dd if=/dev/nvme0n1 bs=4096k iflag=direct of=/dev/null status=progress skip=450000
  ...
  16710107136 bytes (17 GB, 16 GiB) copied, 2,42802 s, 6,9 GB/s

I have another nvme drive - Force MP510 - and it doesn't care if data was previously written or not. When reading from it, I get ~full x4/Gen3 speeds of 3.5GB/s

PS: nvme smart-log shows 100% available spare, and 0% percentage_used, so it doesn't seem to be wear-related.

dale_glass · on May 23, 2022

SSDs are extremely complex internally.

They write in small blocks (eg, 4K), but can only erase in very large blocks (eg, 2M). Writing to an empty device is easy, but eventually you have to overwrite something, and you can't just replace a 4K block with another. You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful, then do the write you actually wanted to.

This effect is known as "write amplification", and it means that in bad cases you need to do many times more work than the host system requested.

Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.

This is what TRIM (confusingly also known as 'discard' in some contexts) is for -- the SSD operates in blocks and has no way to know that some chunk you've written to before is now useless to you because you deleted the file it belonged to, until you give it the command to overwrite that block. TRIM tells the drive "these parts can be recycled", and allows it to create empty blocks in advance. So make sure to TRIM once in a while.

Sadly TRIM wasn't well specified initially in regards to what performance characteristics it should have -- some old drives can get stuck on it for a while. So while many filesystems support emitting TRIM automatically, it can cause severe performance issues on some drives and the recommendation is to do it as a maintenance task on a timer instead.

TL;DR: Run `fstrim /mountpoint`. Wait a while before testing if it changed anything. The drive isn't obligated to do the work immediately.

It gets a bit more complex than that due to layers. LVM and dmcrypt can filter out TRIM requests. You can use `lsblk -D` to check the support status.

mgerdts · on May 23, 2022

> You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful,

This is called garbage collection. It may be happening at any time in the background but becomes more frequent and perhaps in the write path as the drive fills. 2M is an example size - the actual size will vary by drive model and it is rarely disclosed.

> Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.

There are multiple types of SLC cache as well. Client drives may have a small number of gigabytes of SLC that can absorb a small burst of writes. Client drives may also have pseudo SLC (pSLC) that is called pSLC, TurboWrite (Samsung), etc. With pSLC, when there's about 30% of the drive's NAND that is erased, the drive will use that as SLC. So, a 1 TB drive will use about 300 GB of space as a 100 GB SLC cache.

How performance degrades as the drive fills varies widely between drive models. Some drives start to have significant read and write performance degradation long before the drive is half written. Others will maintain fairly consistent read performance (maybe within 90% of that which is advertised) regardless of how full the drive is. For instance, the original version of the Samsung 980 Pro maintains close-to-spec read speeds regardless of how full it is but write performance drops from about 5200 MB/s to about 1300(?) MB/s the moment it hits 70% allocated.

Datacenter and enterprise class drives tend to have lower peak performance than client drives, but their performance is much more consistent regardless of how full they are.

If you are buying a client NVMe drive for speed, buy one that is larger than you need and set aside at least 30% of it in unpartitioned (or unused partition) space. This will prevent the OS from writing to 30% of the drive thus keeping plenty of space for pSLC and similar optimizations. This will also increase the life of the drive as garbage collection is likely to have to rewrite the same data less frequently, resulting in a lower write amplification factor.[1]

1. See Over Provisioning at https://semiconductor.samsung.com/consumer-storage/magician/

> This is what TRIM (confusingly also known as 'discard' in some contexts)

But wait, there's more terms for the same concept: trim (ATA), unmap (SCSI), deallocate (NVMe) are interace-specific ways that Linux performs discard.

> Run `fstrim /mountpoint`

Or if there is no filesystem, `blkdiscard /dev/nvmeXn1[pY]`.

mgerdts · on May 23, 2022

I've found that the write pattern can have an effect on future reads. For instance, I've seen random 4k writes followed by 1m reads result in a somewhat lower read rate than if the writes were originally done as 1m writes.

The degradation you are seeing is rather extreme - it seems to be performing worse than a cheap SATA SSD. I'd say this is worth having a conversation with Kingston.

dijit · on May 23, 2022

copying zeroes is much less work for the controller and SLC cache.

is it possible that the controller is over-heated? Do you have the NVME drive under a GPU?

nousermane · on May 23, 2022

> copying zeroes is much less work

Yes, reading "evicted" or "trimmed" space [0] specifically, is much less work - only flags are read, not actual media.

[0] https://en.wikipedia.org/wiki/Trim_(computing)

trasz · on May 23, 2022

And on FreeBSD (and perhaps others) you can also specify the speed limit.

$ echo 'like this' | dd bs=1 speed=10

naikrovek · on May 23, 2022

Oh thank you for submitting this to HN. I’ve been telling people not to use dd for years and everyone looks at me like I just gave birth to a full grown dinosaur or something.

“Well why does the entire internet say to use dd then?”

Because they copy from each other just like you copied from them. Just use cat.

jhardy54 · on May 23, 2022

> Just use cat.

We’ve gone full circle.

https://en.m.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_ca...

nikau · on May 23, 2022

The reason to use cat is so you don't risk accidentally overwriting the source file by messing up a command.

Anything after the "cat file |" can't overwrite the file.

jhugo · on May 23, 2022

That seems like a bad reason. You can always overwrite the file by messing up the command.

  cat file | cat > file

(This won't, as you might hope, leave the file unchanged.)

Re-read what you typed before hitting enter, and keep backups of important things.

nikau · on May 23, 2022

That's an unlikely mistake.

More likely is you are messing around with some new/infrequently used utility and you pass in the arguments incorrectly and specify "file" as the output instead of the input as intended.

jhugo · on May 23, 2022

The one I typed is unlikely (I wrote it in response to your "Anything after the `cat file |` can't overwrite the file", which is factually incorrect as demonstrated), but there are plenty of other simple mistakes. Understand what you're running, and keep backups, otherwise you'll screw up eventually. Selecting your tools this way doesn't move the needle much.

nikau · on May 23, 2022

Removing cat and specifying the file directly to the utility doesn't move the needle in any tangible way either.

At the end of the day do whatever pattern works for you - for me it's doing "cat file |" at the start of a pipeline and "> outfile" at the end.

I also avoid globbing inside a pipeline as it can be dangerous too.

GrumpySloth · on May 23, 2022

You can replace "cat file |" with "<file".

jhugo · on May 23, 2022

Well, yes, but then it wouldn't have been a very good demonstration of why the statement "anything after the `cat file |` can't overwrite the file" is incorrect.

HomeGear · on May 23, 2022

>“Well why does the entire internet say to use dd then?” Because they copy from each other just like you copied from them.

Great statement. This brings me some much needed internal clarity on my own thoughts and actions.

chubot · on May 23, 2022

This is basically the parable of "Grandma's Ham"

https://www.executiveforum.com/cutting-off-the-ends-of-the-h...

tl;dr nobody in 2 generations knows why they cut the ends off the ham before cooking it, until they talked to grandma, who said her pan was too small

A Unix thing that's been posted to HN for a decade, that's almost literally the same story:

Understanding the bin, sbin, usr/bin , usr/sbin split

https://news.ycombinator.com/item?id=3519952

tl;dr /usr/bin is separate from /bin because someone had a small hard disk once

ahartmetz · on May 23, 2022

So Grandma's Ham is a variant of Chesterton's Fence... When you do or don't do something for reasons of tradition, find the real reasons for doing / not doing it.

soneil · on May 23, 2022

You’d be amazed how many people I’ve found that do ‘tar xzf filename’ without knowing what xzf is doing.

sam_bristow · on May 23, 2022

I'll admit to being one of those people. My brain still spells it out as eXtract Ze Files every time :-P

Izkata · on May 23, 2022

You're 2/3 correct:

x - extract

z - gzip format

f - file (must be last so it parses the filename arg correctly

I also add "vv" in the middle so it lists every file as it goes, so I can see it work instead of just waiting with no output.

sam_bristow · on May 25, 2022

It's not often I actually unzip stuff on the command line these days, but saying "extract ze files" in a terrible French accent in my head when I do is a highlight.

jthrowsitaway · on May 23, 2022

Passing "z" is pretty pointless for extractions.

schoen · on May 23, 2022

I think it was required until about 2010 or so for GNU tar. (I still use it from habit.)

There was a problem about integrating support for different compression applications, with each one needing to get a new letter in the tar command!

jhugo · on May 23, 2022

Yup, I remember when you had to pass `z` if it was gzip, and I remember my surprise when I missed it once and it still worked (apparently about 5 years after it was no longer needed!)

sneusse · on May 23, 2022

Won't this fail when the file is not gzipped but, for example, zstd compressed?

jhugo · on May 23, 2022

It doesn't assume gzip, it detects the compression format.

avgcorrection · on May 23, 2022

I don't tend to amaze myself.

shrimp_emoji · on May 23, 2022

What initially mystified me into thinking "I guess I have to use `dd` for block devices" was learning that `rsync` doesn't work with block devices.

I just tried to copy an ISO to a USB drive with `rsync`, and it didn't work.

Looking it up, I read these comments[0]:

"rsync operates on files which are on a filesystem. It doesn’t do comparisons between blocks. If you want to back up a block device to a file, use dd"

"I completely understood the use case. rsync (still) does not operate on block devices, so it’s not the solution here."

"Agreed, rsync has never operated on block devices."

So I thought, "Ah. There's File Data, which is what `rsync` deals with, and RAW BLOCK data, which is what scary hardcore tools like `dd` deal with."

Then the notion that `cp` and `cat` can deal with both species of data is confusing. :p

0: https://old.reddit.com/r/linuxadmin/comments/eappzm/rsync_bl...

mort96 · on May 23, 2022

"Well why does the entire internet say to use dd then?"

Because you usually need root to write to drives, and you can't `sudo` redirect. `sudo dd if=whatever.img of=/dev/whatever` is nicer and easier than `sudo sh -c "cat whatever.img >/dev/whatever"`.

reliefcrew · on May 23, 2022

In the cases where you need root it's often even easier to use the 'tee' command instead of spawning another shell; esp. if the command has symbols which need to be escaped.

Your example would then become something like...

cat whatever.img | sudo tee /dev/whatever > /dev/null # https://stackoverflow.com/questions/82256

Just like in household plumbing, the 'tee' command basically takes the input and sends it to more than one place. Naturally, running 'sudo tee' will let you send things all over but as another user.

All that said, I won't speak to the comment "why does the entire internet say to use dd". I've never actually found the "whole internet" to agree on much of anything (-:

RaoulP · on May 23, 2022

Well, I’d like to thank the author for writing this. I always enjoy having the veil of “magic” lifted from computers.

UNIX’s simplicity of “everything is a file” continues to surprise me, even though it shouldn’t any more. I think a lack of confidence in my understanding of the basics, leaves me tempted to copy from others as you mention.

koala_man · on May 23, 2022

Author here. You're welcome ^^

throwawaylinux · on May 23, 2022

What if you need to test direct IO, or test various partial-page, partial-block, or full block/page access to pagecache? Sometimes you might actually want to try using direct IO too, if buffered IO is not giving you sufficient control or reporting of errors from the device, or continuation from part-way through the copy, and in some cases possibly because you don't want to consume memory for page cache (although dd isn't good with direct IO so that's probably unlikely).

That said if you don't know whether or not you need it, you don't. And those who do would know not to listen to your advice (for what they need to do) anyway, so it's not bad general advice to give.

spudlyo · on May 23, 2022

Direct I/O is notoriously finicky. Depending on what filesystem you're using there are apparently alignment restrictions and other corner cases to think about. I don't think it's that `dd` isn't good with Direct I/O, I just think its hard to use.

throwawaylinux · on May 23, 2022

Direct IO is not finicky (on Linux, since 2.4 era kernels). It's just that at least with dd, if you don't know what it does then it probably does not do what you want (synchronous, single queue depth). If you know what you want and know what it does, then it's fine to use, and can be useful (I used it today to measure block device interrupt latency in a VM).

_lnwk · on May 23, 2022

I've been taught at my uni to use dd...

kazinator · on May 23, 2022

> Usage of dd in this context is so pervasive that it’s being hailed as the magic gatekeeper of raw devices.

That's the thing; it isn't. Author forgot to explain (if he knows that at all) that /dev/sda2 on Linux is not a raw device. It's a block device.

So if dd is hailed as something to use on /dev/sda, that's not an example of being hailed for a raw device.

dd's capability to control the read/write size is needed for classic raw devices on Unix, which require transfers to follow certain sizes.

E.g if a classic Unix tape needs 512 byte blocks, but you do 1024 byte writes, you lose half the data; each write creates a block.

The raw/block terminology comes from Unix. You have a raw device and a block device representing the same device. The block device allows arbitrarily sized reads and writes, doing the re-blocking underneath. That overhead costs something, which you can avoid by using the raw device (and doing so correctly).

trasz · on May 23, 2022

At least this used to be the case. Nowadays FreeBSD doesn’t implement block devices at all - there are only raw disk devices.

p_l · on May 23, 2022

Something that, IIRC, came from Linux and it's allowance on at least some block devices to support "character" style access.

Tapes are still annoying on both in their block-ness, iirc?

trasz · on May 23, 2022

In a way, but in Linux block devices can still be accessed as block devices, while in FreeBSD (since around 1999) they can't - there's no caching at that level anymore; raw devices (which Linux got a few years before that) are the only kind of devices. (If you do "ls -al /dev", you'll still see block devices, but it's maintained only to pretend to userspace, just like major/minor numbers.)

Tapes aren't block devices at all - that's why you can't mount them :-)

p_l · on May 24, 2022

Mounting and a device being a block device or not had zero relation in classic unix though :)

Tapes by nature are block devices due to only being accessible in block increments, and writing them in transparent way is a bit more problematic than with disks, as being able to just re-read a block to do read-update-write cycle isn't guaranteed - or necessarily easy.

And yes, there used to be a time where you could mount tape as filesystem ;)

licebmi__at__ · on May 23, 2022

>cat /dev/cdrom > myfile.iso

Heh, I remember finding out this on the good old days when I was finding out how to rip cds and shitting bricks. I mean I read it on a BBS and thought "that can't be right"; I was expecting to find something like the Nero suite back in windows.

Much more recently, I enjoyed the same kind of amazement on the bash tcp pseudo devices.

bcook · on May 24, 2022

I was much more confused by your post than I should've been because I overlooked that ">" is how HackerNews prefixes quoted text.

MertsA · on May 23, 2022

I think this has a bit of bad advice for using cp or shell redirection to read / write to raw block devices but dd isn't necessarily the best either. Personally any time I'm trying to image some hard drive, damaged or otherwise, I'll just about always jump straight to ddrescue (not dd_rescue). It's similar to dd, surprise surprise, but it keeps a log of which parts of the input and output have been copied / had errors / skipped. Nothing is more annoying than waiting an hour for some large copy to make progress and then run into an error or get interrupted for whatever reason. Using ddrescue because it keeps a log of the status of the operation you can resume it with the same command and it will pick back up right where it left of instead of having to start all over. It's also intelligent enough to not fail the first time and skip over some bad region of the disk on error and come back and reattempt it using various strategies once it's already copied the low hanging fruit.

There's very little reason not to use it, even if it's just to get a nice progress view instead of just the current amount of data copied.

krnlpnc · on May 23, 2022

Came here to mention 'ddrescue' as well.

It's been invaluable as an intuitive tool to recover data from failing disks/drives.

Never thought to use it as a day to day dd, but will give that a try. Thanks for the idea!

pronoiac · on May 23, 2022

Also, ddrescue is in Ubuntu, as gddrescue.

bee_rider · on May 23, 2022

Just a note -- despite the title, the article eventually presents a nuanced view and points out that "dd" has some uses.

> If an alias specifies -a, cp might try to create a new block device rather than a copy of the file data. If using gzip without redirection, it may try to be helpful and skip the file for not being regular. Neither of them will write out a reassuring status during or after a copy.

> dd, meanwhile, has one job*: copy data from one place to another. It doesn’t care about files, safeguards or user convenience. It will not try to second guess your intent, based on trailing slashes or types of files.

> However, when this is no longer a convenience, like when combining it with other tools that already read and write files, one should not feel guilty for leaving dd out entirely.

lgeorget · on May 23, 2022

One thing dd does for me, that cp and cat do not, is that it forces me to read and check the command at least three times before pressing enter, which is a very good thing when messing with raw devices.

When I first learned that dd was not magical, I started using cp but I made some mistakes with partitions number and whatnot (nothing serious).

Maybe it's just the wierd syntax or the fact that I treat dd differently, I'm just more cautious and don't press enter automatically. Of course it's silly but to me at least, that's a good reason to keep using dd.

cc101 · on May 23, 2022

Back in the Dark Ages (1968) we had pre-written JCl scripts. I don't think many people knew JCL. We just appended a script to the front of our Fortran card decks. A DD card was the final card. After the frustrating work of finally getting the JCL right, the DD card was always just the "Do it Damn it" card in my mind.

a1369209993 · on May 23, 2022

  > dd if=/dev/sda | gzip > image.gz

This actually serves at least two approximately legitimate purposes: firstly, it ensures that reads from sda are aligned to a (at least nominal, ie 512-byte) disk block, which doesnt matter for normal, kernel-supported drives like IDE/SATA/most USB (which is almost certainly what sda is), but avoids bespoke devices (or their drivers) trying to do something clever when gzip asks for 1 or 7 or 17 bytes. (And writes to poorly-designed devices/drivers can be even worse.)

More importantly, like useless use of cat, it prevents gzip from trying to delete sda when it's done, which is something it will in fact do:

  $ echo test > /tmp/sda
  $ gzip /tmp/sda
  $ cat /tmp/sda
  cat: /tmp/sda: No such file or directory

(For gzip specificially, you can also prevent this by writing `gzip </tmp/sda`, but I've occasionally run into tools that try to 'intellegently' handle stdin file descriptors that point at 'real' files, so I feel better having a separate process blocking the way.)

smcameron · on May 23, 2022

gzip doesn't do a stat(2) and detect that /dev/sda is a device node? looks at gzip source code -- sure looks like it will delete device nodes. And seems like the wrong thing for it to be doing.

a1369209993 · on May 24, 2022

The underlying question, when dealing with stupid DWIM logic, isn't "Does it do that?"; it's "Can I be sure that it does that, and will continue to do that, even on old versions on legacy systems that a unknown amount of random crap depends on to not fall over because I accidentally the hard drive?".

And there's also the fact that deleting the source file by default is always the wrong thing for it to be doing. If I want to deal with corner cases like being almost out of disk space, I can pass --delete-source explicitly.

salmo · on May 23, 2022

It’s funny (given that the name likely is inspired by Useless Use of ‘cat’) that examples here have actual useless uses of ‘cat’. ‘cat file | pv > disk’ can just be ‘pv < file > disk’.

But whatever. I’m sure 90% of code stems from something someone read and copied or came most easily to them. It works.

‘dd’ was really useful for finicky media like tapes and doing EBCDIC translation. It’s still great when you combine bs and count. Blow away an MBR, make an xGB file, etc.

It’s a Swiss Army knife. It can do a lot, but isn’t the best tool for most things. I still love it. Probably just muscle memory.

hackmiester · on May 23, 2022

Or just `pv file > disk` ... and then pv will give you a progress bar automatically

salmo · on May 23, 2022

Oh, good point. When pv sees the actual file vs stdin, it can give % progress without hints. Which is nice.

And it’s nicer that the recent ‘dd status=progress’ (if I remember that option right).

koala_man · on May 23, 2022

Original author here. The intended point was that `dd if=.. | something | dd of=..` is just as useless as the two cats in `cat .. | something | cat > ..`. People sometimes mock the latter while still believing that the former is necessary.

salmo · on May 23, 2022

Sorry, that came off snarkier than I meant rereading it.

You’re 100% correct. I’ve grown more tolerant of these ceremonial uses. There really aren’t many folks that understand shell. I see it in CICD, etc all the time, too.

The one left that kills me that I see in vendor scripts all the time is ‘command; ret=$?; if [ “$ret” -ne 0 ]; then foo; fi’

If you’re handling return code cases, then cool. But folks don’t realize ‘test’ aka ‘[‘ is just another command.

But then again, so much Java, etc. I read is the same copy/paste. Those can be worse because example code isn’t prod ready, where bad shell usually does the right thing, just awkwardly or inefficiently.

But as you point out, so much isn’t thinking about what you’re doing, just mimicking.

orkj · on May 23, 2022

For those not familiar, there is a "useless use of cat" (https://porkmail.org/era/unix/award) which I have a feeling this title is referring to

mkup · on May 23, 2022

In Linux, raw disk devices like /dev/sda1 are cached by kernel (unless opened with O_DIRECT flag).

In FreeBSD (and presumably other UNIX implementations) they aren't: https://docs.freebsd.org/en/books/arch-handbook/driverbasics...

So, in FreeBSD "dd if=/dev/ada0p1 of=/dev/null bs=1 count=1" will fail: disk driver will return EINVAL from read(2) because I/O size (1) is not divisible by physical sector size (usually 512). "cat" with buffer size X (which depends on the implementation) will either work or not depending on divisibility of X by physical sector size, and other random factors, like short file I/O caused by delivery of a signal.

Summary: dd(1) still has its place and author of original article is getting it wrong.

eggsome · on May 23, 2022

My best useless use of dd was in the Ubuntu 16.04 days.

I was at a satellite office with all windows PCs for the day, so used a live disk to get a decent environment to get things done. Only problem was that the DVD drive kept spinning down and every time I did something that was not cached it made me wait forever.

nohup + while loop + sleep 4s + raw dd read from CD for the win :)

EDIT: Reading this article it sounds like dd has no "special" ability to access the disk in a raw way. But surely that's what the nocache option is for...

ace2358 · on May 23, 2022

I’ve only used dd once in my life (not much of a hacker!) but it was mostly useless.

Dropbox had some promotion for their new photo storage service, where they were giving away free extra space up to 10gb to encourage you to store your photos.

Some clever cookie told me you could make 10gb of ‘empty’ jpeg files and put them in your Dropbox folder.

The Dropbox app would compress this 10gb of files for upload (usual behaviour for the app in those days) and increase your storage permanently for free.

I can’t remember the command but it was something like

dd if=/dev/zero of=/path/to/dropbox/1.jpg bs=10000000

Bam! 1 10gb jpeg file full of zeros that compress down to a few bytes. Instant 10gb free. Now uploading required!

I think I still have that dropbox account, I learnt about dd and dev/{null,zero,one} that day and never used them again.

enasterosophes · on May 23, 2022

So I should stop using dd as a text editor, is that what you're trying to tell me?

hyperdimension · on May 23, 2022

  yes 'y' | dd of=response bs=1 count=1
  yes 'e' | dd of=response bs=1 count=1 seek=1
  yes 's' | dd of=response bs=1 count=1 seek=2

yjftsjthsd-h · on May 23, 2022

> using dd as a text editor

Bravo, you've found something (marginally) more upsetting than using ed:)

sramsay · on May 23, 2022

You move like them. Are you the chosen one?

LeoPanthera · on May 23, 2022

Occasionally I run a dd on a loop with increasing block sizes to see what is actually fastest. I regularly see instructions on the web saying you should use "1M" or even "4M", but in my tests, smaller block sizes are often faster.

A few years ago, "128K" was usually the fastest choice. Today, on faster systems, "512K" has a slight edge.

I could not tell you why, though. Try it for yourself.

usr1106 · on May 23, 2022

If you wirte to USB sticks / SD cards you should use (a multiple of) of the erase block size in oder not wear out the flash. I don't know what somewhat current sizes are. But 1M sounds likely to be a reasonable multiple.

For USB also make sure to power it off before removing otherwise you might lose data. Eg. by

   udisksctl power-off -b /dev/sdX

yjftsjthsd-h · on May 23, 2022

Huh. I always run `sync` (twice, because Tradition™) before yanking a USB stick, and I've never noticed issues; any idea if powering off is better?

zamadatix · on May 23, 2022

Sync just syncs but it doesn't prevent anything from starting to write while you go to pull it.

"udisksctl power-off" checks nothing is using the drive, commits buffers to storage, deconfigures the drive, and powers it off. Unfortunately it may kill more devices than you want in some scenarios due to the way killing the port works. Also for normal USB drives I don't think® poweroff is any different than an immediate physical pull post unmounting. The upside is it'll be completely disabled the instance the command completes, i.e. it can't even be written to raw as it is no longer powered even though plugged in. Not sure how helpful that is in reality though.

I usually just umount the drive which syncs and prevents further writes (well, to the mount at least) but doesn't try to kill it for me. Most likely sync is "good enough" for the vast majority of use though.

oynqr · on May 23, 2022

Wait, what about eject?

zamadatix · on May 23, 2022

Eject is supposed to be like the udisksctl option but I've had it act wonky on me depending on the type of drive being ejected so I've avoided it myself. If it works though it shouldn't be any different.

usr1106 · on May 23, 2022

I learned in the 1980s that it's

    sync; sync; sync

I feel more comfortable with just unmounting and powering off nowadays. But I have no hard evidence.

Hardly ever user more than one USB drive same time, so not sure whether there could be side effects with powering off one of them. Well, USB being USB, nothing would surprise me too much.

_abox · on May 23, 2022

I understand his point, but using 'dd' allows you to set a buffer size which can make cloning a bunch faster. It also has great progress reporting (status=progress) which is really useful for the things dd is usually used for.

And even if you use it without it being needed, it's not a big deal. It doesn't add much overhead, if any.

_abox · on May 23, 2022

By the way, on versions that don't support status=progress (like busybox and the BSD versions), you can periodically send a USR1 signal to get a progress update.

trasz · on May 23, 2022

Not sure about other BSDs, but FreeBSD and MacOS already support status=progress. Also ^T is much more convenient than SIGUSR1.

_abox · on May 23, 2022

I thought FreeBSD didn't but indeed it was Alpine (which uses busybox instead of gnu-utils) indeed.

What I normally do is just do a 'watch -n 30 killall -USR1 dd' in another window which triggers regular progress updates :) That why I don't use ^T

I use FreeBSD also but indeed it supports it now.

messe · on May 23, 2022

On BSDs (and IIRC macOS) you want a SIGINFO not SIGUSR1, which can be sent by ^T.

chronogram · on May 23, 2022

On the left side of a typical Linux hobbyist experience graph you probably have "this is a disk image file, and this is a disk, I'll just copy the disk image file onto the disk!", then you have a period of "I'll use the cool dd tool (without oflag=direct and/or sync because by this point you know people use it but you don't know why people use it) like I saw on the internet!", then when you understand that everything is a file you have "this is a disk image file, and this is a disk, I'll just copy the disk image file onto the disk!" again.

I personally suggest recommending people the Disks application included with Ubuntu Desktop and Fedora/Centos Workstation. It shows icons representing internal disks, SD cards or flash drives so they know what device they want to work with. If they want to take their time they can see all the information about the drives and partitions, they can start discovering and asking questions and reading up on how computers use disks right from there if they want to. And if they don't want to, it's just extra confirmations that it's the correct disk or DVD that they want to put their image onto. Then when they're sure about the device they can create a disk image and restore a disk image in that same application!

klibertp · on May 23, 2022

gparted is also nice, although Disks seem to display more information by default.

jws · on May 23, 2022

dd lets you specify the write block size. This is essential when writing to 9 track tape. Everything is a byte stream… except for the things which are not.

naikrovek · on May 23, 2022

I don’t think the author is intending to tell everyone to always use something else, I think the author is trying to tell most people who use dd that there are easier ways to do what they are trying to do, and that things are files, which is something that a lot of people here seem to be forgetting.

tjoff · on May 23, 2022

Which is like saying vim is useless, 90% of what you'll do is covered by nano.

Well, it pays off a lot to know a complex tool and using it for easy stuff gets you in the habit of using it, reminds you of the arguments etc.

dd is really really useful.

naikrovek · on May 24, 2022

yes, it is. and if you know you need it, you know you need it. and if you don't know that, you don't need it.

mark-r · on May 23, 2022

I had almost managed to completely forget about 9 track tapes. Now you've made my head hurt.

aib · on May 23, 2022

I use it for the O_DIRECT flag (oflag=direct). Better progress report, and I know when to pull the stick.

Edit: Looks like O_SYNC (oflag=sync) is needed, too. Should update my 'sddd' alias.

Retr0id · on May 23, 2022

Is it? I thought O_DIRECT implied O_SYNC, in practical terms?

aib · on May 23, 2022

So did I, but my man page (open(2)) cast doubts:

"The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred."

While I don't think this is relevant to block devices, I see no harm in including the flag in my alias, either.

indigodaddy · on May 23, 2022

Not exactly dd, but in my NOC tech days, ddrescue was _indispensable_ for cloning drives with pending sectors/errors.

viraptor · on May 23, 2022

> Want to simulate a lseek+execve? Use dd!

I can't find anything about it in the man page - does anyone know the options? I'm assuming it doesn't mean simply running dd with seek to output another file and then running that.

koala_man · on May 23, 2022

Original author here. An example would be `{ dd bs=1 count=0 skip=1234; myprogram; } < file`. This is similar to C `lseek(0, 1234, SEEK_CUR); execve("myprogram", ...);`

It's not something you'd realistically use, but it's better to be limited by imagination than tooling.

viraptor · on May 23, 2022

Ah, I get it! What I initially thought you meant was creating a process from an ELF embedded further into the file. For example lseek-ing into a tar file and somehow execve-ing an included binary. Not I wonder if Linux you allow me to run /dev/stdin...

throwawayboise · on May 23, 2022

I used to used dd to convert EBCDIC to ASCII reading tapes from a 1/2" reel-to-reel tape drive. The "convert" capability of dd is what differentiates it from utilities such as cat.

theonemind · on May 23, 2022

Cult of dd : https://news.ycombinator.com/item?id=13896675 / https://eklitzke.org/the-cult-of-dd

eternityforest · on May 23, 2022

dd is one of THE most common ways people accidentally wipe disks. All it takes is typing sda when you meant sdb.

The by-label trick helps a bit, but I still don't like it. Etcher exists. Etcher will give an alert when it's done flashing. Etcher will verify what it wrote. Etcher will predict how much time is remaining.

The raspi imaging utility goes even farther and gives you configuration options. There are many other special purpose flasher utilities like that.

Where dd really shines is in a script, for making empty images of a certain size. But even then... there are tools like truncate to make a sparse file instead.

The only other time I ever need the CLI is for directly creating a compressed image, but that's a somewhat uncommon task for me. And I would not be surprised if one of the GUIs had it by now.

superdisk · on May 23, 2022

Downloading a giant electron application which bundles Chrome and Xbox joystick drivers just to write to a flash drive feels downright criminal when there's a 200kb program that will do the same job already on your computer.

Kaze404 · on May 23, 2022

I personally use dd for such cases as well, but I don't mind downloading a "giant electron application which bundles Chrome and Xbox joystick divers" (I just downloaded Etcher and it takes 40MB on my machine) if that's what it takes to lower the risk of accidentally wiping my hard drive when I want to write an ISO to a USB drive.

eternityforest · on May 24, 2022

Electron makes the problem of flashing totally solved. I have a 1TB SSD for exactly this reason, so I don't have to worry about what's light and what isn't.

aftbit · on May 23, 2022

I mostly use `gzip -d image.gz | dd of=/dev/sda bs=1M oflag=direct status=progress` to write a compressed disk image to a slow USB stick with progress bar and no disk caching. This avoids the lengthy wait at the end of a typical `cp` for the sync before unplug.

teddyh · on May 23, 2022

> in the same way we ended up with a Window system called X

There actually was a Window system named “W”; it was the window system for the “V” operating system (IIRC). “X”, therefore, was the successor to “W”.

adrusi · on May 23, 2022

I always assumed that dd was preferred when writing disk images to flash media to ensure each hardware block is written to exactly once, extending the live of the device and increasing performance substantially. I never checked if that's actually a problem with general-purpose file IO, nothing stops the kernel from noticing when a pipe is connected to a block device and configuring an appropriate buffer.

status=progress is also quite handy. You can splice a "| dd status=progress |" in the middle of a pipeline to get a sense of what's happening.

bayindirh · on May 23, 2022

Fun fact: dd outputs its progress when it receives a USR1 signal via kill too.

trasz · on May 23, 2022

And on BSD systems you can just press ctrl-t.

ggm · on May 23, 2022

use of cat needs to be discussed too:

  cat < file | somecmd | cat > file2
  #vs  
  somecmd < file > file2