Hacker News new | past | comments | ask | show | jobs | submit login
Cloud Run quietly swaps HOME env var in Docker (chanind.github.io)
73 points by chanind on Sept 27, 2021 | hide | past | favorite | 31 comments



Cloud Run PM here. Sorry for the inconvenience. This is an issue for which we have a fix, but the fix still needs to roll out (internal bug 154368323 for googlers reading, we're still evaluating the best rollout strategy). Until then, I'll add it to https://cloud.google.com/run/docs/issues.


I only say this half in jest: Google, please give this person a raise. I'm not sure I've ever seen a Google PM be this up front and straightforward with a future roadmap item, even in HN, where I more commonly see more direct communications by PMs or engineers from various companies.

While you're at it, if there is a way you can get the Google Identity Platform PM to let us know when some form of multi-factor auth besides SMS will be supported, that would be great!


I feel that employees from big Corp sometimes can't do good.

If they respond here, it should be as you mentioned. They are monitoring media on mentions.

If they don't, people can't fix their issues ( probably ) as fast.

But responding here gives the observing effect of having a higher priority. Which is sometimes "ridiculed" ( don't know a better describing term)

Either way, thanks to the PM for an update about this.


I guess the main problem is the second part of the subOP comment:

>While you're at it, if there is a way you can get the Google Identity Platform PM to let us know when some form of multi-factor auth besides SMS will be supported, that would be great!

When they respond they get inundated with random requests and ailments that they have exactly zero control on.


You're absolutely correct.


Out of interest what is the trickiness in the roll out? Breaking previously deployed containers that depend on this behavior?


Cool beans. Thanks for the response.


For those not familiar with Cloud Run, it's a Google Cloud service that allows you to run Docker containers without managing the server it runs on.

https://cloud.google.com/run


The solution here is to use `$XDG_CACHE_HOME` instead of `$HOME/.cache`, which may not be where a user wants their cache to live: https://specifications.freedesktop.org/basedir-spec/basedir-...


$XDG_CACHE_HOME is usually not set, it is an override and software should default to $HOME/.cache.

As per your link:

> If $XDG_CACHE_HOME is either not set or empty, a default equal to $HOME/.cache should be used.

So even if Huggingface is aware of that variable (and it probably is) that won't help at all.


It does appear to[1], and it does help because you can set XDG_CACHE_HOME in your dockerfile so it'll be the same across all runs. I think that's what the GP was getting at.

[1] https://huggingface.co/transformers/v4.3.3/installation.html...


This has always seemed sort of passive-aggressive on the part of the spec. "No, memorizing our various custom env var names is not sufficient. Developers must also implement our custom logic, and immediately update whenever the logic changes!" That seems a bit much to ask. Why not just trust the user, who will set the env vars if she really wants to use this scheme and who will do something else if she wants something else?


> will do something else if she wants something else

How should that be communicated by the user? A prompt when opening every application asking where you'd like the data, config, state, and cache dirs to live? And if so, how would the application figure out where that config is stored?

These env vars are explicitly only (with the exception of XDG_RUNTIME_DIR) for when the user cares to override them.


Somehow lots of software functioned before the advent of XDG. Lots of software that knows nothing about XDG still functions.

These env vars are explicitly only (with the exception of XDG_RUNTIME_DIR) for when the user cares to override them.

Yeah, sure, that's what the spec says, and that's what is passive-aggressive. There is a complicated system and it is opt-out, which is considered abusive whenever anyone else does it.

I actually like XDG, and I use it even when I have to specifically tell packages to do so. I have contributed patches to unrelated software for XDG support. However, I don't agree with the common online strategy of shitting on software that doesn't follow a spec that is deliberately more complicated than it needs to be.


How is an optional override "opt-out"? Who is "shitting on software"? This rant is inscrutable.


Is that variable set to a consistent value at both build & run in Cloud Run?


As an alternative, don’t use root for Dockerfiles. Ever. It only takes a few lines to create a user, group, and use said user, and closes a whole class of security issues.


Closes security loopholes, increases operational burden (if only slightly). A tale as old as time.


Agree. It only takes a few lines. But how long does it take to learn what those lines are?


Can confirm, using a non-root user also solves the issue (which I should have been doing anyway). It guess Cloud Run assumes $HOME is set to /home/something and since root uses /root it gets confused?


Thanks for confirming! I was starting to doubt myself a bit after some of the other responses.


Are you sure Cloud Run only overrides $HOME when the container runs as root? Because if it always overrides $HOME then what user the container runs as is meaningless (for this particular issue).


The problem is with root vs all other users is that it’s $HOME is literally /root/ while other users have their $HOME directory in /home/username/

~ is a BASH tilde expansion for $HOME. So in this case Cloud Run would have been looking at /home/root/.cache/ which doesn’t exist (/root/.cache/ is what was being built), whereas another user would have /home/username/.cache/ and run as expected.

PS. I was initially going to call ~ an alias, but I checked myself and found it’s actually considered a BASH tilde expansion. While ~ alone operates as an alias, I learned there are all kinds of other uses for variations of ~ which hopefully someone will find useful:

http://www.gnu.org/software/bash/manual/html_node/Tilde-Expa...


>So in this case Cloud Run would have been looking at /home/root/.cache/

Firstly, the article says that Cloud Run is setting $HOME to /home. Which means Huggingface would've been looking at /home/.cache, not /home/root/.cache. $HOME is not a base path to which the username is appended to get the homedir. It's the homedir itself.

I also assume this isn't the article author using a shorthand (ie writing "/home" when they mean "/home/someuser"), because the SO post they link to also says the same thing. So, as busted as it is, it does sound like Cloud Run is setting HOME to /home.

Secondly, and more importantly, my point is that if Cloud Run is setting HOME=/root at container build time and HOME=/home at container runtime, then any path rooted to $HOME is going to be different at build time vs runtime, regardless of what user the process in the container is running as.

    $ docker run -it --rm -e HOME=/foo debian:10-slim sh -c 'echo $HOME/.cache'

    /foo/.cache

    $ docker run -it --rm -e HOME=/bar --user 1000 debian:10-slim sh -c 'echo $HOME/.cache'

    /bar/.cache
So as good as it is to not run containerized processes as root, I don't think it makes any difference to this particlar issue.


True! That would have also been a smart move


Why some people do it: there are problems with bind mounts when using a non privileged user inside of a container. Not only that, but many containers either don't have instructions for running with custom users, make bad assumptions about what users will be used (chmod/chown) or even will have cryptic error messages when you try to run them with custom users.

In short, it's a pain for the average person just trying to get something running.

For example, you want to mount /docker/my-app/mysql-container/var/lib/mysql into /var/lib/mysql of the container. Maybe you're doing this so you can view the files on the host system and not use the Docker Volumes abstraction, maybe you want to do backups on a per directory basis and Docker doesn't let you move separate volumes into separate directories, or any other reason. Some orchestrators like Docker Swarm won't create the local directory for you, so you just end up doing that yourself, with UID:GID of 1004:1004 (or whatever your local server user has).

Now, if you run the container with the default settings, which indeed use the root user, there is a good chance of it working (illustrated here with Docker):

  > docker run --rm -e MYSQL_ROOT_PASSWORD="something-hopefully-long" -v "/docker/my-app/mysql-container/var/lib/mysql:/var/lib/mysql" mysql:5.7
  ... lots of output here
  [Note] mysqld: ready for connections.
Because by default, even MySQL uses root inside of the container:

  > docker run --rm mysql:5.7 id
  uid=0(root) gid=0(root) groups=0(root)
When you change it to another user without knowing which one you need, which is pretty common, it breaks:

  > docker run --rm -u 1010:1010 -e MYSQL_ROOT_PASSWORD="something-hopefully-long" -v "/docker/my-app/mysql-container/var/lib/mysql:/var/lib/mysql" mysql:5.7
  [ERROR] InnoDB: The innodb_system data file 'ibdata1' must be writable
  [ERROR] InnoDB: The innodb_system data file 'ibdata1' must be writable
  [ERROR] InnoDB: Plugin initialization aborted with error Generic error
  [ERROR] Plugin 'InnoDB' init function returned error.
  [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
  [ERROR] Failed to initialize builtin plugins.
  [ERROR] Aborting
As you can see, the error messages don't come from Docker or any other system that knows what's actually happening, but from the piece of software itself, in this case, MySQL. Not only that, but it doesn't give you actionable steps on its own and "Generic error" will only serve to confuse many of the newer users, who'll fail to understand what "The innodb_system data file 'ibdata1' must be writable" means.

How many users do you think will understand what the error message means and will know how to solve it? How many container images out there won't give user friendly error messages and instead will just crash?

How many users do you think will notice the "Running as an arbitrary user" under the instructions for that particular container image [0]? How many container images will even support running as arbitrary users?

How many users do you think will be able to find the documentation for the parameters that they need for either Docker Compose [1] or specifying a user in Docker [2]?

To avoid that problem, you need to:

  - know about users/groups and permissions management in GNU/Linux
  - know what your user/group is if you're creating anything like bind mounts
  - know how to set the user/group that's going to run inside of the container
  - know that the external container you use will handle them properly (but also know how it's initialized, e.g. no additional config necessary, since some badly containerized pieces of software will start additional processes with UID/GIDs that have been set in some configuration file somewhere in the container)
And with all of that, how many users do you think will decide that none of the above is worth the hassle and just run their containers as root, regardless of any other potential risks in the future, as opposed to having problems now? If something is hard to do, it simply won't be done in practice, especially if things work without doing it.

Though i'm not sure what can be done to improve the DX here, without making Docker aware of the attempts to access files inside of the container. Plus, managing users and groups has far too many approaches to begin with [3].

Links:

  [0] https://hub.docker.com/_/mysql
  [1] https://docs.docker.com/compose/compose-file/compose-file-v3/
  [2] https://docs.docker.com/engine/reference/run/#user
  [3] https://blog.giovannidemizio.eu/2021/05/24/how-to-set-user-and-group-in-docker-compose/


All fair points! Docker could certainly make it easier, and it sounds like Cloud Run could do better to accommodate this type of case. AFAIK these type of issues have been the main motivator for the creation of alternative container runtimes such as Podman.


That sucks.

My only advice is, when using this or cloud functions always start by creating a function/image that prints all env variables.

For cloud functions these changes tremendously between Python version. Also logging changes completely between some Python versions, to the point where upping the runtime causes logs to stop being saved on cloud logging.


But if you have secrets in your environment variables, make sure to filter them out.


Does Huggingface not have a way to override the cache directory? I always specify directories like this manually when using docker specifically to avoid issues like the one in TFA.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: