Git for Windows accidentally creates NTFS alternate data streams
Jul 20, 2016As part of the small minority of devs at my company who primarily run Windows, I’m accustomed to working around occasional Unix-specific behaviors in our build and deployment systems. Cygwin makes most stuff just work, I can fix simple incompatibilities myself, and as a last resort I can always boot into OSX for a while if needed.
One oddity that took me quite some time to diagnose, though, was Git’s strange behavior when dealing with files in our repo whose names contained a colon.
What happens when you sync a file with a colon in the filename?
Besides the inital drive prefix (e.g. C:\
), Windows does not permit the colon
character in file or directory paths. Unix has no such restriction. So what
happens if a Git repo of Unix origin contains a file with a colon in the name,
and that repo is cloned on a Windows machine?
I’ve created a sample repo that contains
a single file foo:bar
with the content hello
. Cloning the repo with a
default installation of Git for Windows
you get no errors or warnings:
C:\src
> git clone https://github.com/latkin/filetest.git
Cloning into 'filetest'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 3
Unpacking objects: 100% (3/3), done.
Checking connectivity... done.
Instead of a file named foo:bar
, though, you get a file named foo
, with
nothing in it:
C:\src
> cd .\filetest\
C:\src\filetest
> dir -force
Directory: C:\src\filetest
Mode LastWriteTime Length Name
---- ------------- ------ ----
d--h-- 7/17/2016 5:53 PM .git
-a---- 7/17/2016 5:47 PM 0 foo
That’s kind of strange on its own, but even more peculiar is that Git has a different opinion of what things look like:
C:\src\filetest
> git status
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
foo
nothing added to commit but untracked files present (use "git add" to track)
Git notices the untracked file foo
, but seems to think foo:bar
is both
present and contains the expected content. How strange…
Confusing matters further is that when you enable the Git config option
core.fscache
(which is enabled by default in version 2.8.2 and later), the working set
suddenly changes - now foo:bar
is reported as missing:
C:\src\filetest
> git config core.fscache true
C:\src\filetest
> git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
deleted: foo:bar
Untracked files:
(use "git add <file>..." to include in what will be committed)
foo
no changes added to commit (use "git add" and/or "git commit -a")
That’s what we originally would have expected given that only the foo
file
was created when we cloned the repository, but why is it different from
core.fscache = false
? And why was this empty file foo
created in the first
place?
Alternate data streams
The root cause of all this is a relatively obscure NTFS feature called alternate data streams. Some good summary links here and here.
Briefly, files in NTFS are not simple buckets of data, but rather a collection
of 1 or more data streams. What we normally think of as a file’s contents is
really the contents of the primary, unnamed stream. One can also create and
add data to alternate, named streams. These streams are directly addressable by
appending :streamname
to the normal file path. e.g. the stream MyStream
in
file qwerty.txt
can be accessed via the path qwerty.txt:MyStream
.
So although foo:bar
is not a legal Windows file name, Windows file APIs are
nonetheless happy to accept it for read and write operations because it is
indeed legal as a path to something in the filesystem, namely the bar
alternate stream of the file foo
.
What Git does
Once you are aware of alternate data streams, Git’s behavior starts to make sense.
When cloning, Git naively blasts content into the path foo:bar
. That is
a 100% legal path, so no errors are raised by the OS. The result is a file foo
with no content in the primary data stream (hence reported as length 0), but 6
bytes in an alternate stream bar
:
C:\src\filetest
> Get-Item .\foo -Stream * | ft Stream,Length
Stream Length
------ ------
:$DATA 0
bar 6
C:\src\filetest
> cat .\foo:bar
hello
When checking the status of the working set, Git uses different algorithms
depending on whether core.fscache
is enabled.
When core.fscache
is false, file metadata checks are done one at a time,
ultimately invoking GetFileAttributesEx
for each path. Git has no clue it’s even dealing with an alternate stream,
because these file APIs behave exactly the same as they would with a normal file
path. Does foo:bar
exist? Yep! Does the last modified time on foo:bar
match
what Git expects? Yep! Is the content of foo:bar
what Git expects? Yep! Well
alright, that file must be unchanged.
When core.fscache
is true, Git pre-caches file metadata per directory,
then reads it from the cache
instead of invoking file APIs directly. This leads to a different view of the
world - when enumerating files in the containing directory, Windows only
mentions foo
, since that’s the only file present. Thus the cache, when asked
for the metadata of foo:bar
, believes this file does not exist.
Conclusion
In my opinion, this is all rather silly and should never have been allowed to
happen in the first place. Git should simply detect the bogus filename, issue
an error, and never even attempt to write the file to disk. This is how other
valid-in-Unix-but-invalid-in-Windows filenames are handled already (e.g. a file
named \Windows\System32\crypt32.dll
will be blocked). Such files would then
(correctly) be reported as missing from the working set, regardless of
core.fscache
setting.
I opened a bug against Git for Windows to track this issue, and provided a PR with a fix, but these have sat dormant with no feedback for the past 4 months. This week I’m making noise again on the PR, hopefully that will spur some action by the maintainers.