Git for Windows accidentally creates NTFS alternate data streamsJul 20, 2016
As part of the small minority of devs at my company who primarily run Windows, I’m accustomed to working around occasional Unix-specific behaviors in our build and deployment systems. Cygwin makes most stuff just work, I can fix simple incompatibilities myself, and as a last resort I can always boot into OSX for a while if needed.
One oddity that took me quite some time to diagnose, though, was Git’s strange behavior when dealing with files in our repo whose names contained a colon.
What happens when you sync a file with a colon in the filename?
Besides the inital drive prefix (e.g.
C:\), Windows does not permit the colon
character in file or directory paths. Unix has no such restriction. So what
happens if a Git repo of Unix origin contains a file with a colon in the name,
and that repo is cloned on a Windows machine?
I’ve created a sample repo that contains
a single file
foo:bar with the content
hello. Cloning the repo with a
default installation of Git for Windows
you get no errors or warnings:
C:\src > git clone https://github.com/latkin/filetest.git Cloning into 'filetest'... remote: Counting objects: 3, done. remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 3 Unpacking objects: 100% (3/3), done. Checking connectivity... done.
Instead of a file named
foo:bar, though, you get a file named
nothing in it:
C:\src > cd .\filetest\ C:\src\filetest > dir -force Directory: C:\src\filetest Mode LastWriteTime Length Name ---- ------------- ------ ---- d--h-- 7/17/2016 5:53 PM .git -a---- 7/17/2016 5:47 PM 0 foo
That’s kind of strange on its own, but even more peculiar is that Git has a different opinion of what things look like:
C:\src\filetest > git status On branch master Your branch is up-to-date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) foo nothing added to commit but untracked files present (use "git add" to track)
Git notices the untracked file
foo, but seems to think
foo:bar is both
present and contains the expected content. How strange…
Confusing matters further is that when you enable the Git config option
(which is enabled by default in version 2.8.2 and later), the working set
suddenly changes - now
foo:bar is reported as missing:
C:\src\filetest > git config core.fscache true C:\src\filetest > git status On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add/rm <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) deleted: foo:bar Untracked files: (use "git add <file>..." to include in what will be committed) foo no changes added to commit (use "git add" and/or "git commit -a")
That’s what we originally would have expected given that only the
was created when we cloned the repository, but why is it different from
core.fscache = false? And why was this empty file
foo created in the first
Alternate data streams
The root cause of all this is a relatively obscure NTFS feature called alternate data streams. Some good summary links here and here.
Briefly, files in NTFS are not simple buckets of data, but rather a collection
of 1 or more data streams. What we normally think of as a file’s contents is
really the contents of the primary, unnamed stream. One can also create and
add data to alternate, named streams. These streams are directly addressable by
:streamname to the normal file path. e.g. the stream
qwerty.txt can be accessed via the path
foo:bar is not a legal Windows file name, Windows file APIs are
nonetheless happy to accept it for read and write operations because it is
indeed legal as a path to something in the filesystem, namely the
alternate stream of the file
What Git does
Once you are aware of alternate data streams, Git’s behavior starts to make sense.
When cloning, Git naively blasts content into the path
foo:bar. That is
a 100% legal path, so no errors are raised by the OS. The result is a file
with no content in the primary data stream (hence reported as length 0), but 6
bytes in an alternate stream
C:\src\filetest > Get-Item .\foo -Stream * | ft Stream,Length Stream Length ------ ------ :$DATA 0 bar 6 C:\src\filetest > cat .\foo:bar hello
When checking the status of the working set, Git uses different algorithms
depending on whether
core.fscache is enabled.
core.fscache is false, file metadata checks are done one at a time,
ultimately invoking GetFileAttributesEx
for each path. Git has no clue it’s even dealing with an alternate stream,
because these file APIs behave exactly the same as they would with a normal file
foo:bar exist? Yep! Does the last modified time on
what Git expects? Yep! Is the content of
foo:bar what Git expects? Yep! Well
alright, that file must be unchanged.
core.fscache is true, Git pre-caches file metadata per directory,
then reads it from the cache
instead of invoking file APIs directly. This leads to a different view of the
world - when enumerating files in the containing directory, Windows only
foo, since that’s the only file present. Thus the cache, when asked
for the metadata of
foo:bar, believes this file does not exist.
In my opinion, this is all rather silly and should never have been allowed to
happen in the first place. Git should simply detect the bogus filename, issue
an error, and never even attempt to write the file to disk. This is how other
valid-in-Unix-but-invalid-in-Windows filenames are handled already (e.g. a file
\Windows\System32\crypt32.dll will be blocked). Such files would then
(correctly) be reported as missing from the working set, regardless of
I opened a bug against Git for Windows to track this issue, and provided a PR with a fix, but these have sat dormant with no feedback for the past 4 months. This week I’m making noise again on the PR, hopefully that will spur some action by the maintainers.