[Rd] Comments requested on "changedFiles" function

Duncan Murdoch murdoch.duncan at gmail.com
Fri Sep 6 11:17:03 CEST 2013


On 13-09-06 2:46 AM, Karl Millar wrote:
> Comments inline:
>
>
> On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>>
>> On 13-09-04 8:02 PM, Karl Millar wrote:
>>>
>>> Hi Duncan,
>>>
>>> I think this functionality would be much easier to use and understand if
>>> you split it up the functionality of taking snapshots and comparing them
>>> into separate functions.
>>
>>
>> Yes, that's another possibility.  Some more comment below...
>>
>>
>>
>>   In addition, the 'timestamp' functionality
>>>
>>> seems both confusing and brittle to me.  I think it would be better to
>>> store file modification times in the snapshot and use those instead of
>>> an external file.  Maybe:
>>
>>
>> You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).
>
>
> Sorry, I completely failed to explain what I was thinking here.  There
> are a number of issues here, but the biggest one is that you're
> implicitly assuming that files that get modified will have mtimes that
> come after the timestamp file was created.  This isn't always true,
> with the most notable exception being if you download a package from
> CRAN and untar it, the mtimes are usually well in the past (at least
> with GNU tar on a linux system), so this code won't notice that the
> files have changed.
>
> It may be a good idea to store the file sizes as well, which would
> help prevent false negatives in the (rare IIRC) cases where the
> contents have changed but the mtime values have not.  Since you
> already need to call file.info() to get the mtime, this shouldn't
> increase the runtime, and the extra memory needed is fairly modest.

If we need to use file.info(), then I store the complete result, so I 
have size if I have mtime.
>
>>>
>>> # Take a snapshot of the files.
>>> takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum
>>>
>>> = FALSE, full.names = FALSE, recursive = TRUE, ...)
>>>
>>> # Take a snapshot using the same options as used for snapshot.
>>> retakeFileSnapshot(snapshot, directory = snapshot$directory) {
>>>      takeFileSnapshot)(directory, file.info <http://file.info> =
>>> snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc)
>>>
>>> }
>>>
>>> compareFileSnapshots(snapshot1, snapshot2)
>>> - or -
>>> getNewFiles(snapshat1, snapshot2)       # These names are probably too
>>> generic
>>> getDeletedFiles(snapshot1, snapshot2)
>>> getUpdatedFiles(snapshot1, snapshot2)
>>> -or-
>>> setdiff(snapshot1, snapshot2)  # Unclear how this should treat updated files
>>>
>>>
>>> This approach does have the difficulty that users could attempt to
>>> compare snapshots that were taken with different options and that can't
>>> be compared, but that should be an easy error to detect.
>>
>>
>> I don't want to add too many new functions.  The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks.  This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code.
>
>
> This is somewhat more nuanced and not particular to interactive use
> IMHO.  Having functions that do a lot is good, _as long as the
> semantics are always consistent_.  For example, lm() does a huge
> amount and has a wide variety of ways that you can specify your data,
> but it basically does the same thing no matter how you use it.  On the
> other hand, if you have a function that does different things
> depending on how you call it (e.g. reshape()) then it's easy to
> remember the function name, but much harder to remember how to call it
> correctly, harder to understand the documentation and less readable.
>
>>
>> I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime.  (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version.  Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time.  So overall it's about 20-25% faster.)  It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare).
>
>
> I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot'
> combination might work well.

I have split it into two functions.  The compare function has two 
snapshot arguments, but if only the "before" is given, it will compute 
the "after" from the current file system.  This makes a cleaner design, 
thanks for the suggestion.

About the function names:  selection of files for the snapshot is done 
by list.files, and that function's "path" argument can be a vector, so 
multiple directories can be recorded at once.  I've chosen 
"fileSnapshot" and "changedFiles" so far, but those aren't perfect.

I need to do a little more cleanup and testing, then I'll put the new 
version online somewhere.

Duncan Murdoch



More information about the R-devel mailing list