Performance Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Case Study: 10x File Copy Performance with Robocopy

05.03.2013
| 5369 views |
  • submit to reddit

Source data:

  • ~500,000 folders (court cases)
  • ~2.5-3 million documents
  • Source drives is replicated x2 with RAID
  • Copying to NAS over GB ethernet
  • Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
  • Final copy took ~24 hours

Monitoring:

  • Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
  • CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
  • In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
  • The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).

Filesystem Lessions:

  • NTFS can hold a folder with large numbers of files but takes forever to enumerate
  • When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
  • The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
  • ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
  • I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
  • One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)

Robocopy Options:

  • Robocopy lets you remove the console logging, with /LOG:output.txt
  • Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.

To investigate:

  • Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping. 
  • One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead



Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)