What is the largest file transfer you have ever done?

data1701d (He/Him)@startrek.website to Linux@lemmy.ml – 215 points –

I'm writing a program that wraps around dd to try and warn you if you are doing anything stupid. I have thus been giving the man page a good read. While doing this, I noticed that dd supported all the way up to Quettabytes, a unit orders of magnitude larger than all the data on the entire internet.

This has caused me to wonder what the largest storage operation you guys have done. I've taken a couple images of hard drives that were a single terabyte large, but I was wondering if the sysadmins among you have had to do something with e.g a giant RAID 10 array.

139

Not that big by today's standards, but I once downloaded the Windows 98 beta CD from a friend over dialup, 33.6k at best. Took about a week as I recall.

I remember downloading the scene on American Pie where Shannon Elizabeth strips naked over our 33.6 link and it took like an hour, at an amazing resolution of like 240p for a two minute clip 😂

Yep, downloaded XP over 33.6k modem, but I'm in NZ so 33.6 was more advertising than reality, it took weeks.

I obviously downloaded a car after seeing that obnoxious anti-piracy ad.

I'm currently backing up my /dev folder to my unlimited cloud storage. The backup of the file /dev/random is running since two weeks.

No wonder. That file is super slow to transfer for some reason. but wait till you get to /dev/urandom. That file hat TBs to transfer at whatever pipe you can throw at it...

I’m guessing this is a joke, right?

/dev/random and other "files" in /dev are not really files, they are interfaces which van be used to interact with virtual or hardware devices. /dev/random spits out cryptographically secure random data. Another example is /dev/zero, which spits out only zero bytes.

Both are infinite.

Not all "files" in /dev are infinite, for example hard drives can (depending on which technology they use) be accessed under /dev/sda /dev/sdb and so on.

I’m aware of that. I was quite sure the author was joking, with the slightest bit of concern of them actually making the mistake.

In grad school I worked with MRI data (hence the username). I had to upload ~500GB to our supercomputing cluster. Somewhere around 100,000 MRI images, and wrote 20 or so different machine learning algorithms to process them. All said and done, I ended up with about 2.5TB on the supercomputer. About 500MB ended up being useful and made it into my thesis.

Don't stay in school, kids.

Entire drive/array backups will probably be by far the largest file transfer anyone ever does. The biggest I've done was a measly 20TB over the internet which took forever.

Outside of that the largest "file" I've copied was just over 1TB which was a SQL file backup for our main databases at work.

+1

From an order of magnitude perspective, the max is terabytes. No "normal" users are dealing with petabytes. And if you are dealing with petabytes, you're not using some random poster's program from reddit.

For a concrete cap, I'd say 256 tebibytes...

I work in cinema content so hysterical laughter

Interesting! Could you give some numbers? And what do you use to move the files? If you can disclose obvs

A small dcp is around 500gb. But that's like basic film shizz, 2d, 5.1 audio. For comparison, the 3D deadpool 2 teaser was 10gb.

Aspera's commonly used for transmission due to the way it multiplexes. It's the same protocolling behind Netflix and other streamers, although we don't have to worry about preloading chunks.

My laughter is mostly because we're transmitting to a couple thousand clients at once, so even with a small dcp thats around a PB dropped without blinking

Eh, what's a dcp?

Digital Cinema Package; basically the movie file you're watching when you're in a movie theater.

Digital Cinema Package. Films come out in a buncha files that rather resemble a dvd rip. You got your video files (still called reels!) and your audio files, maybe some subtitle files and other bits and pieces and your assetmap (list of files) all in a big fat folder collectively called a DCP

In the early 2000s I worked on an animated film. The studio was in the southern part of Orange County CA, and the final color grading / print (still not totally digital then) was done in LA. It was faster to courier a box of hard drives than to transfer electronically. We had to do it a bunch of times because of various notes/changes/fuck ups. Then the results got courier'd back because the director couldn't be bothered to travel for the fucking million dollars he was making.

You legally have to tell us if that movie was Shrek.

Hah, nope. Shrek was made in Glendale, so they probably had everything on site or right next door.

Oh yeah I worked in animation for a bit too. Those 4K master files are no joke lol

Fucking hell the raws woulda been gigantic

I used to work in the same industry. We transferred several PBs from West US to Australia using Aspera via thick AWS pipes. Awesome software.

Hahahah did you enjoy Australian Internet? It's wonderfully archaic

(MPS, Delux, Gofilex or Qubewire?)

Ahhh thanks for the reply! Makes sense! We also use Aspera here at work (videogames) but dont move that ammount, not even close.

I’ve done a 1PB sync between a pair of 8-node SAN clusters as one was being physically moved since it’d be faster to seed the data and start a delta sync rather than try to do it all over a 10Gb pipe. M

It was something around 40 TB X2 . We were doing a terrain analysis of the entire Earth. Every morning for 25 days I would install two fresh drives in the cluster doing the data crunching and migrate the filled drives to our file server rack.

The drives were about 80% full and our primary server was mirrored to two other 50 drive servers. At the end of the month the two servers were then shipped to customer locations.

In the middle of something 200tb for my Plex server going from a 12 bay system to a 36 LFF system. But I've also literally driven servers across the desert because it was faster than trying to move data from one datacenter to another.

That's some RFC 2549 logic, right there.

Just thinking about how much data you could transfer using this. MicroSD cards makes it a decent amount. Latency would be horrible, but throughput could be pretty good I think.

Which desert? I've lived in the desert my entire life.

I’ve migrated petabytes from one GPFS file system to another. More than once, in fact. I’ve also migrated about 600TB of data from D3 tape format to 9940.

I once abused an SMTP relay (my own) by emailing Novell a 400+ MB memory dump. Their FTP site kept timing out.

After all that, and them swearing they had to have it, the OS team said "Nope, we're not going to look at it". Guess how I feel about Novell after that?

This was in the mid-90's.

I don't remember how many files, but typically these geophysical recordings clock in at 10-30 GB. What I do remember, though, was the total transfer size: 4TB. It was kind of like a bunch of .segd, and they were stored in this server cluster that was mounted in a shipping container for easy transport and lifting onboard survey ships. Some geophysics processors needed it on the other side of the world. There were nobody physically heading in the same direction as the transfer, so we figured it would just be easier to rsync it over 4G. It took a little over a week to transfer.

Normally when we have transfers of a substantial size going far, we ship it on LTO. For short distance transfers we usually run a fiber, and I have no idea how big the largest transfer job has been that way. Must be in the hundreds of TB. The entire cluster is 1.2PB, bit I can't recall ever having to transfer everything in one go, as the receiving end usually has a lot less space.

4G?! That strikes fear into my heart!

The alternative was 5mbit/s VSAT. 4G was a luxury at that time.

At the rates I'm paying for 4G data, there are very few places in the world where it wouldn't be cheaper for me to get on a plane and sneakernet that much data

A few years back I worked at a home. They organised the whole data structure but needed to move to another Providor. I and my colleagues moved roughly just about 15.4 TB. I don't know how long it took because honestly we didn't have much to do when the data was moving so we just used the downtime for some nerd time. Nerd time in the sense that we just started gaming and doing a mini LAN party with our Raspberry and banana pi's.

Surprisingly the data contained information of lots of long dead people which is quiet scary because it wasn't being deleted.

No idea about which specific type of business it is, but keeping that history long term can have some benefits, especially to outside people. Some government agencies require companies to keep records for a certain number of years. It could also help out in legal investigations many years in the future and show any auditors you keep good records. From a historical perspective, it can be matched to census, birth, and death certificates. A lot of generational history gets lost.

Companies also just hoard data. Never know what will be useful later. shrug

I worked at a niche factory some 20 years ago. We had a tape robot with 8 tapes at some 200GB each. It'd do a full backup of everyone's home directories and mailboxes every week, and incremental backups nightly.

We'd keep the weekly backups on-site in a safe. Once a month I'd do a run to another plant one town over with a full backup.

I guess at most we'd need five tapes. If they still use it, and with modern tapes, it should scale nicely. Today's LTO-tapes are 18TB. Driving five tapes half an hour would give a nice bandwidth of 50GB/s. The bottleneck would be the write speed to tape at 400MB/s.

~15TB over the internet via 30Mbps uplink without any special considerations. Syncthing handled any and all network and power interruptions. I did a few power cable pulls myself.

I think it's crazy that not that long ago 30mbps was still pretty good, we now have 1gbps+ at residential addresses and it fairly common too

I’ve got symmetrical gigabit in my apartment, with the option to upgrade to 5 or 8. I’d have to upgrade my equipment to use those speeds, but it’s nice to know I have the option.

Yeah, I also moved from 30Mb upload to 700Mb recently and it's just insane. It's also insane thinking I had a symmetric gigabit connection in Eastern Europe in the 2000s for fairly cheap. It was Ethernet though, not fiber. Patch cables and switches all the way to the central office. 🫠

Most people in Canada today have 50Mb upload at the most expensive connection tiers - on DOCSIS 3.x. Only over the last few years fiber began becoming more common but it's still fairly uncommon as it's the most expensive connection tier if at all available.

We might pay some of the most expensive internet in the world in Canada but at least we can't fault them for providing an unstable or unperformqnt service. Download llama models is where 1gbps really shines, you see a 7GB model? It's done before you are even back from the toilet. Crazy times.

I should have know that the person on the internet noting 30Mbps was pretty good till recently is a fellow Canadian. 🍁 #ROBeLUS

BTW, TekSavvy recently started offering fiber seemingly on Bell's last mile.

How long did that take? A month or two? I've backfilled my NAS with about 40 TB before over a 1 gig fiber pipe in about a week or so of 24/7 downloading.

Yeah, something like that. I verified it it with rsync after that, no errors.

I once moved ~5TB of research data over the internet. It took days and unfortunately it also turned out that the data was junk :/

I think 16 terabytes? Might have been twelve. I was consolidating a bunch of old drives and data into a nas for a friend. He just didn't have the time, between working and school and brought me all the hardware and said "go" lol.

Largest one I ever did was around 4.something TB. New off-site backup server at a friends place. Took me 4 months due to data limits and an upload speed that maxed out at 3MB/s.

My Chia crypto farm at its peak had about 1.5 PB of plots, each plot was I think about 100ish gigs? I'd plot them on a dedicated machine and then move them to storage for farming. I think I'd move around 10TB per night.

It was done with a combination of powershell and bash scripts on Windows, Linux, and the built in Windows Services for Linux.

Today I've migrated my data from my old zfs pool to a new bigger one, the rsync of 13.5TiB took roughly 18 hours. It's slow spinning disks storage so that's fine.

The second and third runs of the same rsync took like 5 seconds, blazing fast.

Upgraded a NAS for the office. It was reaching capacity, so we replaced it. Transfer was maybe 30 TB. Just used rsync. That local transfer was relatively fast. What took longer was for the NAS to replicate itself with its mirror located in a DC on the other side of the country.

Yeah it's kind of wild how fast (and stable) rsync is, especially when you grew up with the extremely temperamental Windows copying thing, which I've seen fuck up a 50mb transfer before.

The biggest one I've done in one shot with rsync was only about 1tb, but I was braced for it to take half a day and cause all sorts of trouble. But no, it just sent it across perfectly first time, way faster than I was expecting.

Never dealt with windows. rsync just makes sense. I especially like that its idempotent, so I can just run it twice or three times and it'll be near instant on the subsequent run.

I downloaded that 200gb leak from national public data the other day, maybe not the biggest total but certainly the largest single text file ive ever messed with

Currently pushing about 3-5 TB of images to AI/ML scanning per day. Max we've seen through the system is about 8 TB.

Individual file? Probably 660 GB of backups before a migration at a previous job.

We have DBs in the dozens of TB at work so probably one of them

Back in the late 90’s I worked for an internet search company, long before Google was a thing. We would regularly physically drive a dozen SCSI drives from a RAID array between two datacenters about 20 miles apart.

I transferred my entire NAS storage, which includes all of my backups, cloud files, my family’s backups, and my… Linux ISOs. That was about 12TB.

When I was in highschool we toured the local EPA office. They had the most data I've ever seen accessible in person. Im going to guess how much.

It was a dome with a robot arm that spun around and grabbed tapes. It was 2000 so I'm guessing 100gb per tape. But my memory on the shape of the tapes isn't good.

Looks like tapes were four inches tall. Let's found up to six inches for housing and easier math. The dome was taller than me. Let's go with 14 shelves.

Let's guess a six foot shelf diameter. So, like 20 feet circumference. Tapes were maybe .8 inches a pop. With space between for robot fingers and stuff, let's guess 240 tapes per shelf.

That comes out to about 300 terabytes. Oh. That isn't that much these days. I mean, it's a lot. But these days you could easily get that in spinning disks. No robot arm seek time. But with modern hardware it'd be 60 petabytes.

I'm not sure how you'd transfer it these days. A truck, presumably. But you'd probably want to transfer a copy rather than disassemble it. That sounds slow too.

This was your local EPA? Do you mean at the state level (often referred to as "DEP")? Or is this the federal EPA?

Because that seems like quite the expense in 2000, and I can't imagine my state's DEP ever shelling out that kind of cash for it. Even nowadays.

Sounds cool though.

I think it was the EPA's National Compute Center. I'm guessing based on location though.

Tape robots are fun, but tape isn't as popular today.

Yes, it's a truck. It's always been a truck, as the bandwidth is insane.

I did 100TB, 100 streams of 1TB, all simultaneous with rsync

When I was moving from a Windows NAS (God, fuck windows and its permissions management) on an old laptop to a Linux NAS I had to copy about 10TB from some drives to some other drives so I could re-format the drives as a Linux friendly format, then copy the data back to the original drives.

I was also doing all of this via terminal, so I had to learn how to copy in the background, then write a script to check and display the progress every few seconds. I'm shocked I didn't loose any data to be completely honest. Doing shit like that makes me marvel at modern GUIs.

Took about 3 days in copying files alone. When combined with all the other NAS setup stuff, ended up taking me about a week just in waiting for stuff to happen.

I cannot reiterate enough how fucking difficult it was to set up the Windows NAS vs the Ubuntu Server NAS. I had constant issues with permissions on the Windows NAS. I've had about 1 issue in 4 months on the Linux NAS, and it was much more easily solved.

The reason the laptop wasn't a Linux NAS is due to my existing Plex server instance. It's always been on Windows and I haven't yet had a chance to try to migrate it to Linux. Some day I'll get around to it, but if it ain't broke... Now the laptop is just a dedicated Plex server and serves files from the NAS instead of local. It has much better hardware than my NAS, otherwise the NAS would be the Plex server.

so I had to learn how to copy in the background, then write a script to check and display the progress every few seconds

I hope you learned about terminal multiplexers in the meantime... They make your life much easier in cases like this.

30 years with Linux and I know I still haven't. Maybe this year? :-D

Manually transferred about 7TBs to my new Rpi4 powered NAS. It took a couple of days because I was lazy and transferred 15 GBs at a time which slowed down the speed for some reason. It could handle small sub 1 GB files in half a minute otherwise.

Could the slowdown be down to HDDs that cache on a section of - I think it's single layer? - and slowly rewrite that cache onto the denser (compound layer?) storage?

Rsynced 4.2TB of data from one server to another but with multiple files

~340GB, more than a million small files (~10KB or less each one). It took like one week to move because the files were stored in a hard drive and it was struggling to read that many files.

My cousin once stuffed an ISO through my mail server in '98. His connection up in Bella Bella restricted non-batched comms back then, so he jammed it through the server as email to get on the batched quota.

It took the data and passed it along without error, albeit with some constipation!

Multiple TB when setting up a new server to mirror an existing one. (Did an initial copy with both together in the same room, before moving the clone to a physically separate location. Doing that initial copy would saturate the network connection for a week or more otherwise)

I synced to the BSV shitcoin which is 11+ terabytes. So large I had to turn on throwing away the rest of what I downloaded because it wouldn't fit on all of the storage media I own. I feel sorry for the people running an archive node.

i've transferred 10's of ~300 GB files via manual rsyncs. it was a lot of binary astrophysical data, most of which was noise. eventually this was replaced by an automated service that bypassed local firewalls with internet-based transfers and aws stuff.

You should ping CERN or Fermilab about this. Or maybe the Event Horizon Telescope team but I think they used sneakernet to image the M87 black hole.

Anyway, my answer is probably just a SQL backup like everyone else.

4 TB over my home network. 800GB download from a external server.

Local file transfer?

I cloned a 1TB+ system a couple of times.

As the Anaconda installer of Fedora Atomic is broken (yes, ironic) I have one system originally meant for tweaking as my "zygote" and just clone, resize, balance and rebase that for new systems.

Remote? 10GB MicroWin 11 LTSC IOT ISO, the least garbage that OS can get.

Also, some leaked stuff 50GB over Bittorrent

Probably ~15TB through file-level syncing tools (rsync or similar; I forget exactly what I used), just copying my internal RAID array to an external HDD. I've done this a few times, either for backup purposes or to prepare to reformat my array. I originally used ZFS on the array, but converted it to something with built-in kernel support a while back because it got troublesome when switching distros. Might switch it to bcachefs at some point.

With dd specifically, maybe 1TB? I've used it to temporarily back up my boot drive on occasion, on the assumption that restoring my entire system that way would be simpler in case whatever I was planning blew up in my face. Fortunately never needed to restore it that way.

I mean dd claims they can handle a quettabyte but how can we but sure.

As a single file? Likely 20GB iso.
As a collective job, 3TB of videos between hard drives for Jellyfin.

I'm currently in the process of transferring about 50 TB from one zpool to another (locally), so I can destroy and recreate it.

I've downloaded a few torrents that were around 5 TB each, they're PS4 and Xbox 360 game collections.

I routinely do 1-4TB images of SSDs before making major changes to the disk. Run fstrim on all partitions and pipe dd output through zstd before writing to disk and they shrink to actually used size or a bit smaller. Largest ever backup was probably ~20T cloned from one array to another over 40/56GbE, the deltas after that were tiny by comparison.

Why would dd have a limit on the amount of data it can copy, afaik dd doesn't check not does anything fancy, if it can copy one bit it can copy infinite.

Even if it did any sort of validation, if it can do anything larger than RAM it needs to be able to do it in chunks.

No, it can't copy infinite bits, because it has to store the current address somewhere. If they implement unbounded integers for this, they are still limited by your RAM, as that number can't infinitely grow without infinite memory.

Not looking at the man page, but I expect you can limit it if you want and the parser for the parameter knows about these names. If it were me it'd be one parser for byte size values and it'd work for chunk size and limit and sync interval and whatever else dd does.

Also probably limited by the size of the number tracking. I think dd reports the number of bytes copied at the end even in unlimited mode.

It’s less about dd’s limits and more laughs the fact that it supports units that might take decades or more for us to read a unit that size.

Well they do nickname it disk destroyer, so if it was unlimited and someone messed it up, it could delete the entire simulation that we live in. So its for our own good really.

I recently copied ~1.6T from my old file server to my new one. I think that may be my largest non-work related transfer.

20TB (out of 21TB usable), a second 6x6TB zfs raidz2 server as my send target.

I think it would be my whole broken manjaro install, I just used dd to make a copy so I could work on it later lol. About 500 gigs

While I haven't personally had to move a data center I imagine that would be a pretty big transfer. Probably not dd though.

I can't imagine how nerve-wracking it would be to run dd on something like that lol. I still don't trust myself to copy a USB stick with my unimportant bullshit on it with dd, let alone a server with anything important on it!

Probably some vigeo game on that is ~150-200 GiB. Does that count?