Slax author's Blog (RSS)
27
January2009
Build Slax - resumable download
There were several suggestions regarding the resume functionality of build Slax. I will write a bit about the possibilities here.
One of the ideas was to save the generated ISO or TAR file on server's harddisk and send it from there by using standard apache's functionality (sendfile can seek if resume operation is requested). Unfortunately, storing the generated files on disk is in no way possible; imagine 1000 users who build their customized Slax at the same time, each adding 1GB of modules ... I don't have such capacity. Moreover, it would have some other side effects - the user would have to wait until his ISO is fully created, while (contrary) using the current method, his download starts immediately.
Another idea was to use bittorrent technologies. This is, in my opinion, of no use. The advantage of bittorrent is that many users download the same file, and thus can actually upload some of its parts to other clients. This can't be applied to Build Slax, since every single build is different, there are rarely two people who download exactly the same ISO or TAR.
One last opinion was that the resume operation would consume a lot of disk space (if ISO files are stored on disk) or CPU usage (if we read the files again and just drop the initial data). If my understanding is correct, the former is correct, but the later is not. Imagine a surfer who wants to download his build. Without the resume functionality, he will download it again and again, until he give up or until it finally works out. On the other hand, when the resume functionality is implemented (even by using the way I described in my previous post), the server will have to process the same amount of data like it actually had to without resume, but the advantage is that it doesn't have to send all the data to client, but only the requested part.
So from my point of view, even the questionable method of resume operation is better than nothing (it actually saves bandwidth). Still, I would like to find a way which would be even better, so the server doesn't have to read the data it doesn't need, but that may be impossible...

User comments


@Fabian
There is still the issue of the hard disk space, and the time needed for the iso/tar to be generated.
@Tomas M
I am fairly sure that the method of ignoring the output will be the most efficient method that is possible, unless there is a simplistic method of predicting the data structure of the .tar and .iso files.
To elaborate on the previous statement:
It may be possible to have a system that can understand that the first 4.1 out of 6 files have allready been downloaded into the .tar or .iso based on the seeking location the download client requests for. (assuming that there is a good way of determing that)
Then the server could start creating another .iso or .tar at the 4th file.
It would discard the headers and continue to discard bits until it reaches the seek location.
It would then stream the rest of the file to the user.
I do not know too much about the structure of an iso file system, but I am fairly sure that this can be done with a tar file.
This approach (if valid) would allow for the resume, while reading and packaging the fewest files, but as I said before, I am not sure if this option is feasible.

It seems to me that it would be better just to make a "Build SLAX" application and put the work load on the user end. Then, downloads come from a static set of files, and don't have re-transferred to make a different SLAX build containing several of the same files.
Perhaps the existing system can be left as it is, but also give an option to instead download a generated Perl script (written to work on Linux or ActivePerl) that handles the downloads and file generation.

Tomas you may want to check out the "HTTP_Download" package at pear.php.net it may be able to do what you are wanting.
Here is its description:
Provides an interface to easily send hidden files or any arbitrary data to HTTP clients. HTTP_Download can gain its data from variables, files or stream resources.
It features:
- Basic caching capabilities
- Basic throttling mechanism
- On-the-fly gzip-compression
*- Ranges (partial downloads and resuming)
- Delivery of on-the-fly generated archives through Archive_Tar and Archive_Zip
- Sending of PgSQL LOBs without the need to read all data in prior to sending
Hope it will do what you want. If I find anything else I will let you know.

I don't understand the structure of iso files, but I guess an iso file is assembled by raw data of files the iso file contains. So maybe you don't have to generate an iso file.
Process:
1. Generate the header of the iso instantaneously. (You can cache this part or not.)
2. Find the lzm module and send the output. (raw data?)
So it is possible to resume the iso file without making a WHOLE iso image. Even you don't have to store any piece on local. Seek to the breakpoint and start output.

There is always the obvious solution. Store the files on a public server and delete it after a few days.

@Fabian:
Yes, that sounds very no-nonsense and pragmatical. However, one should estimate the load of Slax build service and see if there is a public server willing to take it.
Anyway at this point I'm taking the issue as an inspiration to a proof of concept thing, and to learn new things. :)

garyzyg' idea is essentially correct. An easy approach might be to create a hacked version of mkisofs that zero-fills files that occur before the resume point. It still has a bit of overhead for seeking through the zero-filled regions, but saves all of the disk I/O.

UPDATE (I'm in a laborious mood :) )
Progress:
1) Added a -g option to pass through parameters to genisoimage. So I managed to create and test a bootable slax cd.
2) Successfully tested whole iso versus iso obtained from two halves produced via isondemand -s and -e parameters. Their contents don't differ :)
Thing seems to do its job, I'm getting serious about it :)

Thank you very much for your tests Marco!
> real 6m20.566s <----- THIS IS PRE-PROCESSING TIME
more than 6 minutes to prepare? Sorry but that's far too long.
> The bottom line is that one wastes time on a one-time
> pre-process overhead on first run, and earns time when
> partially writing isos, because there's no iso-seek
> overhead.
Did you test by using strace, to see the actual system calls?

Ok, Tomas, thanks for your reply.
First, the script has been initially written as a proof of concept, that is, without focusing much on speed, rather on feasibility.
I already stated that the script needs md5 hashes of files to be includede in iso, both in hex and base64 form.
The timings you saw include on-the-fly generation of both of them, a real waste of computation resources, since they could be stored in a separated, permanent file.
So I slightly changed the script adding the possibility to use pre-calculated md5 hashes, thus reducing computation to the really needed steps, namely
1) jigdo template files generation
2) actual iso generation
As I already said, 1) is moreover skipped on subsequent generation of the same ISO.
Here I post new timing tests taking pre-made md5 hashes, and with separated times for steps 1) and 2)
My system's only storage device is a usb pen, if somebody else want to do testings on a faster storaged-system, he's welcome (I uploaded the new script to the same place).
Regarding strace, I didn't even know of it, I'm reading some tutorial, and going to try it.
A last reflection: the really critical computing phase is actually 1), because 2) has speeds comparable to mkisofs, so produces output at a rate usually quite larger than client bandwidth. Since output (step 2) ) cannot start before having jigdo template files, 1) is a deadlock.
Anyway, I just had an idea to overcome this empasse so that there is no overhead at all, 'll let you know if successful.
For the time being, you can have a look at this.
Ciao,
Marco
root@slax:~# du -hL ~/toiso/
1.1G /root/toiso/
root@slax:~# time genisoimage -f -J -o /dev/null -quiet ~/toiso/
real 0m35.174s
user 0m0.116s
sys 0m1.144s
root@slax:~# time ~/current.sh -d 8 /root/toiso/ > /dev/null
doin jigdo
real 1m30.657s <--- THIS IS STEP 1), CRITICAL (DEADLOCK) TIME
user 0m20.965s
sys 0m3.028s
root@slax:~# time ~/current.sh -d 1 /root/toiso/ > /dev/null
producing output
real 1m14.320s <--- THIS IS STEP 2) TIME, NOT SO CRITICAL
user 0m44.535s
sys 0m1.276s

One and half minutes is still a lot of time. If we can get below two seconds, I will use it.
Regarding strace, one simply starts
root@slax:~# strace -o log.txt -f command.sh
and it produces log.txt file where all system calls are logged. Further examination may show what files are actually opened() and read().

We are now below two second limit Tomas posed, I'd say we are virtually at ZERO seconds wait.
Here's how it works now: since it was indeed quite stupid to launch a dry genisoimage to just create template files, I simply took advantage of it to concurrently generate the iso we need.
The idea sounds straightforward, but in a server environment you have to manage concurrent access to a same file, so I needed to develop a rough semaphore file arrangement.
Basically, the plot is:
1) An iso file is requested
2) genisoimage is run, and produces both template files and the wanted iso
3) From now on, if there is a partial download request, you can use the template files generated in 2)
On the top of that, I had to make sure that just one instance actually wrote to template files, which was the trickiest part.
I successfully tested it, and include in the footer some tests like done in previous posts, plus a strace printout subset (though I'm not sure what data matters and what not, hope it is meaningful to some extent).
I have a further major improvement to be assessed (not sure if feasible), Tomas tell me if I should put some effort on it.
It remains just to know how the script should access pre-made md5 hashes: I need to know if the server already stores such a thing, and in this case how the text file is formatted, or not, and in this case I have a separate script to generate it the way I need.
root@slax:~# du -Lh ~/toiso/
1.6G /root/toiso/
root@slax:~# time genisoimage -J -f -r -quiet -o /dev/null ~/toiso/
real 0m57.774s <--- A WHOLE 1.6G ISO IMAGE IS PRODUCED ON STDOUT
user 0m0.288s
sys 0m1.672s
root@slax:~# time ~/isondemand.sh -l "" -f "/root/md5" -m "/root/md5b64" ~/toiso/ > /dev/null
real 1m38.423s <--- A WHOLE 1.6G ISO IMAGE IS PRODUCED ON STDOUT
user 0m0.388s
sys 0m2.112s
root@slax:~# time ~/isondemand.sh -l "" -m "/root/md5b64" -s 1500000000 ~/toiso/ > /dev/null
real 0m0.597s <--- A ~100MB ISO PIECE IS PRODUCED ON STDOUT
user 0m0.044s
sys 0m0.348s
root@slax:~# strace -f -o ~/strace.txt ~/isondemand.sh -l "" -m "/root/md5b64" -s 1500000000 ~/toiso/ > /dev/null ; cat ~/strace.txt | grep -i open
12265 open("/etc/ld.so.cache", O_RDONLY) = 3
12265 open("/lib/libtermcap.so.2", O_RDONLY) = 3
12265 open("/lib/libdl.so.2", O_RDONLY) = 3
12265 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/tty", O_RDWR|O_NONBLOCK|O_LARGEFILE) = 3
12265 open("/proc/meminfo", O_RDONLY) = 3
12265 open("/root/isondemand.sh", O_RDONLY|O_LARGEFILE) = 3
12266 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12266 open("/etc/ld.so.cache", O_RDONLY) = 3
12266 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12268 open("/etc/ld.so.cache", O_RDONLY) = 3
12268 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12269 open("/etc/ld.so.cache", O_RDONLY) = 3
12269 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12270 open("/etc/ld.so.cache", O_RDONLY) = 3
12270 open("/lib/libc.so.6", O_RDONLY) = 3
12272 open("/etc/ld.so.cache", O_RDONLY) = 3
12272 open("/lib/libc.so.6", O_RDONLY) = 3
12272 open(".", O_RDONLY|O_LARGEFILE) = 3
12272 open("/root", O_RDONLY|O_LARGEFILE) = 4
12272 open("toiso", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|0x80000) = 4
12272 open("toiso", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 4
12272 open(".", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 4
12273 open("/etc/ld.so.cache", O_RDONLY) = 3
12273 open("/lib/libm.so.6", O_RDONLY) = 3
12273 open("/lib/librt.so.1", O_RDONLY) = 3
12273 open("/lib/libc.so.6", O_RDONLY) = 3
12273 open("/lib/libpthread.so.0", O_RDONLY) = 3
12273 open("/proc/meminfo", O_RDONLY) = 3
12273 open("/proc/meminfo", O_RDONLY) = 3
12274 open("/etc/ld.so.cache", O_RDONLY) = 3
12274 open("/lib/libc.so.6", O_RDONLY) = 3
12275 open("/etc/ld.so.cache", O_RDONLY) = 3
12275 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12276 open("/etc/ld.so.cache", O_RDONLY) = 3
12276 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3
12277 open("/dev/null", O_RDONLY|O_LARGEFILE) = 3
12277 open("/dev/null", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3
12278 open("/etc/ld.so.cache", O_RDONLY) = 3
12278 open("/lib/libc.so.6", O_RDONLY) = 3
12265 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
12279 open("/dev/null", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3
12279 open("/etc/ld.so.cache", O_RDONLY) = 3
12279 open("/usr/lib/libz.so.1", O_RDONLY) = 3
12279 open("/lib/libbz2.so.1", O_RDONLY) = 3
12279 open("/lib/libc.so.6", O_RDONLY) = 3
12279 open("/root/md5b64", O_RDONLY|O_LARGEFILE) = 3
12279 open("/var/tmp/isondemand/7f035f455d7c24f4b483c7ebf770b7603744c768bbd6148bdc3bfa79fb03bd57/jigdo", O_RDONLY) = 4
12279 open("/var/tmp/isondemand/7f035f455d7c24f4b483c7ebf770b7603744c768bbd6148bdc3bfa79fb03bd57/template", O_RDONLY|O_LARGEFILE) = 4
12279 open("/root/toiso/netpbm-10.26.58_compiled_under_slax-6.0.9-ok.lzm", O_RDONLY|O_LARGEFILE) = 5
... A (incomplete?!) list of LZMs follows

Re-reading my previous post made me wanting to make a point clear: genisoimage run timing is reported just for comparison sake, all you need to do to produce iso on stdout is invoke isondemand, and it works from first time, so deadlock is gone in new version.

Why not make a script that will wget of the build and send it the the user so he can run it whenever he wants.
like make list
1-wget -c whatever-module
2- save it to a folder
3-repeat the rest of the modules

I want to notice, that without the feature being discussed, Your Slax is absolutely useless.
Yesterday, I have spent about 3 hours to examine all the modules listed. After that, I have spent one hour more to include all required modules into a ~700 MB image (special thanks for user-frindly handling of dependences tree). Since then I tryed to download an ISO image more than 10 times, but each time download progress was broken.
Was it really so difficult to warn users about download problems on the main page of Your site?
Now I already downloading another Linux distributive, and I will never again get involved with Slax, nor recommend it to other people.

Well if you use bittorrent technologies the file gets split into many pieces and all you would have to do is seed the torrent. This would give you resume capabilities also. And since only one user is donwloading it, as soon as you ration becomes 1.0 you can have the server delete the generated iso/tar file