Unduh secara rekursif dengan wget

32

Saya punya masalah dengan perintah wget berikut:

wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

Itu harus mengunduh secara rekursif semua dokumen yang ditautkan di web asli tetapi hanya mengunduh dua file ( index.htmldan robots.txt).

Bagaimana saya bisa mendapatkan unduhan berulang web ini ?

wget

— xralf
sumber

40

wgetsecara default menghormati standar robots.txt untuk merayapi halaman, seperti halnya mesin pencari, dan untuk archive.org, itu melarang seluruh / web / subdirektori. Untuk mengganti, gunakan -e robots=off,

wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

— Ulrich Schwarz
sumber

Terima kasih. Apakah ada opsi untuk menyimpan setiap tautan hanya sekali? Mungkin saya harus mengurangi 10ke angka yang lebih rendah, tetapi sulit ditebak. Sekarang ada sebuah file introduction.html, introduction.html.1, introduction.html.2dan aku lebih berakhir proses.

— xralf

Dan tautannya mengarah ke web. Apakah --mirroropsi untuk tautan langsung ke sistem file?

— xralf

1

@ xralf: baik, Anda menggunakan -nd, jadi berbeda index.htmlditempatkan di direktori yang sama, dan tanpa -k, Anda tidak akan mendapatkan penulisan ulang tautan.

— Ulrich Schwarz

12

$ wget --random-wait -r -p -e robots=off -U Mozilla \
    http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

Unduhan konten url secara rekursif.

--random-wait - wait between 0.5 to 1.5 seconds between requests.
-r - turn on recursive retrieving.
-e robots=off - ignore robots.txt.
-U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".

Beberapa opsi lain yang bermanfaat adalah:

--limit-rate=20k - limits download speed to 20kbps.
-o logfile.txt - log the downloads.
-l 0 - remove recursion depth (which is 5 by default).
--wait=1h - be sneaky, download one file every hour.

— Nikhil Mulley
sumber

-l 0 - remove recursion depth (which is 5 by default)+1

— Dani