Saya memiliki ribuan dokumen dan beberapa di antaranya dipindai. Jadi saya perlu skrip untuk menguji semua file PDF milik direktori. Apakah ada cara sederhana untuk melakukan itu?
- Sebagian besar PDF adalah laporan. Dengan demikian mereka memiliki banyak teks.
Mereka sangat berbeda, tetapi yang dipindai seperti yang disebutkan di bawah ini dapat menemukan beberapa teks karena proses OCR genting ditambah dengan pemindaian.
Usulan dari Sudodus dalam komentar di bawah ini tampaknya sangat menarik. Lihatlah perbedaan antara PDF yang dipindai dan yang tidak dipindai:
Dipindai:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Tidak dipindai:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
Jumlah gambar per halaman jauh lebih besar (sekitar satu per halaman)!
pdffile berisi gambar (dimasukkan dalam dokumen di samping teks atau sebagai seluruh halaman, 'pdf yang dipindai'), file tersebut sering (mungkin selalu) berisi string /Image/, yang dapat ditemukan dengan baris perintah grep --color -a 'Image' filename.pdf. Ini akan memisahkan file yang hanya berisi teks dari yang berisi gambar (gambar halaman penuh serta halaman teks dengan logo kecil dan gambar ilustrasi berukuran sedang).