Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz


• Final Change

rga is a line-oriented search tool that lets in you to gape for a regex in a mess of file forms. rga wraps the awesome ripgrep and permits it to switch making an strive in pdf, docx, sqlite, jpg, zip, tar.*, movie subtitles (mkv, mp4), and so on.



Examples

PDFs

Philosophize you’ve got a top-notch folder of papers or lecture slides, and you may possibly’t undergo in thoughts which indubitably one of them mentioned GRUs. With rga, you may possibly possibly factual speed this:

~$ rga "GRU" slides/
slides/2016/winter1516_lecture14.pdf
Internet page 34:   GRU                            LSTM
Internet page 35:   GRU                            CONV
Internet page 38:     - Try out GRU-RCN! (imo finest model)

slides/2018/cs231n_2018_ds08.pdf
Internet page  3: ●   CNNs, GANs, RNNs, LSTMs, GRU
Internet page 35: ● 1) temporal pooling 2) RNN (e.g. LSTM, GRU)

slides/2019/cs231n_2019_lecture10.pdf
Internet page 103:   GRU [Learning phrase representations using rnn
Page 105:    - Common to use LSTM or GRU

and it will recursively find a string in pdfs, including if some of them are zipped up.

You can do mostly the same thing with pdfgrep -r, but you will miss content in other file types and it will be much slower:

Searching in 65 pdfs with 93 slides each

05101520pdfgreprga (first run)rga (subsequent runs)

  • run time (seconds, lower is better)

On the first run rga is mostly faster because of multithreading, but on subsequent runs (with the same files but any regex query) rga will cache the text extraction, so it becomes almost as fast as searching in plain text files. All runs were done with a warm FS cache.

Other files

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
    ├── dir
    │   ├── greeting.docx
    │   └── inner.tar.gz
    │       └── greeting.pdf
    └── greeting.epub

(see the actual directory here)

~$ rga "hello" demo/

demo/greeting.mkv
metadata: chapters.chapter.0.tags.title="Chapter 1: Hello"
00:08.398 --> 00:11.758: Hello from a movie!

demo/hello.odt
Hello from an OpenDocument file!

demo/hello.sqlite3
tbl: greeting='hello', from='sqlite database!'

demo/somearchive.zip
dir/greeting.docx: Hello from a MS Office document!
dir/inner.tar.gz: greeting.pdf: Page 1: Hello from a PDF!
greeting.epub: Hello from an E-Book!

It can even search jpg / png images and scanned pdfs using OCR, though this is disabled by default since it is not useful that often and pretty slow.

~$ # find screenshot of crates.io
~$ rga crates ~/screenshots --rga-adapters=+pdfpages,tesseract
screenshots/2019-06-14-19-01-10.png
crates.io I Browse All Crates  Docs v
Documentation Repository Dependent crates

~$ # there it is!

Setup

Linux, Windows and OSX binaries are available in GitHub releases. See the readme for more information.

For Arch Linux, I have packaged rga in the AUR: yay -S ripgrep-all

Technical details

The code and a few more details are here: https://github.com/phiresky/ripgrep-all

rga simply runs ripgrep (rg) with some options set, especially --pre=rga-preproc and --pre-glob.

rga-preproc [fname] will match an “adapter” to the given file in line with both it’s filename or it’s mime form (if --rga-factual is given). You may possibly possibly explore all adapters at the 2d incorporated in src/adapters.

Some rga adapters speed external binaries to safe the right kind work (corresponding to pandoc or ffmpeg), generally by writing to stdin and studying from stdout. Others use a Rust library or bindings to full the identical safe (admire sqlite or zip).

To be taught archives, the zip and tar libraries are frail, which work fully in a streaming vogue – this implies that the RAM utilization is low and no recordsdata is ever surely extracted to disk!

Most adapters be taught the recordsdata from a Read, so that they work fully on streamed recordsdata (that may reach from wherever including within nested archives).

For the duration of the extraction, rga-preproc will compress the tips with ZSTD to a memory cache while concurrently writing it uncompressed to stdout. After completion, if the memory cache is smaller than 2MByte, it is written to a rkv cache. The cache is keyed by (adapter, filename, mtime), so if a file adjustments it’s sigh material is extracted again.

Future Work

  • I desired to add a represent adapter (in line with object classification / detection) for fun, so you may possibly possibly grep for “mountain” and this will whine photos of mountains, admire in Google Images. It labored with YOLO, nonetheless something extra well-behaved and voice-of-the artwork admire this proved very robust to integrate.
  • 7z adapter (couldn’t gain a succesful to use Rust library with streaming)
  • Allow per-adapter configuration alternatives (doubtlessly by means of env (RGA_ADAPTERXYZ_CONF=json))
  • Maybe use a particular disk kv-retailer as a cache somewhat than rkv, because I had some weird and wonderful complications with that. SQLite is top-notch. All a form of Rust picks I could gain don’t enable writing from extra than one processes.
  • Tests!
  • There’s some extra (largely technical) todos in the code I don’t know be taught how to fix. Encourage wanted.
  • Other initiating points
  • pdfgrep
  • this gist has my proof of thought version of a caching extractor to use ripgrep as a replacement for pdfgrep.
  • this gist is a extra intensive preprocessing script by @ColonolBuendia
  • lesspipe is a tool to form less work with many numerous file forms. Diversified usecase, nonetheless connected in what it does.

Read More

Recent Content