• Final Change
rga is a line-oriented search tool that lets in you to gape for a regex in a mess of file forms. rga wraps the awesome ripgrep and permits it to switch making an strive in pdf, docx, sqlite, jpg, zip, tar.*, movie subtitles (mkv, mp4), and so on.
Philosophize you’ve got a top-notch folder of papers or lecture slides, and you may possibly’t undergo in thoughts which indubitably one of them mentioned
GRUs. With rga, you may possibly possibly factual speed this:
~$ rga "GRU" slides/ slides/2016/winter1516_lecture14.pdf Internet page 34: GRU LSTM Internet page 35: GRU CONV Internet page 38: - Try out GRU-RCN! (imo finest model) slides/2018/cs231n_2018_ds08.pdf Internet page 3: ● CNNs, GANs, RNNs, LSTMs, GRU Internet page 35: ● 1) temporal pooling 2) RNN (e.g. LSTM, GRU) slides/2019/cs231n_2019_lecture10.pdf Internet page 103: GRU [Learning phrase representations using rnn Page 105: - Common to use LSTM or GRU
and it will recursively find a string in pdfs, including if some of them are zipped up.
You can do mostly the same thing with
pdfgrep -r, but you will miss content in other file types and it will be much slower:
Searching in 65 pdfs with 93 slides each
- run time (seconds, lower is better)
On the first run rga is mostly faster because of multithreading, but on subsequent runs (with the same files but any regex query) rga will cache the text extraction, so it becomes almost as fast as searching in plain text files. All runs were done with a warm FS cache.
rga will recursively descend into archives and match text in every file type it knows.
Here is an example directory with different file types:
demo ├── greeting.mkv ├── hello.odt ├── hello.sqlite3 └── somearchive.zip ├── dir │ ├── greeting.docx │ └── inner.tar.gz │ └── greeting.pdf └── greeting.epub
(see the actual directory here)
~$ rga "hello" demo/ demo/greeting.mkv metadata: chapters.chapter.0.tags.title="Chapter 1: Hello" 00:08.398 --> 00:11.758: Hello from a movie! demo/hello.odt Hello from an OpenDocument file! demo/hello.sqlite3 tbl: greeting='hello', from='sqlite database!' demo/somearchive.zip dir/greeting.docx: Hello from a MS Office document! dir/inner.tar.gz: greeting.pdf: Page 1: Hello from a PDF! greeting.epub: Hello from an E-Book!
It can even search jpg / png images and scanned pdfs using OCR, though this is disabled by default since it is not useful that often and pretty slow.
~$ # find screenshot of crates.io ~$ rga crates ~/screenshots --rga-adapters=+pdfpages,tesseract screenshots/2019-06-14-19-01-10.png crates.io I Browse All Crates Docs v Documentation Repository Dependent crates ~$ # there it is!
Linux, Windows and OSX binaries are available in GitHub releases. See the readme for more information.
For Arch Linux, I have packaged
rga in the AUR:
yay -S ripgrep-all
The code and a few more details are here: https://github.com/phiresky/ripgrep-all
rga simply runs ripgrep (
rg) with some options set, especially
rga-preproc [fname] will match an “adapter” to the given file in line with both it’s filename or it’s mime form (if
--rga-factual is given). You may possibly possibly explore all adapters at the 2d incorporated in src/adapters.
Some rga adapters speed external binaries to safe the right kind work (corresponding to pandoc or ffmpeg), generally by writing to stdin and studying from stdout. Others use a Rust library or bindings to full the identical safe (admire sqlite or zip).
To be taught archives, the
tar libraries are frail, which work fully in a streaming vogue – this implies that the RAM utilization is low and no recordsdata is ever surely extracted to disk!
Most adapters be taught the recordsdata from a Read, so that they work fully on streamed recordsdata (that may reach from wherever including within nested archives).
For the duration of the extraction, rga-preproc will compress the tips with ZSTD to a memory cache while concurrently writing it uncompressed to stdout. After completion, if the memory cache is smaller than 2MByte, it is written to a rkv cache. The cache is keyed by (adapter, filename, mtime), so if a file adjustments it’s sigh material is extracted again.
- I desired to add a represent adapter (in line with object classification / detection) for fun, so you may possibly possibly grep for “mountain” and this will whine photos of mountains, admire in Google Images. It labored with YOLO, nonetheless something extra well-behaved and voice-of-the artwork admire this proved very robust to integrate.
- 7z adapter (couldn’t gain a succesful to use Rust library with streaming)
- Allow per-adapter configuration alternatives (doubtlessly by means of env (RGA_ADAPTERXYZ_CONF=json))
- Maybe use a particular disk kv-retailer as a cache somewhat than rkv, because I had some weird and wonderful complications with that. SQLite is top-notch. All a form of Rust picks I could gain don’t enable writing from extra than one processes.
- There’s some extra (largely technical) todos in the code I don’t know be taught how to fix. Encourage wanted.
- Other initiating points
- this gist has my proof of thought version of a caching extractor to use ripgrep as a replacement for pdfgrep.
- this gist is a extra intensive preprocessing script by @ColonolBuendia
- lesspipe is a tool to form
lesswork with many numerous file forms. Diversified usecase, nonetheless connected in what it does.