Extracting information from a lot of images on disk using find

If you need to extract information from a large number of images on disk (and you’re using a *nix system), you could do worse than using find with Imagemagick’s command line tools.

If you’re unfamiliar with find, I’d recommend reading the beginners guide on Linux.ie. It has terse and initially daunting syntax, but is one of the most powerful tools available to *nix users and proficiency with it is massively useful, especially for sysadmins and developers.

Here’s how you’d go about finding all jpg, gif, png and bmp images in a directory, excluding anything in a “thumbs” directory, getting their dimensions, compression type and filesize, separate each piece of information with a comma and writing it our to a file:

find . -path "*/thumbs/*" -prune -o -type f \(\
 -iname "*.jp*g" -o -iname "*.gif" -o -iname "*.png" -o -iname "*.bmp"  \)\
  -exec identify -format "%i,%wx%h,%m,%[size]\n" {} + > /tmp/images.info

Broken down:

find .

Searches in the current directory (.) – you can specify a path just as easily (find /path/to/directory/)

-path "*/thumbs/*" -prune

Exclude (prune) paths that match the preceding pattern. You can specify this multiple times (or not at all).

-o

This is the OR operator. AND is implied between each modifier if left out.

-type f

Specifies that we’re looking for a file (a directory would be -type d)

\(\
 -iname "*.jp*g" -o -iname "*.gif" -o -iname "*.png" -o -iname "*.bmp" 
 \)\

( opens a group, ) closes it. The backslashes escape the parentheses and newline (I’ve just used the newline to make it more readable). The -iname directive specifies a case-insensitive filename, in this case matching file extensions. The usage of the -o operator is more obvious here, as without it we’d be asking that each file match .jpg AND .png AND .gif – which wouldn’t really work.

-exec ... {} +

This executes a command on each item found, the “current” found item being contained in the {} placeholder. + is the terminator in this case. \; can also be used (again, backslash as escape), but the + terminator batches results and performs much better with large numbers of files. This is roughly equivalent to piping into xargs on older systems which may not have the + terminator available (pre-2005 builds).

identify -format "%i,%wx%h,%m,%[size]\n"

In this case, the command we’re executing is Imagemagick’s identify tool. There’s quite a lot of information available here, it’s prudent to use the -format option to limit the information to what you need. Helpfully, there’s a list of escape characters to let you know what can be extracted.
Here, I’m getting the file path (%i), the width(%w), the height (%h) and putting in a literal ‘x‘ to separate them. After that, there’s the compression type (%m) and the filesize in KB (%[size]). I separate each value with a literal comma and ending each line with a newline (\n).

> /tmp/images.info

Finally, rather than output this information to the screen (by default), we direct the output into a file in the tmp directory. If there are a lot of files to process, you won’t immediately see data start to pour in here, as it’ll be batched using the + terminator mentioned before. You’ll probably see it populate in lumps of several thousand.
You should get a file containing results that look something like this:

./images/3tm9wzz4z9kzd51168cef0a9cc77ca616916128aaa3d.JPG,640x480,JPEG,22.8KB
./images/226te3jc3m85519d6348418bdde11ee08d77ffd338ff.JPG,626x639,JPEG,44.6KB
./images/2s9262f4uix2e26113b8007a2a3dfadb6aa3fa7aa0ee.JPG,384x288,JPEG,36.6KB
./images/3572wcuya3pi3fb0f68eff3d6104a7b94d5725b2b526.jpg,480x640,JPEG,50.9KB
./images/5wby49rxay9lcc890e914b4d52e9909700f8d5227bb9.jpg,354x142,JPEG,11.9KB
./images/1c6cf3icti8v9c2b997592c0c7c51c25e900969eaec4.JPG,478x640,JPEG,41.4KB
./images/53h1y0x1q37q22d65cc682f6d7994db2510cab013ddf.JPG,478x640,JPEG,28.1KB
./images/4r8ck3kn1ezi809f7d4a63c0fb95b4f07053641bd8d3.JPG,478x640,JPEG,33.5KB
./images/156m118zdn7n4a10fef7d6c88067482f0803db2837e6.JPG,478x640,JPEG,25.5KB

If you spot any typos, mistakes or ways you think this might be improved, feel free to let me know.

This entry was posted in Codetry. Bookmark the permalink. Both comments and trackbacks are currently closed.