Had to identify certain files in a site consisting of 12,000+ pages. The files I was looking for were ones that did not have a Dreamweaver template applied to them. grep and gawk to the rescue…
First I did a grep, going recursively through the site, using the -c option. This counts the number of occurrances of the expression you’re looking for. It then appends :# to the end of the filename containing the match. The results I saved into a file.
grep -cr "searchstring" . > path/to/outputfile.txt
Since I was after files that did NOT have the search string, I used this:
grep -cr "searchstring" . | grep :0 > path/to/outputfile.txt
I then used gawk against the results file, to produce a new file, containing all filenames that had no matches of the grep.
gawk "/:0/ {print}" resultsfile.txt > newfile.txt
The results in newfile contained paths to binary files such as images, as well as the various other files that Dreamweaver creates when using Design notes or Check In/Check Out. To remove these, I ran gawk again:
gawk "!/images|.JPG|.mno|_notes|.LCK|.css|.xml/ {print} newfile.txt > newfile.txt
If I was more proficient with regular expressions, I might have been able to deal with that a little better. Doing the initial grep against only asp, htm, html files, to find the occurrences of InstanceBegin. Perhaps a visitor to this site that comes across this post might have a more efficient routine. I did try using xargs, but I kept getting errors.
Did all this at the office with UnxUtils (and the UnxUpdates applied) on a Windows 2000 workstation. At home, I use Cygwin (base installation) on XP Pro x64
Popularity: 1% [?]