Friday, June 29, 2012
Friday, June 22, 2012
I had a bunch of email (EML) files scattered around my hard drive. Some of them, I noticed, were displaying a lot of HTML codes. For example, when I opened one (using Thunderbird as the default EML opener), it began with this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0"> <TITLE>RE: Scholar Program</TITLE> </HEAD> <BODY> <!-- Converted from text/rtf format -->
Finding the Offending Files
findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txtIt produced a dozen "Cannot open" error messages. The reason seemed to be that the filenames for those files had funky characters (e.g., #, §). Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command. DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file. So despite a number of variations with FINDSTR and several Google searches, I gave up. I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually. Copernic had a convenient right-click Move to Folder option, so that helped a little. So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder. I would just have to re-sort them back to where they belonged manually.
But I still wasn't sure that everything in that folder was problematic. Basically, I needed to see what the EMLs looked like when they were opened up. Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails. But I was having problems in my efforts to print EMLs as PDFs. As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder. To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat. Somehow, this failed to crash my system.
Cleaning Up the Files
After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility. Now I could edit them as text files. My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.
I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it. The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced. It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation. Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed. After a couple of tries, I found a search that did bring up a couple of solutions to related problems. But those solutions still looked pretty manual. Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?
A different search led to HotHotSoftware, which offered a tool for this purpose for $30. A video seemed to demonstrate that it would work. But, you know, $30 was more than the files were worth. Besides, I wouldn't learn anything useful that way. ReplacePioneer ($39, 21-day trial) looked like it might also do the job. A thread offered a way to do something like it in an unspecified language, presumably Visual Basic. Another thread offered an approach in sed. Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX. Other free options included Nodesoft Search and Replace and Replace Text.
I tried TexFinderX. In its File > Add Folder menu pick, I added the list of files to be changed. I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage. The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder. They advised using an existing table to create a new one. As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas. The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.
But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space. I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar. I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space. I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe. I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided. I also changed the top lines, the ones that explained what the file was about. I saved the file, went into the program's Replacement Table button, and there it was. I selected it and clicked Apply. On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files. Then I ran it. It looked like it worked.
I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff. I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line. So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it. It seemed best not to add too many chunks at once, lest I be repeating the same lines: run a few, after eyeballing them for duplication, and then see what was left. It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.
I added back some of their original items (modified) to the replacement table. These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>). I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work. The answer to that seemed to be to replace three pairs of <13><10> with two. I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.
I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth. After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others. Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run. I realized that it would probably not produce perfect results across all cases. It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.
I had noticed that the program took a little longer to run as I added more rules to its replacement table. But this did not seem to be due to file processing time: the time did not grow far longer when I added far more files to the list. It was still done within a minute or so in any case. Apparently it was just reading the instructions into memory.
The excess (now blank) lines in the files were the slowest to remove. I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks. Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.
Cleaning Up the Starts and Ends of Files
href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files. Now I wanted to use a similar process on these files. I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.
I had not previously noticed that most if not all of these emails had originally included attachments. As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion. These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition. These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself. I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX. I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.
Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously. After making a backup, the summary version of those steps, as modified, went like this:
- Start Emacs. Open one of the post-TexFinderX emails. Hit F3 to start macro recording. C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end. Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
- At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line. I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails. So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
- Hit C-space to set the mark. C-End. Del.
- C-s to search for Message-ID. Then C-e to go to the end of that line, and right-arrow to go to the start of the next line. C-Space to set the mark, C-Home, and then Del. That was as much as I could do with this particular email; it was clean, though not ideally formatted.
- C-x C-s to save the file. F4 to end the macro recording. C-x C-k n Macro1 Enter (to name the macro to be Macro1). C-x C-k b 1 (to bind the macro to key 1).
- C-x C-f ~/ Enter (to find my Emacs Home directory). In my case, Home was C:\Users\Ray\AppData\Roaming\.emacs.d. I went there in Windows Explorer and created a new text file named _emacs, with no extension. This was my init file.
- From the Emacs menu: File > Open File > navigate to the new _emacs init file > select and open _emacs. Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter. This hopefully saved my macro to my init file. C-x C-c to save and quit Emacs. A quick look with Notepad confirmed that there was something in _emacs.
- Restart Emacs. Open another of these text emails. Test my macro by typing C-x C-k 1. I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again. Still undefined. Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs. In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs. I used C-x 2 to split the window. _emacs appeared in both panes. In the top pane, I went to Buffers > select the text file to be changed. (Apparently it was listed as one of the available buffers because I had already opened it.) So now I was viewing the macro in the bottom pane and the email file in the top pane. I selected the top pane and tried C-x C-k 1 again; still undefined. I found other advice to just use M-x Macro1. That worked. The macro ran in the top pane.
Converting Emails to PDF
I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs. The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands. I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions. This produced PDFs with "Notepad" at the end of their names. For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.
Word had problems printing a number of these Word docs. It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques. It did produce some PDFs. I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time. Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF. That went fine, as a manual process. So apparently it was not, in every case, a problem with the file. Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.
Other extracted attachments were text files. At this point, I had two ways of dealing with these. On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files. I did start to use this approach, but ran into dialogs and potential problems. On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above). Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt. They also all originated from the same source. I examined a handful of these attachments and decided I could delete them all.
Some extracted attachments were image files -- JPG, GIF, PNG, BMP. I also had a dozen attachments without extensions. I opened the latter in IrfanView. I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly. On the other hand, as I looked through these files, I saw that some of the GIFs were animations. Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files. I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.
There were a few miscellaneous file types. For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been. One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.
Combining Email and Attachment PDFs
Now I had everything in PDF format. I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible. I combined the files and attachments into one folder and made a list of the files using DIR (above).
Now the goal was to combine the emails that did have attachments with their accompanying attachments. There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it. I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM. Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives. The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.
I downloaded and unzipped pdftk. As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32. I opened a command prompt in some other folder, at random, and typed "pdftk --help." This was supposed to give me the documentation. Instead, it gave me an error:
pdftk.exe - System Error The program can't start because libconv2.dll is missing from your computer. Try reinstalling the program to fix this problem.I moved the two files to C:\Windows and tried again. That worked: I got documentation. It scrolled on past the point of recovery. Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder. The next step was to put pdftk to work. It would apparently allow me to specify the files to combine, using a command of this form:
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdfMy problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other. I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf. I would need to keep the spreadsheet for a while, so as to know what the original filenames were. The original filenames were my guide as to what files needed to be combined together. For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf". Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1. Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.
After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined. Each original filename began with date and time. With few exceptions, this was sufficient to distinguish one email and its attachments from another. So I used =LEFT to extract that identifying information from column A. Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using. Finally, I added a column to create the appropriate command for the batch file. Instead of the 123.pdf output shown in the example above, I used the original email filename. Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.
I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK. I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them. Pdftk had proved to be a much easier tool for this project than pdfSAM. This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.
In a previous post, I looked at replacements for Windows Explorer ("WinEx"), including especially FreeCommander. The runner-up, at that point, was Explorer++. Further experience with FreeCommander prompted me to take a closer look at Explorer++ after all. This post provides further information on these two utilities.
As I used FreeCommander, I was surprised to find that a few right-click (context menu) options were missing. For example, I often used LockHunter to find out why Windows was not letting me move or delete a certain file or folder. But in FreeCommander, I was no longer seeing the context menu question, "What is locking this file?" That option did continue to appear in Explorer++, as it had appeared in WinEx. One possible explanation was that FreeCommander did not offer a 64-bit version, whereas Explorer++ did, and I was using the 64-bit version of LockHunter.
Another problem in both FreeCommander and Explorer++ was that I no longer had the option to create a new text file in a specified folder. That option had been available in WinEx, as I recalled, via File > New > Text File. I was pretty sure there was a way to create a new text file in FreeCommander. It seemed to me that I had done so by accident, once or twice, while trying to do something else with a familiar command from WinEx. But I was not seeing that option on the menu nor in the list of shortcuts, and likewise in Explorer++. Workarounds in either program were to open a command window in the selected folder and type one of these options:
- copy con filename.txt Enter. Then type the text. End with F6 or Ctrl-Z.
- echo [a line of text to put into new text file] > filename.txt
- notepad filename.txt Enter
Unlike FreeCommander, it was not necessary to display a toolbar listing all drives in Explorer++, because the navigation pane already showed all drives, as in WinEx. Also like WinEx, Explorer++ allowed me to customize the toolbar area by right-clicking on it. By contrast, FreeCommander required me to go to Extras > Settings > View > Toolbar; and once there, I had to save changes to each segment of the toolbar separately. Explorer++ offered more toolbar icons that I was likely to find useful, including Back, Forward, and Up buttons.
Explorer++ did not offer the dual panel option. But in recent weeks, I had not found myself using that option very often in FreeCommander. I tended to prefer to keep my windows to half-screen width (using the half-screen snap available in Windows 7 via WinKey - left- (or right-) arrow), and a half-screen was too narrow for many filenames. Moving from one tab to another was an easier way to work among multiple folders. Explorer++ (unlike FreeCommander) further aided that by offering the option of bookmarking folders. A bookmark would not create a new tab; it would change the focused folder in the already focused tab.
Unlike FreeCommander, Explorer++ offered the option of being treated as a replacement for WinEx. This meant that my Start Menu icon (and other menu picks in various programs) that previously would have opened a Windows Explorer session were now opening an Explorer++ session instead. That option was available via Tools > Options > General tab > Default File Manager. I still had the option of opening Windows Explorer by typing "explorer" in a command box; hence, batch commands designed to open WinEx to a particular folder would still do so.
FreeCommander appeared to offer more command-line options. The options in Explorer++ appeared to be limited to (a) the possibility of listing multiple directories to open when Explorer++ started up, each opening in its own tab and (b) the possibility of opening virtual folders by using their names (e.g., explorer++.exe "control panel"). I did not think I would need the latter. The former would be useful only when dealing with relatively short pathnames; Windows might balk at a command listing several long paths. I obtained information about these options by typing "explorer++.exe /?" at the command prompt. That seemed to work only in the folder where explorer++.exe was located.
Other points of comparison: Both Explorer++ and FreeCommander seemed to remember their window positions better than WinEx had done. Even more so than FreeCommander, Explorer++ displayed much more information onscreen than WinEx: 51 rows, in my configuration. Regrettably, unlike FreeCommander, the status bar in Explorer++ did not state both the number of items selected and the total number of items in the folder. Like FreeCommander, Explorer++ did not offer an Undo option, in case I had accidentally moved or deleted the wrong file or folder. Using Explorer++ or FreeCommander did not stop the annoying "This folder is shared with other people" messages.
As these remarks probably suggest, I found myself gravitating toward Explorer++ shortly after I began using FreeCommander in earnest as my WinEx replacement. There would surely be many more contrasts between the two. But I wasn't sure how many of them I would detect, since by this point it seemed that I would mostly just be using Explorer++.
Thursday, June 21, 2012
I guess I have assumed that almost everybody loves Google, and those who don't are the bad guys. Microsoft, for example. Maybe it takes a huge corporation to stand up to another huge corporation. If so, Google is a champion for those who have disliked various things about how Microsoft got its start, what it did to increase its power, and what it has done with that power.
There comes a point, however, when the good guy turns bad. Maybe it doesn't have to happen. But power tends to corrupt. And even when it doesn't actually corrupt, it tends to create an impression of corruption. That impression may be able, by itself, to make people more or less as miserable as they would be in case of actual corruption and abuse.
Case in point. I have been blogging for years, here in Blogger. I wasn't necessarily eager to see Google acquire Blogger. But they were welcome to do so, for my purposes, as long as they left me alone. The deal was that I got to use their free blogging platform to put out various things that I wanted to write, and they got to use my work, my viewers, etc. to make money from advertising and whatnot.
Gifts can make people resentful when they stop. I would be unhappy with Google if they pulled the plug on my blogging enterprise, even though they're not charging me for it. I have spent years putting stuff here, linking one post to another and so forth. It would take a lot of work -- work that I might never do -- if they were suddenly to just shut it down or screw it up. I would feel that, after all, Google does have competitors, notably WordPress. If nothing else, I'd sooner be paying for a hosted website than to do all this work and then watch it get messed up.
What's sad is that I have been warned that they are quite capable of doing exactly that. It has already happened. Circa 2000, many people were using DejaNews as a convenient gateway to Usenet. Usenet newsgroups contained tons of free, helpful information on a vast array of subjects -- especially but not only computer-related, like this blog. Google acquired DejaNews. Evidently they felt that all that information would interfere with their desire to sell advertising related to webpages. For whatever reason, they basically destroyed Deja. That was a shame, for all those people who could have continued to use it to obtain useful information. And it was irritating to me, because all the things I had put out there, thinking I would always be able to access them, were removed from access as a practical matter, by me and most everyone else.
I was pretty unhappy with Google about that. That was the first big chink in their claim that they would "do no evil," as their corporate motto ("Don't be evil") has been widely reported. They had obviously ruined something useful, for purposes of increasing profits.
That stuff would not be coming back to mind now if I weren't having an off day with Google today. Here I am, working away on my blog, and suddenly it is no longer very functional in Internet Explorer. I have a nice little desktop arrangement, with various browsers, but now Blogger has suddenly ceased to work properly when I try to post or edit. Google lets me know that, instead, I should be using its own browser, Chrome, for this purpose.
That part happened several days ago. So, OK, I have been trying to post in Chrome instead. But I am finding that Chrome is not yet up to speed for this purpose. Google was eager enough to move me over to its browser -- the statements and signals have been out there for some time -- but, lo, it develops that Chrome is inserting white backgrounds. Whole chunks of my post are whited out. Why? I don't know. Probably they don't know either. I am having to go in and manually remove whiteing that I didn't put there. Why not just leave me alone, free to work on my blog in Internet Explorer, until Chrome gets its act together?
That seemed like a fair question, so I tried to present it to Google. Problem is, their "Contact Us" webpage is a lie. You cannot contact them through their webpage. Or at least I cannot. I tried today. I tried once before, with a problem so obvious and banal that it pained me to have to bring it to their attention. In that case, I gave up and wrote them a letter. It seemed ironic, and yet telling, that I had to use the U.S. Post Office to communicate a simple thought to one of the world's largest software corporations.
Like most people, I don't like being lied to. If you're not going to let me contact you, don't give me a "Contact Us" webpage. Call it "FAQs" or whatever. It's great that you can hire the best and the brightest, but that can backfire: you can create the impression that you think you're too good for the rest of us. It wouldn't be terribly smart to generate unnecessary resentment, would it?
It had never occurred to me, until today, to search for something that I have now searched for and found. Yes, as it turns out, there does exist something called IHateGoogle.org. I'm not really sure what it's about. I'm not resentful enough to dig into it. But, Google, keep it up: maybe someday I will be. You seem to be making a good start at it: today you tell me that as many as 1.4 million webpages convey that sort of feeling toward you and your actions.
Obviously, I am not the only person who has attempted to communicate with Google along these lines. People rarely get resentful when they feel they are being respected. If Google cannot make its own programs work together -- Chrome and Blogger, in this case -- it is welcome to keep them in beta. But forcing me to use them when I don't want to: at this point, that is a problem. Not just a software problem. As presented in this post, it is an indication of larger and more worrisome things.
Tuesday, June 19, 2012
I was using Thunderbird 11.0.1 in Windows 7. I had accumulated some emails that I wanted to export as individual EML files. An EML would still be readable in Thunderbird, and it would carry any attachments along with it. I had attacked this problem on several previous occasions. As before, I was not sure I would get all the way through from Thunderbird to EML to PDF. This post provides another contribution in the slog toward that outcome.
First Step: From Thunderbird to EML Format
Some of my previous efforts to export to EML and then convert to PDF had produced something of a mess. Exporting, itself, was easy enough. I was using ImportExportTools. It would give me EMLs with names containing some, but not all, of the information that I wanted in file names. Specifically, it would provide the date and time, the sender, and the subject; but it did not include the recipient. I could get it to produce a separate Index.csv file that would contain the full information, but that would just be a spreadsheet file. I could use that spreadsheet file to give me nice names for files; but which file was supposed to get which name? Matching them up had required a surprising amount of manual effort, last time around. I was hoping to make the process smoother, if I could.
It wouldn't help to print a PDF directly from Thunderbird. As far as I knew, that would require me to enter PDF filenames manually. I was looking for a mass-production kind of solution. About.com recommended mbx2eml, but it seemed to have some disadvantages, notably a very limited set of options for the resulting EML filenames -- which was the main problem. Generally, it did not seem that any solution had broken through into prominence, in either the T-bird to EML or T-bird to PDF category.
In my first try at this problem, I had tried Total Thunderbird Converter and Birdie EML to PDF Converter, but for various reasons had not been impressed with either. I did like Attachment Extractor, for when I got to that part of the project. My notes seemed to favor Universal Document Converter (UDC) ($69), if I wanted a direct T-bird-to PDF-solution. As I reviewed the struggles I'd had in that first try at this problem, and also in the second and third tries, I wondered if I should have focused more seriously on UDC. But it did not seem to have command-line capability or other automation features. It was basically a glorified PDF printer. Moreover, its default filenames did not include all the information I wanted.
My previous notes did not seem to mention that Thunderbird messages were apparently already in EML format, stored in Thunderbird subfolders. For instance, I had moved the messages that I was now seeking to export to a Local Folders subfolder called Export, and I could see that folder in Windows Explorer as Mail\Local Folders\Export.mozmsgs. But this was confusing: the number of EML files in that folder was not very close to the number of messages in the Export subfolder in Thunderbird. Anyway, the EMLs in Export.mozmsgs had seemingly random names that would be useless for my purposes.
So I went ahead with ImportExportTools. My first step was to eliminate duplicates. For this, I used Remove Duplicate Messages (Alternate). Then, in Thunderbird, I went to Tools > ImportExportTools > Export all messages in the folder > EML format. The first time around, this produced undesirable results (see below). But I didn't know that until I was partway through the second step.
Second Step: Adding Recipient to the EML File Name
I had my EMLs. But as noted above, I wanted to add the name of the Recipient to the filename, in the format Date-From-To-Subject. As a first step, I thought I would just try to append the Recipient's name to the end of the filename. Then I would figure out how to shuffle the words around to the desired order.
Given my limited knowledge of programming and such, I decided to try to achieve this with a Windows batch file. I struggled to figure out how to write a suitable one, and finally posted a question on it. One of the early answers to that question led to a separate pursuit -- a one-line batch file that would convert Word and WordPerfect documents to PDF.
The answers that I had received, at the point when I was writing up these notes, fell into two categories. One, which I found easier to understand (and, predictably, seemed less popular among the knowledgeable respondents), involved a simple loop that would call an external process. Basically, in plain English, it went like this:
FOR each EML file, run Process.By contrast, the approach preferred by most of the answering individuals would put all the steps inside the loop, instead of having a separate process afterwards. It seemed to be a matter of style. A second difference was that, in discussing the specific steps, they seemed divided between two general possibilities: with, or without, delayed expansion. Delayed expansion was apparently a response to a complication in how the FOR command worked. As I understood it, the computer would read the entire contents of a FOR command as soon as it hit the word FOR. So assigning a value to a variable inside a FOR loop would be too late; the computer would already have decided what value that variable had. The variable would have been immediately expanded to its value. Delayed expansion would postpone definition of the variable's value until later in the game. A variable would be marked for delayed expansion by surrounding it with exclamation marks (e.g., !VAR!). I wasn't familiar with delayed expansion, so I was in accord with some advisors' feeling that it would be better to proceed without it if possible. What they (especially Aacini) suggested was:
When list of files is exhausted, quit.
Process starts here.
Do various things.
End of process
@ECHO OFFI have double-spaced the lines for clarity, anticipating that Blogger will wrap some long lines. I haven't indented the way a programmer would, because of apparent limitations in the formatting options here in Blogger. Basically, this batch file said, give me a fresh output file called Fullnames.txt; and on each line in Fullnames.txt, type the contents of two variables. The first variable, %%f, was the name of the EML file under consideration, in all its Date-Sender-Subject glory. There would be one such filename assignment for each EML file in the folder; hence a FOR loop. The batch file would loop through all EML files in the folder.
IF EXIST fullnames.txt DEL fullnames.txt
FOR %%f IN (*.eml) DO (
FOR /F "delims=" %%l IN ('findstr /B /C:"To: " "%%f"') DO (
IF NOT DEFINED firstfind SET firstfind=now & ECHO %%f%%l >> fullnames.txt
Inside that FOR loop, there would be an examination of the contents of each individual EML. This examination would use FINDSTR to locate the first line beginning with "To: ." The contents of that line would be assigned to the %%l variable. (That's an L, not a one.) I wasn't sure why this had to be done inside a second, inner loop, and I also didn't know how the "now" part worked. But I was an openminded individual. I was interested in new ideas. The point is, I was willing to plow ahead and give it a try.
So I copied the foregoing lines of script, beginning with @ECHO OFF and continuing to the last closed parenthesis (")"), and pasted them into a file in Notepad. I saved that file as EMLNamer.bat, and put it into the folder containing the EMLs that I had exported from Thunderbird (above). There, I ran it (either double-click it or highlight and hit Enter). The command window displayed nothing, which was a bit disconcerting; but, viewing the folder in Windows Explorer, I could see Fullnames.txt spring into existence and grow larger.
When it was done, the command window disappeared, and Fullnames.txt stopped getting bigger. I put EMLNamer.bat into a folder where I could find it later. I opened Fullnames.txt file and pasted its contents into Excel. Some lines seemed to be missing. Not many, but less than the total number of files shown in the Windows Explorer folder minus two (for EMLNamer.bat and Fullnames.txt). I guessed that the names of a few EMLs had presented complications for the script. I would have to process the rest and see what remained.
Third Step: Improving the EML File Name
I looked at the new Excel spreadsheet. Spot checks, supplemented by previous experience with ImportExportTools, yielded the following observations:
- The first 13 characters in each filename seemed match the date and time (in 24-hour format) shown in Thunderbird for the email in question -- the time, that is, when the email was sent or received.
- The next characters indicated the sender. This string ended, in some cases, with three characters (namely, "_-_") and in other cases with just one (namely, "-"). It seemed that ImportExportTools would surround some senders' names with underscores ("_") but would not do so for others. The reason seemed to be that those senders' names appeared within brackets. For instance, I had emails from "[Wordpress.com]" that now appeared as "_WordPress_com_." So at least in these situations, the underscore seemed to be something that I could replace with a space, which would then be removed by an Excel TRIM command if it appeared at the start or end of a string.
- Some senders' names ended with "_com." Ordinarily, the preceding note would suggest replacing that with ".com," and likewise for ".org," ".edu," and so forth. But I decided that step would come later, if at all: instead, I would start by identifying full names (e.g., "Yahoo_com") that I might want to replace with simpler names (e.g., "Yahoo").
- Hyphens were not always a reliable indicator of the end of a sender's name. For example, an email from some "Pan-European" organization came through the ImportExportTools process unchanged.
- ImportExportTools seemed to replace apostrophes with underscores. So instead of "Miller's" I would get "Miller_s_." Likewise for other uses of the apostrophe (e.g., "Don't" became "Don_t_"). It seemed that, before doing any sweeping replacement of underscores, I might want to look for those sorts of special cases.
- Sometimes a hyphen would not be a reliable indicator of the end of a sender's name. An example appeared in an email from a "Pan-European" organization: it came through the ImportExportTools process unchanged.
- Due to the EMLNamer process, the end of the Subject field and the beginning of the Recipient field were marked by ".emlTo:" -- which was certainly recognizable.
- Subject fields often began with things like "Fwd_" and "Re_" -- which, I had decided in a previous use of ImportExportTools, would best be deleted.
When I ran that, I got an index.html file listing relevant information about each file: its subject, from, to, date, and an indication of whether it had attachments. This did not appear likely to be helpful, given its HTML format. In the output folder, there was the right number of files. I ran EMLNamer.bat again. This time, the command window gave me some error messages. Preliminarily, it seemed they were produced by the length of the filenames. I could not save them before the command window closed. There was probably a way to modify EMLNamer.bat to save those messages to a file, but I did not tinker with that at this point. These messages appeared to be in addition to the unknown problems that had prevented Fullnames.txt from containing a complete list of all EMLs: there were now about 20 filenames missing from the output that I pasted into Excel. So, again, those would have to be dealt with manually.
This time around, when I pasted the results from Fullnames.txt into Excel, I saw that the output filenames had characteristics largely similar to, but in some regards different from, those noted above. There were fewer underscores, which meant that it would probably be simpler to develop rules to translate them into more useful characters. Hyphens were still not reliable field-end indicators.
Manipulating the File Information in a Spreadsheet
In Excel, after a couple of false starts not detailed here, I took the following steps:
- Insert row 1 for column headings. Label column A as "Combined." These entries contained the combined original filename plus the "To:" information added by EMLNamer.bat.
- In column B (heading: "Original"), use =LEFT(A2,FIND(".emlTo: ",A2)-1) to obtain the original filename as exported from Thunderbird. I would need this to remain unchanged: my ultimate goal, a batch command indicating how the original filename should be changed, would need this information to tell the command processor what file was being renamed. As with all other columns discussed below, I copied the formula down the column to all rows in use.
- In column C (heading: "Find & Replace"), use =A2. Fix the values in this column -- that is, make them permanent by highlighting them all and using the Edit - Copy, Edit - Paste Special -Values sequence. The shortcut key sequence for Excel 2003 -- which I believed would work in ribbon versions like Excel 2007 and 2010 --was Alt-E C, Alt-E S V Enter Enter. Now column C contained values rather than formulas.
- Move the values from column C to a new worksheet. Don't rearrange them. I needed a new worksheet because I was going to be using global find-and-replace (Ctrl-H) commands, and I didn't want to have to try to protect columns A and B from being affected by these commands.
- In that new worksheet, I made changes to the list that I had just brought over from column C in the first worksheet. The first thing I did was to search for an unusual character, one I searched, in Excel, to find a character that did not already appear in the list. The caret ("^") was one such character. I would use this as my field delimiter. I didn't want any of my Subject field entries to begin with "Re" or "Fwd," so I started by replacing "-Re_" and "-Fw_" and "-Fwd_" with carets, gambling (on the basis of previous experience) that there would be few instances where this would prove inadvisable.
- I also replaced the "-_" and "_-" and "-[" combinations with carets. To reduce the number of underscores potentially requiring manual attention, I did one or two additional find-and-replace operations in obvious cases; for example, "Woodcock_s " (ending with a space) became "Woodcock's ." It could have been counterproductive to go too far with this, though. For example, I did not try to remove underscores from every version of my name and email address, because that could have created additional variations on my name, somewhere down in the list, potentially complicating the number of things I would have to look for later. It was better to leave the underscore as a flag for some purposes. Then I cut and pasted that modified list back into column C in the main worksheet.
- Back in the main worksheet, in column D, I set up a Date and Time column B, using =LEFT(C2,13). I didn't parse that column for the various year, month, day, hour, and minute components at this point; that could wait until I needed that information.
- In column E, I created my first Remainder column. The purpose of the Remainder columns was to show what was left from the modified values appearing in column C, after removing whatever I had just separated out (in this case, the date and time). The formula was =TRIM(MID(C2,15,LEN(C2))).
- I used column F for the Recipient (i.e., "To") value, from the end of the string appearing in the Remainder column (E). The reason was that this was a fairly obvious entry, and its removal would simplify the next steps. The formula in column F was =TRIM(MID(E2,FIND(".emlTo: ",E2)+7,LEN(E2))).
- Column G could be another Remainder column: =TRIM(LEFT(E2,FIND(".emlTo: ",E2)-1)).
- In column H (heading: "Left 1"), I entered =LEFT(G2,1). The reason was that ImportExportTools had failed to export the names of some senders, notably those appearing in angle brackets ("< >"), and I couldn't identify them by just sorting on the Remainder column because Excel would irritatingly overlook those characters when doing a sort. But now I could sort on column H and make manual entries of those senders' names in the appropriate column. I had not yet created that column, nor made those manual entries, because there was something else I needed to do first:
- In column I, under a "Hyphen" heading, I entered =IF(ISERROR(FIND("-",G2)),"",FIND("-",G2)). In column J (heading: "Caret"), I entered =IF(ISERROR(FIND("^",G2)),"",FIND("^",G2)). Finally, in column K (heading: "Best"), I entered =IF(J2="",I2,J2). Column I would look for the first occurrence of a hyphen in the Remainder (column G). Column J would do likewise for a caret. It was necessary to use both because, at this point, either one might have been the delimiter indicating the end of the Sender field. Column K would favor carets over hyphens, so as to reduce the number of problems with senders with hyphenated names.
- In column L ("Sender), I used =TRIM(LEFT(G2,K2-1)). This produced good Sender names in most cases. It was not yet time to deal with the exceptions.
- In column M ("Subject"), I used =TRIM(MID(G2,LEN(L2)+1,LEN(G2))). This produced good Subject names in most cases. Now it was time to deal with the exceptions.
- I went back and sorted on column H to identify those rows where I would have to make manual entries of Sender names because none was provided by ImportExportTools. I put those entries in column L as needed, replacing whatever the automatic calculation had put there. To assist in my process of looking up those that I didn't recognize, I sorted the From column in Thunderbird, for the Export folder, to gather all those senders at the top of the list for easier reference; I moved these items into a separate subfolder, sorted by Subject; I maximized the viewable space for that list; and once I had dealt with them, I moved them to another subfolder, so as to reduce the size of the list that I would have to page through. The objective here was just to make sure I had a coherent division of information between the Sender and Subject columns -- to prevent some Sender data from appearing in the Subject column, or vice-versa. Cleaning them up or otherwise improving them at this point would have been premature. Changing Sender names worked best if I made the changes back in column G, or if I altered or removed numbers in columns I and J. Just making a change in the Sender column would leave a problem in column M. It helped, for this purpose, to fix the values in column G (that is, to replace formulas with values; see the procedure described in connection with column C, above).
- I sorted on column M ("Subject") and cleaned up the entries there. I found that I wanted to do find-and-replace operations on multiple entries. I decided at this point that I could safely fix the values of the entire spreadsheet. It seemed that I would want to sort and re-sort these Subject values to get similar ones together. To preserve the original order, I added an Index column, indicating the original numerical order of entries. (Enter 1 and 2 in the first two rows; highlight all rows to be numbered; then hit Alt-E I S Enter.) Then I moved the Subject and Index columns to a separate temporary worksheet, where I could do these sweeping changes without affecting other columns. There, I reversed these two columns, putting Index on the left, to keep it out of harm's way. My changes here included LEFT and RIGHT commands to sort by first and last characters of Subjects (supplemented, on the left, with CODE comparisons, to identify unwanted lowercasing), as well as FIND and Ctrl-H searches and replacements for underscores (doing many replaces to eliminate most instances) and other text that I wanted to change across multiple Subjects. To identify undesirable characters (e.g., exclamation marks and others whose presence in filenames might mess up batch commands and other applications), I used SUBSTITUTELIST. SUBSTITUTELIST would remove the characters listed in a separate worksheet (generated with a series of numbers 1-255 in column A and a corresponding =CHAR(A1) in column B). I could have had it remove characters that looked unwanted, but to be cautious I decided instead to have it remove everything that I knew was normal (i.e., 0-9 and a-z and A-Z, plus a few others) and show me what was left.
- I deleted columns that were unnecessary, now that I had fixed the values. I also moved some columns, and inserted a few ones. My arrangement was now as follows: Index (column A), Original (B), Date & Time (C), Sender (D), NewSender (E), Recipient (F), NewRecipient (G), and Subject (H).
- I copied values from columns D (Sender) and Recpient (F) to a separate worksheet. There, I did a unique filter. This gave me a list of names that I might want to change or simplify. I put the original (unique) name in column A in that separate worksheet, sorted it, and entered the desired replacement names in column B. I sorted this Names worksheet on column A (Original). I named this worksheet Names; I planned to keep it for future Thunderbird EML exports. I named the main worksheet Data. I sped up the process of developing replacement names by using various functions (e.g., FIND, MID) to distinguish first and last names of individuals. When I had my completed list of preferred names for Senders and Recipients, I went back to the main (Data) worksheet. In column E (NewSender), I entered =VLOOKUP(D2,Names!$A$2:$B$869,2,FALSE). (There were 869 rows in the Names spreadsheet.) I copied that formula over to column G (NewRecipient); it provided a similar replacement for the Recipient values.
- I inserted columns to figure out the date and time. In column D ("Y"), I used =LEFT(C2,4). In colunn E ("M"), I used =MID(C2,5,2). In column F ("D"), I used =MID(C2,7,2). In column G ("H"), I used =MID(C2,10,2). In column H ("M"), I used =MID(C2,12,2). Finally, in column I ("NewDate"), I used =D2&"-"&E2&"-"&F2&" "&G2&"."&H2.
- I added column O ("New Name"). There, I used =CHAR(34)&I2&" Email from "&K2&" to "&M2&" re "&N2&".eml"&CHAR(34). This produced a new name for the EML file. I sorted on this column to identify instances where my formulas had failed, and made corrections as needed.
- I added column P ("Batch"). There, I used ="ren "&CHAR(34)&B2&".eml"&CHAR(34)&" "&O2. This produced a batch command to rename the EML file to my preferred new name. I copied the command down the column and then copied all those commands, one from each row, to Notepad. I saved the Notepad file as Renamer.bat and put it into the folder where the EMLs were located. I ran Renamer.bat. The renamed files sorted conspicuously in Windows Explorer, so I didn't need to work up a modification of these REN commands in a new column Q, using MOVE instead of REN, to move the newly renamed files to another folder. Instead, I could just cut them from the folder in Windows Explorer and put them aside.
- Now I had a couple dozen EMLs remaining. They had not renamed properly. I probably should have added something like " > errorlist.txt" at the end of each batch command, to show me whether I was trying to give the same name to two different files. I did a DIR of the files remaining, saved its output to dirlist.txt, copied the contents of dirlist.txt into Excel, and compared them against my main spreadsheet. To my surprise, none of these files appeared in the original list of files shown there. I'd had some problems not described in this post; had I somehow dropped some EMLs somewhere along the line? Was I not doing this comparison properly? I did not have a clear answer. I worked up another set of new file names for these EMLs, substantially following the steps presented above, and renamed them. It looked like, somehow, at least some of them were duplicates after all. So I was not understanding something there. Others were apparently not renaming properly because the original filenames contained characters like ®.