Monday, April 28, 2008

Batch Processing Word Documents using PowerShell

A friend of mine was organizing a public event, the event has a website and they were asking for registration through a registration webpage or filling a word document and send it by email.

My friend got hundreds of documents on his email, and hence came the problem of processing those documents.

Let us see how we can automate processing those documents using PowerShell.

Firstly, we need to de-attach all the documents from the emails, and save them into a folder. Although this is an interesting thing to accomplish using PowerShell, but my friend had this already done for me using Attachment Extractor <Okay … he was using ThunderBird and this post is not intended for arguing that I will prefer Outlook>

Now, we have all the documents in one folder; or many folders. (We can simply gather them from the whole drive using only 2 lines of PowerShell)

Secondly, let us process all those documents and get all the data out of them into a CSV file ready for Excel, or importing into a database.

1. Getting the folder path and open Word in Hidden mode

$docPath = $args[0]

$all_docs = Get-ChildItem $docPath -filter "*.docx"

$word = New-Object -comobject "Word.Application"

$word.Visible = $False

We are getting the documents folder as a parameter for our script, then we will get all the documents by calling Get-ChildItem and filter that to files with .docx extension, then we created an instance from Word and setting its Visibility to false so we are now working silently.

2. Open each document and list the ContentControls

foreach ( $doc in $all_docs)

{

Write-Host "Processing :" $doc.FullName

$doc = $word.Documents.Open($doc.FullName);

$controls = $doc.ContentControls.Count

Here we are using the foreach cmdlet to enumerate the documents we found in the folder, and then open each of them in Word using the Open method in the Documents collection.

My friend used the Content Controls to restrict users to edit certain fields in the document so he can process the document latter, and this is better than a plain word document to pull your hair trying to develop a parser for it.

Word object model provides a ContentControls collection, which will be holding all the content controls and their properties.

We put a reference to all the content controls in a variable so we can use it afterwards.

3. Create a collection of custom objects holding our data

$item = New-Object System.Object

foreach ( $control in $doc.ContentControls )

{

$item | Add-Member -type NoteProperty -name $control.Title -value $control.Range.Text

}

$all_items += $item

Here we are using the New-Object cmdlet to create a custom object, and using the Add-Member cmdlet to create properties on the fly.

Each property name is the ContentControl’s Title, and the property value equals to the text inside this control. After that we add the new object to a collection

4. Save the collection to a CSV file

$all_items | Export-CSV "Data.CSV"

The PowerShell guys had done a great job here, if we just pipe our collection to the Export-CSV cmdlet, we now has a reflection based enumeration of our objects dumped to the CSV file with an automatic header.

Finally, you got served my friend.

The whole script and sample document can be downloaded here:

To run the script, call it and pass the folder path:

>> ./Process-Documents.ps1 "path-to-the-documents"

It is reminding me with DOS days : Beep Beep Beep Beep


I was playing around with PowerShell and I tried to dump the content of a text file, but used a word document instead (by mistake), surprisingly; my machine started to Beep like the old DOS days.

Try it yourself, open PowerShell and Type:

Get-Content "path to a word document"

Wednesday, April 16, 2008

How to backup all your documents with 2 lines of PowerShell script

I (and I think all of us) have documents scattered every where in data drive.

Today I wanted to get all the word documents on my D: drive and put them on a shared folder for archiving.

Since I was playing with PowerShell recently, I thought it will be nice to write a script that do that, and the following is what I have written:

$files = Get-ChildItem -recurse -filter "*.doc*"
$files | foreach {copy $_.fullname -Destination E:\Docs}

I saved those 2 lines into a file [backup-docs.ps1] and opened PowerShell and navigate to my drive and called this script,and in seconds I found all my documents in the E:\Docs folder ; Thanks to PowerShell


Monday, April 14, 2008

Un-maintainable Code

Paul Stovell (MVP from Australia) had a nice post on his blog; it seems like a fictional dialogue between 2 developers one of them is a new member ask questions to his elder colleague, this dialogue whill light the pulp on how some decisions we take during development, will lead to confusions and hard times to maintain the source code.

Enjoy: http://www.paulstovell.com/blog/the-inline-sql-application-six-months-later