Monday, April 28, 2008

Batch Processing Word Documents using PowerShell

A friend of mine was organizing a public event, the event has a website and they were asking for registration through a registration webpage or filling a word document and send it by email.

My friend got hundreds of documents on his email, and hence came the problem of processing those documents.

Let us see how we can automate processing those documents using PowerShell.

Firstly, we need to de-attach all the documents from the emails, and save them into a folder. Although this is an interesting thing to accomplish using PowerShell, but my friend had this already done for me using Attachment Extractor <Okay … he was using ThunderBird and this post is not intended for arguing that I will prefer Outlook>

Now, we have all the documents in one folder; or many folders. (We can simply gather them from the whole drive using only 2 lines of PowerShell)

Secondly, let us process all those documents and get all the data out of them into a CSV file ready for Excel, or importing into a database.

1. Getting the folder path and open Word in Hidden mode

$docPath = $args[0]

$all_docs = Get-ChildItem $docPath -filter "*.docx"

$word = New-Object -comobject "Word.Application"

$word.Visible = $False

We are getting the documents folder as a parameter for our script, then we will get all the documents by calling Get-ChildItem and filter that to files with .docx extension, then we created an instance from Word and setting its Visibility to false so we are now working silently.

2. Open each document and list the ContentControls

foreach ( $doc in $all_docs)

{

Write-Host "Processing :" $doc.FullName

$doc = $word.Documents.Open($doc.FullName);

$controls = $doc.ContentControls.Count

Here we are using the foreach cmdlet to enumerate the documents we found in the folder, and then open each of them in Word using the Open method in the Documents collection.

My friend used the Content Controls to restrict users to edit certain fields in the document so he can process the document latter, and this is better than a plain word document to pull your hair trying to develop a parser for it.

Word object model provides a ContentControls collection, which will be holding all the content controls and their properties.

We put a reference to all the content controls in a variable so we can use it afterwards.

3. Create a collection of custom objects holding our data

$item = New-Object System.Object

foreach ( $control in $doc.ContentControls )

{

$item | Add-Member -type NoteProperty -name $control.Title -value $control.Range.Text

}

$all_items += $item

Here we are using the New-Object cmdlet to create a custom object, and using the Add-Member cmdlet to create properties on the fly.

Each property name is the ContentControl’s Title, and the property value equals to the text inside this control. After that we add the new object to a collection

4. Save the collection to a CSV file

$all_items | Export-CSV "Data.CSV"

The PowerShell guys had done a great job here, if we just pipe our collection to the Export-CSV cmdlet, we now has a reflection based enumeration of our objects dumped to the CSV file with an automatic header.

Finally, you got served my friend.

The whole script and sample document can be downloaded here:

To run the script, call it and pass the folder path:

>> ./Process-Documents.ps1 "path-to-the-documents"

4 comments:

Silver said...

Great job! This saved me a lot of time. I've just used the script with no modifications

Dean Dahle said...

Essam, this looks great and just what I'm looking for. I downloaded your script and sample docs which execute perfectly. But I'm having a problem with my own forms. Can you offer some advice? The errors I get are....(followed by each field and value)...
Add-Member : Cannot bind argument to parameter 'Name' because it is null.
At C:\Temp\TestDocs2\Process-Documents.ps1:24 char:46
+ $item | Add-Member -type NoteProperty -name <<<< $control.Title -val
ue $control.Range.Text
+ CategoryInfo : InvalidData: (:) [Add-Member], ParameterBindingV
alidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,M
icrosoft.PowerShell.Commands.AddMemberCommand

Add-Member : Cannot bind argument to parameter 'Name' because it is null.
At C:\Temp\TestDocs2\Process-Documents.ps1:24 char:46
+ $item | Add-Member -type NoteProperty -name <<<< $control.Title -val
ue $control.Range.Text
+ CategoryInfo : InvalidData: (:) [Add-Member], ParameterBindingV
alidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,M
icrosoft.PowerShell.Commands.AddMemberCommand

Add-Member : Cannot bind argument to parameter 'Name' because it is null.
At C:\Temp\TestDocs2\Process-Documents.ps1:24 char:46
+ $item | Add-Member -type NoteProperty -name <<<< $control.Title -val
ue $control.Range.Text
+ CategoryInfo : InvalidData: (:) [Add-Member], ParameterBindingV
alidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,M
icrosoft.PowerShell.Commands.AddMemberCommand

Legal Last Name : ZzWoolastname
Preferred Last Name : ZzWoolastname
Legal Middle Initial : U
Preferred Middle Initial : Click here to enter text.
Legal First Name : ZYName
Preferred First Name : ZYName
Last 4 SSN # : xxxx
Job Title : workerHere
Employee ID # : Click here to enter text.
Supervisor Name : WallName
Supervisor Phone # : 555 555 1212
Work Location : SOC
Effective Date : 8/13/12
Date Form Submitted : 8/5/12
Department/Domain : Workers Department
Status : Choose an item.
Comments : Click here to enter text.

I even went so far as to name one of the content fields "Name" ugg. did not help! Do you have an idea what's with my form? thank you.

Chad Smith said...

It's an old post, but it's exactly what I needed to pull some data from contentControls. Thank you!

Anonymous said...

When using this with the original docx it wors like a charm, but when the contents of docx changes (other documents) it doesn´t!
Is it possible to add contentdeviders (Tab, colon; semicolon; etc.) to devide and select contents.

Thank you.