A friend of mine was organizing a public event, the event has a website and they were asking for registration through a registration webpage or filling a word document and send it by email.
My friend got hundreds of documents on his email, and hence came the problem of processing those documents.
Let us see how we can automate processing those documents using PowerShell.
Firstly, we need to de-attach all the documents from the emails, and save them into a folder. Although this is an interesting thing to accomplish using PowerShell, but my friend had this already done for me using Attachment Extractor <Okay … he was using ThunderBird and this post is not intended for arguing that I will prefer Outlook>
Now, we have all the documents in one folder; or many folders. (We can simply gather them from the whole drive using only 2 lines of PowerShell)
Secondly, let us process all those documents and get all the data out of them into a CSV file ready for Excel, or importing into a database.
1. Getting the folder path and open Word in Hidden mode
$docPath = $args[0]
$all_docs = Get-ChildItem $docPath -filter "*.docx"
$word = New-Object -comobject "Word.Application"
$word.Visible = $False
We are getting the documents folder as a parameter for our script, then we will get all the documents by calling Get-ChildItem and filter that to files with .docx extension, then we created an instance from Word and setting its Visibility to false so we are now working silently.
2. Open each document and list the ContentControls
foreach ( $doc in $all_docs)
{
Write-Host "Processing :" $doc.FullName
$doc = $word.Documents.Open($doc.FullName);
$controls = $doc.ContentControls.Count
Here we are using the foreach cmdlet to enumerate the documents we found in the folder, and then open each of them in Word using the Open method in the Documents collection.
My friend used the Content Controls to restrict users to edit certain fields in the document so he can process the document latter, and this is better than a plain word document to pull your hair trying to develop a parser for it.
Word object model provides a ContentControls collection, which will be holding all the content controls and their properties.
We put a reference to all the content controls in a variable so we can use it afterwards.
3. Create a collection of custom objects holding our data
$item = New-Object System.Object
foreach ( $control in $doc.ContentControls )
{
$item | Add-Member -type NoteProperty -name $control.Title -value $control.Range.Text
}
$all_items += $item
Here we are using the New-Object cmdlet to create a custom object, and using the Add-Member cmdlet to create properties on the fly.
Each property name is the ContentControl’s Title, and the property value equals to the text inside this control. After that we add the new object to a collection
4. Save the collection to a CSV file
$all_items | Export-CSV "Data.CSV"
The PowerShell guys had done a great job here, if we just pipe our collection to the Export-CSV cmdlet, we now has a reflection based enumeration of our objects dumped to the CSV file with an automatic header.
Finally, you got served my friend.
The whole script and sample document can be downloaded here:
To run the script, call it and pass the folder path:
>> ./Process-Documents.ps1 "path-to-the-documents"