OOXML Text Scanning & XPath Guide
Overview: Common Structure
All OOXML files are ZIP containers: *.docx, *.xlsx, *.pptx.
[Content_Types].xml: content type declarations_rels/.rels: package root relationshipsdocProps/: document properties/metadata- App-specific text stacks:
- Word: WordprocessingML (
w:) - PowerPoint / Excel shapes & charts: DrawingML (
a:) - Excel cells: SpreadsheetML (
s:) + sharedStrings
- Word: WordprocessingML (
What to Scan & Caveats
- Collect only character nodes:
w:t,a:t, SpreadsheetML string nodes; avoid numeric<v>unless typed as string. - Preserve spaces when declared:
xml:space="preserve"or Word’sw:space="preserve". - Text is split across runs → merge at paragraph/sentence level.
- Scan sub-parts: hyperlinks, field codes, headers/footers, comments/notes, shapes/charts, tables/threaded comments, etc.
DOCX (Word)
Core Text Part Paths
- Main body:
word/document.xml - Headers/Footers:
word/header*.xml,word/footer*.xml - Footnotes/Endnotes:
word/footnotes.xml,word/endnotes.xml - Comments/Review:
word/comments.xml,word/commentsExtended.xml - Text boxes in shapes:
w:drawing→ DrawingML (a:) →a:t - Field codes/captions: inline (
w:fldSimple,w:instrText)
Tags to Read
- Plain text:
w:t - Field code text:
w:instrText - Drawing text (inside Word):
a:t
Representative XPath
w = http://schemas.openxmlformats.org/wordprocessingml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
- Body text:
//w:t - Shape text:
//w:drawing//a:t - Field codes:
//w:instrText
Practice Tips
- Tree:
w:p→w:r→w:t. Join allw:tunder eachw:p. - Convert
w:br/w:tabto newline/tab while merging. - Hyperlink display text in
w:hyperlink→w:t; URL via relationship (r:id).
PPTX (PowerPoint)
Core Text Part Paths
- Slides:
ppt/slides/slide*.xml - Notes:
ppt/notesSlides/notesSlide*.xml - Masters/Layouts:
ppt/slideMasters/slideMaster*.xml,ppt/slideLayouts/slideLayout*.xml - Charts:
ppt/charts/chart*.xml(rich text within) - Shapes/Tables:
p:spTree→a:txBody→a:p→a:r→a:t
Tags to Read
- General text (shapes/tables/charts):
a:t - Table cells:
a:tbl → a:tr → a:tc → a:txBody → a:p/a:r/a:t - Chart titles/axes/labels:
c:tx/c:rich//a:t,c:dLbls//a:t
Representative XPath
p = http://schemas.openxmlformats.org/presentationml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart
- All slide text:
//a:t - Table cell text:
//a:tbl//a:tc//a:txBody//a:t - Chart text:
//c:chart//a:t(or//c:tx//a:t)
Practice Tips
- Placeholder text may inherit from master/layout — include them if needed.
- Merge at
a:plevel: join alla:r/a:tin a paragraph. - Scan notes slides (
notesSlide*.xml) for sensitive info.
XLSX
Core Text Part Paths
- Worksheets:
xl/worksheets/sheet*.xml - Shared strings:
xl/sharedStrings.xml - Legacy comments:
xl/comments*.xml - Threaded comments:
xl/threadedComments/threadedComment*.xml - Header/Footer (per sheet): inside
<headerFooter>of each sheet - Shapes/Charts/Text boxes:
xl/drawings/drawing*.xml,xl/charts/chart*.xml
String Storage Rules
- Shared string cell:
<c t="s"><v>holds index tosharedStrings.xml. - In
sharedStrings.xml:- Simple:
<si><t> - Rich:
<si><r><t>(join allr/t)
- Simple:
- Inline string cell:
<c t="inlineStr"><is><t>or<is><r><t> - Numeric
<v>isn’t text unless typed as string; formatting changes display only.
Tags to Read
- SharedStrings:
//si/tand//si/r/t - Inline strings (per sheet):
//is/tand//is/r/t - Headers/Footers:
//headerFooter/*[text()] - Comments:
//comment//t - Shapes/Charts:
//a:t
Representative XPath
s = http://schemas.openxmlformats.org/spreadsheetml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart
- Shared strings:
//s:si/s:t | //s:si/s:r/s:t - Inline strings:
//s:is/s:t | //s:is/s:r/s:t - Header/Footer:
//s:headerFooter/*[text()] - Comments:
//s:comments//s:comment//s:t - Shapes/Charts:
//a:t
Practice Tips
- Parse
sharedStrings.xmlfirst; then inline strings, comments, headers, drawings/charts. - Merged cells repeat the same value → consider de-duplication.
- “Numeric-looking” IDs may still be stored as strings.
Parser Tips (PowerShell/XPath)
Namespace Strategies
1) Namespace Manager (canonical)
$ns = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$ns.AddNamespace('w','http://schemas.openxmlformats.org/wordprocessingml/2006/main')
$ns.AddNamespace('a','http://schemas.openxmlformats.org/drawingml/2006/main')
$ns.AddNamespace('s','http://schemas.openxmlformats.org/spreadsheetml/2006/main')
# Typical text nodes across parts
$nodes = $xml.SelectNodes('//w:t | //a:t | //s:si/s:t | //s:si/s:r/s:t', $ns)
2) local-name() + namespace-uri() (quick & robust)
# Word plain text
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/wordprocessingml/2006/main"]')
# DrawingML text (Word/PPTX/Excel drawings)
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/drawingml/2006/main"]')
Merging & Cleanup
- Merge per paragraph:
- Word: join all
w:tunder eachw:p. - PPTX: join all
a:r/a:tunder eacha:p. - XLSX shared string item: join all
<si><t>and<r><t>.
- Word: join all
- Line breaks:
- Word:
w:br→\n - DrawingML:
a:br→\n
- Word:
- Whitespace rules:
- Honor
xml:space="preserve"andw:space="preserve". - Otherwise trim outer whitespace after merge.
- Honor
High-Risk Auxiliary Parts
- Word: headers/footers, footnotes/endnotes, comments, drawing text.
- PowerPoint: notes slides, chart labels/axes, master/layout placeholders.
- Excel: legacy & threaded comments, headers/footers, chart/shape text boxes.
Minimum Scan Checklist
DOCX
word/document.xml→//w:t,//w:instrText,//w:drawing//a:tword/header*.xml,word/footer*.xmlword/footnotes.xml,word/endnotes.xml,word/comments*.xml
PPTX
ppt/slides/slide*.xml→//a:tppt/notesSlides/notesSlide*.xml→//a:t- Optional:
ppt/slideMasters/*,ppt/slideLayouts/*,ppt/charts/chart*.xml→//a:t
XLSX
xl/sharedStrings.xml→//s:si/s:t|//s:si/s:r/s:txl/worksheets/sheet*.xml→ inline strings//s:is//s:t, headers/footers//s:headerFooter/*[text()]xl/comments*.xml,xl/threadedComments/*→//txl/drawings/*.xml,xl/charts/*.xml→//a:t
Common Pitfalls
- PPTX run splitting: one visible line may be multiple
a:r/a:t→ merge ata:p. - XLSX string sources: handle both shared and inline strings.
- DOCX field codes: important text may be in
w:instrText. - PPTX masters/layouts/notes: sensitive text often lives outside main slides.
- Headers/Footers (Word/Excel): confidentiality notices may exist only here.