OOXML Text Scanning & XPath Guide

OOXML Text Scanning & XPath Guide

Overview: Common Structure

All OOXML files are ZIP containers: *.docx, *.xlsx, *.pptx.

  • [Content_Types].xml: content type declarations
  • _rels/.rels: package root relationships
  • docProps/: document properties/metadata
  • App-specific text stacks:
    • Word: WordprocessingML (w:)
    • PowerPoint / Excel shapes & charts: DrawingML (a:)
    • Excel cells: SpreadsheetML (s:) + sharedStrings

What to Scan & Caveats

  • Collect only character nodes: w:t, a:t, SpreadsheetML string nodes; avoid numeric <v> unless typed as string.
  • Preserve spaces when declared: xml:space="preserve" or Word’s w:space="preserve".
  • Text is split across runs → merge at paragraph/sentence level.
  • Scan sub-parts: hyperlinks, field codes, headers/footers, comments/notes, shapes/charts, tables/threaded comments, etc.

DOCX (Word)

Core Text Part Paths

  • Main body: word/document.xml
  • Headers/Footers: word/header*.xml, word/footer*.xml
  • Footnotes/Endnotes: word/footnotes.xml, word/endnotes.xml
  • Comments/Review: word/comments.xml, word/commentsExtended.xml
  • Text boxes in shapes: w:drawing → DrawingML (a:) → a:t
  • Field codes/captions: inline (w:fldSimple, w:instrText)

Tags to Read

  • Plain text: w:t
  • Field code text: w:instrText
  • Drawing text (inside Word): a:t

Representative XPath

w = http://schemas.openxmlformats.org/wordprocessingml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
  • Body text: //w:t
  • Shape text: //w:drawing//a:t
  • Field codes: //w:instrText

Practice Tips

  • Tree: w:pw:rw:t. Join all w:t under each w:p.
  • Convert w:br/w:tab to newline/tab while merging.
  • Hyperlink display text in w:hyperlinkw:t; URL via relationship (r:id).

PPTX (PowerPoint)

Core Text Part Paths

  • Slides: ppt/slides/slide*.xml
  • Notes: ppt/notesSlides/notesSlide*.xml
  • Masters/Layouts: ppt/slideMasters/slideMaster*.xml, ppt/slideLayouts/slideLayout*.xml
  • Charts: ppt/charts/chart*.xml (rich text within)
  • Shapes/Tables: p:spTreea:txBodya:pa:ra:t

Tags to Read

  • General text (shapes/tables/charts): a:t
  • Table cells: a:tbl → a:tr → a:tc → a:txBody → a:p/a:r/a:t
  • Chart titles/axes/labels: c:tx/c:rich//a:t, c:dLbls//a:t

Representative XPath

p = http://schemas.openxmlformats.org/presentationml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart
  • All slide text: //a:t
  • Table cell text: //a:tbl//a:tc//a:txBody//a:t
  • Chart text: //c:chart//a:t (or //c:tx//a:t)

Practice Tips

  • Placeholder text may inherit from master/layout — include them if needed.
  • Merge at a:p level: join all a:r/a:t in a paragraph.
  • Scan notes slides (notesSlide*.xml) for sensitive info.

XLSX

Core Text Part Paths

  • Worksheets: xl/worksheets/sheet*.xml
  • Shared strings: xl/sharedStrings.xml
  • Legacy comments: xl/comments*.xml
  • Threaded comments: xl/threadedComments/threadedComment*.xml
  • Header/Footer (per sheet): inside <headerFooter> of each sheet
  • Shapes/Charts/Text boxes: xl/drawings/drawing*.xml, xl/charts/chart*.xml

String Storage Rules

  • Shared string cell: <c t="s"><v> holds index to sharedStrings.xml.
  • In sharedStrings.xml:
    • Simple: <si><t>
    • Rich: <si><r><t> (join all r/t)
  • Inline string cell: <c t="inlineStr"><is><t> or <is><r><t>
  • Numeric <v> isn’t text unless typed as string; formatting changes display only.

Tags to Read

  • SharedStrings: //si/t and //si/r/t
  • Inline strings (per sheet): //is/t and //is/r/t
  • Headers/Footers: //headerFooter/*[text()]
  • Comments: //comment//t
  • Shapes/Charts: //a:t

Representative XPath

s = http://schemas.openxmlformats.org/spreadsheetml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart
  • Shared strings: //s:si/s:t | //s:si/s:r/s:t
  • Inline strings: //s:is/s:t | //s:is/s:r/s:t
  • Header/Footer: //s:headerFooter/*[text()]
  • Comments: //s:comments//s:comment//s:t
  • Shapes/Charts: //a:t

Practice Tips

  • Parse sharedStrings.xml first; then inline strings, comments, headers, drawings/charts.
  • Merged cells repeat the same value → consider de-duplication.
  • “Numeric-looking” IDs may still be stored as strings.

Parser Tips (PowerShell/XPath)

Namespace Strategies

1) Namespace Manager (canonical)
$ns = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$ns.AddNamespace('w','http://schemas.openxmlformats.org/wordprocessingml/2006/main')
$ns.AddNamespace('a','http://schemas.openxmlformats.org/drawingml/2006/main')
$ns.AddNamespace('s','http://schemas.openxmlformats.org/spreadsheetml/2006/main')

# Typical text nodes across parts
$nodes = $xml.SelectNodes('//w:t | //a:t | //s:si/s:t | //s:si/s:r/s:t', $ns)
2) local-name() + namespace-uri() (quick & robust)
# Word plain text
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/wordprocessingml/2006/main"]')

# DrawingML text (Word/PPTX/Excel drawings)
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/drawingml/2006/main"]')

Merging & Cleanup

  • Merge per paragraph:
    • Word: join all w:t under each w:p.
    • PPTX: join all a:r/a:t under each a:p.
    • XLSX shared string item: join all <si> <t> and <r><t>.
  • Line breaks:
    • Word: w:br\n
    • DrawingML: a:br\n
  • Whitespace rules:
    • Honor xml:space="preserve" and w:space="preserve".
    • Otherwise trim outer whitespace after merge.

High-Risk Auxiliary Parts

  • Word: headers/footers, footnotes/endnotes, comments, drawing text.
  • PowerPoint: notes slides, chart labels/axes, master/layout placeholders.
  • Excel: legacy & threaded comments, headers/footers, chart/shape text boxes.

Minimum Scan Checklist

DOCX

  • word/document.xml//w:t, //w:instrText, //w:drawing//a:t
  • word/header*.xml, word/footer*.xml
  • word/footnotes.xml, word/endnotes.xml, word/comments*.xml

PPTX

  • ppt/slides/slide*.xml//a:t
  • ppt/notesSlides/notesSlide*.xml//a:t
  • Optional: ppt/slideMasters/*, ppt/slideLayouts/*, ppt/charts/chart*.xml//a:t

XLSX

  • xl/sharedStrings.xml//s:si/s:t | //s:si/s:r/s:t
  • xl/worksheets/sheet*.xml → inline strings //s:is//s:t, headers/footers //s:headerFooter/*[text()]
  • xl/comments*.xml, xl/threadedComments/*//t
  • xl/drawings/*.xml, xl/charts/*.xml//a:t

Common Pitfalls

  • PPTX run splitting: one visible line may be multiple a:r/a:t → merge at a:p.
  • XLSX string sources: handle both shared and inline strings.
  • DOCX field codes: important text may be in w:instrText.
  • PPTX masters/layouts/notes: sensitive text often lives outside main slides.
  • Headers/Footers (Word/Excel): confidentiality notices may exist only here.