A GitHub Action that scans repository files and sends them to Azure Purview for data governance and compliance tracking.
- 🔐 Secure Authentication: Uses GitHub OIDC for passwordless Azure authentication
- 📁 Smart File Processing: Automatically detects and processes changed files; binary files are skipped
- 🔄 Resilient API Integration: Built-in retry logic with exponential backoff
- 📊 Comprehensive Logging: Detailed execution logs with sensitive data redaction
- 🚀 Enterprise Ready: Handles large repositories with chunking and streaming
- Azure AD Application with federated credentials configured for GitHub OIDC
- Purview account with API access enabled
- GitHub repository with OIDC permissions
- Create an app registration in entra with the following permissions:
- ContentActivity.Write (Application)
- Content.Process.User (Application)
- ProtectionScopes.Compute.All (Application)
- (Optional, used for user id lookup) User.Read.All (Application)
- Grant admin consent to those permissions
- In that app registration, click the "Certificates & secrets" tab, then click the "Federated credentials" tab, and click "Add credential"
- Choose "Other issuer" from the "Federated credential scenario" dropdown.
- Set Issuer to https://token.actions.githubusercontent.com
- Set Type to "Claims matching expression"
- Set "Value" to
claims['sub'] matches 'repo:{your-user-or-org-name}/{your-repo-name}:*'replacing the sections in curly braces with the values for your repo. - Set Name and Description.
- Set "Audience" to api://AzureADTokenExchange if not already set.
- Click "Add".
name: Scan with Purview
on:
push:
branches: [main]
pull_request:
workflow_dispatch: # Allow manual triggering for full scans
permissions:
id-token: write
contents: read
pull-requests: write
actions: read
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: microsoft/purview-github-action@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
client-certificate: ${{ secrets.AZURE_CLIENT_CERTIFICATE }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
users-json-path: 'users.json'
file-patterns: '**'
debug: trueInstead of passing a single Azure AD user ID, the action resolves user IDs from a users.json file placed in your workflow-definition repo. When the workflow-definition repo differs from the target repo being scanned (cross-repo workflow), the action automatically fetches users.json from the workflow-definition repo via the GitHub API using the state-repo-token. When the workflow repo is the same as the target repo, the file is read from the local filesystem ($GITHUB_WORKSPACE).
The file maps commit author emails to Azure AD user IDs:
{
"users": [
{ "email": "alice@contoso.com", "userId": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" },
{ "email": "bob@contoso.com", "userId": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb" }
],
"defaultUserId": "00000000-0000-0000-0000-000000000000"
}For each commit author, the action checks the email against the users array. If a match is found that user ID is used; otherwise the defaultUserId is used. The chosen value is logged for every commit.
| Input | Description | Required | Default |
|---|---|---|---|
client-id |
Azure AD application client ID | Yes | - |
client-certificate |
PEM containing private key + certificate for certificate-based auth. If omitted, uses GitHub OIDC federated credentials. | No | - |
client-secret |
Azure AD application client secret for secret-based auth. If omitted, uses GitHub OIDC federated credentials. | No | - |
tenant-id |
Azure AD tenant ID | Yes | - |
users-json-path |
Path to users.json in the workflow-definition repo (relative to repo root). In cross-repo workflows the file is fetched via the GitHub API using state-repo-token. |
No | users.json |
purview-account-name |
Name of the Purview account | No | - |
purview-endpoint |
Purview API endpoint URL | No | https://graph.microsoft.com/v1.0 |
file-patterns |
Comma-separated file patterns to scan | No | ** |
exclude-patterns |
Comma-separated file patterns to exclude from scanning | No | **/.git/** |
max-file-size |
Maximum file size in bytes | No | 10485760 (10MB) |
debug |
Enable debug logging | No | false |
state-repo-branch |
Branch in the workflow-definition repo where the state marker is written | No | repo default branch |
state-repo-token |
Token with contents:write access to the workflow-definition repo. Used for first-run state tracking and for fetching users.json in cross-repo workflows. |
No | empty |
Patterns are standard glob patterns and should use / as the path separator.
Scan only specific extensions:
with:
file-patterns: "**/*.md,**/*.yml,**/*.yaml,**/*.json"Scan a single folder (and everything under it):
with:
file-patterns: "src/**"Exclude common folders (even if included by file-patterns):
with:
file-patterns: "**/*"
exclude-patterns: "**/node_modules/**,**/dist/**,**/build/**,**/.git/**"Exclude a specific folder and file type:
with:
file-patterns: "**/*"
exclude-patterns: ".github/**,**/*.lock"When state-repo-token is provided, the action stores a marker file (.purview/state/<owner>-<repo>.json) in the workflow-definition repo. On the first run it performs a full repository scan; subsequent runs only process changed files. The scanned repository only needs contents: read — the action never writes files back into it.
If state repo tracking is not configured, the action queries the repo's workflow history to check if it has been run before. If the action has not been run before, or if previous runs have all failed, it will perform a full scan.
You can trigger a complete repository scan by running the workflow manually via workflow_dispatch. This is useful when:
- You want to re-scan all files after updating Purview policies
- You need to ensure full compliance after security changes
- You're troubleshooting issues and want to reprocess everything
Simply add workflow_dispatch to your workflow triggers and run it manually from the GitHub Actions tab:
on:
push:
branches: [main]
pull_request:
workflow_dispatch: # Enables manual triggering for full scansWhen triggered via workflow_dispatch, the action will automatically perform a full repository scan regardless of state tracking.
| Output | Description |
|---|---|
processed-files |
Number of files successfully processed |
failed-requests |
Number of files that failed processing |
blocked-files |
JSON array of file paths that were blocked by data security policies |
The action follows a modular architecture with clear separation of concerns:
- Authentication Service: Handles OIDC token exchange, certificate-based, and client-secret authentication via MSAL, with token caching and refresh
- File Processor: Manages file discovery, content extraction, binary detection, and diff computation (LCS-based)
- Purview Client: Implements API communication with retry logic and exponential backoff for processContent, processContentAsync, contentActivities, and protection scope endpoints
- Payload Builder: Constructs optimized payloads with chunking (content ≤ 3 MB, request ≤ 3.7 MB, max 64 items per batch)
- Full Scan Service: Orchestrates first-run full repository scans including state tracking, tenant/user protection scope resolution, and commit processing
- Block Detector: Identifies
blockAccessandrestrict → blockpolicy actions from processContent responses - PR Comment Service: Posts PR review comments listing blocked files when data security policies trigger block actions
- User Resolver: Maps commit author emails to Azure AD user IDs via
users.jsonmappings and Microsoft Graph API lookups with caching - State Service: Manages first-run state markers (
.purview/state/<owner>-<repo>.json) in the workflow-definition repository - Retry Handler: Provides exponential backoff retry strategy with jitter for transient failures (429, 5xx, network errors)
- Logger: Provides structured logging with sensitive data redaction
- All authentication tokens are handled securely and never logged
- Sensitive data is automatically redacted from error messages
- API communications use TLS and follow zero-trust principles
- File contents are validated before processing
The action implements comprehensive error handling:
- Network failures trigger automatic retries with exponential backoff
- Rate limiting is respected with proper delay handling
- File processing errors are isolated and don't stop the entire scan
- All errors include actionable context for debugging
# Install dependencies
npm install
# Build TypeScript
npm run build
# Package for distribution
npm run package
# Run tests
npm run test
# Lint code
npm run lint- Ensure the Azure AD app has proper federated credentials or a valid client certificate
- Verify OIDC permissions (
id-token: write) are granted in the workflow - Check that the
users.jsonfile exists and has a validdefaultUserId
- Verify the Purview endpoint URL is correct
- Ensure the service principal has proper Purview permissions
- Check for rate limiting in debug logs
- Review file patterns match your repository structure
- Check file size limits for large files
- Ensure files are UTF-8 encoded
- Binary files (images, executables, etc.) are automatically detected and skipped
This project is licensed under the MIT License.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.