What is Data Processing?
Data Processing is the bridge between raw extracted data and polished, actionable output. When you scrape a website or pull data from an API, the result is rarely ready to use directly. Duplicates, inconsistent formats, missing fields, and irrelevant rows are common. Data Processing gives you the tools to clean, transform, and enrich that data — all within the same automation pipeline, without needing a separate ETL tool or spreadsheet.
Autonoly offers two approaches that work together: no-code transforms for common operations like filtering and deduplication, and full Python execution for custom logic, statistical analysis, and machine learning. Both run in secure, isolated cloud environments and integrate seamlessly with every other Autonoly feature.
Why Process Data Inside Your Automation?
Many teams extract data with one tool, clean it in a spreadsheet, and then manually upload it somewhere else. This creates manual steps, introduces errors, and doesn't scale. By processing data inside the automation pipeline, you get a fully hands-off workflow from extraction to delivery.
No-Code Transforms
For the most common data operations, no code is needed. Autonoly provides built-in transforms that you can apply through the AI Agent Chat or the Visual Workflow Builder:
Deduplication
Remove duplicate rows based on one or more key fields. Useful when scraping overlapping pages, merging data from multiple sources, or cleaning up datasets where items appear more than once.
Filtering and Sorting
Keep only the rows that match your criteria — filter by price range, date, status, keyword presence, or any custom condition. Sort results by any field in ascending or descending order.
Format Conversion
Standardize messy data:
Dates — convert between formats (MM/DD/YYYY to ISO 8601, relative dates like "2 days ago" to absolute)
Currencies — normalize currency symbols, convert between formats
Phone numbers — standardize to international format
Text — trim whitespace, fix capitalization, remove HTML tags
Text Manipulation
Apply regex patterns, split strings into fields, join multiple values, and use templates to construct new fields from existing data. This is particularly useful when extracted data needs restructuring before it reaches its destination.
JSON Parsing and Restructuring
When working with API responses or complex nested data, you can parse JSON structures, extract specific nested fields, and flatten hierarchies into tabular formats suitable for spreadsheets and databases.
Combine no-code transforms with Data Extraction to build complete scrape-and-clean pipelines.
Python Execution
When built-in transforms aren't enough, switch to Python. Autonoly provides a full Python 3 environment with popular libraries pre-installed:
pandas — dataframe operations, groupby, pivot tables, merges
numpy — numerical computation, statistical functions
requests — make HTTP calls to external APIs for data enrichment
scikit-learn — machine learning, clustering, classification
BeautifulSoup — additional HTML parsing if needed
You can also install any package with pip at runtime. Need a specialized library for geocoding, NLP, or financial calculations? Just include the pip install in your script.
How Python Scripts Work
- Your script receives input data from the previous step (extracted data, API response, or file contents)
- You process it using any Python logic — from a three-line dedup to a 200-line ML pipeline
- The script outputs results that flow to the next step in the workflow
This runs in a secure, isolated environment. Your scripts can't affect other users or access anything outside the designated input and output channels.
Common Python Use Cases
Custom scoring models — score leads, rank products, or classify items using business-specific logic
Statistical analysis — calculate averages, medians, standard deviations, correlations across extracted datasets
Data enrichment — call external APIs to add geocoding, company info, or market data to your records
Machine learning — run classification, clustering, or prediction models on collected data
Custom formatting — generate complex reports, build structured outputs, or prepare data for specific downstream systems
Building ETL Pipelines
Data Processing is most powerful when chained with other steps to create full ETL (Extract, Transform, Load) pipelines. Here's a real example:
- Extract — Browser Automation visits 50 competitor websites and Data Extraction scrapes current product prices
- Transform — Data Processing deduplicates the results, calculates average price per product category, and flags items where the price changed more than 10%
- Load — Results push to Google Sheets for the team to review, and a summary alert fires to Slack
You design these pipelines visually in the Visual Workflow Builder or let the AI Agent Chat build them from a natural language description.
Variable Passing Between Steps
Each processing step can output data that the next step consumes. This variable passing happens automatically — the output of a Python script becomes the input of the next transform, which feeds into the export step. Use Logic & Flow to add conditional branches (e.g., "if the dataset has more than 1000 rows, split into batches").
Data Validation
Before data reaches its destination, you can add validation rules:
Type checking — ensure numeric fields contain numbers, dates are valid, URLs are properly formatted
Required fields — flag or remove rows with missing critical data
Range constraints — prices must be positive, dates must be in the future, quantities within expected bounds
Custom rules — any validation logic you can express in a Python condition
Catching data quality issues inside the pipeline prevents bad data from reaching your spreadsheets, databases, or downstream systems.
Explore the templates library for pre-built data processing pipelines, or check the pricing page for processing limits on each plan.
Best Practices
Clean data processing is the backbone of reliable automation. Follow these tips to build robust data pipelines:
Validate data at the entry point, not just at the exit. Most teams add validation before exporting data to its destination. But validating earlier — immediately after extraction or API response — saves processing time and prevents errors from cascading through multiple downstream steps. Add a validation node right after your data source that checks for required fields, expected data types, and reasonable value ranges. Records that fail validation can be routed to a quarantine path via Logic & Flow for manual review.
Deduplicate aggressively, but choose the right key. Deduplication is one of the most common processing operations, but choosing the wrong dedup key produces bad results. For product data, the URL or SKU is usually a better key than the product name (which may vary slightly across sources). For contact data, email is more reliable than name. For listings, combine multiple fields (address + listing date) for a composite key. Our web scraping best practices guide covers deduplication strategies for common data types.
Use no-code transforms for standard operations; switch to Python only when needed. The built-in no-code transforms (filter, sort, deduplicate, format conversion) are faster to configure, easier to maintain, and less error-prone than custom Python scripts for straightforward operations. Reserve Python for complex logic — statistical analysis, ML inference, custom scoring algorithms, or multi-step transformations that cannot be expressed as simple filter/sort/map operations.
Chain small processing steps rather than building one massive transform. A 200-line Python script that does everything is hard to debug and impossible to partially reuse. Instead, break processing into focused steps: one node deduplicates, the next normalizes dates, the next filters by criteria, and the last calculates derived fields. Each step is independently testable and reusable across different workflows. The Visual Workflow Builder makes this modular approach easy to visualize and manage.
Save intermediate results for debugging. For complex ETL pipelines, add checkpoint saves at key stages. When something goes wrong, you can inspect intermediate datasets to identify exactly where the problem occurred. Read our guide on automate Google Sheets for strategies on using Sheets as intermediate checkpoints.
Security & Compliance
Data processing often involves the most sensitive step in an automation pipeline — where raw data is transformed, enriched, or combined before reaching its final destination. Autonoly handles this securely at every level.
All data processing runs in isolated execution environments that are destroyed after each run. Python scripts execute in sandboxed containers with no network access except through explicitly configured API calls. This prevents scripts from making unauthorized network connections or accessing data outside the designated input and output channels. Processing results are encrypted at rest (AES-256) and in transit (TLS 1.3), and access is governed by the same role-based permissions that apply to all workspace data.
For teams processing personal data (PII), data processing nodes are the ideal place to anonymize or pseudonymize records before they reach external destinations — hash email addresses, truncate phone numbers, or generalize location data. The execution log captures the processing operations performed without logging actual data values. For comprehensive security details, visit the Security feature page.
Common Use Cases
Data processing is the bridge between raw data and actionable intelligence. Here are detailed real-world examples:
Price Comparison and Competitive Analysis
An e-commerce company extracts pricing data from 10 competitor sites using Data Extraction. Raw data comes in inconsistent formats: some prices include tax, some use different currencies, some include shipping. A processing pipeline normalizes all prices to a common format, deduplicates products across sites, calculates average prices, and flags items where the company's price exceeds the market average by more than 15%. Results push to Google Sheets with conditional formatting. See our ecommerce price monitoring guide for a detailed approach.
Lead Data Enrichment and Scoring
A sales team collects leads from multiple sources: Data Extraction from business directories, API responses from enrichment services, and CSV imports. Data processing merges these sources, deduplicates by email address, fills missing fields by combining data across sources, and calculates a lead score using a Python script with pandas. Scored leads push to Airtable for the sales team. Learn more in our automating lead generation guide.
Survey Response Analysis
A research team collects survey responses via API requests. Data processing cleans responses (trimming whitespace, standardizing dates, handling nulls), filters incomplete submissions, and aggregates by demographic group. Python scripts calculate statistical measures. AI Content classification tags open-ended responses by theme and sentiment. The dataset exports to a multi-tab Excel file and uploads to Google Drive.
ETL Pipeline for Database Migration
A company migrates data from a legacy system to a new platform. Browser Automation extracts data from the legacy web interface (which has no API), and data processing handles transformation: mapping old field names to new schema fields, converting date formats, normalizing addresses, and splitting combined fields. Records that fail validation are quarantined for manual review. Successfully processed records upload to the new system via API requests. The migration runs in batches with checkpointing for safe resumption.