SciDataCopilot An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Shanghai Artificial Intelligence Laboratory
SciDataCopilot pipeline

Figure 1. The Scientific AI-Ready data paradigm is designed to meet the fundamental requirements of scientific discovery. Under this, raw data are not only machine-readable, but are explicitly structured to support task customization and expert guidance, enabling autonomous workflows. Instantiating this paradigm, we propose SciDataCopilot, a staged agentic framework, that transforms heterogeneous raw data into Scientific AI-Ready data that can be directly consumed by scientific analysis.

Abstract

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this, we propose SciDataCopilot, an autonomous agentic framework designed to handle data accessing, scientific intent parsing, and multi-modal integration in an end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30 × speedup in data preparation.

Scientific AI-Ready Data Paradigm

🎯 Task-conditioned principle: Scientific AI-Ready data paradigm adopts scientific tasks as the primary organizing principle, translating scientific intent to required data units, variables, and constraints. This shifts task-conditioned data specification from inefficient manual collection to automating reusable workflows.

⛓️ Downstream compatibility: Scientific AI-Ready data paradigm prioritizes direct compatibility with downstream scientific analysis, ensuring that prepared data satisfy model-specific input constraints and enabling composable, executable workflows beyond standalone inference.

🧠 Cross-integration Ability: Scientific AI-Ready data paradigm emphasizes principled cross-modal and cross-disciplinary alignment, enabling systematic data association, retrieval, alignment, and composition across heterogeneous scientific domains.

Framework

SciDataCopilot framework

Figure 2. Architecture of SciDataCopilot. The framework integrates four collaborative agents (Data Access, Intent Parsing, Data Processing, and Data Integration) to autonomously align user intents with complex data resources. Ultimately, SciDataCopilot bridges heterogeneous scientific data with specific models, re-defining task-guided data customization and cross-disciplinary integration to empower diverse scientific research tasks.

01

Data Access Agent

  • Data perception
  • Knowledge base construction K={D,T,C}
  • Reproducible metadata grounding
02

Intent Parsing Agent

  • Requirement analysis
  • Case retrieval adaptation
  • Plan generation and verification
03

Data Processing Agent

  • Plan checking
  • Self-repairing execution loop
  • Scientific report trace generation
04

Data Integration Agent

  • Integration strategy analysis
  • Integration pipeline generation
  • Data unit integration

Use Cases

Click to explore representative SciDataCopilot use cases.

INSTRUCTION

Download all enzyme catalysis data, including enzyme sequences and substrate--product reaction information.

Statistics

Overall

Basic dataset statistics.

Export examples
Total records
214,104
Unique enzymes
168,576
Unique substrates
4,485
Unique products
5,754
Unique RHEA
9,221
Unique organisms
5,893
Unique enzyme types
10
Protein UniProt:Q96IZ6
Entry MET2A_HUMAN
Protein tRNA N(3)-cytidine methyltransferase METTL2A
Sequence length 378
MAGSYPEGAPAVLADKRQQFGSRFLRDPARVFHHNAWDNVEWSEEQAAAAERKVQENSIQRVCQEKQVDYEINAHKYWNDFYKIHENGFFKDRHWLFTEFPELAPSQNQNHLKDWFLENKSEVPECRNNEDGPGLIMEEQHKCSSKSLEHKTQTLPVEENVTQKISDLEICADEFPGSSATYRILEVGCGVGNTVFPILQTNNDPGLFVYCCDFSSTAIELVQTNSEYDPSRCFAFVHDLCDEEKSYPVPKGSLDIIILIFVLSAIVPDKMQKAINRLSRLLKPGGMMLLRDYGRYDMAQLRFKKGQCLSGNFYVRGDGTRVYFFTQEELDTLFTTAGLEKVQNLVDRRLQVNRGKQLTMYRVWIQCKYCKPLLSSTS
Reaction RHEA:50960
cytidine(32) in tRNA(Thr) + S-adenosyl-L-methionine = N(3)-methylcytidine(32) in tRNA(Thr) + S-adenosyl-L-homocysteine + H(+)
Substrate
S-adenosyl-L-methionine (CHEBI:59789)
Substrate molecule
Product
S-adenosyl-L-homocysteine (CHEBI:57856)
Product molecule
INSTRUCTION

Using the MNE fieldtrip_cmc MEG dataset, perform Spatio-Spectral Decomposition (SSD) by resampling the raw data to 250 Hz and cropping it to a 60-second window (50–110s). Fit the SSD estimator using OAS regularization and zero-phase FIR filters, defining the signal band at 9–12 Hz and the noise band at 8–13 Hz (both with 1 Hz transition bandwidths) to isolate alpha oscillations from 1/f background activity. Validate the results by inverting the spatial filters to plot patterns, transforming the data to source space, and comparing the Welch PSD of the maximum and minimum SNR components to verify oscillatory enhancement.

Plan
Data preparation
  1. Acquire the MNE fieldtrip_cmc raw MEG dataset, mark bad channels, and keep MEG planar gradiometers to avoid sensor-type scaling bias.
  2. Resample the continuous raw data to 250 Hz with anti-aliasing.
  3. Optionally apply artifact mitigation (SSP/ICA for EOG/ECG, SSS/Maxwell if available) and a 50/60 Hz notch.
  4. Define zero-phase FIR filters with 1 Hz transitions and adequate padding for 250 Hz.
Band definition & filtering
  1. Set the SSD signal band to 9–12 Hz (zero-phase FIR, 1 Hz transitions).
  2. Define flank noise bands as 8–9 Hz and 12–13 Hz (separate zero-phase FIR filters).
  3. Apply signal + flank filters to continuous data before cropping, using adequate padding.
  4. Crop filtered datasets to the 60 s window (50–110 s) with identical spans and sensors.
SSD estimation
  1. Compute signal-band covariance from 9–12 Hz data.
  2. Compute flank-band covariances and combine them into the noise covariance.
  3. Apply OAS regularization to both covariances.
  4. Solve the generalized eigenvalue problem to obtain filters W and SNR scores.
Validation & reporting
  1. Prepare broadband data (same sensors and window; optional 1–45 Hz bandpass).
  2. Compute broadband covariance C_data for interpretable patterns.
  3. Derive spatial patterns A = C_data Γ— W and visualize top ranks.
  4. Project to SSD source space and compute Welch PSDs (≀ 0.5 Hz resolution).
  5. Compare max/min SNR components to confirm alpha enhancement.
  6. Quantify alpha SNR ratios and save filters, patterns, PSDs, and reports.
Analysis
Summary of execution
  • Loads the MNE fieldtrip_cmc dataset and keeps the CTF axial-gradiometer set (MNE uses meg="mag" for this sensor group).
  • Resamples to 250 Hz, applies zero-phase notch at 50/100 Hz, and band-pass filters for SSD signal (9–12 Hz) and flanks (8–9, 12–13 Hz).
  • Crops to 50–110 s, builds OAS-regularized covariances, and solves Cs w = Ξ» Cn w for SSD filters and SNR scores.
  • Computes patterns A = C_data W from broadband covariance, then projects source space and compares Welch PSDs.
  • Saves figures and JSON report with metrics and topographies.
Scientific interpretation
  • SSD maximizes 9–12 Hz power relative to adjacent 8–9 and 12–13 Hz bands.
  • Expected spectrum: pronounced 10–11 Hz peak for max-SNR, flat spectrum for min-SNR.
  • Patterns A are intended for neurophysiological interpretation; spatial plausibility is a key check.
  • Validation emphasizes rank-order consistency, alpha enhancement, and spatial plausibility.
Method notes
  • Channel picking uses pick_meg_grad with meg="mag".
  • OAS is appropriate for covariance estimation.
  • Zero-phase FIR prevents temporal shifts; flank concatenation is standard practice.
  • Cholesky whitening and pattern scaling are acceptable.
  • PSD settings are reasonable for ≀ 0.5 Hz resolution.
Outputs to review
reports/ssd_analysis_report.json figures/ssd_psd_max_min.png figures/ssd_top_patterns.png
Potential improvements
  • Evaluate broader signal bands or time-window shifts.
  • Test alternative pattern scaling.
  • Consider beta-band CMC adjustments if needed.

The implementation meets the requirement; use the report and figures to confirm SSD characteristics.

INSTRUCTION

Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month.

Plan
Plan metadata
Data requirement
  • Polar tabular dataset containing header metadata and hourly records with timestamps.
Data tags tabular hourly header + records
Execution step (step_1 Β· code)
  1. Load header and data files, then enforce canonical columns defined by the header metadata.
  2. Build UTC datetime from Year/Month/Day and three-hourly observation time fields.
  3. Normalize the merged dataset with consistent schema and typed values.
  4. Compute daily averages over numeric measurement columns, excluding calendar/time keys.
  5. Export normalized, daily-averaged, and monthly-split outputs as .xlsx with validation/progress logs.
Code Β· step_1.py
Expand / Collapse code
# Loading code from static/code/step_1.py...
Input & outputs
Input: polar_tabular_dataset_containing_header_metadata_and_hourly_records_with_timestamps Output: normalized_merged_dataset_header_plus_records_combined Output: daily_averaged_dataset_derived_from_hourly_measurements Output: monthly_split_files_daily_averaged_dataset
Integration description

A single code step performs ingestion, canonical header application, datetime construction, daily aggregation, and month-based splitting with required directory structure and .xlsx deliverables.

Analysis
Processing summary
  1. Ingestion and merge of multiple source observations into one unified time-aligned table.
  2. Normalization and standardization to produce normalized_merged.xlsx.
  3. Temporal aggregation from hourly (or sub-daily) measurements to daily values.
  4. Monthly partitioning of daily outputs into separate Excel workbooks.
Data transformations (inferred)
  • Timestamp normalization to build a consistent datetime field for grouping.
  • Resampling/grouping by calendar day to derive daily statistics.
  • Partitioning by year-month for monthly deliverables.
  • Potential variable scaling/normalization in parallel with aggregation, as indicated by normalized_merged.xlsx.
Generated outputs
daily_averaged_YYYY-MM.xlsx (24 files) daily_averaged.xlsx (full-period daily product) normalized_merged.xlsx (normalized merged product)
Coverage: 2005-01 to 2006-12 Β· Processing success: True
Monthly files generated
  • Count: 24 monthly Excel files.
  • Date range: 2005-01 through 2006-12 (inclusive).
  • Organization: one file per month with daily aggregated records for that month.
  • Companion outputs: one full-period daily file and one normalized merged dataset.
  • This structure supports both full-series analysis and monthly subset access.
Aggregation verification (hourly β†’ daily)
Evidence from outputs confirms execution success, but row-level completeness metrics are not included in the report and still need explicit verification.
  1. Daily continuity by month: each monthly file should contain exactly the expected day count for that calendar month (for example, both 2005-02 and 2006-02 should have 28 days).
  2. No duplicated dates: each day should appear once per station/site or sensor key.
  3. Valid daily statistic definitions: use variable-appropriate rules (mean for temperature/pressure, sum for accumulation variables, circular mean for wind direction).
  4. Time zone/day-boundary consistency: confirm UTC (or explicitly declared local time) and no month-edge shifts that create missing/extra bins.
  • Temporal consistency checks: verify monotonic time order in every file.
  • Cross-file consistency checks: full-period daily file should match the concatenation of all monthly files.
Quality assessment
Baseline intrinsic 0.8950
Baseline distributional NaN
Baseline utility 1.0000
Baseline total NaN
Current intrinsic 0.7072
Current distributional 0.5213
Current utility 1.0000
Current total 0.7393
Improvement is not directly comparable because baseline distributional/total metrics are NaN.
Interpretation
  • Structural integrity is good: required files are generated and pipeline execution succeeded.
  • Intrinsic quality decreased (0.8950 β†’ 0.7072), indicating potential missingness or consistency issues after aggregation.
  • Distributional score is moderate (0.5213), possibly reflecting expected smoothing from hourly-to-daily aggregation or potential transformation artifacts.
  • Utility remains 1.0, so outputs are still operationally usable for downstream tasks.
Data integrity after processing
  • Structural integrity appears intact: expected files are produced and processing completed successfully.
  • Scientific integrity still requires further checks on missingness policy, day definition, and variable-specific aggregation behavior.
Recommendations
  1. Add explicit variable-wise aggregation rules and persist them in metadata/README (for example, mean vs sum vs circular mean).
  2. Enforce and report hourly coverage thresholds (for example, >= 18/24 observations) and add per-day n_obs.
  3. Validate time handling rigorously: UTC declaration, day boundaries, and month-edge inclusion logic.
  4. Investigate and resolve baseline distributional/total NaN so before-vs-after quality comparisons are meaningful.
  5. Strengthen QC deliverables: per-file summary (date range, records, missingness, flagged days), outlier flags, and provenance (processing date/software/method).
  6. After QC hardening, run seasonal climatology, interannual comparison (2005 vs 2006), extreme-event analysis, and reanalysis/satellite coupling.
Next action

Attach one representative monthly file schema (or normalized_merged.xlsx schema) to generate a variable-specific validation checklist.

BibTeX

@article{scidatacopilot2026,
  title={SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery},
  author={Shanghai Artificial Intelligence Laboratory},
  journal={arXiv},
  year={2026},
  url={https://arxiv.org/abs/2602.09132}
}