PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

SciDataCopilot An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Shanghai Artificial Intelligence Laboratory

Paper Code arXiv

Figure 1. The Scientific AI-Ready data paradigm is designed to meet the fundamental requirements of scientific discovery. Under this, raw data are not only machine-readable, but are explicitly structured to support task customization and expert guidance, enabling autonomous workflows. Instantiating this paradigm, we propose SciDataCopilot, a staged agentic framework, that transforms heterogeneous raw data into Scientific AI-Ready data that can be directly consumed by scientific analysis.

Abstract

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this, we propose SciDataCopilot, an autonomous agentic framework designed to handle data accessing, scientific intent parsing, and multi-modal integration in an end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30 × speedup in data preparation.

Scientific AI-Ready Data Paradigm

🎯 Task-conditioned principle: Scientific AI-Ready data paradigm adopts scientific tasks as the primary organizing principle, translating scientific intent to required data units, variables, and constraints. This shifts task-conditioned data specification from inefficient manual collection to automating reusable workflows.

⛓️ Downstream compatibility: Scientific AI-Ready data paradigm prioritizes direct compatibility with downstream scientific analysis, ensuring that prepared data satisfy model-specific input constraints and enabling composable, executable workflows beyond standalone inference.

🧠 Cross-integration Ability: Scientific AI-Ready data paradigm emphasizes principled cross-modal and cross-disciplinary alignment, enabling systematic data association, retrieval, alignment, and composition across heterogeneous scientific domains.

Framework

Figure 2. Architecture of SciDataCopilot. The framework integrates four collaborative agents (Data Access, Intent Parsing, Data Processing, and Data Integration) to autonomously align user intents with complex data resources. Ultimately, SciDataCopilot bridges heterogeneous scientific data with specific models, re-defining task-guided data customization and cross-disciplinary integration to empower diverse scientific research tasks.

01

Data Access Agent

Data perception
Knowledge base construction K={D,T,C}
Reproducible metadata grounding

02

Intent Parsing Agent

Requirement analysis
Case retrieval adaptation
Plan generation and verification

03

Data Processing Agent

Plan checking
Self-repairing execution loop
Scientific report trace generation

04

Data Integration Agent

Integration strategy analysis
Integration pipeline generation
Data unit integration

Use Cases

Click to explore representative SciDataCopilot use cases.

INSTRUCTION

Download all enzyme catalysis data, including enzyme sequences and substrate--product reaction information.

Statistics

Overall

Basic dataset statistics.

Export examples

Total records

214,104

Unique enzymes

168,576

Unique substrates

4,485

Unique products

5,754

Unique RHEA

9,221

Unique organisms

5,893

Unique enzyme types

10

Protein UniProt:Q96IZ6

Entry MET2A_HUMAN

Protein tRNA N(3)-cytidine methyltransferase METTL2A

Sequence length 378

MAGSYPEGAPAVLADKRQQFGSRFLRDPARVFHHNAWDNVEWSEEQAAAAERKVQENSIQRVCQEKQVDYEINAHKYWNDFYKIHENGFFKDRHWLFTEFPELAPSQNQNHLKDWFLENKSEVPECRNNEDGPGLIMEEQHKCSSKSLEHKTQTLPVEENVTQKISDLEICADEFPGSSATYRILEVGCGVGNTVFPILQTNNDPGLFVYCCDFSSTAIELVQTNSEYDPSRCFAFVHDLCDEEKSYPVPKGSLDIIILIFVLSAIVPDKMQKAINRLSRLLKPGGMMLLRDYGRYDMAQLRFKKGQCLSGNFYVRGDGTRVYFFTQEELDTLFTTAGLEKVQNLVDRRLQVNRGKQLTMYRVWIQCKYCKPLLSSTS

Reaction RHEA:50960

cytidine(32) in tRNA(Thr) + S-adenosyl-L-methionine = N(3)-methylcytidine(32) in tRNA(Thr) + S-adenosyl-L-homocysteine + H(+)

Substrate

S-adenosyl-L-methionine (CHEBI:59789)

Product

S-adenosyl-L-homocysteine (CHEBI:57856)

INSTRUCTION

Using the MNE fieldtrip_cmc MEG dataset, perform Spatio-Spectral Decomposition (SSD) by resampling the raw data to 250 Hz and cropping it to a 60-second window (50–110s). Fit the SSD estimator using OAS regularization and zero-phase FIR filters, defining the signal band at 9–12 Hz and the noise band at 8–13 Hz (both with 1 Hz transition bandwidths) to isolate alpha oscillations from 1/f background activity. Validate the results by inverting the spatial filters to plot patterns, transforming the data to source space, and comparing the Welch PSD of the maximum and minimum SNR components to verify oscillatory enhancement.

Plan

Data preparation

Acquire the MNE fieldtrip_cmc raw MEG dataset, mark bad channels, and keep MEG planar gradiometers to avoid sensor-type scaling bias.
Resample the continuous raw data to 250 Hz with anti-aliasing.
Optionally apply artifact mitigation (SSP/ICA for EOG/ECG, SSS/Maxwell if available) and a 50/60 Hz notch.
Define zero-phase FIR filters with 1 Hz transitions and adequate padding for 250 Hz.

Band definition & filtering

Set the SSD signal band to 9–12 Hz (zero-phase FIR, 1 Hz transitions).
Define flank noise bands as 8–9 Hz and 12–13 Hz (separate zero-phase FIR filters).
Apply signal + flank filters to continuous data before cropping, using adequate padding.
Crop filtered datasets to the 60 s window (50–110 s) with identical spans and sensors.

SSD estimation

Compute signal-band covariance from 9–12 Hz data.
Compute flank-band covariances and combine them into the noise covariance.
Apply OAS regularization to both covariances.
Solve the generalized eigenvalue problem to obtain filters W and SNR scores.

Validation & reporting

Prepare broadband data (same sensors and window; optional 1–45 Hz bandpass).
Compute broadband covariance C_data for interpretable patterns.
Derive spatial patterns A = C_data × W and visualize top ranks.
Project to SSD source space and compute Welch PSDs (≤ 0.5 Hz resolution).
Compare max/min SNR components to confirm alpha enhancement.
Quantify alpha SNR ratios and save filters, patterns, PSDs, and reports.

Analysis

Summary of execution

Loads the MNE fieldtrip_cmc dataset and keeps the CTF axial-gradiometer set (MNE uses meg="mag" for this sensor group).
Resamples to 250 Hz, applies zero-phase notch at 50/100 Hz, and band-pass filters for SSD signal (9–12 Hz) and flanks (8–9, 12–13 Hz).
Crops to 50–110 s, builds OAS-regularized covariances, and solves Cs w = λ Cn w for SSD filters and SNR scores.
Computes patterns A = C_data W from broadband covariance, then projects source space and compares Welch PSDs.
Saves figures and JSON report with metrics and topographies.

Scientific interpretation

SSD maximizes 9–12 Hz power relative to adjacent 8–9 and 12–13 Hz bands.
Expected spectrum: pronounced 10–11 Hz peak for max-SNR, flat spectrum for min-SNR.
Patterns A are intended for neurophysiological interpretation; spatial plausibility is a key check.
Validation emphasizes rank-order consistency, alpha enhancement, and spatial plausibility.

Method notes

Channel picking uses pick_meg_grad with meg="mag".
OAS is appropriate for covariance estimation.
Zero-phase FIR prevents temporal shifts; flank concatenation is standard practice.
Cholesky whitening and pattern scaling are acceptable.
PSD settings are reasonable for ≤ 0.5 Hz resolution.

Outputs to review

reports/ssd_analysis_report.json figures/ssd_psd_max_min.png figures/ssd_top_patterns.png

Potential improvements

Evaluate broader signal bands or time-window shifts.
Test alternative pattern scaling.
Consider beta-band CMC adjustments if needed.

The implementation meets the requirement; use the report and figures to confirm SSD characteristics.

INSTRUCTION

Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month.

Plan

Plan metadata

Plan ID polar_daily_avg_month_split_v1

Name Hybrid Plan

Objective Process polar tabular data by merging header and records, computing daily averages from hourly values, and splitting the resulting outputs by month.

Plan type Single code step

Total steps 1

Review APPROVED (Score: 1.00)

Data requirement

Polar tabular dataset containing header metadata and hourly records with timestamps.

Data tags tabular hourly header + records

Execution step (step_1 · code)

Load header and data files, then enforce canonical columns defined by the header metadata.
Build UTC datetime from Year/Month/Day and three-hourly observation time fields.
Normalize the merged dataset with consistent schema and typed values.
Compute daily averages over numeric measurement columns, excluding calendar/time keys.
Export normalized, daily-averaged, and monthly-split outputs as .xlsx with validation/progress logs.

Code · step_1.py

Expand / Collapse code

# Loading code from static/code/step_1.py...

Input & outputs

Input: polar_tabular_dataset_containing_header_metadata_and_hourly_records_with_timestamps Output: normalized_merged_dataset_header_plus_records_combined Output: daily_averaged_dataset_derived_from_hourly_measurements Output: monthly_split_files_daily_averaged_dataset

Integration description

A single code step performs ingestion, canonical header application, datetime construction, daily aggregation, and month-based splitting with required directory structure and .xlsx deliverables.

Analysis

Processing summary

Ingestion and merge of multiple source observations into one unified time-aligned table.
Normalization and standardization to produce normalized_merged.xlsx.
Temporal aggregation from hourly (or sub-daily) measurements to daily values.
Monthly partitioning of daily outputs into separate Excel workbooks.

Data transformations (inferred)

Timestamp normalization to build a consistent datetime field for grouping.
Resampling/grouping by calendar day to derive daily statistics.
Partitioning by year-month for monthly deliverables.
Potential variable scaling/normalization in parallel with aggregation, as indicated by normalized_merged.xlsx.

Generated outputs

daily_averaged_YYYY-MM.xlsx (24 files) daily_averaged.xlsx (full-period daily product) normalized_merged.xlsx (normalized merged product)

Coverage: 2005-01 to 2006-12 · Processing success: True

Monthly files generated

Count: 24 monthly Excel files.
Date range: 2005-01 through 2006-12 (inclusive).
Organization: one file per month with daily aggregated records for that month.
Companion outputs: one full-period daily file and one normalized merged dataset.
This structure supports both full-series analysis and monthly subset access.

Aggregation verification (hourly → daily)

Evidence from outputs confirms execution success, but row-level completeness metrics are not included in the report and still need explicit verification.

Daily continuity by month: each monthly file should contain exactly the expected day count for that calendar month (for example, both 2005-02 and 2006-02 should have 28 days).
No duplicated dates: each day should appear once per station/site or sensor key.
Valid daily statistic definitions: use variable-appropriate rules (mean for temperature/pressure, sum for accumulation variables, circular mean for wind direction).
Time zone/day-boundary consistency: confirm UTC (or explicitly declared local time) and no month-edge shifts that create missing/extra bins.

Temporal consistency checks: verify monotonic time order in every file.
Cross-file consistency checks: full-period daily file should match the concatenation of all monthly files.

Quality assessment

Baseline intrinsic 0.8950

Baseline distributional NaN

Baseline utility 1.0000

Baseline total NaN

Current intrinsic 0.7072

Current distributional 0.5213

Current utility 1.0000

Current total 0.7393

Improvement is not directly comparable because baseline distributional/total metrics are NaN.

Interpretation

Structural integrity is good: required files are generated and pipeline execution succeeded.
Intrinsic quality decreased (0.8950 → 0.7072), indicating potential missingness or consistency issues after aggregation.
Distributional score is moderate (0.5213), possibly reflecting expected smoothing from hourly-to-daily aggregation or potential transformation artifacts.
Utility remains 1.0, so outputs are still operationally usable for downstream tasks.

Data integrity after processing

Structural integrity appears intact: expected files are produced and processing completed successfully.
Scientific integrity still requires further checks on missingness policy, day definition, and variable-specific aggregation behavior.

Recommendations

Add explicit variable-wise aggregation rules and persist them in metadata/README (for example, mean vs sum vs circular mean).
Enforce and report hourly coverage thresholds (for example, >= 18/24 observations) and add per-day n_obs.
Validate time handling rigorously: UTC declaration, day boundaries, and month-edge inclusion logic.
Investigate and resolve baseline distributional/total NaN so before-vs-after quality comparisons are meaningful.
Strengthen QC deliverables: per-file summary (date range, records, missingness, flagged days), outlier flags, and provenance (processing date/software/method).
After QC hardening, run seasonal climatology, interannual comparison (2005 vs 2006), extreme-event analysis, and reanalysis/satellite coupling.

Next action

Attach one representative monthly file schema (or normalized_merged.xlsx schema) to generate a variable-specific validation checklist.

BibTeX

@article{scidatacopilot2026,
  title={SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery},
  author={Shanghai Artificial Intelligence Laboratory},
  journal={arXiv},
  year={2026},
  url={https://arxiv.org/abs/2602.09132}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

SciDataCopilot An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Abstract

Scientific AI-Ready Data Paradigm

Framework

Data Access Agent

Intent Parsing Agent

Data Processing Agent

Data Integration Agent

Use Cases

Overall

BibTeX