Abstract
The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this, we propose SciDataCopilot, an autonomous agentic framework designed to handle data accessing, scientific intent parsing, and multi-modal integration in an end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30 × speedup in data preparation.
Scientific AI-Ready Data Paradigm
π― Task-conditioned principle: Scientific AI-Ready data paradigm adopts scientific tasks as the primary organizing principle, translating scientific intent to required data units, variables, and constraints. This shifts task-conditioned data specification from inefficient manual collection to automating reusable workflows.
βοΈ Downstream compatibility: Scientific AI-Ready data paradigm prioritizes direct compatibility with downstream scientific analysis, ensuring that prepared data satisfy model-specific input constraints and enabling composable, executable workflows beyond standalone inference.
π§ Cross-integration Ability: Scientific AI-Ready data paradigm emphasizes principled cross-modal and cross-disciplinary alignment, enabling systematic data association, retrieval, alignment, and composition across heterogeneous scientific domains.
Use Cases
Click to explore representative SciDataCopilot use cases.
Download all enzyme catalysis data, including enzyme sequences and substrate--product reaction information.
Overall
Basic dataset statistics.
Using the MNE fieldtrip_cmc MEG dataset, perform Spatio-Spectral Decomposition (SSD) by resampling the raw data to 250 Hz and cropping it to a 60-second window (50β110s). Fit the SSD estimator using OAS regularization and zero-phase FIR filters, defining the signal band at 9β12 Hz and the noise band at 8β13 Hz (both with 1 Hz transition bandwidths) to isolate alpha oscillations from 1/f background activity. Validate the results by inverting the spatial filters to plot patterns, transforming the data to source space, and comparing the Welch PSD of the maximum and minimum SNR components to verify oscillatory enhancement.
Plan
- Acquire the MNE fieldtrip_cmc raw MEG dataset, mark bad channels, and keep MEG planar gradiometers to avoid sensor-type scaling bias.
- Resample the continuous raw data to 250 Hz with anti-aliasing.
- Optionally apply artifact mitigation (SSP/ICA for EOG/ECG, SSS/Maxwell if available) and a 50/60 Hz notch.
- Define zero-phase FIR filters with 1 Hz transitions and adequate padding for 250 Hz.
- Set the SSD signal band to 9β12 Hz (zero-phase FIR, 1 Hz transitions).
- Define flank noise bands as 8β9 Hz and 12β13 Hz (separate zero-phase FIR filters).
- Apply signal + flank filters to continuous data before cropping, using adequate padding.
- Crop filtered datasets to the 60 s window (50β110 s) with identical spans and sensors.
- Compute signal-band covariance from 9β12 Hz data.
- Compute flank-band covariances and combine them into the noise covariance.
- Apply OAS regularization to both covariances.
- Solve the generalized eigenvalue problem to obtain filters W and SNR scores.
- Prepare broadband data (same sensors and window; optional 1β45 Hz bandpass).
- Compute broadband covariance C_data for interpretable patterns.
- Derive spatial patterns A = C_data Γ W and visualize top ranks.
- Project to SSD source space and compute Welch PSDs (β€ 0.5 Hz resolution).
- Compare max/min SNR components to confirm alpha enhancement.
- Quantify alpha SNR ratios and save filters, patterns, PSDs, and reports.
Analysis
- Loads the MNE fieldtrip_cmc dataset and keeps the CTF axial-gradiometer set (MNE uses meg="mag" for this sensor group).
- Resamples to 250 Hz, applies zero-phase notch at 50/100 Hz, and band-pass filters for SSD signal (9β12 Hz) and flanks (8β9, 12β13 Hz).
- Crops to 50β110 s, builds OAS-regularized covariances, and solves Cs w = Ξ» Cn w for SSD filters and SNR scores.
- Computes patterns A = C_data W from broadband covariance, then projects source space and compares Welch PSDs.
- Saves figures and JSON report with metrics and topographies.
- SSD maximizes 9β12 Hz power relative to adjacent 8β9 and 12β13 Hz bands.
- Expected spectrum: pronounced 10β11 Hz peak for max-SNR, flat spectrum for min-SNR.
- Patterns A are intended for neurophysiological interpretation; spatial plausibility is a key check.
- Validation emphasizes rank-order consistency, alpha enhancement, and spatial plausibility.
- Channel picking uses pick_meg_grad with meg="mag".
- OAS is appropriate for covariance estimation.
- Zero-phase FIR prevents temporal shifts; flank concatenation is standard practice.
- Cholesky whitening and pattern scaling are acceptable.
- PSD settings are reasonable for β€ 0.5 Hz resolution.
- Evaluate broader signal bands or time-window shifts.
- Test alternative pattern scaling.
- Consider beta-band CMC adjustments if needed.
The implementation meets the requirement; use the report and figures to confirm SSD characteristics.
Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month.
Plan
- Polar tabular dataset containing header metadata and hourly records with timestamps.
- Load header and data files, then enforce canonical columns defined by the header metadata.
- Build UTC datetime from Year/Month/Day and three-hourly observation time fields.
- Normalize the merged dataset with consistent schema and typed values.
- Compute daily averages over numeric measurement columns, excluding calendar/time keys.
- Export normalized, daily-averaged, and monthly-split outputs as
.xlsxwith validation/progress logs.
Expand / Collapse code
# Loading code from static/code/step_1.py...
A single code step performs ingestion, canonical header application, datetime construction, daily aggregation, and month-based splitting with required directory structure and .xlsx deliverables.
Analysis
- Ingestion and merge of multiple source observations into one unified time-aligned table.
- Normalization and standardization to produce
normalized_merged.xlsx. - Temporal aggregation from hourly (or sub-daily) measurements to daily values.
- Monthly partitioning of daily outputs into separate Excel workbooks.
- Timestamp normalization to build a consistent datetime field for grouping.
- Resampling/grouping by calendar day to derive daily statistics.
- Partitioning by year-month for monthly deliverables.
- Potential variable scaling/normalization in parallel with aggregation, as indicated by
normalized_merged.xlsx.
- Count: 24 monthly Excel files.
- Date range: 2005-01 through 2006-12 (inclusive).
- Organization: one file per month with daily aggregated records for that month.
- Companion outputs: one full-period daily file and one normalized merged dataset.
- This structure supports both full-series analysis and monthly subset access.
- Daily continuity by month: each monthly file should contain exactly the expected day count for that calendar month (for example, both 2005-02 and 2006-02 should have 28 days).
- No duplicated dates: each day should appear once per station/site or sensor key.
- Valid daily statistic definitions: use variable-appropriate rules (mean for temperature/pressure, sum for accumulation variables, circular mean for wind direction).
- Time zone/day-boundary consistency: confirm UTC (or explicitly declared local time) and no month-edge shifts that create missing/extra bins.
- Temporal consistency checks: verify monotonic time order in every file.
- Cross-file consistency checks: full-period daily file should match the concatenation of all monthly files.
- Structural integrity is good: required files are generated and pipeline execution succeeded.
- Intrinsic quality decreased (0.8950 β 0.7072), indicating potential missingness or consistency issues after aggregation.
- Distributional score is moderate (0.5213), possibly reflecting expected smoothing from hourly-to-daily aggregation or potential transformation artifacts.
- Utility remains 1.0, so outputs are still operationally usable for downstream tasks.
- Structural integrity appears intact: expected files are produced and processing completed successfully.
- Scientific integrity still requires further checks on missingness policy, day definition, and variable-specific aggregation behavior.
- Add explicit variable-wise aggregation rules and persist them in metadata/README (for example, mean vs sum vs circular mean).
- Enforce and report hourly coverage thresholds (for example, >= 18/24 observations) and add per-day
n_obs. - Validate time handling rigorously: UTC declaration, day boundaries, and month-edge inclusion logic.
- Investigate and resolve baseline distributional/total NaN so before-vs-after quality comparisons are meaningful.
- Strengthen QC deliverables: per-file summary (date range, records, missingness, flagged days), outlier flags, and provenance (processing date/software/method).
- After QC hardening, run seasonal climatology, interannual comparison (2005 vs 2006), extreme-event analysis, and reanalysis/satellite coupling.
Attach one representative monthly file schema (or normalized_merged.xlsx schema) to generate a variable-specific validation checklist.
BibTeX
@article{scidatacopilot2026,
title={SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery},
author={Shanghai Artificial Intelligence Laboratory},
journal={arXiv},
year={2026},
url={https://arxiv.org/abs/2602.09132}
}