Predictive maintenance in heavy industry: how machine learning reduces downtime and cost

Лавров Игорь Андреевич

Introduction

Unplanned equipment failure is the defining cost driver in asset-intensive heavy industry, yet its financial scale is routinely underestimated until deployment-phase data forces a revision. A single integrated steel plant deploying an AI-based early warning system documented the prevention of more than 1,000 potential equipment failures in a pilot phase, generating verified savings of approximately $20 million and projecting an additional $47.5 million in avoided losses over five years [10]. That figure, while plant-specific, is consistent with the broader industry trajectory: the National Institute of Standards and Technology has documented cost reductions of up to 56 % on maintenance expenditure when plants transition from schedule-driven to condition-based regimes, and sector surveys report a 15–25 % improvement in equipment utilization with predictive approaches in place [1]. The gap between what traditional maintenance recovers and what predictive systems recover is the difference between reactive cost absorption and systematic failure prevention across rolling mills, blast furnaces, conveyors, and drive motors that collectively determine whether a continuous-process facility runs or stands idle.

The primary obstacle to ML deployment in heavy-industry maintenance is the absence of the organizational conditions under which any algorithm produces operational value. Condition-based maintenance research established the vocabulary for this transition over two decades ago [4], and prognostics-and-health-management frameworks extended it to rotating machinery with a structured five-stage design methodology [5]. Both paradigms degrade in performance precisely where heavy industry is most demanding: environments characterized by variable material properties, label scarcity, sensor noise under thermal and electromagnetic interference, and legacy control architectures that cannot be retrofitted with continuous data streams at low cost. The systematic review by Zonta et al. [13], covering 38 studies published through mid-2020, confirmed that data quality, heterogeneous equipment, and the absence of standardized fault labeling were the binding constraints on ML deployment in Industry 4.0 manufacturing contexts. Carvalho et al. [2], reviewing 30 papers, found that hybrid ML methods combining feature-engineering pipelines with ensemble classifiers outperformed single-algorithm approaches but generalized inconsistently across operating regimes, a failure mode that intensifies in blast furnace and rolling mill environments where load profiles shift seasonally and with material grade. At the model architecture level, the trajectory from classical SVM-based multi-classifier designs [11] to deep CNN–LSTM hybrids [12] improved benchmark accuracy substantially, while introducing a failure mode that benchmark metrics cannot detect: a model trained on controlled, fully-labelled datasets that produces high laboratory scores and fails on the noisy, imbalanced, partially-labelled sensor records of an actual production deployment.

Literature Review

Treating classifier selection as the central engineering problem in predictive maintenance is the foundational assumption that subsequent research inherited without examining. Susto et al. [11] introduced a multiple-classifier framework for semiconductor manufacturing in which Support Vector Machines, trained on process and logistic variables capturing the degradation footprint, predicted fault conditions at the part-level without direct measurement of component wear, establishing that data-driven PdM could generate predictions without sensor access to the degrading component itself, and simultaneously framing all subsequent research questions around which classifier to select rather than what organizational conditions enable any classifier to generate value. Jardine et al. [4] had earlier structured condition-based maintenance around three constitutive stages (data acquisition, signal processing, and decision algorithms) a taxonomy that positioned the field as a signal-processing and modeling discipline rather than a socio-technical deployment challenge. Lee et al. [5] extended this logic into the PHM framework with a five-stage design methodology for rotary machinery that assumed sensor infrastructure was already in place, failure labels were available, and engineers had computational resources to implement the recommended algorithms: three assumptions that eliminate precisely the conditions under which heavy-industry PdM actually fails.

When deep learning entered the field, it raised the performance ceiling without questioning what was holding deployment back. CNN, RNN, autoencoder, and deep belief network architectures achieve lower remaining-useful-life estimation error and higher fault classification accuracy than SVM-based and shallow ensemble methods on the CWRU bearing dataset, the PRONOSTIA bearing dataset, and the NASA turbofan degradation data, benchmarks that Zhao et al. [12] surveyed comprehensively and that Lei et al. [6] used as the empirical spine of a developmental roadmap from classical ML through deep learning to transfer learning. The roadmap argues that transfer learning reduces dependence on large labeled datasets in the target domain by allowing diagnosis knowledge acquired in one task to be applied to related tasks. What the roadmap does not quantify is the distance between the benchmark domains and the target domains it proposes to bridge: label completeness on the CWRU and PRONOSTIA datasets approaches 100 %, class balance between fault and non-fault observations is controlled by dataset construction, and sensor drift, electromagnetic interference, and maintenance-log gaps are absent by design. In a blast furnace or continuous hot-strip mill, all three of these conditions are reversed. Transfer learning is a technically sound response to labeled-data scarcity, but the organizational infrastructure that generates labeled data in the target domain, specifically, the structure through which operator judgments about model outputs become recorded training signal.

Four systematic reviews — Carvalho et al. [2], Zonta et al. [13], Achouch et al. [1], and Serradilla et al. [8] — aggregate several hundred studies and document the state of ML-based PdM more comprehensively than any primary research program, yet none can answer whether reported accuracy metrics predict operational outcomes in production environments, because production environments are not in their evidence base. Carvalho et al. [2] found that hybrid methods combining feature engineering with ensemble classifiers produced more consistent accuracy and F1 scores across application domains than standalone SVM or decision-tree classifiers evaluated without feature construction; the review cannot establish whether that consistency advantage survives sensor drift, load-profile variability, and the class imbalance of continuous-production fault records. Zonta et al. [13] catalogued data quality, equipment heterogeneity, and absent fault labeling as the primary deployment obstacles, correctly identifying the constraints but positioning them as inputs to better algorithm design, a framing that directs engineering effort toward model sophistication rather than toward the workflow integration and authority structures that production failures actually implicate. Achouch et al. [1] named financial and organizational barriers alongside technical ones (budget constraints, data source fragmentation, repair-planning integration difficulty) but treated all three categories as parallel and separable, which permits the inference that solving technical barriers independently of organizational ones constitutes deployment progress. Serradilla et al. [8] achieved the most architecturally specific diagnosis: comparing CNN, LSTM, autoencoder, and self-organizing map designs across industrial use cases, they identified data variability handling, concept-drift adaptability, and ensemble design as the properties distinguishing deployable from laboratory-only systems. The prescription that follows from that diagnosis keeps the solution entirely within the model and leaves the question of what organizational conditions allow the model’s outputs to reach and be acted upon by a maintenance engineer entirely outside scope.

Surveying 219 steel-industry articles, Jakubowski et al. [3] found that research concentrates disproportionately on blast furnaces and hot rolling, that deep learning has become the dominant methodology, and that the central unresolved problem is not algorithmic performance on controlled data but implementation in production environments, integration into maintenance plans, and reproducibility across deployments. The reproducibility failure Jakubowski et al. identify is the same structural gap that the U. S. steel deployment Shargaev [10] documents resolves from the other side: the $20 million in pilot-phase savings from preventing over 1,000 failures was not generated by a model with higher classification accuracy than existing approaches but by a deployment architecture in which sensor-based alerts were designed for direct interpretability by maintenance engineers and routed into the existing maintenance workflow, making organizational integration the variable that converted model output into maintenance action. The 219 studies Jakubowski et al. surveyed optimized the model; the deployment Shargaev [10] documents optimized the pathway from model output to engineer action, and the pathway produced the larger and more measurable outcome.

At the boundary where algorithm-first research runs out of explanatory power, the mechanism by which AI predictions become maintenance actions becomes the central design question, the interface through which an operator interprets it, the authority level at which the system acts on it, and the structure through which operator corrections re-enter the model as training data. Ran et al. [7] survey PdM system architectures across a wide field and recommend deep reinforcement learning for maintenance decision support in complex dynamic environments, treating the decision support interface and the authority structure around it as implementation details rather than as design variables that determine whether the system is used at all. The authority structure is not an implementation detail: a conditional-autonomy assignment, in which the system acts unless its uncertainty exceeds a defined threshold, at which point the operator handles the exception, produces fundamentally different operational outcomes than a shared-control assignment requiring operator confirmation for every alert, even when the underlying model is identical. A four-level shared-autonomy framework specifying assisted action, shared control, conditional autonomy, and full task autonomy with oversight, with authority-level assignment determined by the consequence of error, the reversibility of the action, and the gap between operator domain knowledge and model uncertainty at the decision point, provides the vocabulary the architecture-comparison literature lacks for diagnosing why identically-specified models produce different operational outcomes across deployment environments [9]. The four levels correspond to empirically distinct decision structures, from quality inspection tasks where the operator retains final authority to stable repetitive operations where the AI acts and the human audits, and the assignment criteria make the authority-level decision tractable rather than arbitrary.

Discussion

Two explanatory models compete to account for why AI-based PdM underperforms relative to laboratory benchmarks, and the choice between them determines where engineering effort should concentrate. The algorithmic model locates the problem in model design: insufficient labeled data, poor generalization across operating regimes, and the gap between benchmark and industrial sensor characteristics. The governance model locates the problem in the structures that determine who acts on model output, under what conditions, with what authority, and through what feedback mechanism.

The multiple-classifier framework Susto et al. [11] established carried an embedded causal claim: health factors (quantitative indicators of system status derived from process variables) drive maintenance scheduling decisions, and better classifiers generate more accurate health factors, which produce better maintenance outcomes. More accurate prediction leads straightforwardly to better maintenance scheduling. The steel deployment Shargaev [10] documents contradicts the mechanism, not merely the result. A sensor-and-alert infrastructure that prevented over 1,000 failures and generated $20 million in verified savings operated in an environment harder for ML than the semiconductor context Susto et al. studied (higher noise, more variable sensor quality, fewer labeled fault records) yet produced larger and more measurable operational gains. The difference was not classification accuracy; the deployment communicated predictions through interpretable alerts designed for maintenance engineers and integrated them into an existing maintenance workflow, making organizational integration the active variable. Classifier design improvements have diminishing returns once predictions are accurate enough to trigger appropriate responses, and that sufficiency threshold is considerably lower than benchmark performance metrics suggest, which means that most of the accuracy investment the algorithmic literature recommends is being spent above the threshold where it produces operational returns.

CNN-LSTM architectures achieving 96.1 % accuracy and F1-scores above 0.95 on industrial machine datasets under controlled conditions are real results, and Serradilla et al.'s [8] architectural guidance, that simultaneous modeling of spatial and temporal structure outperforms single-architecture designs on time-series sensor data when training sets are sufficiently large and balanced, is valid within the conditions it states. Jakubowski et al. [3] found that most steel-industry research relies on laboratory experiments or historical data with controlled quality rather than on the continuously-drifting, partially-labeled sensor records that blast furnaces and hot rolling mills generate. The accuracy gap between controlled-experiment and production-deployment performance is the central empirical quantity the field has not measured, because measuring it requires access to instrumented industrial deployments that most research groups cannot obtain. Serradilla et al. prescribe training for data variability as the architectural solution to deployability, implying that a sufficiently robust model can operate without structured human involvement at uncertainty boundaries, that model resilience substitutes for governance design. The deployment evidence does not support this substitution: a model that handles concept drift gracefully still requires an authority structure that routes its outputs to engineers, at the right decision threshold, with the right interface, before it generates any operational outcome at all.

Conditional autonomy, the authority level at which the system acts unless its uncertainty exceeds a defined threshold, reverting to operator control at the exception boundary, is the assignment the three-criteria framework produces for predictive maintenance, because PdM errors are consequential, maintenance actions are partially reversible, and operator domain knowledge exceeds model confidence in the high-uncertainty cases that matter most [9]. A deployment configured instead at shared control, requiring operator confirmation for every alert regardless of model confidence, degrades into alert fatigue: operators habituate to confirming alerts without evaluation, and the effective sensitivity of the system falls toward zero regardless of the model’s classification accuracy. A deployment configured at full autonomy, acting on every alert without review, produces over-trust failures when model confidence is high but the operating context has shifted outside the training distribution, a condition that load-profile changes, material grade variation, and seasonal thermal cycles produce routinely in heavy industry. The authority-level specification is therefore not a configuration choice made after deployment; it is a precondition for whether the deployment generates value or generates noise. The feedback loop this structure enables is what Lei et al.'s [6] transfer-learning roadmap requires but does not specify: operator interventions at conditional-autonomy exception boundaries produce labeled target-domain data, the operator’s resolution of a case the model flagged as uncertain is a supervised datapoint recording a production-environment fault pattern the training set did not contain. Without the conditional-autonomy structure, no systematic mechanism generates this data; the transfer-learning model remains dependent on the benchmark datasets whose distance from industrial sensor records the roadmap identifies as the central problem. The technical roadmap and the governance framework address the same constraint from different directions, and the technical solution is inoperable without the organizational one.

Multidisciplinary approaches to PdM adoption are operationally underspecified in the Industry 4.0 and Industry 5.0 literature. Achouch et al. [1] identify financial and organizational barriers alongside technical ones but organize them as parallel categories, implying that each can be addressed on its own track. Zonta et al. [13] observe that computer science is increasingly displacing engineering as the dominant expertise in industrial maintenance and call for multidisciplinary integration without specifying what integration requires at the level of authority structures, interface design, and workforce skill architecture. Workforce skill architecture and interface transparency are not additions made after authority levels are assigned, they are preconditions for making the assignment at all, because conditional autonomy requires an operator capable of handling exceptions and an interface that communicates model uncertainty in a form the operator can act on. The authority level, the interface that makes model uncertainty legible to the operator, the feedback structure that records operator resolutions as training data, and the workforce skill profile that supports exception handling must be determined together, not sequentially, as Shargaev [9] specifies in treating these as co-designed elements of the governance framework rather than as a stack of additions to a technical core. Whether this specific framework is the right operationalization is open to empirical test; that some operationalization of governance architecture is a precondition for deployment-level performance is, given the accumulated evidence, no longer seriously in doubt.

Conclusion

Governance architecture is the binding constraint on PdM deployment performance, and the field’s investment in algorithmic sophistication has been concentrated above the accuracy threshold at which governance determines outcomes. This was not a recoverable conclusion from Jardine et al. (2006), Carvalho et al. [2], or Zonta et al. [13] because it requires production-deployment data of the kind that Shargaev [10] documents and Jakubowski et al. [3] identify as structurally absent from the published research base, data that records the rate at which model outputs become maintenance actions under operational ones. The counter-intuitive implication is directional: the next marginal improvement in heavy-industry PdM outcomes is more likely to come from specifying the authority level at which a model of current accuracy operates than from improving that accuracy further.

What the evidence cannot yet resolve is whether the benchmark-to-production accuracy gap is closable through the feedback mechanisms that governance design enables, or whether it reflects sensor-environment constraints that no feedback loop can overcome. If the gap is closable, the conditional-autonomy structure generates a measurable convergence trajectory: target-domain labels accumulate at exception boundaries, transfer-learning models retrain on production-environment fault patterns, and deployment accuracy approaches benchmark performance over a horizon that longitudinal instrumentation can track. If the gap is not closable, if blast-furnace and hot-rolling sensor records are sufficiently different from any available training domain that structured feedback cannot produce adequate labeled coverage, then benchmark accuracy is the wrong evaluation standard for heavy-industry PdM entirely, and the field requires performance criteria defined against the baseline of the maintenance regime the AI system replaces rather than against the controlled-dataset ceiling it approaches in the laboratory.

References:

1. Achouch, M., Dimitrova, M., Ziane, K., Sattarpanah Karganroudi, S., Dhouib, R., Ibrahim, H., & Adda, M. (2022). On predictive maintenance in Industry 4.0: Overview, models, and challenges. Applied Sciences, 12(16), Article 8081. https://doi.org/10.3390/app12168081

2. Carvalho, T. P., Soares, F. A. A. M. N., Vita, R., Francisco, R. da P., Basto, J. P., & Alcalá, S. G. S. (2019). A systematic literature review of machine learning methods applied to predictive maintenance. Computers & Industrial Engineering, 137, Article 106024. https://doi.org/10.1016/j.cie.2019.106024

3. Jakubowski, J., Wojak-Strzelecka, N., Ribeiro, R. P., Pashami, S., Bobek, S., Gama, J., & Nalepa, G. J. (2024). Artificial intelligence approaches for predictive maintenance in the steel industry: A survey. arXiv.

4. Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 20(7), 1483–1510. https://doi.org/10.1016/j.ymssp.2005.09.012

5. Lee, J., Wu, F., Zhao, W., Ghaffari, M., Liao, L., & Siegel, D. (2014). Prognostics and health management design for rotary machinery systems: Reviews, methodology and applications. Mechanical Systems and Signal Processing, 42(1–2), 314–334. https://doi.org/10.1016/j.ymssp.2013.06.004

6. Lei, Y., Yang, B., Jiang, X., Jia, F., Li, N., & Nandi, A. K. (2020). Applications of machine learning to machine fault diagnosis: A review and roadmap. Mechanical Systems and Signal Processing, 138, Article 106587. https://doi.org/10.1016/j.ymssp.2019.106587

7. Ran, Y., Zhou, X., Wen, Y., & Zhu, T. (2019). A survey of predictive maintenance: Systems, purposes and approaches. arXiv.

8. Serradilla, O., Zugasti, E., Rodriguez, J., & Zurutuza, U. (2022). Deep learning models for predictive maintenance: A survey, comparison, challenges and prospects. Applied Intelligence, 52(10), 10934–10964. https://doi.org/10.1007/s10489–021–03004-y

9. Shargaev, V. (2025). Future of manufacturing: Human and machine collaboration. Lambert.

10. Shargaev, V. G. (2026). AI-based predictive maintenance in steel industry. In Proceedings of the LXXVIII International Multidisciplinary Conference «Recent Scientific Investigation». Primedia E-launch LLC.

11. Susto, G. A., Schirru, A., Pampuri, S., McLoone, S., & Beghi, A. (2015). Machine learning for predictive maintenance: A multiple classifier approach. IEEE Transactions on Industrial Informatics, 11(3), 812–820. https://doi.org/10.1109/TII.2014.2349359

12. Zhao, R., Yan, R., Chen, Z., Mao, K., Wang, P., & Gao, R. X. (2019). Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing, 115, 213–237. https://doi.org/10.1016/j.ymssp.2018.05.050

13. Zonta, T., da Costa, C. A., da Rosa Righi, R., de Lima, M. J., da Trindade, E. S., & Li, G. P. (2020). Predictive maintenance in the Industry 4.0: A systematic literature review. Computers & Industrial Engineering, 150, Article 106889. https://doi.org/10.1016/j.cie.2020.106889

Молодой учёный

Predictive maintenance in heavy industry: how machine learning reduces downtime and cost

Predictive maintenance in heavy industry: how machine learning reduces downtime and cost

Молодой учёный