Machine Learning Modeling of a Steam Methane Reforming Process#
This project corresponds to a machine learning regression formulation for a Steam Methane Reforming plant integrated with a palladium membrane reactor. The goal is to predict key process/economic indicators from operating inputs. Context, variables and relevance follow the SMR–MR study presented in this article.
Schematic of the SMR-MR Process
In this study, an optimal design and operating setpoints for key process variables were found by performing a large-scale, nonlinear-constrained optimization of the plant considering a techno-economic analysis of a Palladium membrane reactor (Pd-MR). The flowsheet describing the novel design can be found in the Figure above. The economics were modeled by formalizing the plant’s cost in terms of the total annual costs (TAC), which comprises the annuitized total module costs of the plant (CTM), as well as the cost of manufacturing. In addition, the operational costs were formalized in terms of direct manufacturing costs (DMC) which take into consideration the sum of annual raw materials, utilities, wastewater treatment and eventual membrane replacements if applicable.
Description of Variables and Dataset#
Inputs (features):
Natural Gas Feed (kmol/h) — main hydrocarbon feed rate
Steam Feed (kmol/h) — steam co-feedrate (affects S/C ratio)
Temperature (°C) — reactor/operating temperature
Outputs (targets / labels):
Total Annual Cost (TAC) \([\$/h]\) — total annualized cost proxy
Hydrogen Production \([Nm^3/h]\) — \(H_2\) production rate
CO in Tail Gas [fraction] — CO mol fraction in the tail gas
Data file: smr_mr_dataset.xlsx (single sheet named “data”).
All units follow the dataset as provided. For definitions and context on TAC and SMR–MR performance indicators, please see this article if interested.
Using the provided dataset, build three separate regression models (one per target: TAC, CO, \(H_2\)) to predict outputs from inputs. Use scikit-learn and a machine learning model of your own choosing for each target.
You should:
Formalize the ML problem (features, targets, assumptions).
Load and inspect the dataset; perform basic analysis of the data (distributions, correlations, unit checks).
Define and justify your data split (train/test).
Build pipelines (preprocessing + regressor).
Train/validate one model per target (TAC, CO, \(H_2\). You may choose different algorithms per target. Your choice.
Evaluate with MAE, RMSE, and \(R^2\) on the held‑out test set; include residual and predicted‑vs‑true plots.
Provide a short discussion: model choice rationale, performance, physical sanity‑checks, limitations.