It's been a while since I had the chance to write one of these articles. I have been saving up a fair bit of literature I wanted to cover but as these things go sometimes other things like doing a triathlon and training for a marathon get in the way. Bad excuse I know.
In this article I want to touch base on the paradox of AI acceleration in Engineering. What do I mean by that? Using Neural Models (like LLMs, AI, you name it) in practical work can be extremely helpful but at the same time also sometimes quite frustrating since it feels as if one is able to explore wide spaces of knowledge but the model seems to be missing the point somehow or the output generated just does not feel right. This is especially pronounced when leveraging AI to (agentically) generate contents such as engineering documentation.
Enjoy!
Last updated: 16th August 2025
LLMs are gaining more and more momentum through the likes of Agentic AI and an increased share of tools availlable on the market. In this article I want to give an overview of where the technology stands, and what challenges I face when practically working with these tools.
The key challenges I want to point out are:
Hallucinations: the Ai tells you something that is simply made up. We all know about this already
Break down of Reasoning: When reasoning models break down due to complexity of logic tasks.
Needle in a Haystack, which is especially relevant in precise technical documentation. When AI models seem to just not use the information in your data that you would like, and why this happens.
The cognitive debt each one of us takes on when using these tools in the wrong way. We impede our own neuronal connections and sense of ownership (I guess rightfully) when having AI do the work for us.
Concerns about data privacy when allowing the AI to take control of your PC. Roles and Permissions!
Some key advancements:
The machine bullshit indicator: When AI uses these filler words that make your text more verbose but without meaning.
Prompting guides: How to get the best out of your tools. Some Examples.
The impact of LLMs and AI is not deniable anymore. In 2025, nearly half of U.S. workers now use generative AI tools at work, up from just 30% a year prior. Across surveyed industries, agentic AI—AI systems that autonomously complete end-to-end processes—has begun to deliver measurable gains: routine knowledge work is being automated at scale, freeing up as much as 30-40% of employee time in customer service, sales, and operations.
The quantifiable economic impact is significant. Leading consultancies estimate that agentic AI could create up to $450 billion in additional economic value in major economies by 2028. When combined with generative AI, this expands to a potential $4.4 trillion added to global GDP per year by 2030, primarily through automating repetitive tasks, augmenting knowledge worker productivity, and unlocking new service models.
Key users of agentic AI are concentrated in sectors prioritizing high-volume, rules-based workflows:
Customer service: Automated agents resolve 60-80% of tier-one customer queries without human intervention, with Fortune 500 companies reporting up to 30% reductions in customer support costs.
Financial services: Banks now deploy agentic AI for fraud detection and contract analysis—JPMorgan Chase alone has saved over 360,000 annual working hours via document automation.
Manufacturing and logistics: Predictive maintenance agents at Siemens have reduced operational downtime by up to 40%. DHL and BP have reported double-digit cost reductions in supply chain and exploration workflows.
Retail and streaming: AI agents at Amazon and Netflix are credited with driving double-digit sales and engagement growth through hyper-personalized recommendations.
Performance metrics continue to climb: autonomous AI agents now match or exceed 80% accuracy in executing complex human tasks, while the cost per AI inference has plunged over 100-fold since 2022—enabling broader enterprise-scale deployment.
The most intense adoption is found among younger, digitally native employees in IT, finance, sales, and customer service, but uptake is spreading quickly into healthcare, education, and manufacturing (see the massive productivity gain displayed in the figure above).
Even as these gains accelerate, the landscape is not without challenges. Confidence in full AI autonomy is dropping, as only 27% of organizations now express strong trust in fully independent agents (down from 43% the prior year), citing issues like transparency, data readiness, and ethical alignment. Most firms are adopting agentic AI via hybrid strategies: turnkey solutions for routine workflows, bespoke agents for high-value or regulated tasks.
In summary: documented productivity gains, hours saved, and cost reductions confirm AI’s real and growing role in business operations, but what does that actually concretely mean when working with these tools. I went ahead and summarized some of the key (painful) learnings I had when working with AI in the recent months.
In this chapter I want to discuss the challenges when working with Gen AI tools or "to your brain" that arise wehn letting AI tools do the work for you or help you with your work. These are:
Hallucinations
Data Security Concerns
Needle in a Haystack
Breakdown of Reasoning
Cognitive Dept
Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical.
LLMs are sensitive to where detailed information is placed in lager contexts. The location of the information the user is trying to retrieve has significant impact on the quality of the output the model generates. Looking at the U curve in the figure to the right, it becomes evident, that the accuracy of the response is depending on the location of the information inside the context window. Link to the paper in the first icon at the end of this section..
As the context length grows, one also sees retrieval failure towards the start of the document. The effect appears to start earlier in the multi-needle case (around 25k tokens) than the single needle case (which started around 73k tokens for GPT-4). This phenomenon is depicted in the heatmap graph on the top left. Here multiple needles (facts of relevance) are placed inside the context window that increasingly grows in token size (x-axis). Link below the figure.
In another experiment models were given some facts about San Francisco and then pressure tested in a single needle scenario. The models were prompted to answer what the best thing to do in San Francisco was, only using the provided context. This was then repeated for different depths between 0% (top of document) and 100% (bottom of document) and different context lengths between 1K tokens and the token limit of each model (128k for GPT-4 and 200k for Claude 2.1). The below graphs document the performance of these two models, with the Link to the source in the third icon below the figures.
In the paper by Apple named the "illusion of thinking" AI reasoning models are tested with respect to their reasoning traces and how these apply logic, a topic that is not widely understood since most benchmarks only validate against the final result.
Reasoning models (LRMs), i.e. models with self reflection mechanisms, outperform traditional LLMs on reasoning benchmarks, whereby these benchmarks do not validate the thinking process of the models per se but only evaluate the end results.
Apple researchers were analysing how these models would perform on logic tasks like puzzles, where each step of the reasoning process can be monitored and evaluated.
Some key findings could be derived, as listed below:
There are three distinct reasoning regimes:
On simple tasks, standard LLMs outperform reasoning-enabled LRMs due to LRMs “overthinking” and wasting tokens.
On medium-complexity tasks, LRMs can outperform LLMs, but accuracy varies across puzzle types.
On hard tasks, both LRMs and LLMs suffer a total collapse in accuracy; neither produces correct solutions.
Counterintuitive resource use: As problem complexity rises, LRMs use more thinking tokens until a critical threshold, then paradoxically reduce reasoning effort—even when plenty of compute remains.
Generalization failure: Providing explicit solution algorithms does not help LRMs execute the steps correctly; reasoning breakdown occurs at similar thresholds as with open-ended problems.
Puzzle type dependence: Performance and breakdown points are strongly influenced by puzzle structure and model training exposure.
While this study evaluates only a very small slice of problems where reasoning is required, it clearly shows the limitations of these models to apply logic to given "toy" problems. Find the Link to this paper in the fourth link icon below.
In the paper called "Your brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task" a group of researchers analyses the brain activity and formation of connections in the brain while participants are tasked to write essays to given topics. The 54 participants are roughly split into three groups who have varying levels of tools as support. Group 1 has no tools available, Group 2 has search engines (i.e. Google) available to them and Group 3 has access to LLMs.
Brain connectivity as well as reported ownership about essays written just minutes earlier systematically scaled down with the amount of external support:
the Brain‑only group exhibited the strongest, widest‑ranging networks, with very strong reported ownership of essays in subsequent interviews
Search Engine group showed intermediate engagement, with strong reported ownership of essays in subsequent interviews
and LLM assistance demonstrated the weakest overall coupling, with low reported ownership of essays in subsequent interviews
Taken together, the behavioral data revealed that higher levels of neural connectivity and internal content generation in the Brain-only group correlated with stronger memory, greater semantic accuracy, and firmer ownership of written work. Brain-only group, though under greater cognitive load, demonstrated deeper learning outcomes and stronger identity with their output. The Search Engine group displayed moderate internalization, likely balancing effort with outcome. The LLM group, while benefiting from tool efficiency, showed weaker memory traces, reduced self-monitoring, and fragmented authorship.
In this section I want to point out a field of research I found fairly amusing but also very relevant: Machine Bullshit.
Machine Bullshit taxonomy is described in the table pasted below. It typically consits of the subtypes: Empty Rhethoric, Weasel Words, Platering, Unverfiable Claims and Flattery.
In the Authors marketplace experiments, they found, that no matter what facts the AI knows, it insists the products have great features most of the time, as well as that the AI doesn’t become confused about the truth—it becomes uncommitted to reporting it. These two points are primarily introdused by RLHF.
Some models are more prone to Machine Bullsit some less. In the following figure the Authors categorize different models according to the different subtypes. See in the figure below a categorization in the realm of political topics. Also find the link to this publication in the icon on the bottom.
====Use ChatGPT Deep Research====
You are a senior McKinsey consultant writing an industry deep-dive for institutional portfolio managers.
**Scope**
• Industry: [{{INDUSTRY}}]
• Geographic focus: [{{REGION}}] (specify “Global” to cover all major regions)
• Time horizon: [{{TIME_HORIZON}}] (e.g., 2024-2030)
• Currency: [{{USD / EUR / etc.}}]
• Report date: [{{DATE}}]
• Output format: Markdown headings, numbered sections, bullet lists, and concise sentences. Include tables in Markdown and clearly label every chart.
• Sources: Only publicly available information accessible via web search, company filings, government data, trade groups, academic research, reputable media, and open datasets.
**Required Sections**
1. **Executive Summary** – 400-600 words with bullet highlights and a one-sentence bottom line.
2. **Industry Definition and Segmentation** – define scope, key sub-segments, and value-chain stages.
3. **Market Size and Growth Outlook** – current TAM, historic CAGR, and five-year forecast with at least two scenarios. Summarize drivers and inhibitors.
4. **Macro Context** – macroeconomic, demographic, and policy factors affecting the sector.
5. **Competitive Landscape** – top 10 companies by revenue, market share table, recent M&A, and private-equity activity.
6. **Porter Five Forces Analysis** – rate each force (Low/Medium/High) with supporting evidence.
7. **Cost Structure and Economics** – typical gross margin, capex intensity, and working-capital cycle.
8. **Technology and Innovation Trends** – key R&D themes, patent velocity, and emerging business models.
9. **Regulatory and ESG Considerations** – current rules, pending legislation, carbon footprint, and material ESG risks.
10. **Regional Nuances** – compare key regions on demand outlook, supply chain depth, and policy.
11. **Risk Matrix** – rank top 5 risks by likelihood and impact, describe early-warning indicators.
12. **Strategic Implications for Investors** – outline three actionable theses, each with trigger events, KPIs to track, and illustrative upside/downside.
13. **Appendix** – detailed tables, methodology notes, and full bibliography.
**Style & Depth**
• Write for sophisticated buy-side readers. Assume familiarity with finance jargon.
• Aim for 7,000-9,000 words total.
• Use precise, evidence-backed statements. Avoid fluff.
• No em dashes!
**Workflow Instructions for ChatGPT**
1. Before drafting, silently plan the structure in bullet form.
2. Search the web iteratively until confident data coverage is >90 percent.
3. Draft each section in order, keeping paragraphs under 120 words.
4. Insert tables immediately after the paragraph that cites them.
5. End with the full footnote list sorted by first appearance. Ask for clarifications only if absolutely required; otherwise start the report immediately.
Further very useful material on prompting is regulary updated by my personal fav AI company Anthropic: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
In an endless pursuit of understanding and learning, this is just one more step. I hope this article can help you to gain understand these magical and weird tools better. Also don"t forget to brain jog every now and then.
Feel free to connect on LinkedIn and give me feedback.