Lessons with LLMs#
Through six months of our BNY Capstone project on risk management, I have come to appreciate the intricate details involved in LLM management and application. It is easy to get swept away by the popular discourse on agentic approaches, or large language model intelligence, but my experience in this project has stressed that the details still matter. While our team had developed an applicable and useful front-end back in November, we remain entrenched in the iterative phase as to how to channel LLMs to populate the finished front-end with valuable results.
There are three realizations I arrived at as our team has grappled with extensive application of LLMs in our final solution: LLMs remain limited in quantitative purposes, they take long to process with large context windows, and they are unable to assess the relevance of news events without extensive prompting.
LLMs Remain Limited In Quantitative Purposes#
Through our experimentation as part of scoring each piece of news in terms of its risk to our client,
Despite prompting which both offers examples of rated events and requests a justification for the assigned score, we find that the resulting scores do not align with the expected impact to the client.
In December, we ran this experiment across multiple models and sought alignment between the models on the provided score for a news article, and were met with varying scores, despite the same example articles and corresponding scores. Subsequently, to extend beyond a score based on example articles, we supplied LLMs with a formula based on the presence or absence of factors contributing to risk, such as a high volatility score or a credit downgrade. Even with a set formula, the recognition and categorization of events (0/1) in response to news was not consistently accurate, thus questioning the viability of this approach.
Large Context Windows Can Slow The Workflow#
Given our intended goal of interpreting the risk associated with news and offering a hedging strategy, speed is of the utmost importance to ensure the hedging remains effective. However, despite the quick nature of individual calls, we have found the combination of LLM processing and machine learning modeling to take a long time and consequently erase any value-add which could have been garnered from efficiently reacting to the news.
Ultimately, the goal is to imitate a risk analyst who is faced with specific news, and the embedding of the full text associated with a full article results in a timetable significantly exceeding the processing ability of a risk analyst. Similarly, with many articles with extensive text, the LLM employed takes considerable time to evaluate and assign scores to each article, or assess its relevance to risk.
Moreover, though this is independent of an LLM, the application of text to modeling adds considerable time, given the number of features, which often number in the thousands. Therefore, we have moved towards keyword identifiers to reduce the number of variables involved in modeling and predicting the intensity of risk. To supplement these variables, we have sought to include historical metrics, such as traded volume or stock price to further add context to predicting the risk intensity, measured by variables such as volatility or the change in traded indices.
Relevance Requires Careful Task Design#
Most recently, we have attempted to harness the LLM’s capabilities in determining if a certain news event is relevant to impacting the risk metric of interest. For instance, we are currently attempting to evaluate if a news piece serves to impact the dollar’s value in relation to other currencies, as this could inform hedging strategy. To do so, we are comparing the LLM’s response after evaluating historical news to the actual impact on foreign exchange rates that day to validate its accuracy.
However, initial testing shows that local LLMs, which are typically limited in reasoning due to fewer parameters, struggle to accurately assess the direction of the change prompted by news. For instance, out of 15 historical articles tested, three articles resulted in the LLM expecting a decrease in the dollar’s value when the actual change was an increase.
Therefore, we have divided the LLM’s task into separate components, and this has assisted in accurately classifying the impact of news. Instead of inundating the LLM with multiple questions, such as does a news event impact foreign exchange rates, and if so, in which direction and with what intensity, we have asked the LLM to classify the news event into one of twenty-five events which could impact an exchange rate. This has served to reduce the dependence on the LLM’s reasoning capabilities, and instead pivot to a classification problem, which the LLM can perform more effectively. Moreover, the question of response intensity has been transferred to a model with keyword indicators and technical metrics, as quantitative operations are better performed with a ML model.
Takeaway#
Overall, this project has helped teach me about the value of a human in the loop. This is perhaps skeptical and may reflect my lack of knowledge in the science of LLM prompting, but it appears that there is still significant room for machine learning where the LLM is unable to reason or integrate math sufficiently to replace a human. Through the previous six months, in particular the last three, despite more effective prompting to derive value from the LLMs, I have uncovered a lot more skepticism to the AI revolution, especially as it relates to research.