This blog post will give a first glance on how to use open source NLP models and large language models (LLMs) with Python.
Transformer models have revolutionized natural language processing (NLP) by enabling models to handle sequential data more efficiently than traditional architectures. Introduced in the seminal paper Attention Is All You Need by Vaswani et al. in 2017, transformers rely on a mechanism called self-attention to process input data.
Transformers are neural network architectures designed to handle sequences of data by weighing the relevance of different elements in the input. This is achieved through the self-attention mechanism, which allows the model to consider the entire input sequence simultaneously, capturing long-range dependencies more effectively.
Key Components:
Transformer models can be categorized based on their architecture:
Transformer models offer several advantages:
By understanding the roles of encoders and decoders within transformer models, you can select the appropriate architecture for your NLP tasks, leveraging the power of modern language models to achieve state-of-the-art results.
Hugging Face is a leading open-source platform that provides tools and resources for natural language processing (NLP) and machine learning. It hosts a vast collection of pre-trained models, datasets, and libraries that make it easier for developers and researchers to leverage state-of-the-art models for various NLP tasks.
At its core, Hugging Face is a community-driven hub that facilitates the sharing and use of machine learning models. The platform offers:
Transformers Library: A popular library that provides APIs and tools to easily download and use pre-trained models for tasks like text classification, translation, summarization, question answering, and more.
Model Hub: A repository where developers can share and access thousands of pre-trained models contributed by the community.
Datasets Library: A collection of ready-to-use datasets for NLP tasks.
Tokenizers Library: Efficient implementations of tokenizers, which are essential for preparing text data for models.
Several open-source large language models (LLMs) are available on Hugging Face. Here are some of the most popular ones:
In this section, we’ll walk through setting up a sentiment analysis pipeline using Hugging Face’s Transformers library in Python. We’ll cover installing the necessary packages, explain each step of the code, and demonstrate how to run a simple sentiment analysis.
First, ensure you have Python installed on your system. Hugging Face’s Transformers library requires Python 3.6 or higher. You can check your Python version by running:
python --version
We need to install transformers
and torch
(PyTorch) or tensorflow
:
pip install transformers
pip install torch
This can also of couse be installed with any virtual environment tool such as poetry
.
The default sentiment analysis model from the transformrs library is
distilbert-base-uncased-finetuned-sst-2-english
. This model is trained on the Stanford
Sentiment Treebank v2 (SST-2) dataset. This dataset contains only two sentiment labels:
Positive and Negative. There is no neutral class in this dataset. Therefore, I will use
the cardiffnlp/twitter-roberta-base-sentiment-latest
model, which is a roBERTa-base
model trained on ~124M tweets (between 2018 to 2021) and finetuned for sentiment
analysis with the TweetEval benchmark.
the model is trained to classify any input as either positive or negative.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
from scipy.special import softmax
# Preprocess text (username and link placeholders; this is how the model was trained)
def preprocess(text):
new_text = []
for t in text.split():
t = '@user' if t.startswith('@') and len(t) > 1 else t
t = 'http' if t.startswith('http') else t
new_text.append(t)
return " ".join(new_text)
# Load the tokenizer and model
model_name = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Labels mapping
labels = ['Negative', 'Neutral', 'Positive']
def analyze_sentiment(text):
# Preprocess the text
text = preprocess(text)
# Encode the text
encoded_input = tokenizer(text, return_tensors='pt')
# Get model outputs
with torch.no_grad():
output = model(**encoded_input)
# Convert logits to probabilities
scores = output.logits[0].numpy()
scores = softmax(scores)
# Get the label with the highest score
ranking = np.argsort(scores)
ranking = ranking[::-1]
top_label = labels[ranking[0]]
top_score = scores[ranking[0]]
# Return the result
return {
'label': top_label,
'score': float(top_score),
'scores': {labels[i]: float(scores[i]) for i in range(len(scores))}
}
# Test the function
proposal_text = "I am thrilled with the new updates to our DAO governance model!"
sentiment = analyze_sentiment(proposal_text)
print(f"Text: {proposal_text}")
print(f"Sentiment: {sentiment['label']}, Score: {sentiment['score']:.4f}")
print("All scores:", sentiment['scores'])
Below are some examples where the model successfully classifies a positive, a negative, and a neutral text.
Text: I am thrilled with the new updates to our DAO governance model!
Sentiment: Positive, Score: 0.9901
All scores: {'Negative': 0.0016886675730347633, 'Neutral': 0.008168383501470089, 'Positive': 0.9901429414749146}
Text: I am not thrilled with the new updates to our DAO governance model!
Sentiment: Negative, Score: 0.9132
All scores: {'Negative': 0.9132303595542908, 'Neutral': 0.07846872508525848, 'Positive': 0.008300881832838058}
Text: My name is Magnus
Sentiment: Neutral, Score: 0.8236
All scores: {'Negative': 0.03530899062752724, 'Neutral': 0.8236108422279358, 'Positive': 0.14108018577098846}
When using these models in papers, it is important to correctly cite the work that you are using, to give credit where it’s due. Citing information can be found on the project page.
@inproceedings{loureiro-etal-2022-timelms,
title = "{T}ime{LM}s: Diachronic Language Models from {T}witter",
author = "Loureiro, Daniel and
Barbieri, Francesco and
Neves, Leonardo and
Espinosa Anke, Luis and
Camacho-collados, Jose",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-demo.25",
doi = "10.18653/v1/2022.acl-demo.25",
pages = "251--260"
}
In this section, we’ll explore how to use Large Language Models (LLMs) to evaluate a CV. We’ll start by the process of extracting text from a PDF CV, feeding it into an LLM, and interpreting the results.
There are a few open source libraries for Python such as PyMuPDF
, PDFplumber
, and
PyPDF1
. There are also paid options such as Azure Document Intelligence
. In this
blog post, we will use pymupdf4llm. This
package converts the pages of a PDF to text in Markdown format using PyMuPDF, standard
text and tables are detected, brought in the right reading sequence and then together
converted to GitHub-compatible Markdown text.
The choice of language model is of course very important. For this example, we will use
the google/flan-t5-base
and run it on a NVIDIA GeForce GTX 1060 6GB (an old and not
particularly powerful GPU).
pip install pymupdf4llm
pip install torch
import pymupdf4llm
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Import CV as Pdf and transform it to Str
cv_file = "/home/magnus/Git/HanssonMagnus/research/latex/cv/professional/cv_magnus.pdf"
md_text = pymupdf4llm.to_markdown(cv_file)
# Evaluate the CV with an LLM
def evaluate_cv_with_hf(cv_text):
"""
Uses an open-source LLM to evaluate a CV.
Args:
cv_text (str): The text content of the CV.
Returns:
str: The evaluation response from the model.
"""
model_name = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Use GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
prompt = (
"You are a professional career advisor. Please evaluate the following CV:\n\n"
f"{cv_text}\n\n"
)
inputs = tokenizer(prompt, return_tensors='pt', truncation=True,
max_length=512).to(device)
outputs = model.generate(
**inputs,
max_length=512,
num_beams=5,
early_stopping=True
)
evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True)
return evaluation
evaluation = evaluate_cv_with_hf(md_text)
print("CV Evaluation:\n")
print(evaluation)
The following is the result:
CV Evaluation:
**WORK EXPERIENCE** **Stockholm Business School and Swedish House of Finance** August
2023 - Ongoing _Assistant Professor (Quantitative Finance)_ _Stockholm, Sweden_
Quantitative finance, market microstructure, and decentralized finance. __ **Independent
Consultant** September 2023 - Ongoing _Data Scientist_ _Remote_ Risk modelling and data
science. __ **Combine Control Systems** September 2017 - August 2018 _Data Scientist_
_Gothenburg, Sweden_ Artificial neural network for virtual engine testing. __ **Nordea
Bank** 2012, 2013, 2014 _Summer Analyst_ _Gothenburg, Sweden_ Summer analyst in
corporate retail. __ **EDUCATION** **University of Gothenburg** September 2018 - June
2023 _PhD Economics (Quantitative Finance)_ _Gothenburg, Sweden_ Thesis: “Decentralized
Finance and Central Bank Communication”. __ Visiting PhD student at Stockholm School of
Economics. __ **Lund University** September 2016 - June 2017 _MSc Financial Economics_
_Lund, Sweden_ Thesis: “On Stock Return Prediction with LSTM Networks”. __ **Lund
University** September 2011 - June 2014 _BSc Mathematics_ _Lund, Sweden_ Exchange
semester at University of St.Gallen.
It is noteworthy how “bad” this result is compared to more advanced models. The exact same promp input gives the following result from ChatGPT 4o:
# CV Evaluation
## Strengths:
1. **Concise and Clear**: The CV highlights your background in quantitative finance,
data science, and decentralized finance concisely. The focus on methods and
programming is strong.
2. **Relevant Experiences**: Your roles in academia, industry, and consulting show
versatility and expertise in both research and practical applications.
3. **Clear Technical Competence**: Listing programming languages and tools clearly
conveys your proficiency. It's good that you highlight Linux, Git, Vim, and Ethereum
nodes, which shows depth in technical knowledge.
4. **PhD and Research**: Your education section clearly outlines advanced academic
achievements, which is a major strength, especially for research or quant-focused
roles.
## Areas for Improvement:
1. **Formatting**: The CV could benefit from more consistent formatting. For example,
bullet points or detailed descriptions under each role would make it easier to scan
for specific responsibilities or achievements. Right now, the CV reads more like a
timeline.
2. **Achievements and Impact**: For each role, especially in consultancy and industry,
add concrete accomplishments or measurable impacts, e.g., "Developed and implemented
a risk model that reduced X by Y%." This will help quantify your contributions and give
the reader a sense of your practical results.
3. **Work Experience Details**: The 'Assistant Professor' and 'Independent Consultant'
roles could be expanded. Describe key projects, teaching focus, or consulting
deliverables, especially if they are relevant to future career moves (e.g., towards
quantitative finance roles).
4. **Education Clarity**: The education section is strong, but you could mention any
specific coursework or collaborations, particularly during your PhD or visiting
roles, if they involved notable researchers or institutions.
5. **Tools & Methods Section**: This section could be broken down more clearly. Consider
categorizing into **Languages**, **Methods**, and **Tools** instead of a long list.
Additionally, you might want to include some context for how you've applied these skills
in your work.
6. **Professional Website**: Including your website is good, but ensure that it is
up-to-date and relevant. It should showcase any publications, projects, or further
details about your work.
## Suggestions for Changes:
- **Assistant Professor**: Add a short bullet list of 2-3 key projects, lectures, or
research contributions. For example: "Developed curriculum on decentralized finance,
focusing on market microstructure."
- **Independent Consultant**: Provide more detail on your consulting work, especially if
it involves specific tools, companies, or data science projects that can show industry
relevance.
- **Combine Control Systems**: Describe the type of neural networks used, the impact on
the virtual engine testing process, and any collaborative aspects of the work.
---
In sum, emphasize achievements and measurable results, improve formatting consistency,
and make the experience section more detailed to highlight your impact in each role.
Thus, for any project it is important to evaluate what you need and what level of sophistication you require from the LLM for your project.
However, to run a model like GPT-4 locally, you’d likely need 350GB+ of GPU memory, which could be achieved by using multiple high-end GPUs, such as the NVIDIA A100 (40GB or 80GB versions). For smaller or quantized versions, 40-80GB of GPU memory might suffice, which could be handled by 1-2 GPUs like the NVIDIA A100 or RTX 3090. A full setup for GPT-4’s largest variant could easily exceed $50,000 depending on the number of GPUs needed.