Project Overview
This project provides an in-depth analysis of human rights violations in 2024, based on news data from Human Rights Watch. I've processed 401 articles to extract key information on accusations, risks, and affected victims, using Large Language Models (LLMs) to convert qualitative data into quantitative insights.
Methodology
1. Data Collection
I scraped 401 articles from Human Rights Watch using the GDELT API, focusing on publications from January 1, 2024, to August 28, 2024. Upon receiving the article links, I scraped each of them to extract articles content and metadata.
2. Data Processing
I employed Large Language Models (LLMs), specifically OpenAI's GPT-4o-mini model, as a gateway to convert qualitative textual data into quantitative data. This approach allowed me to systematically categorize and quantify complex human rights information.
LLM-Powered Qualitative to Quantitative Conversion:
- Accusation Categories: I used the LLM to identify and categorize specific types of human rights violations mentioned in each article. The model was prompted to extract and classify accusations, enabling me to quantify the frequency and distribution of different violation types.
- Risk Analysis: The LLM analyzed articles to identify potential long-term consequences of human rights violations. By structuring the model's output, I converted qualitative assessments of risks into quantifiable data points.
- Victim Demographics: I leveraged the LLM to extract and categorize information about affected groups. This allowed me to transform descriptive text into structured data on victim demographics.
For each category, I used carefully crafted prompts to guide the LLM in extracting relevant information and formatting it in a consistent, quantifiable manner. This process involved:
- Feeding article text to the LLM with specific instructions.
- Parsing the LLM's structured output into a format suitable for data analysis.
- Aggregating results across all articles to generate quantitative datasets.
3. Data Analysis and Visualization
I used Python libraries such as pandas, matplotlib, and seaborn to process the quantified data and create visual representations. My analysis focused on identifying patterns, trends, and correlations within and across the three main categories: accusations, risks, and victims.
Visualizations
My analysis is divided into three main categories, each represented by a series of data visualizations:
- Accusations: Bar charts and heatmaps showing the frequency and co-occurrence of different types of human rights violations, including political repression, arbitrary detention, and discrimination against minorities.
- Risks: Network graphs and impact matrices illustrating the potential long-term consequences of violations, such as cultural erasure, economic instability, and erosion of civil liberties.
- Victims: Demographic breakdowns and geographical distributions of affected groups, including children, ethnic minorities, and journalists, presented through pie charts and choropleth maps.
Explore each category in detail using the navigation menu above to view the full set of visualizations and their interpretations.
Limitations and Future Work
While my LLM-based approach allows for efficient processing of large volumes of qualitative data, it's important to note potential limitations such as model biases and the need for human verification. Future work could involve refining my prompts, incorporating multiple LLMs for cross-validation, and integrating expert human review to further enhance the accuracy and reliability of my findings.