极速揭示见解

在短短9分钟内使用ChatGPT和Python对Makeover Monday数据集进行分析

对于本文，我使用的是 ChatGPT-4（5月24日版），没有插件或代码解释器，并在 VSCode 中的 Jupyter Notebook 中使用 Python 执行代码。

我的目标是使用ChatGPT尽快分析一个简单数据集，从头开始构建所有代码。

简述- 9分钟⚡?

数据集：2023 / W24：全球谷物产量

提示1：

You are the perfect Python data analyst and possess a combination of technical skills, analytical thinking, and domain knowledge to effectively manipulate and interpret data.

Write a Python program that contains the following functions
# 1. Ingest data handling for dynamic file types
def ingest_data(file_name):

# 2. Update the dataframe to ensure that the column data types are correct and all integers are stored as floats
def convert_dtypes(df):

# 3. Dynamically scan any dataframe and apply standard cleaning and at least 10 data quality tests 
def clean_and_validate(df):
# Standard cleaning 
# Drop duplicates
# Data quality tests (All tests must check if a coulmn is an int and handle invalid type errors)
return df, tests

# 4. provide an abridged summary of the dataset
def summary(df):

# Test with a file
df = ingest_data(r'filename.csv')
print(df)
df = convert_dtypes(df)
df, tests = clean_and_validate(df)
print("Data Quality Tests:")
for test, result in tests.items():
    print(f"{test}: {result}")
summary_df = summary(df)
print("Dataset Summary:")
print(summary_df)

回答1：

import pandas as pd
import os

def ingest_data(file_name):
    """Read data from csv, excel, or json file"""
    _, file_extension = os.path.splitext(file_name)
    
    if file_extension == '.csv':
        df = pd.read_csv(file_name)
    elif file_extension == '.xlsx':
        df = pd.read_excel(file_name)
    elif file_extension == '.json':
        df = pd.read_json(file_name)
    else:
        raise ValueError(f'Unsupported file type {file_extension}')
    
    return df

def convert_dtypes(df):
    """Convert integer types to float"""
    for col in df.select_dtypes(include='integer').columns:
        df[col] = df[col].astype(float)
    return df

def clean_and_validate(df):
    """Standard cleaning and quality tests"""
    # Standard cleaning 
    df = df.drop_duplicates()

    # Data quality tests
    tests = {}

    for col in df.columns:
        try:
            # Check if column can be converted to int
            df[col].astype(int)
            tests[col] = "Passed"
        except ValueError:
            tests[col] = "Failed"

    return df, tests

def summary(df):
    """Provide a brief summary of the dataset"""
    return df.describe(include='all')

# Test with a file
df = ingest_data(r'filename.csv')
print(df)
df = convert_dtypes(df)
df, tests = clean_and_validate(df)
print("Data Quality Tests:")
for test, result in tests.items():
    print(f"{test}: {result}")
summary_df = summary(df)
print("Dataset Summary:")
print(summary_df)

更新文件名为filename.csv并执行

执行1：（为了易读而截短）

            Entity Code  Year  Area harvested (hectares)  Production (tonnes)   
0      Afghanistan  AFG  1961                  3313500.0           3695000.00  \
1      Afghanistan  AFG  1962                  3425500.0           3696000.00   
2      Afghanistan  AFG  1963                  3426500.0           3378000.00   
3      Afghanistan  AFG  1964                  3446500.0           3732000.00   
4      Afghanistan  AFG  1965                  3444500.0           3785000.00   
...            ...  ...   ...                        ...                  ...   
17768     Zimbabwe  ZWE  2017                  1452971.0           1748338.90   
17769     Zimbabwe  ZWE  2018                  1419684.0           1780775.60   
17770     Zimbabwe  ZWE  2019                  1368117.0           1023357.94   
17771     Zimbabwe  ZWE  2020                  1446029.0           1660855.20   
17772     Zimbabwe  ZWE  2021                  1322342.0           2043297.60   

       Yield (tonnes per hectare)  Population (historical estimates)  
0                          1.1151                          8790140.0  
1                          1.0790                          8969055.0  
2                          0.9858                          9157463.0  
3                          1.0828                          9355510.0  
4                          1.0989                          9565154.0  
...                           ...                                ...  
17768                      1.2033                         14751101.0  
17769                      1.2543                         15052191.0  
17770                      0.7480                         15354606.0  
17771                      1.1486                         15669663.0  
17772                      1.5452                         15993525.0  
...
25%                          3.253180e+05  
50%                          4.365712e+06  
75%                          1.875828e+07  
max                          7.909295e+09

提示2：（将完整回复1复制粘贴在[]内）

You are the perfect Python data analyst and possess a combination of technical skills, analytical thinking, and domain knowledge to effectively manipulate and interpret data.

I will provide you with a the summary of a [dataframe] 

[
]

you will respond with code to run in a jupyter notebook that will provide 30 different analytical and statistical figures and print() the description of these figures before the values.
Also, please extract any other information you may need to conduct and analysis on the dataset.
All in one block of code please.
Only Print() text and values, no plots

回复2：

import pandas as pd

# assuming df is your dataframe
# df = your_dataframe 

# Clean up data: drop NaNs and duplicates
df_clean = df.dropna().drop_duplicates()

# Calculate different metrics

# 1. Total number of unique countries/entities
print("1. Total number of unique countries/entities:")
print(df_clean['Entity'].nunique())

# 2. Total number of unique years
print("\n2. Total number of unique years:")
print(df_clean['Year'].nunique())

# 3. Total area harvested across all entities and years
print("\n3. Total area harvested across all entities and years:")
print(df_clean['Area harvested (hectares)'].sum())

# 4. Average area harvested per year
print("\n4. Average area harvested per year:")
print(df_clean.groupby('Year')['Area harvested (hectares)'].mean())

# 5. Total production across all entities and years
print("\n5. Total production across all entities and years:")
print(df_clean['Production (tonnes)'].sum())

# 6. Average production per year
print("\n6. Average production per year:")
print(df_clean.groupby('Year')['Production (tonnes)'].mean())

# 7. Maximum yield across all entities and years
print("\n7. Maximum yield across all entities and years:")
print(df_clean['Yield (tonnes per hectare)'].max())

# 8. Minimum yield across all entities and years
print("\n8. Minimum yield across all entities and years:")
print(df_clean['Yield (tonnes per hectare)'].min())

# 9. Average yield per year
print("\n9. Average yield per year:")
print(df_clean.groupby('Year')['Yield (tonnes per hectare)'].mean())

# 10. Total population across all entities and years
print("\n10. Total population across all entities and years:")
print(df_clean['Population (historical estimates)'].sum())

# 11. Average population per year
print("\n11. Average population per year:")
print(df_clean.groupby('Year')['Population (historical estimates)'].mean())

# 12-16. 10th, 25th, 50th, 75th, 90th percentiles of the area harvested
print("\n12-16. 10th, 25th, 50th, 75th, 90th percentiles of the area harvested:")
print(df_clean['Area harvested (hectares)'].quantile([0.1, 0.25, 0.5, 0.75, 0.9]))

# 17-21. 10th, 25th, 50th, 75th, 90th percentiles of the production
print("\n17-21. 10th, 25th, 50th, 75th, 90th percentiles of the production:")
print(df_clean['Production (tonnes)'].quantile([0.1, 0.25, 0.5, 0.75, 0.9]))

# 22-26. 10th, 25th, 50th, 75th, 90th percentiles of the yield
print("\n22-26. 10th, 25th, 50th, 75th, 90th percentiles of the yield:")
print(df_clean['Yield (tonnes per hectare)'].quantile([0.1, 0.25, 0.5, 0.75, 0.9]))

# 27-30. 10th, 25th, 50th, 75th, 90th percentiles of the population
print("\n27-30. 10th, 25th, 50th, 75th, 90th percentiles of the population:")
print(df_clean['Population (historical estimates)'].quantile([0.1, 0.25, 0.5, 0.75, 0.9]))

# Extract years for further analysis
unique_years = df_clean['Year'].unique()
print("\nUnique years for further analysis:")
print(unique_years)

# Extract entity names for further analysis
unique_entities = df_clean['Entity'].unique()
print("\nUnique entities for further analysis:")
print(unique_entities)

执行2：（为了易读性而截断）

1. Total number of unique countries/entities:
191

2. Total number of unique years:
61

3. Total area harvested across all entities and years:
84851339650.0

4. Average area harvested per year:
Year
1961.0    8.455053e+06
1962.0    8.530493e+06
1963.0    8.600838e+06
1964.0    8.744941e+06
1965.0    8.704869e+06
              ...     
2017.0    8.003554e+06
2018.0    7.980563e+06
2019.0    7.937625e+06
2020.0    8.072692e+06
2021.0    8.170379e+06
Name: Area harvested (hectares), Length: 61, dtype: float64

5. Total production across all entities and years:
...
 'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'USSR' 'Uganda'
 'Ukraine' 'United Arab Emirates' 'United Kingdom' 'United States'
 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam' 'World' 'Yemen'
 'Yugoslavia' 'Zambia' 'Zimbabwe']

提示3：

You are a Python Data Visualization Specialist: You are in charge of creating data visualizations using Python libraries.

Expert Skills: Proficiency in Python, expertise in Python visualization libraries (Matplotlib, Seaborn), a deep understanding of data analysis, and highly skilled in designing visually appealing and informative representations of data.

I will provide you with the [key features] of a dataset, this dataset is already in memory in the dataframe df 

[ 
 ]


Please prepare code for a jupyter notebook that will:
1. print(a detailed paragraph summary of the dataset in natural language)
2. Analyse the dataset and describe the 5 most interesting and useful insights
3. Use your creativity to come up with the most effective charts to represent these insights, please use no Donut or Pie charts
4. Please provide 2 additional mind-blowing, out-of-the-box insights and charts

Notes: 
Please ensure that legends are hidden for categories with more than 10 unique items
Please ensure there are no more than 10 unique categories on any axis

无论 Response 3 是什么，我们总是运行解析器提示。

提示4：(解析器)

You are a resolver tasked with 
1) finding any errors or potential flaws in the last response
2) improving that answer, and 
3) Printing the improved answer in full.
4) ensure that all required libraries are in the response

Let's work this out in a step by step way to be sure we have the right answer

回复4：

在本次运行中，出现了一个有趣的 bug，在第 3 次回应和第 4 次回应之间，ChatGPT 丢掉了对于洞见 1 到 4 的 plt.show()，因此我已经编辑了下面的代码。

# Importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
# df = pd.read_csv('your_data.csv')  # Uncomment this line to load data from a csv file
# This data loading step depends on your data source and format

# Ensure pandas won't cut off our chart labels
pd.set_option('display.max_colwidth', None)

# Generate a summary paragraph for the data
print(f"This dataset consists of agricultural data from {df['Entity'].nunique()} unique countries/entities spanning {df['Year'].nunique()} years from 1961 to 2021. \
The dataset contains information on the area harvested, production, and yield of various crops, as well as historical population estimates. \
Over the time period, a total area of {df['Area harvested (hectares)'].sum():.2f} hectares was harvested, resulting in a total production of {df['Production (tonnes)'].sum():.2f} tonnes of crops. \
Yields have varied significantly, with an average yield of {df['Yield (tonnes per hectare)'].mean():.2f} tonnes per hectare, ranging from {df['Yield (tonnes per hectare)'].min():.4f} to {df['Yield (tonnes per hectare)'].max():.2f} tonnes per hectare. \
The total population across all entities and years is approximately {df['Population (historical estimates)'].sum():.2f}.")

# Identify the five most interesting insights and represent them visually
# Note: These are just example insights. The specific insights you can extract from your data will depend on its characteristics

# Insight 1: Overall trends in area harvested, production, and yield over time
df.groupby('Year')[['Area harvested (hectares)', 'Production (tonnes)', 'Yield (tonnes per hectare)']].mean().plot()
plt.title('Overall trends in area harvested, production, and yield over time')
plt.ylabel('Average')
plt.xlabel('Year')
plt.show() # Not originally in response 4

# Insight 2: Top 10 countries with the highest total production
top_producing_countries = df.groupby('Entity')['Production (tonnes)'].sum().nlargest(10)
top_producing_countries.plot(kind='barh')
plt.title('Top 10 countries with highest total production')
plt.ylabel('Country')
plt.xlabel('Total production (tonnes)')
plt.show() # Not originally in response 4

# Insight 3: Yield over time for top 5 producing countries
top_countries = top_producing_countries.index.tolist()
df_top_countries = df[df['Entity'].isin(top_countries)]
sns.lineplot(x='Year', y='Yield (tonnes per hectare)', hue='Entity', data=df_top_countries)
plt.title('Yield over time for top 5 producing countries')
plt.show() # Not originally in response 4

# Insight 4: Correlation between area harvested, yield, and production
sns.heatmap(df[['Area harvested (hectares)', 'Yield (tonnes per hectare)', 'Production (tonnes)']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between area harvested, yield, and production')
plt.show() # Not originally in response 4

# Insight 5: Distribution of population for the top 5 most populous countries
top_populated_countries = df.groupby('Entity')['Population (historical estimates)'].max().nlargest(5).index.tolist()
df_top_populated = df[df['Entity'].isin(top_populated_countries)]
sns.boxplot(x='Entity', y='Population (historical estimates)', data=df_top_populated)
plt.title('Distribution of population for the top 5 most populous countries')
plt.ylabel('Population')
plt.xlabel('Country')
plt.show()

执行4：

这个数据集包含来自301个独特的国家/实体的农业数据，跨越了61年，从1961年到2021年。该数据集包含有关各种作物的收获面积、生产量和产量的信息，以及历史人口估计数据。在该时间段内，总共收获了291356015846.00公顷的面积，产出了789506968554.36吨农产品。产量差异很大，平均收成为每公顷2.53吨，从每公顷0.0001吨到36.76吨不等。跨所有实体和年份的总人口约为1686757048817.00。