Python | pandas | “describe”の利用方法

2024年1月9日2024年9月12日

Pythonとは，高レベルの汎用プログラミング言語であり，読みやすさとシンプルさで知られている．Web開発やデータサイエンス，人口知能，機械学習など幅広く利用されている．なお，レベルの高低はハードウェアに近いかどうかを意味しており，レベルが低いとハードウェアに近いことを意味している．

Pandasとは，Python向けに人気のあるオープンソースのデータ操作および分析ライブラリーである．異なるフォーマットでの読み書きするツールに加えて，効率的にデータ保存および大規模なデータセットを操作するためのデータ構造を提供する．

Pandasには，2つの"class"である一次元データを利用する"Series"と二次元データを利用する"DataFrame"がある．これらのデータを操作・分析するのに，"Attributes"と"Methods"がある．

“Attributes"は，"Series"や"DataFrame"のプロパティや性質を示すものであり，データ構造についての情報を提供する．一方，"Methods"は，特定のデータ操作を実行する機能であり，データの計算や変換などを実行する．

本記事では，"Methods"の1つである"describe"についての利用方法を以下に記す．

1. 実施環境
2. “describe"の説明
- 2.1. 数字の場合の項目
- 2.2. 文字などの場合の項目
3. “describe"の利用方法
4. 参照

実施環境

各バージョンの確認方法はこちら

OS: Windows11
VS Code: 1.85.1
Python 3.12.0
Pandas 2.1.4
Numpy 1.26.2

“describe"の説明

“describe"は，"Methods"の1つであり，各列ごとのデータの統計を生成する．

pandas | pandas.DataFrame.describe

数字の場合の項目

データが数字の場合，以下一覧が出力される．

count: null以外の値の個数
mean: 平均
std: 標準偏差
min: 最小値
25%: 下位25%に該当する値
50%: 中央値
75%: 上位25%に該当する値
max: 最大値

文字などの場合の項目

データが文字の場合，"数字の場合の項目"に加え"，以下一覧が出力される．

unique: 種類の個数
top: 最も頻出する値
freq: 最も頻出する値の頻度

“describe"の利用方法

数字データの出力

“df.describe()"を利用することで，data内の数字列の統計結果を出力することができる．

Pyファイルに以下を入力し，実行する．

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
        'Age': [25, 30, 22, 25, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Seattle'],
        'Height': [160, 180, 175, 165, 190]}

df = pd.DataFrame(data)

print(df.describe())

■実行結果

             Age      Height
count   5.000000    5.000000
mean   27.400000  174.000000
std     5.128353   11.937336
min    22.000000  160.000000
25%    25.000000  165.000000
50%    25.000000  175.000000
75%    30.000000  180.000000
max    35.000000  190.000000

すべてのデータの出力

“df.describe(include=’all’)"を利用することで，data内の文字列も含むすべての列の統計結果を出力することができる．

Pyファイルに以下を入力し，実行する．

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
        'Age': [25, 30, 22, 25, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Seattle'],
        'Height': [160, 180, 175, 165, 190]}

df = pd.DataFrame(data)

print(df.describe(include='all'))

■実行結果

         Name        Age      City      Height
count       5   5.000000         5    5.000000
unique      4        NaN         4         NaN
top     Alice        NaN  New York         NaN
freq        2        NaN         2         NaN
mean      NaN  27.400000       NaN  174.000000
std       NaN   5.128353       NaN   11.937336
min       NaN  22.000000       NaN  160.000000
25%       NaN  25.000000       NaN  165.000000
50%       NaN  25.000000       NaN  175.000000
75%       NaN  30.000000       NaN  180.000000
max       NaN  35.000000       NaN  190.000000

特定の1つの列データの出力

“Age"列のデータ出力をする場合，"df['Age’].describe()"を利用することで，data内の"Age"列の統計結果を出力することができる．

Pyファイルに以下を入力し，実行する．

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
        'Age': [25, 30, 22, 25, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Seattle'],
        'Height': [160, 180, 175, 165, 190]}

df = pd.DataFrame(data)

print(df['Age'].describe())

■実行結果

count     5.000000
mean     27.400000
std       5.128353
min      22.000000
25%      25.000000
50%      25.000000
75%      30.000000
max      35.000000
Name: Age, dtype: float64

特定の複数の列データの出力

“Age"列と"City"列のデータ出力をする場合，"df[['Age’, 'City’]].describe(include=’all’)"を利用することで，data内の"Age"列と"City"列の統計結果を出力することができる．

Pyファイルに以下を入力し，実行する．

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
        'Age': [25, 30, 22, 25, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Seattle'],
        'Height': [160, 180, 175, 165, 190]}

df = pd.DataFrame(data)

selected_columns = df[['Age', 'City']].describe(include='all')
print(selected_columns)

■実行結果

              Age      City
count    5.000000         5
unique        NaN         4
top           NaN  New York
freq          NaN         2
mean    27.400000       NaN
std      5.128353       NaN
min     22.000000       NaN
25%     25.000000       NaN
50%     25.000000       NaN
75%     30.000000       NaN
max     35.000000       NaN

参照

pandas | pandas.DataFrame.describe

以上

Pythondescribe,Pandas

Posted by クマガイ

PHP | XAMPP | VS Codeを利用した環境構築方法

Python | Pandas | "iloc"と"loc"の利用方法